markitdown

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2026 License: Apache-2.0 Imports: 44 Imported by: 0

README

markitdown

Go Reference Go Report Card Release

A pure-Go library and CLI that converts documents to Markdown. Go port of the Python markitdown library.

Features

  • Pure Go, no CGO, no external runtime dependencies
  • 12 format converters: PDF, DOCX, PPTX, XLSX, XLS, HTML, RSS/Atom, CSV, EPUB, Jupyter, plain text, ZIP
  • Deterministic output with golden test suite
  • PDF extraction via PDFium (WebAssembly, no CGO) with heading/bold/italic detection

Supported formats

Format Extensions Notes
PDF .pdf Text extraction via PDFium (WebAssembly, no CGO)
Word .docx Headings, tables, lists, hyperlinks, comments, math (OMML to LaTeX)
PowerPoint .pptx Slides, tables, notes, image alt text
Excel .xlsx Multi-sheet markdown tables
Excel (legacy) .xls Multi-sheet markdown tables
HTML .html, .htm Full HTML-to-Markdown conversion
RSS/Atom .xml, .rss, .atom Feed items with titles, dates, content
CSV .csv Markdown table with auto charset detection
EPUB .epub Metadata, table of contents, chapter content
Jupyter .ipynb Markdown + fenced code cells with output
Plain text .txt, .md, .json, .jsonl Charset detection and UTF-8 conversion
ZIP .zip Recursively converts supported files inside

Install

go get github.com/conductor-oss/markitdown

Library quick start

package main

import (
	"fmt"
	"log"

	markitdown "github.com/conductor-oss/markitdown"
)

func main() {
	m := markitdown.New()

	// Convert a local file
	result, err := m.ConvertFile("report.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result.Markdown)
}
More examples
// Convert a URL
result, err := m.ConvertURL("https://example.com/page.html")

// Convert with auto-detection (file path or URL)
result, err := m.Convert("report.pdf")
result, err := m.Convert("https://example.com/page.html")

// Convert from a reader with metadata hints
f, _ := os.Open("data.csv")
result, err := m.ConvertReader(f, markitdown.StreamInfo{
	Extension: ".csv",
	MIMEType:  "text/csv",
	Charset:   "shift_jis",
})

// Options
m := markitdown.New(
	markitdown.WithKeepDataURIs(true), // preserve base64 data URIs in output
)

CLI quick start

Build:

go build -o markitdown ./cmd/markitdown

Convert a file to stdout:

./markitdown report.pdf

Convert and write to a file:

./markitdown -o output.md report.docx

Convert from stdin with format hint:

cat data.csv | ./markitdown -x csv

Convert a URL:

./markitdown https://example.com/page.html
CLI flags
Usage: markitdown [flags] [source]

Arguments:
  source    File path or URL to convert (reads stdin if omitted)

Flags:
  -o, --output string       Output file (default: stdout)
  -x, --extension string    File extension hint for stdin input (e.g. "pdf", ".csv")
  -m, --mime-type string    MIME type hint
  -c, --charset string      Charset hint (e.g. "shift_jis", "utf-8")
  -v, --version             Show version
      --keep-data-uris      Keep full base64-encoded data URIs in output

Notes

  • PDF extraction is text-based; image-only PDFs produce no output without OCR.
  • DOCX math equations (OMML) are converted to LaTeX notation.
  • CJK charset detection works without hints but is most reliable when Charset is provided in StreamInfo.

Acknowledgements

This project is a Go port of Microsoft's markitdown Python library. The original project provides the reference implementation, test fixtures, and design that this port is based on.

Documentation

Index

Constants

View Source
const (
	// PrioritySpecific is for format-specific converters (PDF, DOCX, etc.).
	PrioritySpecific = 0.0
	// PriorityGeneric is for fallback converters (PlainText, HTML, ZIP).
	PriorityGeneric = 10.0
)

Variables

This section is empty.

Functions

func IsUnsupportedFormat

func IsUnsupportedFormat(err error) bool

IsUnsupportedFormat reports whether the error is an UnsupportedFormatError.

Types

type ConversionError

type ConversionError struct {
	Attempts []FailedConversionAttempt
}

ConversionError is returned when a converter accepted the input but failed to convert it.

func (*ConversionError) Error

func (e *ConversionError) Error() string

func (*ConversionError) Unwrap

func (e *ConversionError) Unwrap() error

type CsvConverter

type CsvConverter struct{}

CsvConverter handles CSV files.

func NewCsvConverter

func NewCsvConverter() *CsvConverter

NewCsvConverter creates a new CsvConverter.

func (*CsvConverter) Accepts

func (c *CsvConverter) Accepts(info StreamInfo) bool

func (*CsvConverter) Convert

func (c *CsvConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type DocumentConverter

type DocumentConverter interface {
	// Accepts returns true if this converter can handle the given input.
	// It MUST NOT change the read position of reader.
	Accepts(info StreamInfo) bool

	// Convert performs the actual document-to-markdown conversion.
	Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
}

DocumentConverter is the interface all format converters implement.

type DocumentConverterResult

type DocumentConverterResult struct {
	Markdown string
	Title    string
}

DocumentConverterResult holds the output of a conversion.

type DocxConverter

type DocxConverter struct {
	// contains filtered or unexported fields
}

DocxConverter handles DOCX files.

func NewDocxConverter

func NewDocxConverter(m *MarkItDown) *DocxConverter

NewDocxConverter creates a new DocxConverter.

func (*DocxConverter) Accepts

func (c *DocxConverter) Accepts(info StreamInfo) bool

func (*DocxConverter) Convert

func (c *DocxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type EpubConverter

type EpubConverter struct {
	// contains filtered or unexported fields
}

EpubConverter handles EPUB files.

func NewEpubConverter

func NewEpubConverter(m *MarkItDown) *EpubConverter

NewEpubConverter creates a new EpubConverter.

func (*EpubConverter) Accepts

func (c *EpubConverter) Accepts(info StreamInfo) bool

func (*EpubConverter) Convert

func (c *EpubConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type FailedConversionAttempt

type FailedConversionAttempt struct {
	Converter string
	Err       error
}

FailedConversionAttempt records a converter that accepted but failed.

type HTMLConverter

type HTMLConverter struct {
	// contains filtered or unexported fields
}

HTMLConverter handles HTML files.

func NewHTMLConverter

func NewHTMLConverter(m *MarkItDown) *HTMLConverter

NewHTMLConverter creates a new HTMLConverter.

func (*HTMLConverter) Accepts

func (c *HTMLConverter) Accepts(info StreamInfo) bool

func (*HTMLConverter) Convert

func (c *HTMLConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

func (*HTMLConverter) ConvertString

func (c *HTMLConverter) ConvertString(htmlStr string) (*DocumentConverterResult, error)

ConvertString converts an HTML string to markdown.

type IpynbConverter

type IpynbConverter struct{}

IpynbConverter handles Jupyter notebook files.

func NewIpynbConverter

func NewIpynbConverter() *IpynbConverter

NewIpynbConverter creates a new IpynbConverter.

func (*IpynbConverter) Accepts

func (c *IpynbConverter) Accepts(info StreamInfo) bool

func (*IpynbConverter) Convert

func (c *IpynbConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type MarkItDown

type MarkItDown struct {
	// contains filtered or unexported fields
}

MarkItDown is the main document-to-markdown conversion engine.

func New

func New(opts ...Option) *MarkItDown

New creates a new MarkItDown instance with the given options.

func (*MarkItDown) Convert

func (m *MarkItDown) Convert(source string) (*DocumentConverterResult, error)

Convert auto-detects the source type (file path or URL) and converts it.

func (*MarkItDown) ConvertFile

func (m *MarkItDown) ConvertFile(path string) (*DocumentConverterResult, error)

ConvertFile converts a local file to markdown.

func (*MarkItDown) ConvertReader

func (m *MarkItDown) ConvertReader(r io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

ConvertReader converts a stream to markdown using the provided StreamInfo.

func (*MarkItDown) ConvertURL

func (m *MarkItDown) ConvertURL(url string) (*DocumentConverterResult, error)

ConvertURL fetches a URL and converts the response to markdown.

func (*MarkItDown) RegisterConverter

func (m *MarkItDown) RegisterConverter(name string, c DocumentConverter, priority float64)

RegisterConverter adds a custom converter with the given priority. Lower priority values are tried first.

type Option

type Option func(*MarkItDown)

Option configures a MarkItDown instance.

func WithKeepDataURIs

func WithKeepDataURIs(keep bool) Option

WithKeepDataURIs configures whether to keep full data URIs in output (default: false, which truncates them to data:mime/type;base64...).

func WithStyleMap

func WithStyleMap(styleMap string) Option

WithStyleMap sets custom style mapping for DOCX conversion.

type PdfConverter

type PdfConverter struct{}

PdfConverter handles PDF files using the PDFium library via WebAssembly.

func NewPdfConverter

func NewPdfConverter() *PdfConverter

NewPdfConverter creates a new PdfConverter.

func (*PdfConverter) Accepts

func (c *PdfConverter) Accepts(info StreamInfo) bool

func (*PdfConverter) Convert

func (c *PdfConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type PlainTextConverter

type PlainTextConverter struct{}

PlainTextConverter handles plain text, markdown, JSON, and JSONL files.

func NewPlainTextConverter

func NewPlainTextConverter() *PlainTextConverter

NewPlainTextConverter creates a new PlainTextConverter.

func (*PlainTextConverter) Accepts

func (c *PlainTextConverter) Accepts(info StreamInfo) bool

func (*PlainTextConverter) Convert

type PptxConverter

type PptxConverter struct {
	// contains filtered or unexported fields
}

PptxConverter handles PPTX files.

func NewPptxConverter

func NewPptxConverter(m *MarkItDown) *PptxConverter

NewPptxConverter creates a new PptxConverter.

func (*PptxConverter) Accepts

func (c *PptxConverter) Accepts(info StreamInfo) bool

func (*PptxConverter) Convert

func (c *PptxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type RSSConverter

type RSSConverter struct{}

RSSConverter handles RSS and Atom feed files.

func NewRSSConverter

func NewRSSConverter() *RSSConverter

NewRSSConverter creates a new RSSConverter.

func (*RSSConverter) Accepts

func (c *RSSConverter) Accepts(info StreamInfo) bool

func (*RSSConverter) Convert

func (c *RSSConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type StreamInfo

type StreamInfo struct {
	MIMEType  string
	Extension string
	Charset   string
	Filename  string
	LocalPath string
	URL       string
}

StreamInfo holds metadata about the input being converted.

type UnsupportedFormatError

type UnsupportedFormatError struct {
	Extension string
	MIMEType  string
}

UnsupportedFormatError is returned when no converter can handle the input format.

func (*UnsupportedFormatError) Error

func (e *UnsupportedFormatError) Error() string

type XlsConverter

type XlsConverter struct{}

XlsConverter handles legacy XLS files.

func NewXlsConverter

func NewXlsConverter() *XlsConverter

NewXlsConverter creates a new XlsConverter.

func (*XlsConverter) Accepts

func (c *XlsConverter) Accepts(info StreamInfo) bool

func (*XlsConverter) Convert

func (c *XlsConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type XlsxConverter

type XlsxConverter struct{}

XlsxConverter handles XLSX files.

func NewXlsxConverter

func NewXlsxConverter() *XlsxConverter

NewXlsxConverter creates a new XlsxConverter.

func (*XlsxConverter) Accepts

func (c *XlsxConverter) Accepts(info StreamInfo) bool

func (*XlsxConverter) Convert

func (c *XlsxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type ZipConverter

type ZipConverter struct {
	// contains filtered or unexported fields
}

ZipConverter handles ZIP files by recursively converting their contents.

func NewZipConverter

func NewZipConverter(m *MarkItDown) *ZipConverter

NewZipConverter creates a new ZipConverter.

func (*ZipConverter) Accepts

func (c *ZipConverter) Accepts(info StreamInfo) bool

func (*ZipConverter) Convert

func (c *ZipConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

Directories

Path Synopsis
cmd
markitdown command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL