markitdown

package module

v0.0.1 Latest Latest Go to latest Published: Feb 18, 2026 License: Apache-2.0 Imports: 44 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/conductor-oss/markitdown

Links

Open Source Insights

README ¶

markitdown

A pure-Go library and CLI that converts documents to Markdown. Go port of the Python markitdown library.

Features

Pure Go, no CGO, no external runtime dependencies
12 format converters: PDF, DOCX, PPTX, XLSX, XLS, HTML, RSS/Atom, CSV, EPUB, Jupyter, plain text, ZIP
Deterministic output with golden test suite
PDF extraction via PDFium (WebAssembly, no CGO) with heading/bold/italic detection

Supported formats

Format	Extensions	Notes
PDF	`.pdf`	Text extraction via PDFium (WebAssembly, no CGO)
Word	`.docx`	Headings, tables, lists, hyperlinks, comments, math (OMML to LaTeX)
PowerPoint	`.pptx`	Slides, tables, notes, image alt text
Excel	`.xlsx`	Multi-sheet markdown tables
Excel (legacy)	`.xls`	Multi-sheet markdown tables
HTML	`.html`, `.htm`	Full HTML-to-Markdown conversion
RSS/Atom	`.xml`, `.rss`, `.atom`	Feed items with titles, dates, content
CSV	`.csv`	Markdown table with auto charset detection
EPUB	`.epub`	Metadata, table of contents, chapter content
Jupyter	`.ipynb`	Markdown + fenced code cells with output
Plain text	`.txt`, `.md`, `.json`, `.jsonl`	Charset detection and UTF-8 conversion
ZIP	`.zip`	Recursively converts supported files inside

Install

go get github.com/conductor-oss/markitdown

Library quick start

package main

import (
	"fmt"
	"log"

	markitdown "github.com/conductor-oss/markitdown"
)

func main() {
	m := markitdown.New()

	// Convert a local file
	result, err := m.ConvertFile("report.pdf")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result.Markdown)
}

More examples

// Convert a URL
result, err := m.ConvertURL("https://example.com/page.html")

// Convert with auto-detection (file path or URL)
result, err := m.Convert("report.pdf")
result, err := m.Convert("https://example.com/page.html")

// Convert from a reader with metadata hints
f, _ := os.Open("data.csv")
result, err := m.ConvertReader(f, markitdown.StreamInfo{
	Extension: ".csv",
	MIMEType:  "text/csv",
	Charset:   "shift_jis",
})

// Options
m := markitdown.New(
	markitdown.WithKeepDataURIs(true), // preserve base64 data URIs in output
)

CLI quick start

Build:

go build -o markitdown ./cmd/markitdown

Convert a file to stdout:

./markitdown report.pdf

Convert and write to a file:

./markitdown -o output.md report.docx

Convert from stdin with format hint:

cat data.csv | ./markitdown -x csv

Convert a URL:

./markitdown https://example.com/page.html

CLI flags

Usage: markitdown [flags] [source]

Arguments:
  source    File path or URL to convert (reads stdin if omitted)

Flags:
  -o, --output string       Output file (default: stdout)
  -x, --extension string    File extension hint for stdin input (e.g. "pdf", ".csv")
  -m, --mime-type string    MIME type hint
  -c, --charset string      Charset hint (e.g. "shift_jis", "utf-8")
  -v, --version             Show version
      --keep-data-uris      Keep full base64-encoded data URIs in output

Notes

PDF extraction is text-based; image-only PDFs produce no output without OCR.
DOCX math equations (OMML) are converted to LaTeX notation.
CJK charset detection works without hints but is most reliable when Charset is provided in StreamInfo.

Acknowledgements

This project is a Go port of Microsoft's markitdown Python library. The original project provides the reference implementation, test fixtures, and design that this port is based on.

Documentation ¶

Index ¶

Constants
func IsUnsupportedFormat(err error) bool
type ConversionError
- func (e *ConversionError) Error() string
- func (e *ConversionError) Unwrap() error
type CsvConverter
- func NewCsvConverter() *CsvConverter
- func (c *CsvConverter) Accepts(info StreamInfo) bool
- func (c *CsvConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type DocumentConverter
type DocumentConverterResult
type DocxConverter
- func NewDocxConverter(m *MarkItDown) *DocxConverter
- func (c *DocxConverter) Accepts(info StreamInfo) bool
- func (c *DocxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type EpubConverter
- func NewEpubConverter(m *MarkItDown) *EpubConverter
- func (c *EpubConverter) Accepts(info StreamInfo) bool
- func (c *EpubConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type FailedConversionAttempt
type HTMLConverter
- func NewHTMLConverter(m *MarkItDown) *HTMLConverter
- func (c *HTMLConverter) Accepts(info StreamInfo) bool
- func (c *HTMLConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
- func (c *HTMLConverter) ConvertString(htmlStr string) (*DocumentConverterResult, error)
type IpynbConverter
- func NewIpynbConverter() *IpynbConverter
- func (c *IpynbConverter) Accepts(info StreamInfo) bool
- func (c *IpynbConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type MarkItDown
- func New(opts ...Option) *MarkItDown
- func (m *MarkItDown) Convert(source string) (*DocumentConverterResult, error)
- func (m *MarkItDown) ConvertFile(path string) (*DocumentConverterResult, error)
- func (m *MarkItDown) ConvertReader(r io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
- func (m *MarkItDown) ConvertURL(url string) (*DocumentConverterResult, error)
- func (m *MarkItDown) RegisterConverter(name string, c DocumentConverter, priority float64)
type Option
- func WithKeepDataURIs(keep bool) Option
- func WithStyleMap(styleMap string) Option
type PdfConverter
- func NewPdfConverter() *PdfConverter
- func (c *PdfConverter) Accepts(info StreamInfo) bool
- func (c *PdfConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type PlainTextConverter
- func NewPlainTextConverter() *PlainTextConverter
- func (c *PlainTextConverter) Accepts(info StreamInfo) bool
- func (c *PlainTextConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type PptxConverter
- func NewPptxConverter(m *MarkItDown) *PptxConverter
- func (c *PptxConverter) Accepts(info StreamInfo) bool
- func (c *PptxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type RSSConverter
- func NewRSSConverter() *RSSConverter
- func (c *RSSConverter) Accepts(info StreamInfo) bool
- func (c *RSSConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type StreamInfo
type UnsupportedFormatError
- func (e *UnsupportedFormatError) Error() string
type XlsConverter
- func NewXlsConverter() *XlsConverter
- func (c *XlsConverter) Accepts(info StreamInfo) bool
- func (c *XlsConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type XlsxConverter
- func NewXlsxConverter() *XlsxConverter
- func (c *XlsxConverter) Accepts(info StreamInfo) bool
- func (c *XlsxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
type ZipConverter
- func NewZipConverter(m *MarkItDown) *ZipConverter
- func (c *ZipConverter) Accepts(info StreamInfo) bool
- func (c *ZipConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

Constants ¶

View Source

const (
	// PrioritySpecific is for format-specific converters (PDF, DOCX, etc.).
	PrioritySpecific = 0.0
	// PriorityGeneric is for fallback converters (PlainText, HTML, ZIP).
	PriorityGeneric = 10.0
)

Variables ¶

This section is empty.

Functions ¶

func IsUnsupportedFormat ¶

func IsUnsupportedFormat(err error) bool

IsUnsupportedFormat reports whether the error is an UnsupportedFormatError.

Types ¶

type ConversionError ¶

type ConversionError struct {
	Attempts []FailedConversionAttempt
}

ConversionError is returned when a converter accepted the input but failed to convert it.

func (*ConversionError) Error ¶

func (e *ConversionError) Error() string

func (*ConversionError) Unwrap ¶

func (e *ConversionError) Unwrap() error

type CsvConverter ¶

type CsvConverter struct{}

CsvConverter handles CSV files.

func NewCsvConverter ¶

func NewCsvConverter() *CsvConverter

NewCsvConverter creates a new CsvConverter.

func (*CsvConverter) Accepts ¶

func (c *CsvConverter) Accepts(info StreamInfo) bool

func (*CsvConverter) Convert ¶

func (c *CsvConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type DocumentConverter ¶

type DocumentConverter interface {
	// Accepts returns true if this converter can handle the given input.
	// It MUST NOT change the read position of reader.
	Accepts(info StreamInfo) bool

	// Convert performs the actual document-to-markdown conversion.
	Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)
}

DocumentConverter is the interface all format converters implement.

type DocumentConverterResult ¶

type DocumentConverterResult struct {
	Markdown string
	Title    string
}

DocumentConverterResult holds the output of a conversion.

type DocxConverter ¶

type DocxConverter struct {
	// contains filtered or unexported fields
}

DocxConverter handles DOCX files.

func NewDocxConverter ¶

func NewDocxConverter(m *MarkItDown) *DocxConverter

NewDocxConverter creates a new DocxConverter.

func (*DocxConverter) Accepts ¶

func (c *DocxConverter) Accepts(info StreamInfo) bool

func (*DocxConverter) Convert ¶

func (c *DocxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type EpubConverter ¶

type EpubConverter struct {
	// contains filtered or unexported fields
}

EpubConverter handles EPUB files.

func NewEpubConverter ¶

func NewEpubConverter(m *MarkItDown) *EpubConverter

NewEpubConverter creates a new EpubConverter.

func (*EpubConverter) Accepts ¶

func (c *EpubConverter) Accepts(info StreamInfo) bool

func (*EpubConverter) Convert ¶

func (c *EpubConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type FailedConversionAttempt ¶

type FailedConversionAttempt struct {
	Converter string
	Err       error
}

FailedConversionAttempt records a converter that accepted but failed.

type HTMLConverter ¶

type HTMLConverter struct {
	// contains filtered or unexported fields
}

HTMLConverter handles HTML files.

func NewHTMLConverter ¶

func NewHTMLConverter(m *MarkItDown) *HTMLConverter

NewHTMLConverter creates a new HTMLConverter.

func (*HTMLConverter) Accepts ¶

func (c *HTMLConverter) Accepts(info StreamInfo) bool

func (*HTMLConverter) Convert ¶

func (c *HTMLConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

func (*HTMLConverter) ConvertString ¶

func (c *HTMLConverter) ConvertString(htmlStr string) (*DocumentConverterResult, error)

ConvertString converts an HTML string to markdown.

type IpynbConverter ¶

type IpynbConverter struct{}

IpynbConverter handles Jupyter notebook files.

func NewIpynbConverter ¶

func NewIpynbConverter() *IpynbConverter

NewIpynbConverter creates a new IpynbConverter.

func (*IpynbConverter) Accepts ¶

func (c *IpynbConverter) Accepts(info StreamInfo) bool

func (*IpynbConverter) Convert ¶

func (c *IpynbConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type MarkItDown ¶

type MarkItDown struct {
	// contains filtered or unexported fields
}

MarkItDown is the main document-to-markdown conversion engine.

func New ¶

func New(opts ...Option) *MarkItDown

New creates a new MarkItDown instance with the given options.

func (*MarkItDown) Convert ¶

func (m *MarkItDown) Convert(source string) (*DocumentConverterResult, error)

Convert auto-detects the source type (file path or URL) and converts it.

func (*MarkItDown) ConvertFile ¶

func (m *MarkItDown) ConvertFile(path string) (*DocumentConverterResult, error)

ConvertFile converts a local file to markdown.

func (*MarkItDown) ConvertReader ¶

func (m *MarkItDown) ConvertReader(r io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

ConvertReader converts a stream to markdown using the provided StreamInfo.

func (*MarkItDown) ConvertURL ¶

func (m *MarkItDown) ConvertURL(url string) (*DocumentConverterResult, error)

ConvertURL fetches a URL and converts the response to markdown.

func (*MarkItDown) RegisterConverter ¶

func (m *MarkItDown) RegisterConverter(name string, c DocumentConverter, priority float64)

RegisterConverter adds a custom converter with the given priority. Lower priority values are tried first.

type Option ¶

type Option func(*MarkItDown)

Option configures a MarkItDown instance.

func WithKeepDataURIs ¶

func WithKeepDataURIs(keep bool) Option

WithKeepDataURIs configures whether to keep full data URIs in output (default: false, which truncates them to data:mime/type;base64...).

func WithStyleMap ¶

func WithStyleMap(styleMap string) Option

WithStyleMap sets custom style mapping for DOCX conversion.

type PdfConverter ¶

type PdfConverter struct{}

PdfConverter handles PDF files using the PDFium library via WebAssembly.

func NewPdfConverter ¶

func NewPdfConverter() *PdfConverter

NewPdfConverter creates a new PdfConverter.

func (*PdfConverter) Accepts ¶

func (c *PdfConverter) Accepts(info StreamInfo) bool

func (*PdfConverter) Convert ¶

func (c *PdfConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type PlainTextConverter ¶

type PlainTextConverter struct{}

PlainTextConverter handles plain text, markdown, JSON, and JSONL files.

func NewPlainTextConverter ¶

func NewPlainTextConverter() *PlainTextConverter

NewPlainTextConverter creates a new PlainTextConverter.

func (*PlainTextConverter) Accepts ¶

func (c *PlainTextConverter) Accepts(info StreamInfo) bool

func (*PlainTextConverter) Convert ¶

func (c *PlainTextConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type PptxConverter ¶

type PptxConverter struct {
	// contains filtered or unexported fields
}

PptxConverter handles PPTX files.

func NewPptxConverter ¶

func NewPptxConverter(m *MarkItDown) *PptxConverter

NewPptxConverter creates a new PptxConverter.

func (*PptxConverter) Accepts ¶

func (c *PptxConverter) Accepts(info StreamInfo) bool

func (*PptxConverter) Convert ¶

func (c *PptxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type RSSConverter ¶

type RSSConverter struct{}

RSSConverter handles RSS and Atom feed files.

func NewRSSConverter ¶

func NewRSSConverter() *RSSConverter

NewRSSConverter creates a new RSSConverter.

func (*RSSConverter) Accepts ¶

func (c *RSSConverter) Accepts(info StreamInfo) bool

func (*RSSConverter) Convert ¶

func (c *RSSConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type StreamInfo ¶

type StreamInfo struct {
	MIMEType  string
	Extension string
	Charset   string
	Filename  string
	LocalPath string
	URL       string
}

StreamInfo holds metadata about the input being converted.

type UnsupportedFormatError ¶

type UnsupportedFormatError struct {
	Extension string
	MIMEType  string
}

UnsupportedFormatError is returned when no converter can handle the input format.

func (*UnsupportedFormatError) Error ¶

func (e *UnsupportedFormatError) Error() string

type XlsConverter ¶

type XlsConverter struct{}

XlsConverter handles legacy XLS files.

func NewXlsConverter ¶

func NewXlsConverter() *XlsConverter

NewXlsConverter creates a new XlsConverter.

func (*XlsConverter) Accepts ¶

func (c *XlsConverter) Accepts(info StreamInfo) bool

func (*XlsConverter) Convert ¶

func (c *XlsConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type XlsxConverter ¶

type XlsxConverter struct{}

XlsxConverter handles XLSX files.

func NewXlsxConverter ¶

func NewXlsxConverter() *XlsxConverter

NewXlsxConverter creates a new XlsxConverter.

func (*XlsxConverter) Accepts ¶

func (c *XlsxConverter) Accepts(info StreamInfo) bool

func (*XlsxConverter) Convert ¶

func (c *XlsxConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

type ZipConverter ¶

type ZipConverter struct {
	// contains filtered or unexported fields
}

ZipConverter handles ZIP files by recursively converting their contents.

func NewZipConverter ¶

func NewZipConverter(m *MarkItDown) *ZipConverter

NewZipConverter creates a new ZipConverter.

func (*ZipConverter) Accepts ¶

func (c *ZipConverter) Accepts(info StreamInfo) bool

func (*ZipConverter) Convert ¶

func (c *ZipConverter) Convert(reader io.ReadSeeker, info StreamInfo) (*DocumentConverterResult, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
markitdown command
internal
docxmath
ooxml

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL