docmill

package module

v0.1.3 Latest Latest Go to latest Published: May 11, 2026 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ivanvanderbyl/docmill

Links

Open Source Insights

README ¶

docmill

Fast PDF to Markdown conversion using pdfium text extraction with intelligent layout and style analysis.

Features

Fast extraction: Uses native pdfium for text extraction (orders of magnitude faster than LLM processing)
Rich metadata: Extracts font size, weight, style, colour, and positioning information
Intelligent structure detection:
- Headings (H1-H6) based on font size and weight
- Paragraphs with proper line breaking and spacing
- Bullet and numbered lists with nested items
- Code blocks (monospace font detection)
- Bold and italic inline formatting
- Text alignment (left, centre, right)
- Table detection with markdown table output
- Multi-column layout handling with rotated text support
Page-aware: Handles multi-page documents with page separators
Flexible API: Convert from file path, bytes, or io.ReadSeeker
Configurable: Customisable heading detection, table extraction, and formatting options
Performance metrics: Optional timing and statistics logging

Architecture

The converter works in three stages:

Extraction: Extract all characters with rich metadata (font, size, position, colour)
Structure Analysis: Group characters → words → lines → paragraphs, detect document structure
Markdown Conversion: Convert structured document to clean markdown

Installation

go get github.com/ivanvanderbyl/docmill

Usage

Basic Conversion

import (
    "fmt"
    "log"

    "github.com/ivanvanderbyl/docmill"
)

converter, err := docmill.New()
if err != nil {
    log.Fatal(err)
}
defer converter.Close()

markdown, err := converter.ConvertFile("document.pdf")
if err != nil {
    log.Fatal(err)
}

fmt.Println(markdown)

Custom Configuration

config := docmill.DefaultConfig()
config.IncludePageBreaks = true
config.DetectTables = true
config.UseSegmentBasedTables = true  // Better for PDFs without ruling lines
config.UseAdaptiveThresholds = true
config.MinHeadingFontSize = 1.2      // Adjust heading detection sensitivity
config.EnableMetricsLogging = true   // Enable performance metrics

converter, err := docmill.NewWithConfig(config)
if err != nil {
    log.Fatal(err)
}
defer converter.Close()

markdown, err := converter.ConvertFile("document.pdf")

Convert from Bytes

pdfBytes, err := os.ReadFile("document.pdf")
if err != nil {
    log.Fatal(err)
}

markdown, err := converter.ConvertBytes(pdfBytes)

Convert from io.ReadSeeker

file, err := os.Open("document.pdf")
if err != nil {
    log.Fatal(err)
}
defer file.Close()

markdown, err := converter.ConvertReader(file)

Convert Specific Pages

// Convert pages 0-4 (first 5 pages, 0-indexed)
markdown, err := converter.ConvertPageRange("document.pdf", 0, 4)

Get Document Info

info, err := converter.GetDocumentInfo("document.pdf")
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Document has %d pages\n", info.PageCount)

Command Line Tool

A CLI tool is provided for quick conversions:

Installation

Download pre-built binary

Pre-built binaries are available for Linux, macOS, and Windows (amd64 and arm64) from GitHub Releases.

macOS / Linux:

# Download the latest release (adjust OS and ARCH as needed)
# OS: linux, darwin  ARCH: amd64, arm64
curl -sL "https://github.com/ivanvanderbyl/docmill/releases/latest/download/docmill_$(uname -s | tr '[:upper:]' '[:lower:]')_$(uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/').tar.gz" -o /tmp/docmill.tar.gz && tar xzf /tmp/docmill.tar.gz -C /usr/local/bin docmill && rm /tmp/docmill.tar.gz

Windows:

Download the appropriate .zip from the releases page and add docmill.exe to your PATH.

Install with `go install`

go install github.com/ivanvanderbyl/docmill/cmd/docmill@latest

Build from source

git clone https://github.com/ivanvanderbyl/docmill.git
cd docmill
go install ./cmd/docmill

Usage

# Convert to file
docmill -i input.pdf -o output.md

# Convert specific pages (0-indexed)
docmill -i input.pdf -o output.md --start-page 0 --end-page 4

# Output to stdout
docmill -i input.pdf

# Enable metrics logging
docmill -i input.pdf -o output.md --metrics

Options

-i, --input - Input PDF file path (required)
-o, --output - Output markdown file path (default: stdout)
--start-page - Start page number, 0-indexed (default: all pages)
--end-page - End page number, 0-indexed (default: all pages)
-m, --metrics - Enable processing time and statistics logging
--page-breaks - Add --- separators between pages (default: true)
--min-heading-font-size - Minimum font size multiplier to detect headings, 0 disables (default: 1.15)
--detect-tables - Enable table detection and extraction (default: true)
--segment-tables - Use PDF-TREX segment-based table detection, better for tables without ruling lines (default: false)
--adaptive-thresholds - Enable document-specific threshold calculation based on spacing distribution (default: true)
--max-concurrency - Maximum pages processed concurrently during structure detection (default: 10)
--chunk - Output as JSON chunks instead of markdown
--chunk-max-tokens - Maximum tokens per chunk (default: 512)
--chunk-overlap - Number of overlap tokens between chunks (default: 0)
--chunk-repeat-headings - Repeat heading hierarchy at the start of each chunk

Configuration Options

Config Struct

type Config struct {
    // IncludePageBreaks adds "---" separators between pages (default: true)
    IncludePageBreaks bool

    // MinHeadingFontSize is the minimum font size multiplier to detect headings
    // A value of 0 disables size-based heading detection (default: 1.15x body text)
    MinHeadingFontSize float64

    // DetectTables enables table detection and extraction (default: true)
    DetectTables bool

    // TableSettings configures table detection behavior
    TableSettings TableSettings

    // UseSegmentBasedTables enables PDF-TREX segment-based table detection
    // This works better for tables without ruling lines (default: false)
    UseSegmentBasedTables bool

    // UseAdaptiveThresholds enables document-specific threshold calculation
    // Based on spacing distribution analysis (default: true)
    UseAdaptiveThresholds bool

    // EnableMetricsLogging enables processing time and statistics logging (default: false)
    EnableMetricsLogging bool
}

Table Settings

Table detection can be configured using TableSettings:

config := docmill.DefaultConfig()
config.TableSettings = docmill.TableSettings{
    VerticalStrategy:   "lines",  // "text", "lines", "lines_strict", "explicit"
    HorizontalStrategy: "lines",
    SnapTolerance:      3.0,      // Tolerance for snapping close edges
    EdgeMinLength:      3.0,      // Minimum edge length to consider
    MinWordsVertical:   3,        // Minimum words for text-based detection
    MinWordsHorizontal: 1,
}

Markdown Output Features

Headings

Headings are detected based on:

Font size relative to body text (configurable threshold)
Bold font weight
Single-line paragraphs

# Large Heading (H1)
## Medium Heading (H2)
### Smaller Heading (H3)

Lists

Bullet and numbered lists with proper nesting:

* First item
* Second item
  * Nested item
  * Another nested item

1. Numbered item
2. Another item
   1. Nested numbered item

Tables

Tables are detected and converted to markdown tables:

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Cell 1   | Cell 2   | Cell 3   |
| Cell 4   | Cell 5   | Cell 6   |

Inline Formatting

Bold, italic, and code are preserved:

This is **bold** text and *italic* text with `code`.

Code Blocks

Monospace paragraphs are converted to code blocks:

```
func main() {
    fmt.Println("Hello")
}
```

Page Breaks

Multi-page documents include page separators (when IncludePageBreaks is enabled):

Content from page 1

---

Content from page 2

Multi-Column Layouts

The converter intelligently handles multi-column layouts and rotated text, maintaining reading order where possible.

Performance Metrics

When EnableMetricsLogging is enabled, the converter logs detailed timing and statistics:

Processing PDF with 10 pages...
Document opened in 45ms
Page 1 extracted in 23ms
Page 2 extracted in 18ms
...
Total conversion time: 234ms
Statistics:
  - Total paragraphs: 145
  - Total tables: 8
  - Total headings: 23
  - Total words: 3,456
  - Total characters: 18,234

Typical conversion speeds (varies by PDF complexity):

Simple text PDF: ~10-50ms per page
Complex formatted PDF with tables: ~50-200ms per page

Compare to LLM-based extraction:

LLM API call: ~1-5 seconds per page
Cost: $0 vs API costs

Use Cases

Ideal for docmill:

PDFs with extractable text (not scanned images)
Fast conversion without LLM API costs
Document structure is relatively standard
Preserving formatting (bold, italic, headings, tables)
Batch processing large numbers of documents
Building document search/indexing systems
Extracting structured data from reports

Fall back to LLM processing when:

PDFs are scanned images requiring OCR
Complex semantic analysis is required
Need to extract specific information requiring understanding
Documents with highly irregular layouts

Integration with LLM Pipeline

This package is designed to be a fast first pass before LLM processing:

// Try fast extraction first
markdown, err := converter.ConvertFile(pdfPath)
if err != nil || len(markdown) < 100 {
    // Fall back to LLM-based extraction
    return llmExtractor.Extract(pdfPath)
}

// Use extracted markdown as LLM context for further analysis
response, err := llmClient.Analyze(ctx, llm.AnalyzeRequest{
    Context: markdown,
    Task:    "Extract key financial metrics from this report",
})

Capabilities

Supported Features

✅ Text extraction with font metadata
✅ Heading detection (H1-H6)
✅ Paragraph detection with proper spacing
✅ List detection (bullet and numbered)
✅ Table detection and markdown table output
✅ Bold and italic inline formatting
✅ Code block detection (monospace fonts)
✅ Multi-column layout handling
✅ Rotated text support
✅ Page break markers
✅ Configurable thresholds and settings
✅ Performance metrics and logging

Current Limitations

❌ No OCR support (requires extractable text in PDF)
❌ Hyperlinks are not extracted
❌ Images are not extracted (text only)
⚠️ Complex multi-column layouts may not always preserve perfect reading order
⚠️ Tables without clear structure may require segment-based detection

Experimental Features

PDF-TREX segment-based table detection (enable with UseSegmentBasedTables: true)
Adaptive threshold calculation based on document analysis

Contributing

Contributions are welcome! Areas for improvement:

Hyperlink extraction
Image placeholder insertion
Enhanced multi-column layout detection
Custom markdown formatting options
Additional table detection strategies

License

MIT License - see LICENSE file for details

Documentation ¶

Index ¶

type AdaptiveThresholds
type Alignment
type Block
type CellBBox
type Chunk
type ChunkConfig
- func DefaultChunkConfig() ChunkConfig
type Column
type Config
- func DefaultConfig() Config
type Converter
- func New() (*Converter, error)
- func NewConverter(instance pdfium.Pdfium) *Converter
- func NewConverterWithConfig(instance pdfium.Pdfium, config Config) *Converter
- func NewWithConfig(config Config) (*Converter, error)
- func (c *Converter) Close()
- func (c *Converter) ConvertBytes(pdfBytes []byte) (string, error)
- func (c *Converter) ConvertBytesChunks(pdfBytes []byte, cc ChunkConfig) ([]Chunk, error)
- func (c *Converter) ConvertFile(filePath string) (string, error)
- func (c *Converter) ConvertFileChunks(filePath string, cc ChunkConfig) ([]Chunk, error)
- func (c *Converter) ConvertFileWithMetrics(filePath string) (string, ProcessingMetrics, error)
- func (c *Converter) ConvertPageRange(filePath string, startPage, endPage int) (string, error)
- func (c *Converter) ConvertReader(reader io.ReadSeeker) (string, error)
- func (c *Converter) GetDocumentInfo(filePath string) (*DocumentInfo, error)
type Document
- func (d *Document) ToChunks(config Config, cc ChunkConfig) []Chunk
- func (d *Document) ToMarkdown(config Config) string
type DocumentInfo
type DocumentStatistics
type DocumentStats
type Edge
type EnrichedChar
type EnrichedWord
- func (w EnrichedWord) IsBulletOrNumber() bool
type HeadingContext
type Line
type LineType
type Page
- func ExtractPage(instance pdfium.Pdfium, page references.FPDF_PAGE, pageNumber int, ...) (*Page, error)
- func (p *Page) ToMarkdown() string
type PageExtractor
type PageMetrics
type PageQuality
type Paragraph
- func (p Paragraph) CenterX() float64
- func (p Paragraph) Text() string
type Point
type ProcessingMetrics
type RGBA
type Rect
- func (r Rect) CenterX() float64
- func (r Rect) CenterY() float64
- func (r Rect) Height() float64
- func (r Rect) Width() float64
type Segment
type SegmentTableCell
type SegmentTableRow
type Table
- func DetectTables(page *Page, settings TableSettings) []Table
- func DetectTablesSegmentBased(page *Page, thresholds AdaptiveThresholds) []Table
type TableArea
type TableCell
type TableColumn
type TableRow
type TableSettings
- func DefaultTableSettings() TableSettings
type TaggedLine
type TextBlock

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AdaptiveThresholds ¶

type AdaptiveThresholds struct {
	HorizontalThreshold float64 // hT: for horizontal clustering
	VerticalThreshold   float64 // vT: for vertical clustering
}

AdaptiveThresholds contains document-specific threshold values

type Alignment ¶

type Alignment int

Alignment represents text alignment.

const (
	AlignmentLeft Alignment = iota
	AlignmentCenter
	AlignmentRight
	AlignmentJustified
)

type Block ¶

type Block struct {
	Segments    []Segment
	Box         Rect
	LineIndices []int // Which lines this block spans
}

Block represents vertically aligned segments across multiple lines

type CellBBox ¶

type CellBBox struct {
	X0     float64
	Top    float64
	X1     float64
	Bottom float64
}

CellBBox represents a table cell as a bounding box.

type Chunk ¶

type Chunk struct {
	Index      int    `json:"index"`
	Text       string `json:"text"`
	TokenCount int    `json:"token_count"`

	StartPage int `json:"start_page"`
	EndPage   int `json:"end_page"`

	HeadingPath []HeadingContext `json:"heading_path,omitempty"`
}

type ChunkConfig ¶

type ChunkConfig struct {
	MaxTokens      int
	OverlapTokens  int
	RepeatHeadings bool
	EstimateTokens func(s string) int
}

func DefaultChunkConfig ¶

func DefaultChunkConfig() ChunkConfig

type Column ¶

type Column struct {
	Box        Rect
	Words      []EnrichedWord
	Paragraphs []Paragraph
	Index      int // Column number (0-indexed from left to right)
}

Column represents a vertical column of text in a multi-column layout.

type Config ¶

type Config struct {
	// IncludePageBreaks adds "---" separators between pages (default: true)
	IncludePageBreaks bool

	// MinHeadingFontSize is the minimum font size difference to detect headings
	// A value of 0 disables size-based heading detection (default: 1.15x body text)
	MinHeadingFontSize float64

	// DetectTables enables table detection and extraction (default: false)
	DetectTables bool

	// TableSettings configures table detection behavior (default: DefaultTableSettings())
	TableSettings TableSettings

	// UseSegmentBasedTables enables PDF-TREX segment-based table detection
	// This works better for tables without ruling lines (default: true)
	UseSegmentBasedTables bool

	// UseAdaptiveThresholds enables document-specific threshold calculation
	// Based on spacing distribution analysis (default: true)
	UseAdaptiveThresholds bool

	// EnableMetricsLogging enables processing time and statistics logging (default: false)
	EnableMetricsLogging bool

	// MaxConcurrency controls how many pages are processed concurrently during
	// the structure detection phase. PDFium extraction is always sequential,
	// but paragraph/table/heading detection runs in parallel. (default: 10)
	MaxConcurrency int
}

Config controls markdown conversion behavior.

func DefaultConfig ¶

func DefaultConfig() Config

DefaultConfig returns the default converter configuration.

type Converter ¶

type Converter struct {
	// contains filtered or unexported fields
}

Converter converts PDFs to markdown using pdfium text extraction.

func New ¶

func New() (*Converter, error)

New creates a new PDF to markdown converter with default configuration. The returned Converter manages its own pdfium pool and must be closed with Close when no longer needed.

func NewConverter ¶

func NewConverter(instance pdfium.Pdfium) *Converter

NewConverter creates a new PDF to markdown converter with default configuration. The caller is responsible for managing the pdfium pool lifecycle.

func NewConverterWithConfig ¶

func NewConverterWithConfig(instance pdfium.Pdfium, config Config) *Converter

NewConverterWithConfig creates a new PDF to markdown converter with custom configuration. The caller is responsible for managing the pdfium pool lifecycle.

func NewWithConfig ¶

func NewWithConfig(config Config) (*Converter, error)

NewWithConfig creates a new PDF to markdown converter with custom configuration. The returned Converter manages its own pdfium pool and must be closed with Close when no longer needed.

func (*Converter) Close ¶

func (c *Converter) Close()

Close releases resources held by the Converter. Only required for converters created with New or NewWithConfig.

func (*Converter) ConvertBytes ¶

func (c *Converter) ConvertBytes(pdfBytes []byte) (string, error)

ConvertBytes converts PDF bytes to markdown.

func (*Converter) ConvertBytesChunks ¶

func (c *Converter) ConvertBytesChunks(pdfBytes []byte, cc ChunkConfig) ([]Chunk, error)

func (*Converter) ConvertFile ¶

func (c *Converter) ConvertFile(filePath string) (string, error)

ConvertFile converts a PDF file to markdown.

func (*Converter) ConvertFileChunks ¶

func (c *Converter) ConvertFileChunks(filePath string, cc ChunkConfig) ([]Chunk, error)

func (*Converter) ConvertFileWithMetrics ¶

func (c *Converter) ConvertFileWithMetrics(filePath string) (string, ProcessingMetrics, error)

ConvertFileWithMetrics converts a PDF and returns both markdown and metrics

func (*Converter) ConvertPageRange ¶

func (c *Converter) ConvertPageRange(filePath string, startPage, endPage int) (string, error)

ConvertPageRange converts a specific range of pages to markdown.

func (*Converter) ConvertReader ¶

func (c *Converter) ConvertReader(reader io.ReadSeeker) (string, error)

ConvertReader converts a PDF from an io.ReadSeeker to markdown.

func (*Converter) GetDocumentInfo ¶

func (c *Converter) GetDocumentInfo(filePath string) (*DocumentInfo, error)

GetDocumentInfo returns basic information about a PDF without converting it.

type Document ¶

type Document struct {
	Pages []Page
	Stats DocumentStats
}

Document represents the complete extracted document structure.

func (*Document) ToChunks ¶

func (d *Document) ToChunks(config Config, cc ChunkConfig) []Chunk

func (*Document) ToMarkdown ¶

func (d *Document) ToMarkdown(config Config) string

ToMarkdown converts a document to markdown format.

type DocumentInfo ¶

type DocumentInfo struct {
	PageCount int
}

DocumentInfo contains basic information about a PDF document.

type DocumentStatistics ¶

type DocumentStatistics struct {
	TotalPages      int
	TotalParagraphs int
	TotalTables     int
	TotalHeadings   int
	TotalWords      int
	TotalCharacters int
}

DocumentStatistics contains document-level statistics

type DocumentStats ¶

type DocumentStats struct {
	MostUsedFontSize float64         // Most common font size (body text)
	MostUsedFontName string          // Most common font name
	MostUsedLineGap  float64         // Most common line spacing
	FontSizeFreq     map[float64]int // Frequency map of font sizes
	FontNameFreq     map[string]int  // Frequency map of font names
	MaxFontSize      float64         // Largest font size in document
}

DocumentStats holds document-wide font and spacing statistics. These are calculated across all pages as hints for structure detection.

type Edge ¶

type Edge struct {
	X0          float64 // Left x coordinate
	X1          float64 // Right x coordinate
	Top         float64 // Top y coordinate
	Bottom      float64 // Bottom y coordinate
	Width       float64 // Width (for horizontal edges)
	Height      float64 // Height (for vertical edges)
	Orientation string  // "h" for horizontal, "v" for vertical
}

Edge represents a horizontal or vertical line segment used for table detection. Based on pdfplumber's edge structure.

type EnrichedChar ¶

type EnrichedChar struct {
	Text       rune
	Box        Rect
	FontSize   float64
	FontWeight int
	FontName   string
	FontFlags  int
	FillColor  RGBA
	Angle      float32
	IsHyphen   bool
}

EnrichedChar represents a single character with all its metadata.

type EnrichedWord ¶

type EnrichedWord struct {
	Text        string
	Box         Rect
	FontSize    float64 // Average font size
	FontWeight  int     // Dominant font weight
	FontName    string  // Dominant font name
	FontFlags   int     // Dominant font flags
	FillColor   RGBA    // Dominant fill color
	IsBold      bool
	IsItalic    bool
	IsMonospace bool
	Baseline    float64 // Y-coordinate of the text baseline
	XHeight     float64 // Height of lowercase letters
	Rotation    float64 // Rotation angle in degrees (0, 90, 180, 270, etc.)
}

EnrichedWord represents a word with aggregated style information.

func (EnrichedWord) IsBulletOrNumber ¶

func (w EnrichedWord) IsBulletOrNumber() bool

IsBulletOrNumber checks if the word looks like a list marker.

type HeadingContext ¶

type HeadingContext struct {
	Level int    `json:"level"`
	Text  string `json:"text"`
	Page  int    `json:"page"`
}

type Line ¶

type Line struct {
	Words    []EnrichedWord
	Box      Rect
	Baseline float64 // Y-coordinate of the baseline
}

Line represents a horizontal line of text.

type LineType ¶

type LineType string

LineType represents the classification of a line in table detection

const (
	TextLine    LineType = "TxL" // Text line (single segment spanning > 50% width)
	TableLine   LineType = "TbL" // Table line (multiple segments)
	UnknownLine LineType = "UnL" // Unknown line (single segment spanning < 50% width)
)

type Page ¶

type Page struct {
	Number     int
	Width      float64
	Height     float64
	Quality    PageQuality
	Paragraphs []Paragraph
	Tables     []Table
	Lines      []Edge   // Explicit line objects extracted from PDF
	Columns    []Column // Detected column layout
}

Page represents all extracted content from a PDF page.

func ExtractPage ¶

func ExtractPage(instance pdfium.Pdfium, page references.FPDF_PAGE, pageNumber int, config Config) (*Page, error)

ExtractPage extracts all enriched text from a PDF page.

func (*Page) ToMarkdown ¶

func (p *Page) ToMarkdown() string

PageToMarkdown converts a single page to markdown.

type PageExtractor ¶

type PageExtractor struct {
	// contains filtered or unexported fields
}

PageExtractor provides context for extracting text from a page.

type PageMetrics ¶

type PageMetrics struct {
	PageNumber int
	Duration   time.Duration
}

PageMetrics contains timing for a single page

type PageQuality ¶

type PageQuality struct {
	AlnumRatio           float64
	MeaningfulWordRatio  float64
	ReplacementCharRatio float64
	FragmentedWordRatio  float64
	PUARatio             float64
	WordCount            int
	CharCount            int
	NonWhitespaceCount   int
	IsLowQuality         bool
}

type Paragraph ¶

type Paragraph struct {
	Lines        []Line
	Box          Rect
	Alignment    Alignment
	IsHeading    bool
	HeadingLevel int // 1-6 for markdown headings
	IsList       bool
	IsCode       bool
	Indent       float64 // Left indentation
}

Paragraph represents a block of text.

func (Paragraph) CenterX ¶

func (p Paragraph) CenterX() float64

CenterX returns the horizontal center of a paragraph's bounding box

func (Paragraph) Text ¶

func (p Paragraph) Text() string

Text returns the full text of the paragraph.

type Point ¶

type Point struct {
	X float64
	Y float64
}

Point represents an (x, y) coordinate where edges intersect.

type ProcessingMetrics ¶

type ProcessingMetrics struct {
	TotalTime       time.Duration
	DocumentOpen    time.Duration
	PageExtractions []PageMetrics
	Statistics      DocumentStatistics
}

ProcessingMetrics contains timing and statistics for PDF conversion

type RGBA ¶

type RGBA struct {
	R, G, B, A uint
}

RGBA represents a color.

type Rect ¶

type Rect struct {
	X0 float64 // Left
	Y0 float64 // Top (after conversion from PDF coordinates)
	X1 float64 // Right
	Y1 float64 // Bottom (after conversion from PDF coordinates)
}

Rect represents a bounding box in PDF coordinates.

func (Rect) CenterX ¶

func (r Rect) CenterX() float64

CenterX returns the horizontal center of the rectangle

func (Rect) CenterY ¶

func (r Rect) CenterY() float64

CenterY returns the vertical center of the rectangle.

func (Rect) Height ¶

func (r Rect) Height() float64

Height returns the height of the rectangle.

func (Rect) Width ¶

func (r Rect) Width() float64

Width returns the width of the rectangle.

type Segment ¶

type Segment struct {
	Words []EnrichedWord
	Box   Rect
}

Segment represents a group of horizontally adjacent content elements Based on PDF-TREX algorithm

type SegmentTableCell ¶

type SegmentTableCell struct {
	Content string
	Row     int
	Column  int
	Box     Rect
}

SegmentTableCell represents a final table cell with 2D coordinates Used internally by segment-based table detection

type SegmentTableRow ¶

type SegmentTableRow struct {
	Lines    []TaggedLine
	Segments []Segment
	Box      Rect
}

SegmentTableRow represents a logical table row (may span multiple lines) Used internally by segment-based table detection

type Table ¶

type Table struct {
	BBox    CellBBox
	Rows    []TableRow
	Cells   []CellBBox // Raw cell bounding boxes
	NumRows int
	NumCols int
}

Table represents a detected table with its structure and content.

func DetectTables ¶

func DetectTables(page *Page, settings TableSettings) []Table

DetectTables finds tables in a page using word alignment or explicit lines. Based on pdfplumber's TableFinder supporting multiple strategies.

func DetectTablesSegmentBased ¶

func DetectTablesSegmentBased(page *Page, thresholds AdaptiveThresholds) []Table

DetectTablesSegmentBased detects tables using segment-based approach This is an alternative to line-based detection for PDFs without ruling lines

type TableArea ¶

type TableArea struct {
	Lines []TaggedLine
	Box   Rect
}

TableArea represents a region containing table lines

type TableCell ¶

type TableCell struct {
	BBox    CellBBox
	Content string
	Words   []EnrichedWord
}

TableCell represents a detected table cell with its content.

type TableColumn ¶

type TableColumn struct {
	Segments []Segment
	Box      Rect
}

TableColumn represents a logical table column

type TableRow ¶

type TableRow struct {
	Cells []TableCell
	BBox  CellBBox
}

TableRow represents a row of cells in a table.

type TableSettings ¶

type TableSettings struct {
	// Strategy for detecting table edges: "text", "lines", "lines_strict", "explicit"
	VerticalStrategy   string
	HorizontalStrategy string

	// Tolerances for snapping close edges together
	SnapTolerance  float64
	SnapXTolerance float64
	SnapYTolerance float64

	// Tolerances for joining edges on the same line
	JoinTolerance  float64
	JoinXTolerance float64
	JoinYTolerance float64

	// Minimum edge length to consider
	EdgeMinLength float64

	// Minimum number of words required to infer edges from text alignment
	MinWordsVertical   int
	MinWordsHorizontal int

	// Tolerances for finding edge intersections
	IntersectionTolerance  float64
	IntersectionXTolerance float64
	IntersectionYTolerance float64
}

TableSettings configures table detection behavior. Based on pdfplumber's TableSettings.

func DefaultTableSettings ¶

func DefaultTableSettings() TableSettings

DefaultTableSettings returns default settings for table detection. Uses "lines" strategy by default to detect explicit line objects in PDFs.

type TaggedLine ¶

type TaggedLine struct {
	Line     Line
	Segments []Segment
	Type     LineType
}

TaggedLine is a line with its type classification

type TextBlock ¶

type TextBlock struct {
	Words            []EnrichedWord
	Lines            []Line
	Rotation         float64 // Rotation angle in degrees
	ReadingDirection string  // "ltr", "rtl", "ttb", "btt"
}

TextBlock represents a block of text with consistent rotation/orientation.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
docmill command
example

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

docmill

Features

Architecture

Installation

Usage

Basic Conversion

Custom Configuration

Convert from Bytes

Convert from io.ReadSeeker

Convert Specific Pages

Get Document Info

Command Line Tool

Installation

Download pre-built binary

Install with go install

Build from source

Usage

Options

Configuration Options

Config Struct

Table Settings

Markdown Output Features

Headings

Lists

Tables

Inline Formatting

Code Blocks

Page Breaks

Multi-Column Layouts

Performance Metrics

Use Cases

Integration with LLM Pipeline

Capabilities

Supported Features

Current Limitations

Experimental Features

Contributing

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type AdaptiveThresholds ¶

type Alignment ¶

type Block ¶

type CellBBox ¶

type Chunk ¶

type ChunkConfig ¶

func DefaultChunkConfig ¶

type Column ¶

type Config ¶

func DefaultConfig ¶

type Converter ¶

func New ¶

func NewConverter ¶

func NewConverterWithConfig ¶

func NewWithConfig ¶

func (*Converter) Close ¶

func (*Converter) ConvertBytes ¶

func (*Converter) ConvertBytesChunks ¶

func (*Converter) ConvertFile ¶

func (*Converter) ConvertFileChunks ¶

func (*Converter) ConvertFileWithMetrics ¶

func (*Converter) ConvertPageRange ¶

func (*Converter) ConvertReader ¶

func (*Converter) GetDocumentInfo ¶

type Document ¶

func (*Document) ToChunks ¶

func (*Document) ToMarkdown ¶

type DocumentInfo ¶

type DocumentStatistics ¶

type DocumentStats ¶

type Edge ¶

type EnrichedChar ¶

type EnrichedWord ¶

func (EnrichedWord) IsBulletOrNumber ¶

type HeadingContext ¶

Install with `go install`