docmill

package module
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2026 License: MIT Imports: 17 Imported by: 0

README

docmill

Fast PDF to Markdown conversion using pdfium text extraction with intelligent layout and style analysis.

Features

  • Fast extraction: Uses native pdfium for text extraction (orders of magnitude faster than LLM processing)
  • Rich metadata: Extracts font size, weight, style, colour, and positioning information
  • Intelligent structure detection:
    • Headings (H1-H6) based on font size and weight
    • Paragraphs with proper line breaking and spacing
    • Bullet and numbered lists with nested items
    • Code blocks (monospace font detection)
    • Bold and italic inline formatting
    • Text alignment (left, centre, right)
    • Table detection with markdown table output
    • Multi-column layout handling with rotated text support
  • Page-aware: Handles multi-page documents with page separators
  • Flexible API: Convert from file path, bytes, or io.ReadSeeker
  • Configurable: Customisable heading detection, table extraction, and formatting options
  • Performance metrics: Optional timing and statistics logging

Architecture

The converter works in three stages:

  1. Extraction: Extract all characters with rich metadata (font, size, position, colour)
  2. Structure Analysis: Group characters → words → lines → paragraphs, detect document structure
  3. Markdown Conversion: Convert structured document to clean markdown

Installation

go get github.com/ivanvanderbyl/docmill

Usage

Basic Conversion
import (
    "fmt"
    "log"

    "github.com/ivanvanderbyl/docmill"
)

converter, err := docmill.New()
if err != nil {
    log.Fatal(err)
}
defer converter.Close()

markdown, err := converter.ConvertFile("document.pdf")
if err != nil {
    log.Fatal(err)
}

fmt.Println(markdown)
Custom Configuration
config := docmill.DefaultConfig()
config.IncludePageBreaks = true
config.DetectTables = true
config.UseSegmentBasedTables = true  // Better for PDFs without ruling lines
config.UseAdaptiveThresholds = true
config.MinHeadingFontSize = 1.2      // Adjust heading detection sensitivity
config.EnableMetricsLogging = true   // Enable performance metrics

converter, err := docmill.NewWithConfig(config)
if err != nil {
    log.Fatal(err)
}
defer converter.Close()

markdown, err := converter.ConvertFile("document.pdf")
Convert from Bytes
pdfBytes, err := os.ReadFile("document.pdf")
if err != nil {
    log.Fatal(err)
}

markdown, err := converter.ConvertBytes(pdfBytes)
Convert from io.ReadSeeker
file, err := os.Open("document.pdf")
if err != nil {
    log.Fatal(err)
}
defer file.Close()

markdown, err := converter.ConvertReader(file)
Convert Specific Pages
// Convert pages 0-4 (first 5 pages, 0-indexed)
markdown, err := converter.ConvertPageRange("document.pdf", 0, 4)
Get Document Info
info, err := converter.GetDocumentInfo("document.pdf")
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Document has %d pages\n", info.PageCount)

Command Line Tool

A CLI tool is provided for quick conversions:

Installation
Download pre-built binary

Pre-built binaries are available for Linux, macOS, and Windows (amd64 and arm64) from GitHub Releases.

macOS / Linux:

# Download the latest release (adjust OS and ARCH as needed)
# OS: linux, darwin  ARCH: amd64, arm64
curl -sL "https://github.com/ivanvanderbyl/docmill/releases/latest/download/docmill_$(uname -s | tr '[:upper:]' '[:lower:]')_$(uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/').tar.gz" -o /tmp/docmill.tar.gz && tar xzf /tmp/docmill.tar.gz -C /usr/local/bin docmill && rm /tmp/docmill.tar.gz

Windows:

Download the appropriate .zip from the releases page and add docmill.exe to your PATH.

Install with go install
go install github.com/ivanvanderbyl/docmill/cmd/docmill@latest
Build from source
git clone https://github.com/ivanvanderbyl/docmill.git
cd docmill
go install ./cmd/docmill
Usage
# Convert to file
docmill -i input.pdf -o output.md

# Convert specific pages (0-indexed)
docmill -i input.pdf -o output.md --start-page 0 --end-page 4

# Output to stdout
docmill -i input.pdf

# Enable metrics logging
docmill -i input.pdf -o output.md --metrics
Options
  • -i, --input - Input PDF file path (required)
  • -o, --output - Output markdown file path (default: stdout)
  • --start-page - Start page number, 0-indexed (default: all pages)
  • --end-page - End page number, 0-indexed (default: all pages)
  • -m, --metrics - Enable processing time and statistics logging
  • --page-breaks - Add --- separators between pages (default: true)
  • --min-heading-font-size - Minimum font size multiplier to detect headings, 0 disables (default: 1.15)
  • --detect-tables - Enable table detection and extraction (default: true)
  • --segment-tables - Use PDF-TREX segment-based table detection, better for tables without ruling lines (default: false)
  • --adaptive-thresholds - Enable document-specific threshold calculation based on spacing distribution (default: true)
  • --max-concurrency - Maximum pages processed concurrently during structure detection (default: 10)
  • --chunk - Output as JSON chunks instead of markdown
  • --chunk-max-tokens - Maximum tokens per chunk (default: 512)
  • --chunk-overlap - Number of overlap tokens between chunks (default: 0)
  • --chunk-repeat-headings - Repeat heading hierarchy at the start of each chunk

Configuration Options

Config Struct
type Config struct {
    // IncludePageBreaks adds "---" separators between pages (default: true)
    IncludePageBreaks bool

    // MinHeadingFontSize is the minimum font size multiplier to detect headings
    // A value of 0 disables size-based heading detection (default: 1.15x body text)
    MinHeadingFontSize float64

    // DetectTables enables table detection and extraction (default: true)
    DetectTables bool

    // TableSettings configures table detection behavior
    TableSettings TableSettings

    // UseSegmentBasedTables enables PDF-TREX segment-based table detection
    // This works better for tables without ruling lines (default: false)
    UseSegmentBasedTables bool

    // UseAdaptiveThresholds enables document-specific threshold calculation
    // Based on spacing distribution analysis (default: true)
    UseAdaptiveThresholds bool

    // EnableMetricsLogging enables processing time and statistics logging (default: false)
    EnableMetricsLogging bool
}
Table Settings

Table detection can be configured using TableSettings:

config := docmill.DefaultConfig()
config.TableSettings = docmill.TableSettings{
    VerticalStrategy:   "lines",  // "text", "lines", "lines_strict", "explicit"
    HorizontalStrategy: "lines",
    SnapTolerance:      3.0,      // Tolerance for snapping close edges
    EdgeMinLength:      3.0,      // Minimum edge length to consider
    MinWordsVertical:   3,        // Minimum words for text-based detection
    MinWordsHorizontal: 1,
}

Markdown Output Features

Headings

Headings are detected based on:

  • Font size relative to body text (configurable threshold)
  • Bold font weight
  • Single-line paragraphs
# Large Heading (H1)
## Medium Heading (H2)
### Smaller Heading (H3)
Lists

Bullet and numbered lists with proper nesting:

* First item
* Second item
  * Nested item
  * Another nested item

1. Numbered item
2. Another item
   1. Nested numbered item
Tables

Tables are detected and converted to markdown tables:

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Cell 1   | Cell 2   | Cell 3   |
| Cell 4   | Cell 5   | Cell 6   |
Inline Formatting

Bold, italic, and code are preserved:

This is **bold** text and *italic* text with `code`.
Code Blocks

Monospace paragraphs are converted to code blocks:

```
func main() {
    fmt.Println("Hello")
}
```
Page Breaks

Multi-page documents include page separators (when IncludePageBreaks is enabled):

Content from page 1

---

Content from page 2
Multi-Column Layouts

The converter intelligently handles multi-column layouts and rotated text, maintaining reading order where possible.

Performance Metrics

When EnableMetricsLogging is enabled, the converter logs detailed timing and statistics:

Processing PDF with 10 pages...
Document opened in 45ms
Page 1 extracted in 23ms
Page 2 extracted in 18ms
...
Total conversion time: 234ms
Statistics:
  - Total paragraphs: 145
  - Total tables: 8
  - Total headings: 23
  - Total words: 3,456
  - Total characters: 18,234

Typical conversion speeds (varies by PDF complexity):

  • Simple text PDF: ~10-50ms per page
  • Complex formatted PDF with tables: ~50-200ms per page

Compare to LLM-based extraction:

  • LLM API call: ~1-5 seconds per page
  • Cost: $0 vs API costs

Use Cases

Ideal for docmill:

  • PDFs with extractable text (not scanned images)
  • Fast conversion without LLM API costs
  • Document structure is relatively standard
  • Preserving formatting (bold, italic, headings, tables)
  • Batch processing large numbers of documents
  • Building document search/indexing systems
  • Extracting structured data from reports

Fall back to LLM processing when:

  • PDFs are scanned images requiring OCR
  • Complex semantic analysis is required
  • Need to extract specific information requiring understanding
  • Documents with highly irregular layouts

Integration with LLM Pipeline

This package is designed to be a fast first pass before LLM processing:

// Try fast extraction first
markdown, err := converter.ConvertFile(pdfPath)
if err != nil || len(markdown) < 100 {
    // Fall back to LLM-based extraction
    return llmExtractor.Extract(pdfPath)
}

// Use extracted markdown as LLM context for further analysis
response, err := llmClient.Analyze(ctx, llm.AnalyzeRequest{
    Context: markdown,
    Task:    "Extract key financial metrics from this report",
})

Capabilities

Supported Features
  • ✅ Text extraction with font metadata
  • ✅ Heading detection (H1-H6)
  • ✅ Paragraph detection with proper spacing
  • ✅ List detection (bullet and numbered)
  • ✅ Table detection and markdown table output
  • ✅ Bold and italic inline formatting
  • ✅ Code block detection (monospace fonts)
  • ✅ Multi-column layout handling
  • ✅ Rotated text support
  • ✅ Page break markers
  • ✅ Configurable thresholds and settings
  • ✅ Performance metrics and logging
Current Limitations
  • ❌ No OCR support (requires extractable text in PDF)
  • ❌ Hyperlinks are not extracted
  • ❌ Images are not extracted (text only)
  • ⚠️ Complex multi-column layouts may not always preserve perfect reading order
  • ⚠️ Tables without clear structure may require segment-based detection
Experimental Features
  • PDF-TREX segment-based table detection (enable with UseSegmentBasedTables: true)
  • Adaptive threshold calculation based on document analysis

Contributing

Contributions are welcome! Areas for improvement:

  • Hyperlink extraction
  • Image placeholder insertion
  • Enhanced multi-column layout detection
  • Custom markdown formatting options
  • Additional table detection strategies

License

MIT License - see LICENSE file for details

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type AdaptiveThresholds

type AdaptiveThresholds struct {
	HorizontalThreshold float64 // hT: for horizontal clustering
	VerticalThreshold   float64 // vT: for vertical clustering
}

AdaptiveThresholds contains document-specific threshold values

type Alignment

type Alignment int

Alignment represents text alignment.

const (
	AlignmentLeft Alignment = iota
	AlignmentCenter
	AlignmentRight
	AlignmentJustified
)

type Block

type Block struct {
	Segments    []Segment
	Box         Rect
	LineIndices []int // Which lines this block spans
}

Block represents vertically aligned segments across multiple lines

type CellBBox

type CellBBox struct {
	X0     float64
	Top    float64
	X1     float64
	Bottom float64
}

CellBBox represents a table cell as a bounding box.

type Chunk

type Chunk struct {
	Index      int    `json:"index"`
	Text       string `json:"text"`
	TokenCount int    `json:"token_count"`

	StartPage int `json:"start_page"`
	EndPage   int `json:"end_page"`

	HeadingPath []HeadingContext `json:"heading_path,omitempty"`
}

type ChunkConfig

type ChunkConfig struct {
	MaxTokens      int
	OverlapTokens  int
	RepeatHeadings bool
	EstimateTokens func(s string) int
}

func DefaultChunkConfig

func DefaultChunkConfig() ChunkConfig

type Column

type Column struct {
	Box        Rect
	Words      []EnrichedWord
	Paragraphs []Paragraph
	Index      int // Column number (0-indexed from left to right)
}

Column represents a vertical column of text in a multi-column layout.

type Config

type Config struct {
	// IncludePageBreaks adds "---" separators between pages (default: true)
	IncludePageBreaks bool

	// MinHeadingFontSize is the minimum font size difference to detect headings
	// A value of 0 disables size-based heading detection (default: 1.15x body text)
	MinHeadingFontSize float64

	// DetectTables enables table detection and extraction (default: false)
	DetectTables bool

	// TableSettings configures table detection behavior (default: DefaultTableSettings())
	TableSettings TableSettings

	// UseSegmentBasedTables enables PDF-TREX segment-based table detection
	// This works better for tables without ruling lines (default: true)
	UseSegmentBasedTables bool

	// UseAdaptiveThresholds enables document-specific threshold calculation
	// Based on spacing distribution analysis (default: true)
	UseAdaptiveThresholds bool

	// EnableMetricsLogging enables processing time and statistics logging (default: false)
	EnableMetricsLogging bool

	// MaxConcurrency controls how many pages are processed concurrently during
	// the structure detection phase. PDFium extraction is always sequential,
	// but paragraph/table/heading detection runs in parallel. (default: 10)
	MaxConcurrency int
}

Config controls markdown conversion behavior.

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns the default converter configuration.

type Converter

type Converter struct {
	// contains filtered or unexported fields
}

Converter converts PDFs to markdown using pdfium text extraction.

func New

func New() (*Converter, error)

New creates a new PDF to markdown converter with default configuration. The returned Converter manages its own pdfium pool and must be closed with Close when no longer needed.

func NewConverter

func NewConverter(instance pdfium.Pdfium) *Converter

NewConverter creates a new PDF to markdown converter with default configuration. The caller is responsible for managing the pdfium pool lifecycle.

func NewConverterWithConfig

func NewConverterWithConfig(instance pdfium.Pdfium, config Config) *Converter

NewConverterWithConfig creates a new PDF to markdown converter with custom configuration. The caller is responsible for managing the pdfium pool lifecycle.

func NewWithConfig

func NewWithConfig(config Config) (*Converter, error)

NewWithConfig creates a new PDF to markdown converter with custom configuration. The returned Converter manages its own pdfium pool and must be closed with Close when no longer needed.

func (*Converter) Close

func (c *Converter) Close()

Close releases resources held by the Converter. Only required for converters created with New or NewWithConfig.

func (*Converter) ConvertBytes

func (c *Converter) ConvertBytes(pdfBytes []byte) (string, error)

ConvertBytes converts PDF bytes to markdown.

func (*Converter) ConvertBytesChunks

func (c *Converter) ConvertBytesChunks(pdfBytes []byte, cc ChunkConfig) ([]Chunk, error)

func (*Converter) ConvertFile

func (c *Converter) ConvertFile(filePath string) (string, error)

ConvertFile converts a PDF file to markdown.

func (*Converter) ConvertFileChunks

func (c *Converter) ConvertFileChunks(filePath string, cc ChunkConfig) ([]Chunk, error)

func (*Converter) ConvertFileWithMetrics

func (c *Converter) ConvertFileWithMetrics(filePath string) (string, ProcessingMetrics, error)

ConvertFileWithMetrics converts a PDF and returns both markdown and metrics

func (*Converter) ConvertPageRange

func (c *Converter) ConvertPageRange(filePath string, startPage, endPage int) (string, error)

ConvertPageRange converts a specific range of pages to markdown.

func (*Converter) ConvertReader

func (c *Converter) ConvertReader(reader io.ReadSeeker) (string, error)

ConvertReader converts a PDF from an io.ReadSeeker to markdown.

func (*Converter) GetDocumentInfo

func (c *Converter) GetDocumentInfo(filePath string) (*DocumentInfo, error)

GetDocumentInfo returns basic information about a PDF without converting it.

type Document

type Document struct {
	Pages []Page
	Stats DocumentStats
}

Document represents the complete extracted document structure.

func (*Document) ToChunks

func (d *Document) ToChunks(config Config, cc ChunkConfig) []Chunk

func (*Document) ToMarkdown

func (d *Document) ToMarkdown(config Config) string

ToMarkdown converts a document to markdown format.

type DocumentInfo

type DocumentInfo struct {
	PageCount int
}

DocumentInfo contains basic information about a PDF document.

type DocumentStatistics

type DocumentStatistics struct {
	TotalPages      int
	TotalParagraphs int
	TotalTables     int
	TotalHeadings   int
	TotalWords      int
	TotalCharacters int
}

DocumentStatistics contains document-level statistics

type DocumentStats

type DocumentStats struct {
	MostUsedFontSize float64         // Most common font size (body text)
	MostUsedFontName string          // Most common font name
	MostUsedLineGap  float64         // Most common line spacing
	FontSizeFreq     map[float64]int // Frequency map of font sizes
	FontNameFreq     map[string]int  // Frequency map of font names
	MaxFontSize      float64         // Largest font size in document
}

DocumentStats holds document-wide font and spacing statistics. These are calculated across all pages as hints for structure detection.

type Edge

type Edge struct {
	X0          float64 // Left x coordinate
	X1          float64 // Right x coordinate
	Top         float64 // Top y coordinate
	Bottom      float64 // Bottom y coordinate
	Width       float64 // Width (for horizontal edges)
	Height      float64 // Height (for vertical edges)
	Orientation string  // "h" for horizontal, "v" for vertical
}

Edge represents a horizontal or vertical line segment used for table detection. Based on pdfplumber's edge structure.

type EnrichedChar

type EnrichedChar struct {
	Text       rune
	Box        Rect
	FontSize   float64
	FontWeight int
	FontName   string
	FontFlags  int
	FillColor  RGBA
	Angle      float32
	IsHyphen   bool
}

EnrichedChar represents a single character with all its metadata.

type EnrichedWord

type EnrichedWord struct {
	Text        string
	Box         Rect
	FontSize    float64 // Average font size
	FontWeight  int     // Dominant font weight
	FontName    string  // Dominant font name
	FontFlags   int     // Dominant font flags
	FillColor   RGBA    // Dominant fill color
	IsBold      bool
	IsItalic    bool
	IsMonospace bool
	Baseline    float64 // Y-coordinate of the text baseline
	XHeight     float64 // Height of lowercase letters
	Rotation    float64 // Rotation angle in degrees (0, 90, 180, 270, etc.)
}

EnrichedWord represents a word with aggregated style information.

func (EnrichedWord) IsBulletOrNumber

func (w EnrichedWord) IsBulletOrNumber() bool

IsBulletOrNumber checks if the word looks like a list marker.

type HeadingContext

type HeadingContext struct {
	Level int    `json:"level"`
	Text  string `json:"text"`
	Page  int    `json:"page"`
}

type Line

type Line struct {
	Words    []EnrichedWord
	Box      Rect
	Baseline float64 // Y-coordinate of the baseline
}

Line represents a horizontal line of text.

type LineType

type LineType string

LineType represents the classification of a line in table detection

const (
	TextLine    LineType = "TxL" // Text line (single segment spanning > 50% width)
	TableLine   LineType = "TbL" // Table line (multiple segments)
	UnknownLine LineType = "UnL" // Unknown line (single segment spanning < 50% width)
)

type Page

type Page struct {
	Number     int
	Width      float64
	Height     float64
	Quality    PageQuality
	Paragraphs []Paragraph
	Tables     []Table
	Lines      []Edge   // Explicit line objects extracted from PDF
	Columns    []Column // Detected column layout
}

Page represents all extracted content from a PDF page.

func ExtractPage

func ExtractPage(instance pdfium.Pdfium, page references.FPDF_PAGE, pageNumber int, config Config) (*Page, error)

ExtractPage extracts all enriched text from a PDF page.

func (*Page) ToMarkdown

func (p *Page) ToMarkdown() string

PageToMarkdown converts a single page to markdown.

type PageExtractor

type PageExtractor struct {
	// contains filtered or unexported fields
}

PageExtractor provides context for extracting text from a page.

type PageMetrics

type PageMetrics struct {
	PageNumber int
	Duration   time.Duration
}

PageMetrics contains timing for a single page

type PageQuality

type PageQuality struct {
	AlnumRatio           float64
	MeaningfulWordRatio  float64
	ReplacementCharRatio float64
	FragmentedWordRatio  float64
	PUARatio             float64
	WordCount            int
	CharCount            int
	NonWhitespaceCount   int
	IsLowQuality         bool
}

type Paragraph

type Paragraph struct {
	Lines        []Line
	Box          Rect
	Alignment    Alignment
	IsHeading    bool
	HeadingLevel int // 1-6 for markdown headings
	IsList       bool
	IsCode       bool
	Indent       float64 // Left indentation
}

Paragraph represents a block of text.

func (Paragraph) CenterX

func (p Paragraph) CenterX() float64

CenterX returns the horizontal center of a paragraph's bounding box

func (Paragraph) Text

func (p Paragraph) Text() string

Text returns the full text of the paragraph.

type Point

type Point struct {
	X float64
	Y float64
}

Point represents an (x, y) coordinate where edges intersect.

type ProcessingMetrics

type ProcessingMetrics struct {
	TotalTime       time.Duration
	DocumentOpen    time.Duration
	PageExtractions []PageMetrics
	Statistics      DocumentStatistics
}

ProcessingMetrics contains timing and statistics for PDF conversion

type RGBA

type RGBA struct {
	R, G, B, A uint
}

RGBA represents a color.

type Rect

type Rect struct {
	X0 float64 // Left
	Y0 float64 // Top (after conversion from PDF coordinates)
	X1 float64 // Right
	Y1 float64 // Bottom (after conversion from PDF coordinates)
}

Rect represents a bounding box in PDF coordinates.

func (Rect) CenterX

func (r Rect) CenterX() float64

CenterX returns the horizontal center of the rectangle

func (Rect) CenterY

func (r Rect) CenterY() float64

CenterY returns the vertical center of the rectangle.

func (Rect) Height

func (r Rect) Height() float64

Height returns the height of the rectangle.

func (Rect) Width

func (r Rect) Width() float64

Width returns the width of the rectangle.

type Segment

type Segment struct {
	Words []EnrichedWord
	Box   Rect
}

Segment represents a group of horizontally adjacent content elements Based on PDF-TREX algorithm

type SegmentTableCell

type SegmentTableCell struct {
	Content string
	Row     int
	Column  int
	Box     Rect
}

SegmentTableCell represents a final table cell with 2D coordinates Used internally by segment-based table detection

type SegmentTableRow

type SegmentTableRow struct {
	Lines    []TaggedLine
	Segments []Segment
	Box      Rect
}

SegmentTableRow represents a logical table row (may span multiple lines) Used internally by segment-based table detection

type Table

type Table struct {
	BBox    CellBBox
	Rows    []TableRow
	Cells   []CellBBox // Raw cell bounding boxes
	NumRows int
	NumCols int
}

Table represents a detected table with its structure and content.

func DetectTables

func DetectTables(page *Page, settings TableSettings) []Table

DetectTables finds tables in a page using word alignment or explicit lines. Based on pdfplumber's TableFinder supporting multiple strategies.

func DetectTablesSegmentBased

func DetectTablesSegmentBased(page *Page, thresholds AdaptiveThresholds) []Table

DetectTablesSegmentBased detects tables using segment-based approach This is an alternative to line-based detection for PDFs without ruling lines

type TableArea

type TableArea struct {
	Lines []TaggedLine
	Box   Rect
}

TableArea represents a region containing table lines

type TableCell

type TableCell struct {
	BBox    CellBBox
	Content string
	Words   []EnrichedWord
}

TableCell represents a detected table cell with its content.

type TableColumn

type TableColumn struct {
	Segments []Segment
	Box      Rect
}

TableColumn represents a logical table column

type TableRow

type TableRow struct {
	Cells []TableCell
	BBox  CellBBox
}

TableRow represents a row of cells in a table.

type TableSettings

type TableSettings struct {
	// Strategy for detecting table edges: "text", "lines", "lines_strict", "explicit"
	VerticalStrategy   string
	HorizontalStrategy string

	// Tolerances for snapping close edges together
	SnapTolerance  float64
	SnapXTolerance float64
	SnapYTolerance float64

	// Tolerances for joining edges on the same line
	JoinTolerance  float64
	JoinXTolerance float64
	JoinYTolerance float64

	// Minimum edge length to consider
	EdgeMinLength float64

	// Minimum number of words required to infer edges from text alignment
	MinWordsVertical   int
	MinWordsHorizontal int

	// Tolerances for finding edge intersections
	IntersectionTolerance  float64
	IntersectionXTolerance float64
	IntersectionYTolerance float64
}

TableSettings configures table detection behavior. Based on pdfplumber's TableSettings.

func DefaultTableSettings

func DefaultTableSettings() TableSettings

DefaultTableSettings returns default settings for table detection. Uses "lines" strategy by default to detect explicit line objects in PDFs.

type TaggedLine

type TaggedLine struct {
	Line     Line
	Segments []Segment
	Type     LineType
}

TaggedLine is a line with its type classification

type TextBlock

type TextBlock struct {
	Words            []EnrichedWord
	Lines            []Line
	Rotation         float64 // Rotation angle in degrees
	ReadingDirection string  // "ltr", "rtl", "ttb", "btt"
}

TextBlock represents a block of text with consistent rotation/orientation.

Directories

Path Synopsis
cmd
docmill command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL