Documentation
¶
Index ¶
- type AdaptiveThresholds
- type Alignment
- type Block
- type CellBBox
- type Chunk
- type ChunkConfig
- type Column
- type Config
- type Converter
- func (c *Converter) Close()
- func (c *Converter) ConvertBytes(pdfBytes []byte) (string, error)
- func (c *Converter) ConvertBytesChunks(pdfBytes []byte, cc ChunkConfig) ([]Chunk, error)
- func (c *Converter) ConvertFile(filePath string) (string, error)
- func (c *Converter) ConvertFileChunks(filePath string, cc ChunkConfig) ([]Chunk, error)
- func (c *Converter) ConvertFileWithMetrics(filePath string) (string, ProcessingMetrics, error)
- func (c *Converter) ConvertPageRange(filePath string, startPage, endPage int) (string, error)
- func (c *Converter) ConvertReader(reader io.ReadSeeker) (string, error)
- func (c *Converter) GetDocumentInfo(filePath string) (*DocumentInfo, error)
- type Document
- type DocumentInfo
- type DocumentStatistics
- type DocumentStats
- type Edge
- type EnrichedChar
- type EnrichedWord
- type HeadingContext
- type Line
- type LineType
- type Page
- type PageExtractor
- type PageMetrics
- type PageQuality
- type Paragraph
- type Point
- type ProcessingMetrics
- type RGBA
- type Rect
- type Segment
- type SegmentTableCell
- type SegmentTableRow
- type Table
- type TableArea
- type TableCell
- type TableColumn
- type TableRow
- type TableSettings
- type TaggedLine
- type TextBlock
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AdaptiveThresholds ¶
type AdaptiveThresholds struct {
HorizontalThreshold float64 // hT: for horizontal clustering
VerticalThreshold float64 // vT: for vertical clustering
}
AdaptiveThresholds contains document-specific threshold values
type ChunkConfig ¶
type ChunkConfig struct {
MaxTokens int
OverlapTokens int
RepeatHeadings bool
EstimateTokens func(s string) int
}
func DefaultChunkConfig ¶
func DefaultChunkConfig() ChunkConfig
type Column ¶
type Column struct {
Box Rect
Words []EnrichedWord
Paragraphs []Paragraph
Index int // Column number (0-indexed from left to right)
}
Column represents a vertical column of text in a multi-column layout.
type Config ¶
type Config struct {
// IncludePageBreaks adds "---" separators between pages (default: true)
IncludePageBreaks bool
// MinHeadingFontSize is the minimum font size difference to detect headings
// A value of 0 disables size-based heading detection (default: 1.15x body text)
MinHeadingFontSize float64
// DetectTables enables table detection and extraction (default: false)
DetectTables bool
// TableSettings configures table detection behavior (default: DefaultTableSettings())
TableSettings TableSettings
// UseSegmentBasedTables enables PDF-TREX segment-based table detection
// This works better for tables without ruling lines (default: true)
UseSegmentBasedTables bool
// UseAdaptiveThresholds enables document-specific threshold calculation
// Based on spacing distribution analysis (default: true)
UseAdaptiveThresholds bool
// EnableMetricsLogging enables processing time and statistics logging (default: false)
EnableMetricsLogging bool
// MaxConcurrency controls how many pages are processed concurrently during
// the structure detection phase. PDFium extraction is always sequential,
// but paragraph/table/heading detection runs in parallel. (default: 10)
MaxConcurrency int
}
Config controls markdown conversion behavior.
func DefaultConfig ¶
func DefaultConfig() Config
DefaultConfig returns the default converter configuration.
type Converter ¶
type Converter struct {
// contains filtered or unexported fields
}
Converter converts PDFs to markdown using pdfium text extraction.
func New ¶
New creates a new PDF to markdown converter with default configuration. The returned Converter manages its own pdfium pool and must be closed with Close when no longer needed.
func NewConverter ¶
NewConverter creates a new PDF to markdown converter with default configuration. The caller is responsible for managing the pdfium pool lifecycle.
func NewConverterWithConfig ¶
NewConverterWithConfig creates a new PDF to markdown converter with custom configuration. The caller is responsible for managing the pdfium pool lifecycle.
func NewWithConfig ¶
NewWithConfig creates a new PDF to markdown converter with custom configuration. The returned Converter manages its own pdfium pool and must be closed with Close when no longer needed.
func (*Converter) Close ¶
func (c *Converter) Close()
Close releases resources held by the Converter. Only required for converters created with New or NewWithConfig.
func (*Converter) ConvertBytes ¶
ConvertBytes converts PDF bytes to markdown.
func (*Converter) ConvertBytesChunks ¶
func (c *Converter) ConvertBytesChunks(pdfBytes []byte, cc ChunkConfig) ([]Chunk, error)
func (*Converter) ConvertFile ¶
ConvertFile converts a PDF file to markdown.
func (*Converter) ConvertFileChunks ¶
func (c *Converter) ConvertFileChunks(filePath string, cc ChunkConfig) ([]Chunk, error)
func (*Converter) ConvertFileWithMetrics ¶
func (c *Converter) ConvertFileWithMetrics(filePath string) (string, ProcessingMetrics, error)
ConvertFileWithMetrics converts a PDF and returns both markdown and metrics
func (*Converter) ConvertPageRange ¶
ConvertPageRange converts a specific range of pages to markdown.
func (*Converter) ConvertReader ¶
func (c *Converter) ConvertReader(reader io.ReadSeeker) (string, error)
ConvertReader converts a PDF from an io.ReadSeeker to markdown.
func (*Converter) GetDocumentInfo ¶
func (c *Converter) GetDocumentInfo(filePath string) (*DocumentInfo, error)
GetDocumentInfo returns basic information about a PDF without converting it.
type Document ¶
type Document struct {
Pages []Page
Stats DocumentStats
}
Document represents the complete extracted document structure.
func (*Document) ToMarkdown ¶
ToMarkdown converts a document to markdown format.
type DocumentInfo ¶
type DocumentInfo struct {
PageCount int
}
DocumentInfo contains basic information about a PDF document.
type DocumentStatistics ¶
type DocumentStatistics struct {
TotalPages int
TotalParagraphs int
TotalTables int
TotalHeadings int
TotalWords int
TotalCharacters int
}
DocumentStatistics contains document-level statistics
type DocumentStats ¶
type DocumentStats struct {
MostUsedFontSize float64 // Most common font size (body text)
MostUsedFontName string // Most common font name
MostUsedLineGap float64 // Most common line spacing
FontSizeFreq map[float64]int // Frequency map of font sizes
FontNameFreq map[string]int // Frequency map of font names
MaxFontSize float64 // Largest font size in document
}
DocumentStats holds document-wide font and spacing statistics. These are calculated across all pages as hints for structure detection.
type Edge ¶
type Edge struct {
X0 float64 // Left x coordinate
X1 float64 // Right x coordinate
Top float64 // Top y coordinate
Bottom float64 // Bottom y coordinate
Width float64 // Width (for horizontal edges)
Height float64 // Height (for vertical edges)
Orientation string // "h" for horizontal, "v" for vertical
}
Edge represents a horizontal or vertical line segment used for table detection. Based on pdfplumber's edge structure.
type EnrichedChar ¶
type EnrichedChar struct {
Text rune
Box Rect
FontSize float64
FontWeight int
FontName string
FontFlags int
FillColor RGBA
Angle float32
IsHyphen bool
}
EnrichedChar represents a single character with all its metadata.
type EnrichedWord ¶
type EnrichedWord struct {
Text string
Box Rect
FontSize float64 // Average font size
FontWeight int // Dominant font weight
FontName string // Dominant font name
FontFlags int // Dominant font flags
FillColor RGBA // Dominant fill color
IsBold bool
IsItalic bool
IsMonospace bool
Baseline float64 // Y-coordinate of the text baseline
XHeight float64 // Height of lowercase letters
Rotation float64 // Rotation angle in degrees (0, 90, 180, 270, etc.)
}
EnrichedWord represents a word with aggregated style information.
func (EnrichedWord) IsBulletOrNumber ¶
func (w EnrichedWord) IsBulletOrNumber() bool
IsBulletOrNumber checks if the word looks like a list marker.
type HeadingContext ¶
type Line ¶
type Line struct {
Words []EnrichedWord
Box Rect
Baseline float64 // Y-coordinate of the baseline
}
Line represents a horizontal line of text.
type LineType ¶
type LineType string
LineType represents the classification of a line in table detection
type Page ¶
type Page struct {
Number int
Width float64
Height float64
Quality PageQuality
Paragraphs []Paragraph
Tables []Table
Lines []Edge // Explicit line objects extracted from PDF
Columns []Column // Detected column layout
}
Page represents all extracted content from a PDF page.
func ExtractPage ¶
func ExtractPage(instance pdfium.Pdfium, page references.FPDF_PAGE, pageNumber int, config Config) (*Page, error)
ExtractPage extracts all enriched text from a PDF page.
func (*Page) ToMarkdown ¶
PageToMarkdown converts a single page to markdown.
type PageExtractor ¶
type PageExtractor struct {
// contains filtered or unexported fields
}
PageExtractor provides context for extracting text from a page.
type PageMetrics ¶
PageMetrics contains timing for a single page
type PageQuality ¶
type Paragraph ¶
type Paragraph struct {
Lines []Line
Box Rect
Alignment Alignment
IsHeading bool
HeadingLevel int // 1-6 for markdown headings
IsList bool
IsCode bool
Indent float64 // Left indentation
}
Paragraph represents a block of text.
type ProcessingMetrics ¶
type ProcessingMetrics struct {
TotalTime time.Duration
DocumentOpen time.Duration
PageExtractions []PageMetrics
Statistics DocumentStatistics
}
ProcessingMetrics contains timing and statistics for PDF conversion
type Rect ¶
type Rect struct {
X0 float64 // Left
Y0 float64 // Top (after conversion from PDF coordinates)
X1 float64 // Right
Y1 float64 // Bottom (after conversion from PDF coordinates)
}
Rect represents a bounding box in PDF coordinates.
type Segment ¶
type Segment struct {
Words []EnrichedWord
Box Rect
}
Segment represents a group of horizontally adjacent content elements Based on PDF-TREX algorithm
type SegmentTableCell ¶
SegmentTableCell represents a final table cell with 2D coordinates Used internally by segment-based table detection
type SegmentTableRow ¶
type SegmentTableRow struct {
Lines []TaggedLine
Segments []Segment
Box Rect
}
SegmentTableRow represents a logical table row (may span multiple lines) Used internally by segment-based table detection
type Table ¶
type Table struct {
BBox CellBBox
Rows []TableRow
Cells []CellBBox // Raw cell bounding boxes
NumRows int
NumCols int
}
Table represents a detected table with its structure and content.
func DetectTables ¶
func DetectTables(page *Page, settings TableSettings) []Table
DetectTables finds tables in a page using word alignment or explicit lines. Based on pdfplumber's TableFinder supporting multiple strategies.
func DetectTablesSegmentBased ¶
func DetectTablesSegmentBased(page *Page, thresholds AdaptiveThresholds) []Table
DetectTablesSegmentBased detects tables using segment-based approach This is an alternative to line-based detection for PDFs without ruling lines
type TableArea ¶
type TableArea struct {
Lines []TaggedLine
Box Rect
}
TableArea represents a region containing table lines
type TableCell ¶
type TableCell struct {
BBox CellBBox
Content string
Words []EnrichedWord
}
TableCell represents a detected table cell with its content.
type TableColumn ¶
TableColumn represents a logical table column
type TableSettings ¶
type TableSettings struct {
// Strategy for detecting table edges: "text", "lines", "lines_strict", "explicit"
VerticalStrategy string
HorizontalStrategy string
// Tolerances for snapping close edges together
SnapTolerance float64
SnapXTolerance float64
SnapYTolerance float64
// Tolerances for joining edges on the same line
JoinTolerance float64
JoinXTolerance float64
JoinYTolerance float64
// Minimum edge length to consider
EdgeMinLength float64
// Minimum number of words required to infer edges from text alignment
MinWordsVertical int
MinWordsHorizontal int
// Tolerances for finding edge intersections
IntersectionTolerance float64
IntersectionXTolerance float64
IntersectionYTolerance float64
}
TableSettings configures table detection behavior. Based on pdfplumber's TableSettings.
func DefaultTableSettings ¶
func DefaultTableSettings() TableSettings
DefaultTableSettings returns default settings for table detection. Uses "lines" strategy by default to detect explicit line objects in PDFs.
type TaggedLine ¶
TaggedLine is a line with its type classification