Documentation
¶
Overview ¶
Package extractor implements PDF content extraction use cases.
Package extractor implements PDF content extraction use cases.
Package extractor provides use cases for extracting content from PDF documents.
Package extractor implements PDF content extraction use cases.
This is the Application layer in DDD/Clean Architecture. It orchestrates domain logic and infrastructure for extracting content from PDFs.
Index ¶
- type CMapParser
- type CMapTable
- type CellExtractor
- type Color
- type ContentParser
- type FontDecoder
- type GraphicsElement
- type GraphicsParser
- type GraphicsState
- type GraphicsType
- type ImageExtractor
- type Matrix
- type Operator
- type Point
- type Rectangle
- type TextChunk
- type TextElement
- func (te *TextElement) Bottom() float64
- func (te *TextElement) CenterX() float64
- func (te *TextElement) CenterY() float64
- func (te *TextElement) Left() float64
- func (te *TextElement) Right() float64
- func (te *TextElement) String() string
- func (te *TextElement) Top() float64
- func (te *TextElement) VerticalOverlapRatio(other *TextElement) float64
- type TextExtractor
- type TextState
- func (ts *TextState) AdvanceX(width float64)
- func (ts *TextState) MoveToNextLine()
- func (ts *TextState) Reset()
- func (ts *TextState) SetFont(fontName string, fontSize float64)
- func (ts *TextState) SetTextMatrix(a, b, c, d, e, f float64)
- func (ts *TextState) String() string
- func (ts *TextState) Translate(tx, ty float64)
- func (ts *TextState) TranslateSetLeading(tx, ty float64)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CMapParser ¶
type CMapParser struct {
// contains filtered or unexported fields
}
CMapParser parses CMap (Character Map) streams from PDF ToUnicode entries.
CMap Format (simplified):
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CMapName /Adobe-Identity-UCS def /CMapType 2 def % Single character mappings 10 beginbfchar <0001> <0412> % Glyph 0x01 → U+0412 'В' <0002> <044B> % Glyph 0x02 → U+044B 'ы' <0003> <043F> % Glyph 0x03 → U+043F 'п' endbfchar % Range mappings 2 beginbfrange <0010> <0020> <0430> % Glyphs 0x10-0x20 → U+0430-0x0440 endbfrange endcmap
Reference: PDF 1.7 specification, Section 9.7.5 (ToUnicode CMaps).
func NewCMapParser ¶
func NewCMapParser(data []byte) *CMapParser
NewCMapParser creates a new CMapParser for the given stream data.
The stream should be the decoded content of a ToUnicode CMap stream.
func (*CMapParser) Parse ¶
func (p *CMapParser) Parse() (*CMapTable, error)
Parse parses the CMap stream and returns a CMapTable.
The parser handles:
- beginbfchar/endbfchar: Single character mappings
- beginbfrange/endbfrange: Range mappings
Unsupported operators are silently ignored for graceful degradation.
type CMapTable ¶
type CMapTable struct {
// contains filtered or unexported fields
}
CMapTable represents a Character Map that maps glyph IDs to Unicode code points.
CMap (Character Map) defines the mapping between character codes (glyph IDs) used in a PDF font and the corresponding Unicode values. This is essential for extracting readable text from PDFs, especially for custom encodings and non-Latin scripts like Cyrillic, Chinese, Japanese, etc.
The mapping is stored as glyph ID (uint16) → Unicode rune (int32).
Reference: PDF 1.7 specification, Section 9.7.5 (ToUnicode CMaps).
func NewCMapTable ¶
NewCMapTable creates a new empty CMapTable.
func ParseCMapStream ¶
ParseCMapStream is a convenience function that parses a CMap stream.
This is equivalent to:
parser := NewCMapParser(data) return parser.Parse()
func (*CMapTable) AddMapping ¶
AddMapping adds a single glyph ID to Unicode mapping.
func (*CMapTable) AddRangeMapping ¶
AddRangeMapping adds a range of glyph IDs to consecutive Unicode values.
For example: AddRangeMapping(0x10, 0x20, 0x0430) maps:
- Glyph 0x10 → U+0430 ('а')
- Glyph 0x11 → U+0431 ('б')
- ...
- Glyph 0x20 → U+0440 ('р')
func (*CMapTable) GetUnicode ¶
GetUnicode returns the Unicode code point for a given glyph ID.
Returns the Unicode rune and true if mapping exists, or 0 and false if not found.
type CellExtractor ¶
type CellExtractor struct {
// contains filtered or unexported fields
}
CellExtractor extracts text content from a rectangular cell region.
The extractor:
- Finds all text elements within cell bounds
- Sorts text by position (top to bottom, left to right)
- Joins text with proper spacing and line breaks
- Handles multi-line content
This is a critical component for table extraction (Phase 2.7).
func NewCellExtractor ¶
func NewCellExtractor(textElements []*TextElement) *CellExtractor
NewCellExtractor creates a new CellExtractor with the given text elements.
func (*CellExtractor) ExtractCellContent ¶
func (ce *CellExtractor) ExtractCellContent(bounds Rectangle) string
ExtractCellContent extracts text from a rectangular region (cell bounds).
Algorithm:
- Find all text elements within the cell bounds
- Group text elements by line (based on Y position)
- Sort lines from top to bottom
- Within each line, sort elements left to right
- Join text with appropriate spacing
Parameters:
- bounds: The rectangular region to extract text from
Returns the extracted text, or empty string if no text is found.
func (*CellExtractor) FindElementsInBounds ¶
func (ce *CellExtractor) FindElementsInBounds(bounds Rectangle) []*TextElement
FindElementsInBounds returns all text elements that are within the bounds.
An element is considered "within" if its center point is inside the bounds. This handles cases where text might slightly overlap cell boundaries.
This method is exported for use by other extractors (e.g., table alignment detection).
type Color ¶
type Color struct {
R, G, B float64 // RGB values (0.0 - 1.0)
}
Color represents an RGB color.
type ContentParser ¶
type ContentParser struct {
// contains filtered or unexported fields
}
ContentParser parses PDF content streams into operators.
Content streams contain a sequence of operators that describe page graphics and text. The parser reads the stream and extracts operators with their operands.
Example content stream:
BT /F1 12 Tf 100 200 Td (Hello, World!) Tj ET
This would be parsed into operators:
- Operator{Name: "BT"}
- Operator{Name: "Tf", Operands: ["/F1", 12]}
- Operator{Name: "Td", Operands: [100, 200]}
- Operator{Name: "Tj", Operands: ["(Hello, World!)"]}
- Operator{Name: "ET"}
Reference: PDF 1.7 specification, Section 7.8 (Content Streams).
func NewContentParser ¶
func NewContentParser(content []byte) *ContentParser
NewContentParser creates a new ContentParser for the given content stream.
func (*ContentParser) ParseOperators ¶
func (cp *ContentParser) ParseOperators() ([]*Operator, error)
ParseOperators parses all operators from the content stream.
Returns a slice of operators in the order they appear in the stream. Returns error if parsing fails.
Content streams are sequences of objects followed by operators (keywords). Example: "100 200 Td" means: push 100, push 200, execute Td operator.
type FontDecoder ¶
type FontDecoder struct {
// contains filtered or unexported fields
}
FontDecoder decodes glyph byte sequences to Unicode strings using CMap tables.
PDF fonts can use various encodings for text strings:
- Built-in encodings: WinAnsiEncoding, MacRomanEncoding, etc.
- Custom encodings: ToUnicode CMap (most common for non-Latin scripts)
- Identity encodings: Direct mapping (no decoding needed)
The FontDecoder handles all these cases and converts raw glyph bytes to readable Unicode text.
Reference: PDF 1.7 specification, Section 9.6.6 (Character Encoding).
func NewFontDecoder ¶
func NewFontDecoder(cmap *CMapTable, encoding string, use2ByteGlyphs bool) *FontDecoder
NewFontDecoder creates a new FontDecoder with the given CMap and encoding.
Parameters:
- cmap: ToUnicode CMap table (can be nil if not available)
- encoding: Base encoding name (e.g., "WinAnsiEncoding", "Identity-H")
- use2ByteGlyphs: true for 2-byte glyphs (CIDFonts), false for 1-byte glyphs
func NewFontDecoderWithCMap ¶
func NewFontDecoderWithCMap(cmap *CMapTable) *FontDecoder
NewFontDecoderWithCMap creates a FontDecoder that uses only CMap decoding.
This is a convenience constructor for the most common case: custom fonts with ToUnicode CMap (e.g., embedded fonts with Cyrillic text).
func NewFontDecoderWithCustomEncoding ¶
func NewFontDecoderWithCustomEncoding(differences map[uint16]string, baseEncoding string, use2ByteGlyphs bool) *FontDecoder
NewFontDecoderWithCustomEncoding creates a FontDecoder with custom glyph mappings.
This is used when a font has a /Encoding dictionary with /Differences array but no ToUnicode CMap.
Parameters:
- differences: Map of glyph ID → glyph name (from /Encoding/Differences)
- baseEncoding: Base encoding name (e.g., "WinAnsiEncoding")
- use2ByteGlyphs: true for 2-byte glyphs, false for 1-byte glyphs
Returns:
- FontDecoder configured with custom glyph mappings
func (*FontDecoder) DecodeString ¶
func (d *FontDecoder) DecodeString(glyphBytes []byte) string
DecodeString decodes a glyph byte sequence to a Unicode string.
The decoding process:
- Split bytes into glyphs (1-byte or 2-byte depending on font)
- For each glyph ID, look up Unicode in CMap table
- If CMap lookup fails, try built-in encoding (WinAnsi, MacRoman)
- If all else fails, treat as ISO-8859-1 (Latin-1) for ASCII compatibility
Returns the decoded Unicode string. Invalid glyphs are replaced with Unicode replacement character (U+FFFD).
func (*FontDecoder) Encoding ¶
func (d *FontDecoder) Encoding() string
Encoding returns the base encoding name.
func (*FontDecoder) HasCMap ¶
func (d *FontDecoder) HasCMap() bool
HasCMap returns true if this decoder has a CMap table.
func (*FontDecoder) String ¶
func (d *FontDecoder) String() string
String returns a string representation of the decoder's configuration.
type GraphicsElement ¶
type GraphicsElement struct {
Type GraphicsType // Type of graphics element
Points []Point // Points defining the element
Color Color // Stroke/fill color
Width float64 // Line width
}
GraphicsElement represents a graphics element extracted from a PDF page.
Graphics elements include lines, rectangles, and paths that can be used to detect ruling lines in tables (lattice mode detection).
PDF Graphics Operators (Section 8.5):
- Path construction: m (moveto), l (lineto), re (rectangle), c (curve)
- Path painting: S (stroke), s (close/stroke), f (fill), F (fill)
Reference: PDF 1.7 specification, Section 8.5 (Graphics Objects).
func (*GraphicsElement) String ¶
func (ge *GraphicsElement) String() string
String returns a string representation of the graphics element.
type GraphicsParser ¶
type GraphicsParser struct {
// contains filtered or unexported fields
}
GraphicsParser extracts graphics elements from PDF content streams.
The parser processes graphics operators to extract lines, rectangles, and other shapes that can be used for table detection (ruling lines).
Graphics State (Section 8.4):
- Current transformation matrix (CTM)
- Current path
- Line width, color, etc.
Reference: PDF 1.7 specification, Section 8 (Graphics).
func NewGraphicsParser ¶
func NewGraphicsParser(reader *parser.Reader) *GraphicsParser
NewGraphicsParser creates a new GraphicsParser for the given PDF reader.
func (*GraphicsParser) ParseFromPage ¶
func (gp *GraphicsParser) ParseFromPage(pageNum int) ([]*GraphicsElement, error)
ParseFromPage extracts all graphics elements from the specified page.
Page numbers are 0-based (first page is 0).
Returns a slice of GraphicsElements, or error if extraction fails.
type GraphicsState ¶
type GraphicsState struct {
CurrentPath []Point // Points in current path
LineWidth float64 // Current line width
StrokeColor Color // Current stroke color
FillColor Color // Current fill color
}
GraphicsState tracks the current graphics state during parsing.
func NewGraphicsState ¶
func NewGraphicsState() *GraphicsState
NewGraphicsState creates a new graphics state with defaults.
type GraphicsType ¶
type GraphicsType int
GraphicsType represents the type of graphics element.
const ( // GraphicsTypeLine represents a straight line. GraphicsTypeLine GraphicsType = iota // GraphicsTypeRectangle represents a rectangle. GraphicsTypeRectangle // GraphicsTypePath represents a generic path. GraphicsTypePath )
func (GraphicsType) String ¶
func (gt GraphicsType) String() string
String returns a string representation of the graphics type.
type ImageExtractor ¶
type ImageExtractor struct {
// contains filtered or unexported fields
}
ImageExtractor extracts images from PDF pages.
This is an application service that coordinates image extraction from PDF documents using the domain model and infrastructure services.
Example:
reader, _ := parser.OpenPDF("document.pdf")
defer reader.Close()
extractor := NewImageExtractor(reader)
images, _ := extractor.ExtractFromPage(0)
for _, img := range images {
img.SaveToFile(fmt.Sprintf("image_%d.jpg", i))
}
func NewImageExtractor ¶
func NewImageExtractor(reader *parser.Reader) *ImageExtractor
NewImageExtractor creates a new image extractor.
Parameters:
- reader: PDF reader providing access to document structure
Returns a configured ImageExtractor ready to extract images.
func (*ImageExtractor) ExtractFromDocument ¶
func (e *ImageExtractor) ExtractFromDocument() ([]*types.Image, error)
ExtractFromDocument extracts all images from all pages in the document.
This iterates through all pages and extracts images from each.
Returns a slice of all images found in the document, or error if extraction fails.
func (*ImageExtractor) ExtractFromPage ¶
func (e *ImageExtractor) ExtractFromPage(pageIndex int) ([]*types.Image, error)
ExtractFromPage extracts all images from a specific page.
This finds all image XObjects in the page's resources and extracts them.
Parameters:
- pageIndex: 0-based page index
Returns a slice of images found on the page, or error if extraction fails.
type Matrix ¶
type Matrix struct {
A, B, C, D, E, F float64
}
Matrix represents a transformation matrix used in PDF graphics and text.
PDF uses 3x3 transformation matrices in homogeneous coordinate space:
[ a b 0 ] [ c d 0 ] [ e f 1 ]
The matrix is specified by six numbers: [a b c d e f]
Transformations:
- Translation: [1 0 0 1 tx ty] - moves by (tx, ty)
- Scaling: [sx 0 0 sy 0 0] - scales by (sx, sy)
- Rotation: [cos θ sin θ -sin θ cos θ 0 0] - rotates by θ
- Skewing: [1 tan α tan β 1 0 0] - skews by angles α, β
Reference: PDF 1.7 specification, Section 8.3.3 (Common Transformations).
func Identity ¶
func Identity() Matrix
Identity returns the identity matrix [1 0 0 1 0 0].
The identity matrix performs no transformation.
func Translation ¶
Translation creates a translation matrix that moves by (tx, ty).
func (Matrix) IsIdentity ¶
IsIdentity checks if the matrix is the identity matrix.
func (Matrix) Multiply ¶
Multiply multiplies this matrix by another matrix (m * other).
Matrix multiplication is used to combine transformations. The order matters: m.Multiply(other) applies other first, then m.
The formula for matrix multiplication:
[ a1 b1 0 ] [ a2 b2 0 ] [ a1*a2+b1*c2 a1*b2+b1*d2 0 ] [ c1 d1 0 ] × [ c2 d2 0 ] = [ c1*a2+d1*c2 c1*b2+d1*d2 0 ] [ e1 f1 1 ] [ e2 f2 1 ] [ e1*a2+f1*c2+e2 e1*b2+f1*d2+f2 1 ]
Reference: PDF 1.7 specification, Section 8.3.4 (Transformation Matrices).
type Operator ¶
type Operator struct {
Name string // Operator name (e.g., "Tj", "Tm", "BT")
Operands []parser.PdfObject // Operands for the operator
}
Operator represents a PDF content stream operator with its operands.
PDF content streams consist of a sequence of operators and their operands. The general format is:
operand1 operand2 ... operandN operator
For example:
- "100 200 Td" - Move text position to (100, 200)
- "(Hello) Tj" - Show text "Hello"
- "/F1 12 Tf" - Set font F1 with size 12
Reference: PDF 1.7 specification, Section 7.8.2 (Content Streams).
func NewOperator ¶
NewOperator creates a new Operator with the given name and operands.
type Point ¶
type Point struct {
X, Y float64
}
Point represents a 2D point in PDF coordinate space.
PDF coordinates are in points (1/72 inch), with origin at bottom-left.
type Rectangle ¶
type Rectangle struct {
X float64 // Bottom-left X coordinate
Y float64 // Bottom-left Y coordinate
Width float64 // Width
Height float64 // Height
}
Rectangle represents a rectangular bounding box.
This is a simplified version for text extraction. The full Rectangle value object is in domain/types.
func NewRectangle ¶
NewRectangle creates a new Rectangle.
type TextChunk ¶
type TextChunk struct {
Elements []*TextElement // Text elements in this chunk
Bounds Rectangle // Bounding box of all elements
}
TextChunk represents a group of text elements.
A chunk is used to group related text elements (e.g., text on the same line, text in the same cell, text in the same paragraph).
This is useful for table extraction where we need to group text into cells.
func NewTextChunk ¶
func NewTextChunk(elements []*TextElement) *TextChunk
NewTextChunk creates a new TextChunk with the given elements.
The bounding box is calculated from the elements.
func (*TextChunk) Add ¶
func (tc *TextChunk) Add(elem *TextElement)
Add adds a text element to the chunk and updates bounds.
type TextElement ¶
type TextElement struct {
Text string // The actual text content
X float64 // X coordinate (bottom-left, in points)
Y float64 // Y coordinate (bottom-left, in points)
Width float64 // Width of text (in points)
Height float64 // Height of text (in points)
FontName string // Font name (e.g., "/F1", "/Helvetica")
FontSize float64 // Font size in points
}
TextElement represents a single piece of text extracted from a PDF page.
Each TextElement has position information (X, Y coordinates) which is critical for table extraction and layout analysis. The coordinates represent the bottom-left corner of the text element in PDF coordinate space.
PDF Coordinate System (Section 8.3.2):
- Origin (0,0) is at bottom-left of page
- X increases to the right
- Y increases upward
- Coordinates are in points (1 point = 1/72 inch)
Reference: PDF 1.7 specification, Section 9.4 (Text Objects).
func NewTextElement ¶
func NewTextElement(text string, x, y, width, height float64, fontName string, fontSize float64) *TextElement
NewTextElement creates a new TextElement with the given properties.
func (*TextElement) Bottom ¶
func (te *TextElement) Bottom() float64
Bottom returns the Y coordinate of the bottom edge (same as Y).
func (*TextElement) CenterX ¶
func (te *TextElement) CenterX() float64
CenterX returns the X coordinate of the center of the text.
func (*TextElement) CenterY ¶
func (te *TextElement) CenterY() float64
CenterY returns the Y coordinate of the center of the text.
func (*TextElement) Left ¶
func (te *TextElement) Left() float64
Left returns the X coordinate of the left edge (same as X).
func (*TextElement) Right ¶
func (te *TextElement) Right() float64
Right returns the X coordinate of the right edge of the text.
func (*TextElement) String ¶
func (te *TextElement) String() string
String returns a string representation of the text element.
func (*TextElement) Top ¶
func (te *TextElement) Top() float64
Top returns the Y coordinate of the top edge of the text.
func (*TextElement) VerticalOverlapRatio ¶
func (te *TextElement) VerticalOverlapRatio(other *TextElement) float64
VerticalOverlapRatio calculates the vertical overlap ratio between this element and another.
Returns a value between 0.0 (no overlap) and 1.0 (complete overlap). Based on Tabula's algorithm (tabula-java/Rectangle.java:73-90).
This is used for row detection in tables without ruling lines (Stream mode). Elements with overlap < threshold (e.g., 0.1) are considered separate rows.
type TextExtractor ¶
type TextExtractor struct {
// contains filtered or unexported fields
}
TextExtractor extracts text with positional information from PDF pages.
The extractor processes PDF content streams and interprets text operators to extract text along with its X,Y coordinates. This is critical for table extraction, as we need to know where each piece of text is located.
Text Extraction Process:
- Get page's content stream(s)
- Decode stream (handle FlateDecode, etc.)
- Parse content operators
- Track text state (font, position, matrix)
- Extract text with coordinates when text showing operators are encountered
- Decode glyph bytes to Unicode using font CMap/encoding
Reference: PDF 1.7 specification, Section 9.4 (Text Objects).
func NewTextExtractor ¶
func NewTextExtractor(reader *parser.Reader) *TextExtractor
NewTextExtractor creates a new TextExtractor for the given PDF reader.
func (*TextExtractor) ExtractFromPage ¶
func (te *TextExtractor) ExtractFromPage(pageNum int) ([]*TextElement, error)
ExtractFromPage extracts all text elements from the specified page.
Page numbers are 0-based (first page is 0).
Returns a slice of TextElements with position information, or error if extraction fails.
type TextState ¶
type TextState struct {
// Text matrices (Section 9.4.2)
Tm Matrix // Current text matrix
Tlm Matrix // Text line matrix (start of line)
// Text state parameters (Section 9.3)
FontName string // Current font name (from Tf operator)
FontSize float64 // Current font size in points (from Tf operator)
CharSpace float64 // Character spacing (from Tc operator)
WordSpace float64 // Word spacing (from Tw operator)
HorizScale float64 // Horizontal scaling as percentage (from Tz operator, 100 = normal)
Leading float64 // Text leading in points (from TL operator)
Rise float64 // Text rise in points (from Ts operator)
// Current position (derived from Tm)
CurrentX float64
CurrentY float64
}
TextState tracks the current text state during content stream parsing.
The PDF text state includes all parameters that affect how text is rendered:
- Text matrix (Tm): Current text position and transformation
- Text line matrix (Tlm): Position of the start of the current line
- Font and size
- Character/word spacing
- Horizontal scaling
- Text leading (line spacing)
- Text rise (vertical offset)
These parameters are modified by text operators (Tf, Tc, Tw, Tz, TL, Ts, Tm, Td, etc.) and affect how text showing operators (Tj, TJ, ', ") render text.
Reference: PDF 1.7 specification, Section 9.3 (Text State Parameters).
func NewTextState ¶
func NewTextState() *TextState
NewTextState creates a new TextState with default values.
Default values:
- Identity matrices for Tm and Tlm
- Empty font name, 0 font size
- 0 character spacing, word spacing, rise
- 100% horizontal scaling
- 0 leading
Reference: PDF 1.7 specification, Section 9.3.1 (Text State Parameters and Operators).
func (*TextState) AdvanceX ¶
AdvanceX advances the current X position by the given width.
This is used when showing text to move the text position by the width of the text. The width should account for character spacing, word spacing, and horizontal scaling.
func (*TextState) MoveToNextLine ¶
func (ts *TextState) MoveToNextLine()
MoveToNextLine moves to the start of the next line (T* operator).
The T* operator is equivalent to: Td 0 -Tl where Tl is the current leading.
Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).
func (*TextState) Reset ¶
func (ts *TextState) Reset()
Reset resets the text state to default values.
This is called when a BT (Begin Text) operator is encountered. According to the PDF spec, BT initializes the text matrix and line matrix to identity.
Reference: PDF 1.7 specification, Section 9.4.1 (Text Objects).
func (*TextState) SetFont ¶
SetFont sets the current font and size (Tf operator).
The Tf operator takes a font name and size:
Tf /FontName size
Reference: PDF 1.7 specification, Section 9.3 (Text State Parameters).
func (*TextState) SetTextMatrix ¶
SetTextMatrix sets the text matrix (Tm operator).
The Tm operator replaces the current text matrix with a new matrix specified by six numbers: Tm a b c d e f
This also updates the text line matrix (Tlm = Tm).
Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).
func (*TextState) Translate ¶
Translate moves the text position by (tx, ty) (Td operator).
The Td operator updates both the text matrix and text line matrix:
Tlm = Tlm * [1 0 0 1 tx ty] Tm = Tlm
Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).
func (*TextState) TranslateSetLeading ¶
TranslateSetLeading moves the text position and sets leading (TD operator).
The TD operator is equivalent to:
- TL -ty (set leading to -ty)
- Td tx ty (translate)
Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).