extractor

package

v0.6.0 Latest Latest Go to latest Published: Feb 24, 2026 License: MIT Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/coregx/gxpdf

Links

Open Source Insights

Documentation ¶

Overview ¶

Package extractor implements PDF content extraction use cases.

Package extractor provides use cases for extracting content from PDF documents.

Package extractor implements PDF content extraction use cases.

This is the Application layer in DDD/Clean Architecture. It orchestrates domain logic and infrastructure for extracting content from PDFs.

Index ¶

type CMapParser
- func NewCMapParser(data []byte) *CMapParser
- func (p *CMapParser) Parse() (*CMapTable, error)
type CMapTable
- func NewCMapTable(name string) *CMapTable
- func ParseCMapStream(data []byte) (*CMapTable, error)
- func (t *CMapTable) AddMapping(glyphID uint16, unicode rune)
- func (t *CMapTable) AddRangeMapping(startGlyphID, endGlyphID uint16, startUnicode rune)
- func (t *CMapTable) GetUnicode(glyphID uint16) (rune, bool)
- func (t *CMapTable) Name() string
- func (t *CMapTable) Size() int
type CellExtractor
- func NewCellExtractor(textElements []*TextElement) *CellExtractor
- func (ce *CellExtractor) ExtractCellContent(bounds Rectangle) string
- func (ce *CellExtractor) FindElementsInBounds(bounds Rectangle) []*TextElement
type Color
- func NewColor(r, g, b float64) Color
- func (c Color) IsBlack() bool
- func (c Color) String() string
type ContentParser
- func NewContentParser(content []byte) *ContentParser
- func (cp *ContentParser) ParseOperators() ([]*Operator, error)
type FontDecoder
- func NewFontDecoder(cmap *CMapTable, encoding string, use2ByteGlyphs bool) *FontDecoder
- func NewFontDecoderWithCMap(cmap *CMapTable) *FontDecoder
- func NewFontDecoderWithCustomEncoding(differences map[uint16]string, baseEncoding string, use2ByteGlyphs bool) *FontDecoder
- func (d *FontDecoder) DecodeString(glyphBytes []byte) string
- func (d *FontDecoder) Encoding() string
- func (d *FontDecoder) HasCMap() bool
- func (d *FontDecoder) String() string
type GraphicsElement
- func (ge *GraphicsElement) String() string
type GraphicsParser
- func NewGraphicsParser(reader *parser.Reader) *GraphicsParser
- func (gp *GraphicsParser) ParseFromPage(pageNum int) ([]*GraphicsElement, error)
type GraphicsState
- func NewGraphicsState() *GraphicsState
type GraphicsType
- func (gt GraphicsType) String() string
type ImageExtractor
- func NewImageExtractor(reader *parser.Reader) *ImageExtractor
- func (e *ImageExtractor) ExtractFromDocument() ([]*types.Image, error)
- func (e *ImageExtractor) ExtractFromPage(pageIndex int) ([]*types.Image, error)
type Matrix
- func Identity() Matrix
- func NewMatrix(a, b, c, d, e, f float64) Matrix
- func Rotation(angle float64) Matrix
- func Scaling(sx, sy float64) Matrix
- func Translation(tx, ty float64) Matrix
- func (m Matrix) IsIdentity() bool
- func (m Matrix) Multiply(other Matrix) Matrix
- func (m Matrix) String() string
- func (m Matrix) Transform(x, y float64) (float64, float64)
type Operator
- func NewOperator(name string, operands []parser.PdfObject) *Operator
- func (op *Operator) String() string
type Point
- func NewPoint(x, y float64) Point
- func (p Point) String() string
type Rectangle
- func NewRectangle(x, y, width, height float64) Rectangle
- func (r Rectangle) Bottom() float64
- func (r Rectangle) Contains(x, y float64) bool
- func (r Rectangle) Left() float64
- func (r Rectangle) Right() float64
- func (r Rectangle) String() string
- func (r Rectangle) Top() float64
type TextChunk
- func NewTextChunk(elements []*TextElement) *TextChunk
- func (tc *TextChunk) Add(elem *TextElement)
- func (tc *TextChunk) Len() int
- func (tc *TextChunk) String() string
- func (tc *TextChunk) Text() string
type TextElement
- func NewTextElement(text string, x, y, width, height float64, fontName string, fontSize float64) *TextElement
- func (te *TextElement) Bottom() float64
- func (te *TextElement) CenterX() float64
- func (te *TextElement) CenterY() float64
- func (te *TextElement) Left() float64
- func (te *TextElement) Right() float64
- func (te *TextElement) String() string
- func (te *TextElement) Top() float64
- func (te *TextElement) VerticalOverlapRatio(other *TextElement) float64
type TextExtractor
- func NewTextExtractor(reader *parser.Reader) *TextExtractor
- func (te *TextExtractor) ExtractFromPage(pageNum int) ([]*TextElement, error)
type TextState
- func NewTextState() *TextState
- func (ts *TextState) AdvanceX(width float64)
- func (ts *TextState) MoveToNextLine()
- func (ts *TextState) Reset()
- func (ts *TextState) SetFont(fontName string, fontSize float64)
- func (ts *TextState) SetTextMatrix(a, b, c, d, e, f float64)
- func (ts *TextState) String() string
- func (ts *TextState) Translate(tx, ty float64)
- func (ts *TextState) TranslateSetLeading(tx, ty float64)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CMapParser ¶

type CMapParser struct {
	// contains filtered or unexported fields
}

CMapParser parses CMap (Character Map) streams from PDF ToUnicode entries.

CMap Format (simplified):

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def

% Single character mappings
10 beginbfchar
<0001> <0412>  % Glyph 0x01 → U+0412 'В'
<0002> <044B>  % Glyph 0x02 → U+044B 'ы'
<0003> <043F>  % Glyph 0x03 → U+043F 'п'
endbfchar

% Range mappings
2 beginbfrange
<0010> <0020> <0430>  % Glyphs 0x10-0x20 → U+0430-0x0440
endbfrange

endcmap

Reference: PDF 1.7 specification, Section 9.7.5 (ToUnicode CMaps).

func NewCMapParser ¶

func NewCMapParser(data []byte) *CMapParser

NewCMapParser creates a new CMapParser for the given stream data.

The stream should be the decoded content of a ToUnicode CMap stream.

func (*CMapParser) Parse ¶

func (p *CMapParser) Parse() (*CMapTable, error)

Parse parses the CMap stream and returns a CMapTable.

The parser handles:

beginbfchar/endbfchar: Single character mappings
beginbfrange/endbfrange: Range mappings

Unsupported operators are silently ignored for graceful degradation.

type CMapTable ¶

type CMapTable struct {
	// contains filtered or unexported fields
}

CMapTable represents a Character Map that maps glyph IDs to Unicode code points.

CMap (Character Map) defines the mapping between character codes (glyph IDs) used in a PDF font and the corresponding Unicode values. This is essential for extracting readable text from PDFs, especially for custom encodings and non-Latin scripts like Cyrillic, Chinese, Japanese, etc.

The mapping is stored as glyph ID (uint16) → Unicode rune (int32).

Reference: PDF 1.7 specification, Section 9.7.5 (ToUnicode CMaps).

func NewCMapTable ¶

func NewCMapTable(name string) *CMapTable

NewCMapTable creates a new empty CMapTable.

func ParseCMapStream ¶

func ParseCMapStream(data []byte) (*CMapTable, error)

ParseCMapStream is a convenience function that parses a CMap stream.

This is equivalent to:

parser := NewCMapParser(data)
return parser.Parse()

func (*CMapTable) AddMapping ¶

func (t *CMapTable) AddMapping(glyphID uint16, unicode rune)

AddMapping adds a single glyph ID to Unicode mapping.

func (*CMapTable) AddRangeMapping ¶

func (t *CMapTable) AddRangeMapping(startGlyphID, endGlyphID uint16, startUnicode rune)

AddRangeMapping adds a range of glyph IDs to consecutive Unicode values.

For example: AddRangeMapping(0x10, 0x20, 0x0430) maps:

Glyph 0x10 → U+0430 ('а')
Glyph 0x11 → U+0431 ('б')
...
Glyph 0x20 → U+0440 ('р')

func (*CMapTable) GetUnicode ¶

func (t *CMapTable) GetUnicode(glyphID uint16) (rune, bool)

GetUnicode returns the Unicode code point for a given glyph ID.

Returns the Unicode rune and true if mapping exists, or 0 and false if not found.

func (*CMapTable) Name ¶

func (t *CMapTable) Name() string

Name returns the CMap name.

func (*CMapTable) Size ¶

func (t *CMapTable) Size() int

Size returns the number of mappings in the table.

type CellExtractor ¶

type CellExtractor struct {
	// contains filtered or unexported fields
}

CellExtractor extracts text content from a rectangular cell region.

The extractor:

Finds all text elements within cell bounds
Sorts text by position (top to bottom, left to right)
Joins text with proper spacing and line breaks
Handles multi-line content

This is a critical component for table extraction (Phase 2.7).

func NewCellExtractor ¶

func NewCellExtractor(textElements []*TextElement) *CellExtractor

NewCellExtractor creates a new CellExtractor with the given text elements.

func (*CellExtractor) ExtractCellContent ¶

func (ce *CellExtractor) ExtractCellContent(bounds Rectangle) string

ExtractCellContent extracts text from a rectangular region (cell bounds).

Algorithm:

Find all text elements within the cell bounds
Group text elements by line (based on Y position)
Sort lines from top to bottom
Within each line, sort elements left to right
Join text with appropriate spacing

Parameters:

bounds: The rectangular region to extract text from

Returns the extracted text, or empty string if no text is found.

func (*CellExtractor) FindElementsInBounds ¶

func (ce *CellExtractor) FindElementsInBounds(bounds Rectangle) []*TextElement

FindElementsInBounds returns all text elements that are within the bounds.

An element is considered "within" if its center point is inside the bounds. This handles cases where text might slightly overlap cell boundaries.

This method is exported for use by other extractors (e.g., table alignment detection).

type Color ¶

type Color struct {
	R, G, B float64 // RGB values (0.0 - 1.0)
}

Color represents an RGB color.

func NewColor ¶

func NewColor(r, g, b float64) Color

NewColor creates a new Color.

func (Color) IsBlack ¶

func (c Color) IsBlack() bool

IsBlack returns true if the color is black (or very dark).

func (Color) String ¶

func (c Color) String() string

String returns a string representation of the color.

type ContentParser ¶

type ContentParser struct {
	// contains filtered or unexported fields
}

ContentParser parses PDF content streams into operators.

Content streams contain a sequence of operators that describe page graphics and text. The parser reads the stream and extracts operators with their operands.

Example content stream:

BT
  /F1 12 Tf
  100 200 Td
  (Hello, World!) Tj
ET

This would be parsed into operators:

Operator{Name: "BT"}
Operator{Name: "Tf", Operands: ["/F1", 12]}
Operator{Name: "Td", Operands: [100, 200]}
Operator{Name: "Tj", Operands: ["(Hello, World!)"]}
Operator{Name: "ET"}

Reference: PDF 1.7 specification, Section 7.8 (Content Streams).

func NewContentParser ¶

func NewContentParser(content []byte) *ContentParser

NewContentParser creates a new ContentParser for the given content stream.

func (*ContentParser) ParseOperators ¶

func (cp *ContentParser) ParseOperators() ([]*Operator, error)

ParseOperators parses all operators from the content stream.

Returns a slice of operators in the order they appear in the stream. Returns error if parsing fails.

Content streams are sequences of objects followed by operators (keywords). Example: "100 200 Td" means: push 100, push 200, execute Td operator.

type FontDecoder ¶

type FontDecoder struct {
	// contains filtered or unexported fields
}

FontDecoder decodes glyph byte sequences to Unicode strings using CMap tables.

PDF fonts can use various encodings for text strings:

Built-in encodings: WinAnsiEncoding, MacRomanEncoding, etc.
Custom encodings: ToUnicode CMap (most common for non-Latin scripts)
Identity encodings: Direct mapping (no decoding needed)

The FontDecoder handles all these cases and converts raw glyph bytes to readable Unicode text.

Reference: PDF 1.7 specification, Section 9.6.6 (Character Encoding).

func NewFontDecoder ¶

func NewFontDecoder(cmap *CMapTable, encoding string, use2ByteGlyphs bool) *FontDecoder

NewFontDecoder creates a new FontDecoder with the given CMap and encoding.

Parameters:

cmap: ToUnicode CMap table (can be nil if not available)
encoding: Base encoding name (e.g., "WinAnsiEncoding", "Identity-H")
use2ByteGlyphs: true for 2-byte glyphs (CIDFonts), false for 1-byte glyphs

func NewFontDecoderWithCMap ¶

func NewFontDecoderWithCMap(cmap *CMapTable) *FontDecoder

NewFontDecoderWithCMap creates a FontDecoder that uses only CMap decoding.

This is a convenience constructor for the most common case: custom fonts with ToUnicode CMap (e.g., embedded fonts with Cyrillic text).

func NewFontDecoderWithCustomEncoding ¶

func NewFontDecoderWithCustomEncoding(differences map[uint16]string, baseEncoding string, use2ByteGlyphs bool) *FontDecoder

NewFontDecoderWithCustomEncoding creates a FontDecoder with custom glyph mappings.

This is used when a font has a /Encoding dictionary with /Differences array but no ToUnicode CMap.

Parameters:

differences: Map of glyph ID → glyph name (from /Encoding/Differences)
baseEncoding: Base encoding name (e.g., "WinAnsiEncoding")
use2ByteGlyphs: true for 2-byte glyphs, false for 1-byte glyphs

Returns:

FontDecoder configured with custom glyph mappings

func (*FontDecoder) DecodeString ¶

func (d *FontDecoder) DecodeString(glyphBytes []byte) string

DecodeString decodes a glyph byte sequence to a Unicode string.

The decoding process:

Split bytes into glyphs (1-byte or 2-byte depending on font)
For each glyph ID, look up Unicode in CMap table
If CMap lookup fails, try built-in encoding (WinAnsi, MacRoman)
If all else fails, treat as ISO-8859-1 (Latin-1) for ASCII compatibility

Returns the decoded Unicode string. Invalid glyphs are replaced with Unicode replacement character (U+FFFD).

func (*FontDecoder) Encoding ¶

func (d *FontDecoder) Encoding() string

Encoding returns the base encoding name.

func (*FontDecoder) HasCMap ¶

func (d *FontDecoder) HasCMap() bool

HasCMap returns true if this decoder has a CMap table.

func (*FontDecoder) String ¶

func (d *FontDecoder) String() string

String returns a string representation of the decoder's configuration.

type GraphicsElement ¶

type GraphicsElement struct {
	Type   GraphicsType // Type of graphics element
	Points []Point      // Points defining the element
	Color  Color        // Stroke/fill color
	Width  float64      // Line width
}

GraphicsElement represents a graphics element extracted from a PDF page.

Graphics elements include lines, rectangles, and paths that can be used to detect ruling lines in tables (lattice mode detection).

PDF Graphics Operators (Section 8.5):

Path construction: m (moveto), l (lineto), re (rectangle), c (curve)
Path painting: S (stroke), s (close/stroke), f (fill), F (fill)

Reference: PDF 1.7 specification, Section 8.5 (Graphics Objects).

func (*GraphicsElement) String ¶

func (ge *GraphicsElement) String() string

String returns a string representation of the graphics element.

type GraphicsParser ¶

type GraphicsParser struct {
	// contains filtered or unexported fields
}

GraphicsParser extracts graphics elements from PDF content streams.

The parser processes graphics operators to extract lines, rectangles, and other shapes that can be used for table detection (ruling lines).

Graphics State (Section 8.4):

Current transformation matrix (CTM)
Current path
Line width, color, etc.

Reference: PDF 1.7 specification, Section 8 (Graphics).

func NewGraphicsParser ¶

func NewGraphicsParser(reader *parser.Reader) *GraphicsParser

NewGraphicsParser creates a new GraphicsParser for the given PDF reader.

func (*GraphicsParser) ParseFromPage ¶

func (gp *GraphicsParser) ParseFromPage(pageNum int) ([]*GraphicsElement, error)

ParseFromPage extracts all graphics elements from the specified page.

Page numbers are 0-based (first page is 0).

Returns a slice of GraphicsElements, or error if extraction fails.

type GraphicsState ¶

type GraphicsState struct {
	CurrentPath []Point // Points in current path
	LineWidth   float64 // Current line width
	StrokeColor Color   // Current stroke color
	FillColor   Color   // Current fill color
}

GraphicsState tracks the current graphics state during parsing.

func NewGraphicsState ¶

func NewGraphicsState() *GraphicsState

NewGraphicsState creates a new graphics state with defaults.

type GraphicsType ¶

type GraphicsType int

GraphicsType represents the type of graphics element.

const (
	// GraphicsTypeLine represents a straight line.
	GraphicsTypeLine GraphicsType = iota
	// GraphicsTypeRectangle represents a rectangle.
	GraphicsTypeRectangle
	// GraphicsTypePath represents a generic path.
	GraphicsTypePath
)

func (GraphicsType) String ¶

func (gt GraphicsType) String() string

String returns a string representation of the graphics type.

type ImageExtractor ¶

type ImageExtractor struct {
	// contains filtered or unexported fields
}

ImageExtractor extracts images from PDF pages.

This is an application service that coordinates image extraction from PDF documents using the domain model and infrastructure services.

Example:

reader, _ := parser.OpenPDF("document.pdf")
defer reader.Close()

extractor := NewImageExtractor(reader)
images, _ := extractor.ExtractFromPage(0)
for _, img := range images {
    img.SaveToFile(fmt.Sprintf("image_%d.jpg", i))
}

func NewImageExtractor ¶

func NewImageExtractor(reader *parser.Reader) *ImageExtractor

NewImageExtractor creates a new image extractor.

Parameters:

reader: PDF reader providing access to document structure

Returns a configured ImageExtractor ready to extract images.

func (*ImageExtractor) ExtractFromDocument ¶

func (e *ImageExtractor) ExtractFromDocument() ([]*types.Image, error)

ExtractFromDocument extracts all images from all pages in the document.

This iterates through all pages and extracts images from each.

Returns a slice of all images found in the document, or error if extraction fails.

func (*ImageExtractor) ExtractFromPage ¶

func (e *ImageExtractor) ExtractFromPage(pageIndex int) ([]*types.Image, error)

ExtractFromPage extracts all images from a specific page.

This finds all image XObjects in the page's resources and extracts them.

Parameters:

pageIndex: 0-based page index

Returns a slice of images found on the page, or error if extraction fails.

type Matrix ¶

type Matrix struct {
	A, B, C, D, E, F float64
}

Matrix represents a transformation matrix used in PDF graphics and text.

PDF uses 3x3 transformation matrices in homogeneous coordinate space:

[ a  b  0 ]
[ c  d  0 ]
[ e  f  1 ]

The matrix is specified by six numbers: [a b c d e f]

Transformations:

Translation: [1 0 0 1 tx ty] - moves by (tx, ty)
Scaling: [sx 0 0 sy 0 0] - scales by (sx, sy)
Rotation: [cos θ sin θ -sin θ cos θ 0 0] - rotates by θ
Skewing: [1 tan α tan β 1 0 0] - skews by angles α, β

Reference: PDF 1.7 specification, Section 8.3.3 (Common Transformations).

func Identity ¶

func Identity() Matrix

Identity returns the identity matrix [1 0 0 1 0 0].

The identity matrix performs no transformation.

func NewMatrix ¶

func NewMatrix(a, b, c, d, e, f float64) Matrix

NewMatrix creates a new Matrix with the given values.

func Rotation ¶

func Rotation(angle float64) Matrix

Rotation creates a rotation matrix that rotates by angle (in radians).

func Scaling ¶

func Scaling(sx, sy float64) Matrix

Scaling creates a scaling matrix that scales by (sx, sy).

func Translation ¶

func Translation(tx, ty float64) Matrix

Translation creates a translation matrix that moves by (tx, ty).

func (Matrix) IsIdentity ¶

func (m Matrix) IsIdentity() bool

IsIdentity checks if the matrix is the identity matrix.

func (Matrix) Multiply ¶

func (m Matrix) Multiply(other Matrix) Matrix

Multiply multiplies this matrix by another matrix (m * other).

Matrix multiplication is used to combine transformations. The order matters: m.Multiply(other) applies other first, then m.

The formula for matrix multiplication:

[ a1 b1 0 ]   [ a2 b2 0 ]   [ a1*a2+b1*c2  a1*b2+b1*d2  0 ]
[ c1 d1 0 ] × [ c2 d2 0 ] = [ c1*a2+d1*c2  c1*b2+d1*d2  0 ]
[ e1 f1 1 ]   [ e2 f2 1 ]   [ e1*a2+f1*c2+e2  e1*b2+f1*d2+f2  1 ]

Reference: PDF 1.7 specification, Section 8.3.4 (Transformation Matrices).

func (Matrix) String ¶

func (m Matrix) String() string

String returns a string representation of the matrix.

func (Matrix) Transform ¶

func (m Matrix) Transform(x, y float64) (float64, float64)

Transform applies the matrix transformation to a point (x, y).

The transformation formula is:

x' = a*x + c*y + e
y' = b*x + d*y + f

This is used to convert text coordinates from text space to user space.

Reference: PDF 1.7 specification, Section 8.3.2 (Coordinate Spaces).

type Operator ¶

type Operator struct {
	Name     string             // Operator name (e.g., "Tj", "Tm", "BT")
	Operands []parser.PdfObject // Operands for the operator
}

Operator represents a PDF content stream operator with its operands.

PDF content streams consist of a sequence of operators and their operands. The general format is:

operand1 operand2 ... operandN operator

For example:

"100 200 Td" - Move text position to (100, 200)
"(Hello) Tj" - Show text "Hello"
"/F1 12 Tf" - Set font F1 with size 12

Reference: PDF 1.7 specification, Section 7.8.2 (Content Streams).

func NewOperator ¶

func NewOperator(name string, operands []parser.PdfObject) *Operator

NewOperator creates a new Operator with the given name and operands.

func (*Operator) String ¶

func (op *Operator) String() string

String returns a string representation of the operator.

type Point ¶

type Point struct {
	X, Y float64
}

Point represents a 2D point in PDF coordinate space.

PDF coordinates are in points (1/72 inch), with origin at bottom-left.

func NewPoint ¶

func NewPoint(x, y float64) Point

NewPoint creates a new Point.

func (Point) String ¶

func (p Point) String() string

String returns a string representation of the point.

type Rectangle ¶

type Rectangle struct {
	X      float64 // Bottom-left X coordinate
	Y      float64 // Bottom-left Y coordinate
	Width  float64 // Width
	Height float64 // Height
}

Rectangle represents a rectangular bounding box.

This is a simplified version for text extraction. The full Rectangle value object is in domain/types.

func NewRectangle ¶

func NewRectangle(x, y, width, height float64) Rectangle

NewRectangle creates a new Rectangle.

func (Rectangle) Bottom ¶

func (r Rectangle) Bottom() float64

Bottom returns the Y coordinate of the bottom edge.

func (Rectangle) Contains ¶

func (r Rectangle) Contains(x, y float64) bool

Contains checks if a point (x, y) is inside the rectangle.

func (Rectangle) Left ¶

func (r Rectangle) Left() float64

Left returns the X coordinate of the left edge.

func (Rectangle) Right ¶

func (r Rectangle) Right() float64

Right returns the X coordinate of the right edge.

func (Rectangle) String ¶

func (r Rectangle) String() string

String returns a string representation of the rectangle.

func (Rectangle) Top ¶

func (r Rectangle) Top() float64

Top returns the Y coordinate of the top edge.

type TextChunk ¶

type TextChunk struct {
	Elements []*TextElement // Text elements in this chunk
	Bounds   Rectangle      // Bounding box of all elements
}

TextChunk represents a group of text elements.

A chunk is used to group related text elements (e.g., text on the same line, text in the same cell, text in the same paragraph).

This is useful for table extraction where we need to group text into cells.

func NewTextChunk ¶

func NewTextChunk(elements []*TextElement) *TextChunk

NewTextChunk creates a new TextChunk with the given elements.

The bounding box is calculated from the elements.

func (*TextChunk) Add ¶

func (tc *TextChunk) Add(elem *TextElement)

Add adds a text element to the chunk and updates bounds.

func (*TextChunk) Len ¶

func (tc *TextChunk) Len() int

Len returns the number of elements in the chunk.

func (*TextChunk) String ¶

func (tc *TextChunk) String() string

String returns a string representation of the chunk.

func (*TextChunk) Text ¶

func (tc *TextChunk) Text() string

Text returns the concatenated text of all elements.

type TextElement ¶

type TextElement struct {
	Text     string  // The actual text content
	X        float64 // X coordinate (bottom-left, in points)
	Y        float64 // Y coordinate (bottom-left, in points)
	Width    float64 // Width of text (in points)
	Height   float64 // Height of text (in points)
	FontName string  // Font name (e.g., "/F1", "/Helvetica")
	FontSize float64 // Font size in points
}

TextElement represents a single piece of text extracted from a PDF page.

Each TextElement has position information (X, Y coordinates) which is critical for table extraction and layout analysis. The coordinates represent the bottom-left corner of the text element in PDF coordinate space.

PDF Coordinate System (Section 8.3.2):

Origin (0,0) is at bottom-left of page
X increases to the right
Y increases upward
Coordinates are in points (1 point = 1/72 inch)

Reference: PDF 1.7 specification, Section 9.4 (Text Objects).

func NewTextElement ¶

func NewTextElement(text string, x, y, width, height float64, fontName string, fontSize float64) *TextElement

NewTextElement creates a new TextElement with the given properties.

func (*TextElement) Bottom ¶

func (te *TextElement) Bottom() float64

Bottom returns the Y coordinate of the bottom edge (same as Y).

func (*TextElement) CenterX ¶

func (te *TextElement) CenterX() float64

CenterX returns the X coordinate of the center of the text.

func (*TextElement) CenterY ¶

func (te *TextElement) CenterY() float64

CenterY returns the Y coordinate of the center of the text.

func (*TextElement) Left ¶

func (te *TextElement) Left() float64

Left returns the X coordinate of the left edge (same as X).

func (*TextElement) Right ¶

func (te *TextElement) Right() float64

Right returns the X coordinate of the right edge of the text.

func (*TextElement) String ¶

func (te *TextElement) String() string

String returns a string representation of the text element.

func (*TextElement) Top ¶

func (te *TextElement) Top() float64

Top returns the Y coordinate of the top edge of the text.

func (*TextElement) VerticalOverlapRatio ¶

func (te *TextElement) VerticalOverlapRatio(other *TextElement) float64

VerticalOverlapRatio calculates the vertical overlap ratio between this element and another.

Returns a value between 0.0 (no overlap) and 1.0 (complete overlap). Based on Tabula's algorithm (tabula-java/Rectangle.java:73-90).

This is used for row detection in tables without ruling lines (Stream mode). Elements with overlap < threshold (e.g., 0.1) are considered separate rows.

type TextExtractor ¶

type TextExtractor struct {
	// contains filtered or unexported fields
}

TextExtractor extracts text with positional information from PDF pages.

The extractor processes PDF content streams and interprets text operators to extract text along with its X,Y coordinates. This is critical for table extraction, as we need to know where each piece of text is located.

Text Extraction Process:

Get page's content stream(s)
Decode stream (handle FlateDecode, etc.)
Parse content operators
Track text state (font, position, matrix)
Extract text with coordinates when text showing operators are encountered
Decode glyph bytes to Unicode using font CMap/encoding

Reference: PDF 1.7 specification, Section 9.4 (Text Objects).

func NewTextExtractor ¶

func NewTextExtractor(reader *parser.Reader) *TextExtractor

NewTextExtractor creates a new TextExtractor for the given PDF reader.

func (*TextExtractor) ExtractFromPage ¶

func (te *TextExtractor) ExtractFromPage(pageNum int) ([]*TextElement, error)

ExtractFromPage extracts all text elements from the specified page.

Page numbers are 0-based (first page is 0).

Returns a slice of TextElements with position information, or error if extraction fails.

type TextState ¶

type TextState struct {
	// Text matrices (Section 9.4.2)
	Tm  Matrix // Current text matrix
	Tlm Matrix // Text line matrix (start of line)

	// Text state parameters (Section 9.3)
	FontName   string  // Current font name (from Tf operator)
	FontSize   float64 // Current font size in points (from Tf operator)
	CharSpace  float64 // Character spacing (from Tc operator)
	WordSpace  float64 // Word spacing (from Tw operator)
	HorizScale float64 // Horizontal scaling as percentage (from Tz operator, 100 = normal)
	Leading    float64 // Text leading in points (from TL operator)
	Rise       float64 // Text rise in points (from Ts operator)

	// Current position (derived from Tm)
	CurrentX float64
	CurrentY float64
}

TextState tracks the current text state during content stream parsing.

The PDF text state includes all parameters that affect how text is rendered:

Text matrix (Tm): Current text position and transformation
Text line matrix (Tlm): Position of the start of the current line
Font and size
Character/word spacing
Horizontal scaling
Text leading (line spacing)
Text rise (vertical offset)

These parameters are modified by text operators (Tf, Tc, Tw, Tz, TL, Ts, Tm, Td, etc.) and affect how text showing operators (Tj, TJ, ', ") render text.

Reference: PDF 1.7 specification, Section 9.3 (Text State Parameters).

func NewTextState ¶

func NewTextState() *TextState

NewTextState creates a new TextState with default values.

Default values:

Identity matrices for Tm and Tlm
Empty font name, 0 font size
0 character spacing, word spacing, rise
100% horizontal scaling
0 leading

Reference: PDF 1.7 specification, Section 9.3.1 (Text State Parameters and Operators).

func (*TextState) AdvanceX ¶

func (ts *TextState) AdvanceX(width float64)

AdvanceX advances the current X position by the given width.

This is used when showing text to move the text position by the width of the text. The width should account for character spacing, word spacing, and horizontal scaling.

func (*TextState) MoveToNextLine ¶

func (ts *TextState) MoveToNextLine()

MoveToNextLine moves to the start of the next line (T* operator).

The T* operator is equivalent to: Td 0 -Tl where Tl is the current leading.

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

func (*TextState) Reset ¶

func (ts *TextState) Reset()

Reset resets the text state to default values.

This is called when a BT (Begin Text) operator is encountered. According to the PDF spec, BT initializes the text matrix and line matrix to identity.

Reference: PDF 1.7 specification, Section 9.4.1 (Text Objects).

func (*TextState) SetFont ¶

func (ts *TextState) SetFont(fontName string, fontSize float64)

SetFont sets the current font and size (Tf operator).

The Tf operator takes a font name and size:

Tf /FontName size

Reference: PDF 1.7 specification, Section 9.3 (Text State Parameters).

func (*TextState) SetTextMatrix ¶

func (ts *TextState) SetTextMatrix(a, b, c, d, e, f float64)

SetTextMatrix sets the text matrix (Tm operator).

The Tm operator replaces the current text matrix with a new matrix specified by six numbers: Tm a b c d e f

This also updates the text line matrix (Tlm = Tm).

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

func (*TextState) String ¶

func (ts *TextState) String() string

String returns a string representation of the text state.

func (*TextState) Translate ¶

func (ts *TextState) Translate(tx, ty float64)

Translate moves the text position by (tx, ty) (Td operator).

The Td operator updates both the text matrix and text line matrix:

Tlm = Tlm * [1 0 0 1 tx ty]
Tm = Tlm

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

func (*TextState) TranslateSetLeading ¶

func (ts *TextState) TranslateSetLeading(tx, ty float64)

TranslateSetLeading moves the text position and sets leading (TD operator).

The TD operator is equivalent to:

TL -ty (set leading to -ty)
Td tx ty (translate)

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL