extractor

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2026 License: MIT Imports: 15 Imported by: 0

Documentation

Overview

Package extractor implements PDF content extraction use cases.

Package extractor implements PDF content extraction use cases.

Package extractor provides use cases for extracting content from PDF documents.

Package extractor implements PDF content extraction use cases.

This is the Application layer in DDD/Clean Architecture. It orchestrates domain logic and infrastructure for extracting content from PDFs.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CMapParser

type CMapParser struct {
	// contains filtered or unexported fields
}

CMapParser parses CMap (Character Map) streams from PDF ToUnicode entries.

CMap Format (simplified):

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def

% Single character mappings
10 beginbfchar
<0001> <0412>  % Glyph 0x01 → U+0412 'В'
<0002> <044B>  % Glyph 0x02 → U+044B 'ы'
<0003> <043F>  % Glyph 0x03 → U+043F 'п'
endbfchar

% Range mappings
2 beginbfrange
<0010> <0020> <0430>  % Glyphs 0x10-0x20 → U+0430-0x0440
endbfrange

endcmap

Reference: PDF 1.7 specification, Section 9.7.5 (ToUnicode CMaps).

func NewCMapParser

func NewCMapParser(data []byte) *CMapParser

NewCMapParser creates a new CMapParser for the given stream data.

The stream should be the decoded content of a ToUnicode CMap stream.

func (*CMapParser) Parse

func (p *CMapParser) Parse() (*CMapTable, error)

Parse parses the CMap stream and returns a CMapTable.

The parser handles:

  • beginbfchar/endbfchar: Single character mappings
  • beginbfrange/endbfrange: Range mappings

Unsupported operators are silently ignored for graceful degradation.

type CMapTable

type CMapTable struct {
	// contains filtered or unexported fields
}

CMapTable represents a Character Map that maps glyph IDs to Unicode code points.

CMap (Character Map) defines the mapping between character codes (glyph IDs) used in a PDF font and the corresponding Unicode values. This is essential for extracting readable text from PDFs, especially for custom encodings and non-Latin scripts like Cyrillic, Chinese, Japanese, etc.

The mapping is stored as glyph ID (uint16) → Unicode rune (int32).

Reference: PDF 1.7 specification, Section 9.7.5 (ToUnicode CMaps).

func NewCMapTable

func NewCMapTable(name string) *CMapTable

NewCMapTable creates a new empty CMapTable.

func ParseCMapStream

func ParseCMapStream(data []byte) (*CMapTable, error)

ParseCMapStream is a convenience function that parses a CMap stream.

This is equivalent to:

parser := NewCMapParser(data)
return parser.Parse()

func (*CMapTable) AddMapping

func (t *CMapTable) AddMapping(glyphID uint16, unicode rune)

AddMapping adds a single glyph ID to Unicode mapping.

func (*CMapTable) AddRangeMapping

func (t *CMapTable) AddRangeMapping(startGlyphID, endGlyphID uint16, startUnicode rune)

AddRangeMapping adds a range of glyph IDs to consecutive Unicode values.

For example: AddRangeMapping(0x10, 0x20, 0x0430) maps:

  • Glyph 0x10 → U+0430 ('а')
  • Glyph 0x11 → U+0431 ('б')
  • ...
  • Glyph 0x20 → U+0440 ('р')

func (*CMapTable) GetUnicode

func (t *CMapTable) GetUnicode(glyphID uint16) (rune, bool)

GetUnicode returns the Unicode code point for a given glyph ID.

Returns the Unicode rune and true if mapping exists, or 0 and false if not found.

func (*CMapTable) Name

func (t *CMapTable) Name() string

Name returns the CMap name.

func (*CMapTable) Size

func (t *CMapTable) Size() int

Size returns the number of mappings in the table.

type CellExtractor

type CellExtractor struct {
	// contains filtered or unexported fields
}

CellExtractor extracts text content from a rectangular cell region.

The extractor:

  • Finds all text elements within cell bounds
  • Sorts text by position (top to bottom, left to right)
  • Joins text with proper spacing and line breaks
  • Handles multi-line content

This is a critical component for table extraction (Phase 2.7).

func NewCellExtractor

func NewCellExtractor(textElements []*TextElement) *CellExtractor

NewCellExtractor creates a new CellExtractor with the given text elements.

func (*CellExtractor) ExtractCellContent

func (ce *CellExtractor) ExtractCellContent(bounds Rectangle) string

ExtractCellContent extracts text from a rectangular region (cell bounds).

Algorithm:

  1. Find all text elements within the cell bounds
  2. Group text elements by line (based on Y position)
  3. Sort lines from top to bottom
  4. Within each line, sort elements left to right
  5. Join text with appropriate spacing

Parameters:

  • bounds: The rectangular region to extract text from

Returns the extracted text, or empty string if no text is found.

func (*CellExtractor) FindElementsInBounds

func (ce *CellExtractor) FindElementsInBounds(bounds Rectangle) []*TextElement

FindElementsInBounds returns all text elements that are within the bounds.

An element is considered "within" if its center point is inside the bounds. This handles cases where text might slightly overlap cell boundaries.

This method is exported for use by other extractors (e.g., table alignment detection).

type Color

type Color struct {
	R, G, B float64 // RGB values (0.0 - 1.0)
}

Color represents an RGB color.

func NewColor

func NewColor(r, g, b float64) Color

NewColor creates a new Color.

func (Color) IsBlack

func (c Color) IsBlack() bool

IsBlack returns true if the color is black (or very dark).

func (Color) String

func (c Color) String() string

String returns a string representation of the color.

type ContentParser

type ContentParser struct {
	// contains filtered or unexported fields
}

ContentParser parses PDF content streams into operators.

Content streams contain a sequence of operators that describe page graphics and text. The parser reads the stream and extracts operators with their operands.

Example content stream:

BT
  /F1 12 Tf
  100 200 Td
  (Hello, World!) Tj
ET

This would be parsed into operators:

  • Operator{Name: "BT"}
  • Operator{Name: "Tf", Operands: ["/F1", 12]}
  • Operator{Name: "Td", Operands: [100, 200]}
  • Operator{Name: "Tj", Operands: ["(Hello, World!)"]}
  • Operator{Name: "ET"}

Reference: PDF 1.7 specification, Section 7.8 (Content Streams).

func NewContentParser

func NewContentParser(content []byte) *ContentParser

NewContentParser creates a new ContentParser for the given content stream.

func (*ContentParser) ParseOperators

func (cp *ContentParser) ParseOperators() ([]*Operator, error)

ParseOperators parses all operators from the content stream.

Returns a slice of operators in the order they appear in the stream. Returns error if parsing fails.

Content streams are sequences of objects followed by operators (keywords). Example: "100 200 Td" means: push 100, push 200, execute Td operator.

type FontDecoder

type FontDecoder struct {
	// contains filtered or unexported fields
}

FontDecoder decodes glyph byte sequences to Unicode strings using CMap tables.

PDF fonts can use various encodings for text strings:

  • Built-in encodings: WinAnsiEncoding, MacRomanEncoding, etc.
  • Custom encodings: ToUnicode CMap (most common for non-Latin scripts)
  • Identity encodings: Direct mapping (no decoding needed)

The FontDecoder handles all these cases and converts raw glyph bytes to readable Unicode text.

Reference: PDF 1.7 specification, Section 9.6.6 (Character Encoding).

func NewFontDecoder

func NewFontDecoder(cmap *CMapTable, encoding string, use2ByteGlyphs bool) *FontDecoder

NewFontDecoder creates a new FontDecoder with the given CMap and encoding.

Parameters:

  • cmap: ToUnicode CMap table (can be nil if not available)
  • encoding: Base encoding name (e.g., "WinAnsiEncoding", "Identity-H")
  • use2ByteGlyphs: true for 2-byte glyphs (CIDFonts), false for 1-byte glyphs

func NewFontDecoderWithCMap

func NewFontDecoderWithCMap(cmap *CMapTable) *FontDecoder

NewFontDecoderWithCMap creates a FontDecoder that uses only CMap decoding.

This is a convenience constructor for the most common case: custom fonts with ToUnicode CMap (e.g., embedded fonts with Cyrillic text).

func NewFontDecoderWithCustomEncoding

func NewFontDecoderWithCustomEncoding(differences map[uint16]string, baseEncoding string, use2ByteGlyphs bool) *FontDecoder

NewFontDecoderWithCustomEncoding creates a FontDecoder with custom glyph mappings.

This is used when a font has a /Encoding dictionary with /Differences array but no ToUnicode CMap.

Parameters:

  • differences: Map of glyph ID → glyph name (from /Encoding/Differences)
  • baseEncoding: Base encoding name (e.g., "WinAnsiEncoding")
  • use2ByteGlyphs: true for 2-byte glyphs, false for 1-byte glyphs

Returns:

  • FontDecoder configured with custom glyph mappings

func (*FontDecoder) DecodeString

func (d *FontDecoder) DecodeString(glyphBytes []byte) string

DecodeString decodes a glyph byte sequence to a Unicode string.

The decoding process:

  1. Split bytes into glyphs (1-byte or 2-byte depending on font)
  2. For each glyph ID, look up Unicode in CMap table
  3. If CMap lookup fails, try built-in encoding (WinAnsi, MacRoman)
  4. If all else fails, treat as ISO-8859-1 (Latin-1) for ASCII compatibility

Returns the decoded Unicode string. Invalid glyphs are replaced with Unicode replacement character (U+FFFD).

func (*FontDecoder) Encoding

func (d *FontDecoder) Encoding() string

Encoding returns the base encoding name.

func (*FontDecoder) HasCMap

func (d *FontDecoder) HasCMap() bool

HasCMap returns true if this decoder has a CMap table.

func (*FontDecoder) String

func (d *FontDecoder) String() string

String returns a string representation of the decoder's configuration.

type GraphicsElement

type GraphicsElement struct {
	Type   GraphicsType // Type of graphics element
	Points []Point      // Points defining the element
	Color  Color        // Stroke/fill color
	Width  float64      // Line width
}

GraphicsElement represents a graphics element extracted from a PDF page.

Graphics elements include lines, rectangles, and paths that can be used to detect ruling lines in tables (lattice mode detection).

PDF Graphics Operators (Section 8.5):

  • Path construction: m (moveto), l (lineto), re (rectangle), c (curve)
  • Path painting: S (stroke), s (close/stroke), f (fill), F (fill)

Reference: PDF 1.7 specification, Section 8.5 (Graphics Objects).

func (*GraphicsElement) String

func (ge *GraphicsElement) String() string

String returns a string representation of the graphics element.

type GraphicsParser

type GraphicsParser struct {
	// contains filtered or unexported fields
}

GraphicsParser extracts graphics elements from PDF content streams.

The parser processes graphics operators to extract lines, rectangles, and other shapes that can be used for table detection (ruling lines).

Graphics State (Section 8.4):

  • Current transformation matrix (CTM)
  • Current path
  • Line width, color, etc.

Reference: PDF 1.7 specification, Section 8 (Graphics).

func NewGraphicsParser

func NewGraphicsParser(reader *parser.Reader) *GraphicsParser

NewGraphicsParser creates a new GraphicsParser for the given PDF reader.

func (*GraphicsParser) ParseFromPage

func (gp *GraphicsParser) ParseFromPage(pageNum int) ([]*GraphicsElement, error)

ParseFromPage extracts all graphics elements from the specified page.

Page numbers are 0-based (first page is 0).

Returns a slice of GraphicsElements, or error if extraction fails.

type GraphicsState

type GraphicsState struct {
	CurrentPath []Point // Points in current path
	LineWidth   float64 // Current line width
	StrokeColor Color   // Current stroke color
	FillColor   Color   // Current fill color
}

GraphicsState tracks the current graphics state during parsing.

func NewGraphicsState

func NewGraphicsState() *GraphicsState

NewGraphicsState creates a new graphics state with defaults.

type GraphicsType

type GraphicsType int

GraphicsType represents the type of graphics element.

const (
	// GraphicsTypeLine represents a straight line.
	GraphicsTypeLine GraphicsType = iota
	// GraphicsTypeRectangle represents a rectangle.
	GraphicsTypeRectangle
	// GraphicsTypePath represents a generic path.
	GraphicsTypePath
)

func (GraphicsType) String

func (gt GraphicsType) String() string

String returns a string representation of the graphics type.

type ImageExtractor

type ImageExtractor struct {
	// contains filtered or unexported fields
}

ImageExtractor extracts images from PDF pages.

This is an application service that coordinates image extraction from PDF documents using the domain model and infrastructure services.

Example:

reader, _ := parser.OpenPDF("document.pdf")
defer reader.Close()

extractor := NewImageExtractor(reader)
images, _ := extractor.ExtractFromPage(0)
for _, img := range images {
    img.SaveToFile(fmt.Sprintf("image_%d.jpg", i))
}

func NewImageExtractor

func NewImageExtractor(reader *parser.Reader) *ImageExtractor

NewImageExtractor creates a new image extractor.

Parameters:

  • reader: PDF reader providing access to document structure

Returns a configured ImageExtractor ready to extract images.

func (*ImageExtractor) ExtractFromDocument

func (e *ImageExtractor) ExtractFromDocument() ([]*types.Image, error)

ExtractFromDocument extracts all images from all pages in the document.

This iterates through all pages and extracts images from each.

Returns a slice of all images found in the document, or error if extraction fails.

func (*ImageExtractor) ExtractFromPage

func (e *ImageExtractor) ExtractFromPage(pageIndex int) ([]*types.Image, error)

ExtractFromPage extracts all images from a specific page.

This finds all image XObjects in the page's resources and extracts them.

Parameters:

  • pageIndex: 0-based page index

Returns a slice of images found on the page, or error if extraction fails.

type Matrix

type Matrix struct {
	A, B, C, D, E, F float64
}

Matrix represents a transformation matrix used in PDF graphics and text.

PDF uses 3x3 transformation matrices in homogeneous coordinate space:

[ a  b  0 ]
[ c  d  0 ]
[ e  f  1 ]

The matrix is specified by six numbers: [a b c d e f]

Transformations:

  • Translation: [1 0 0 1 tx ty] - moves by (tx, ty)
  • Scaling: [sx 0 0 sy 0 0] - scales by (sx, sy)
  • Rotation: [cos θ sin θ -sin θ cos θ 0 0] - rotates by θ
  • Skewing: [1 tan α tan β 1 0 0] - skews by angles α, β

Reference: PDF 1.7 specification, Section 8.3.3 (Common Transformations).

func Identity

func Identity() Matrix

Identity returns the identity matrix [1 0 0 1 0 0].

The identity matrix performs no transformation.

func NewMatrix

func NewMatrix(a, b, c, d, e, f float64) Matrix

NewMatrix creates a new Matrix with the given values.

func Rotation

func Rotation(angle float64) Matrix

Rotation creates a rotation matrix that rotates by angle (in radians).

func Scaling

func Scaling(sx, sy float64) Matrix

Scaling creates a scaling matrix that scales by (sx, sy).

func Translation

func Translation(tx, ty float64) Matrix

Translation creates a translation matrix that moves by (tx, ty).

func (Matrix) IsIdentity

func (m Matrix) IsIdentity() bool

IsIdentity checks if the matrix is the identity matrix.

func (Matrix) Multiply

func (m Matrix) Multiply(other Matrix) Matrix

Multiply multiplies this matrix by another matrix (m * other).

Matrix multiplication is used to combine transformations. The order matters: m.Multiply(other) applies other first, then m.

The formula for matrix multiplication:

[ a1 b1 0 ]   [ a2 b2 0 ]   [ a1*a2+b1*c2  a1*b2+b1*d2  0 ]
[ c1 d1 0 ] × [ c2 d2 0 ] = [ c1*a2+d1*c2  c1*b2+d1*d2  0 ]
[ e1 f1 1 ]   [ e2 f2 1 ]   [ e1*a2+f1*c2+e2  e1*b2+f1*d2+f2  1 ]

Reference: PDF 1.7 specification, Section 8.3.4 (Transformation Matrices).

func (Matrix) String

func (m Matrix) String() string

String returns a string representation of the matrix.

func (Matrix) Transform

func (m Matrix) Transform(x, y float64) (float64, float64)

Transform applies the matrix transformation to a point (x, y).

The transformation formula is:

x' = a*x + c*y + e
y' = b*x + d*y + f

This is used to convert text coordinates from text space to user space.

Reference: PDF 1.7 specification, Section 8.3.2 (Coordinate Spaces).

type Operator

type Operator struct {
	Name     string             // Operator name (e.g., "Tj", "Tm", "BT")
	Operands []parser.PdfObject // Operands for the operator
}

Operator represents a PDF content stream operator with its operands.

PDF content streams consist of a sequence of operators and their operands. The general format is:

operand1 operand2 ... operandN operator

For example:

  • "100 200 Td" - Move text position to (100, 200)
  • "(Hello) Tj" - Show text "Hello"
  • "/F1 12 Tf" - Set font F1 with size 12

Reference: PDF 1.7 specification, Section 7.8.2 (Content Streams).

func NewOperator

func NewOperator(name string, operands []parser.PdfObject) *Operator

NewOperator creates a new Operator with the given name and operands.

func (*Operator) String

func (op *Operator) String() string

String returns a string representation of the operator.

type Point

type Point struct {
	X, Y float64
}

Point represents a 2D point in PDF coordinate space.

PDF coordinates are in points (1/72 inch), with origin at bottom-left.

func NewPoint

func NewPoint(x, y float64) Point

NewPoint creates a new Point.

func (Point) String

func (p Point) String() string

String returns a string representation of the point.

type Rectangle

type Rectangle struct {
	X      float64 // Bottom-left X coordinate
	Y      float64 // Bottom-left Y coordinate
	Width  float64 // Width
	Height float64 // Height
}

Rectangle represents a rectangular bounding box.

This is a simplified version for text extraction. The full Rectangle value object is in domain/types.

func NewRectangle

func NewRectangle(x, y, width, height float64) Rectangle

NewRectangle creates a new Rectangle.

func (Rectangle) Bottom

func (r Rectangle) Bottom() float64

Bottom returns the Y coordinate of the bottom edge.

func (Rectangle) Contains

func (r Rectangle) Contains(x, y float64) bool

Contains checks if a point (x, y) is inside the rectangle.

func (Rectangle) Left

func (r Rectangle) Left() float64

Left returns the X coordinate of the left edge.

func (Rectangle) Right

func (r Rectangle) Right() float64

Right returns the X coordinate of the right edge.

func (Rectangle) String

func (r Rectangle) String() string

String returns a string representation of the rectangle.

func (Rectangle) Top

func (r Rectangle) Top() float64

Top returns the Y coordinate of the top edge.

type TextChunk

type TextChunk struct {
	Elements []*TextElement // Text elements in this chunk
	Bounds   Rectangle      // Bounding box of all elements
}

TextChunk represents a group of text elements.

A chunk is used to group related text elements (e.g., text on the same line, text in the same cell, text in the same paragraph).

This is useful for table extraction where we need to group text into cells.

func NewTextChunk

func NewTextChunk(elements []*TextElement) *TextChunk

NewTextChunk creates a new TextChunk with the given elements.

The bounding box is calculated from the elements.

func (*TextChunk) Add

func (tc *TextChunk) Add(elem *TextElement)

Add adds a text element to the chunk and updates bounds.

func (*TextChunk) Len

func (tc *TextChunk) Len() int

Len returns the number of elements in the chunk.

func (*TextChunk) String

func (tc *TextChunk) String() string

String returns a string representation of the chunk.

func (*TextChunk) Text

func (tc *TextChunk) Text() string

Text returns the concatenated text of all elements.

type TextElement

type TextElement struct {
	Text     string  // The actual text content
	X        float64 // X coordinate (bottom-left, in points)
	Y        float64 // Y coordinate (bottom-left, in points)
	Width    float64 // Width of text (in points)
	Height   float64 // Height of text (in points)
	FontName string  // Font name (e.g., "/F1", "/Helvetica")
	FontSize float64 // Font size in points
}

TextElement represents a single piece of text extracted from a PDF page.

Each TextElement has position information (X, Y coordinates) which is critical for table extraction and layout analysis. The coordinates represent the bottom-left corner of the text element in PDF coordinate space.

PDF Coordinate System (Section 8.3.2):

  • Origin (0,0) is at bottom-left of page
  • X increases to the right
  • Y increases upward
  • Coordinates are in points (1 point = 1/72 inch)

Reference: PDF 1.7 specification, Section 9.4 (Text Objects).

func NewTextElement

func NewTextElement(text string, x, y, width, height float64, fontName string, fontSize float64) *TextElement

NewTextElement creates a new TextElement with the given properties.

func (*TextElement) Bottom

func (te *TextElement) Bottom() float64

Bottom returns the Y coordinate of the bottom edge (same as Y).

func (*TextElement) CenterX

func (te *TextElement) CenterX() float64

CenterX returns the X coordinate of the center of the text.

func (*TextElement) CenterY

func (te *TextElement) CenterY() float64

CenterY returns the Y coordinate of the center of the text.

func (*TextElement) Left

func (te *TextElement) Left() float64

Left returns the X coordinate of the left edge (same as X).

func (*TextElement) Right

func (te *TextElement) Right() float64

Right returns the X coordinate of the right edge of the text.

func (*TextElement) String

func (te *TextElement) String() string

String returns a string representation of the text element.

func (*TextElement) Top

func (te *TextElement) Top() float64

Top returns the Y coordinate of the top edge of the text.

func (*TextElement) VerticalOverlapRatio

func (te *TextElement) VerticalOverlapRatio(other *TextElement) float64

VerticalOverlapRatio calculates the vertical overlap ratio between this element and another.

Returns a value between 0.0 (no overlap) and 1.0 (complete overlap). Based on Tabula's algorithm (tabula-java/Rectangle.java:73-90).

This is used for row detection in tables without ruling lines (Stream mode). Elements with overlap < threshold (e.g., 0.1) are considered separate rows.

type TextExtractor

type TextExtractor struct {
	// contains filtered or unexported fields
}

TextExtractor extracts text with positional information from PDF pages.

The extractor processes PDF content streams and interprets text operators to extract text along with its X,Y coordinates. This is critical for table extraction, as we need to know where each piece of text is located.

Text Extraction Process:

  1. Get page's content stream(s)
  2. Decode stream (handle FlateDecode, etc.)
  3. Parse content operators
  4. Track text state (font, position, matrix)
  5. Extract text with coordinates when text showing operators are encountered
  6. Decode glyph bytes to Unicode using font CMap/encoding

Reference: PDF 1.7 specification, Section 9.4 (Text Objects).

func NewTextExtractor

func NewTextExtractor(reader *parser.Reader) *TextExtractor

NewTextExtractor creates a new TextExtractor for the given PDF reader.

func (*TextExtractor) ExtractFromPage

func (te *TextExtractor) ExtractFromPage(pageNum int) ([]*TextElement, error)

ExtractFromPage extracts all text elements from the specified page.

Page numbers are 0-based (first page is 0).

Returns a slice of TextElements with position information, or error if extraction fails.

type TextState

type TextState struct {
	// Text matrices (Section 9.4.2)
	Tm  Matrix // Current text matrix
	Tlm Matrix // Text line matrix (start of line)

	// Text state parameters (Section 9.3)
	FontName   string  // Current font name (from Tf operator)
	FontSize   float64 // Current font size in points (from Tf operator)
	CharSpace  float64 // Character spacing (from Tc operator)
	WordSpace  float64 // Word spacing (from Tw operator)
	HorizScale float64 // Horizontal scaling as percentage (from Tz operator, 100 = normal)
	Leading    float64 // Text leading in points (from TL operator)
	Rise       float64 // Text rise in points (from Ts operator)

	// Current position (derived from Tm)
	CurrentX float64
	CurrentY float64
}

TextState tracks the current text state during content stream parsing.

The PDF text state includes all parameters that affect how text is rendered:

  • Text matrix (Tm): Current text position and transformation
  • Text line matrix (Tlm): Position of the start of the current line
  • Font and size
  • Character/word spacing
  • Horizontal scaling
  • Text leading (line spacing)
  • Text rise (vertical offset)

These parameters are modified by text operators (Tf, Tc, Tw, Tz, TL, Ts, Tm, Td, etc.) and affect how text showing operators (Tj, TJ, ', ") render text.

Reference: PDF 1.7 specification, Section 9.3 (Text State Parameters).

func NewTextState

func NewTextState() *TextState

NewTextState creates a new TextState with default values.

Default values:

  • Identity matrices for Tm and Tlm
  • Empty font name, 0 font size
  • 0 character spacing, word spacing, rise
  • 100% horizontal scaling
  • 0 leading

Reference: PDF 1.7 specification, Section 9.3.1 (Text State Parameters and Operators).

func (*TextState) AdvanceX

func (ts *TextState) AdvanceX(width float64)

AdvanceX advances the current X position by the given width.

This is used when showing text to move the text position by the width of the text. The width should account for character spacing, word spacing, and horizontal scaling.

func (*TextState) MoveToNextLine

func (ts *TextState) MoveToNextLine()

MoveToNextLine moves to the start of the next line (T* operator).

The T* operator is equivalent to: Td 0 -Tl where Tl is the current leading.

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

func (*TextState) Reset

func (ts *TextState) Reset()

Reset resets the text state to default values.

This is called when a BT (Begin Text) operator is encountered. According to the PDF spec, BT initializes the text matrix and line matrix to identity.

Reference: PDF 1.7 specification, Section 9.4.1 (Text Objects).

func (*TextState) SetFont

func (ts *TextState) SetFont(fontName string, fontSize float64)

SetFont sets the current font and size (Tf operator).

The Tf operator takes a font name and size:

Tf /FontName size

Reference: PDF 1.7 specification, Section 9.3 (Text State Parameters).

func (*TextState) SetTextMatrix

func (ts *TextState) SetTextMatrix(a, b, c, d, e, f float64)

SetTextMatrix sets the text matrix (Tm operator).

The Tm operator replaces the current text matrix with a new matrix specified by six numbers: Tm a b c d e f

This also updates the text line matrix (Tlm = Tm).

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

func (*TextState) String

func (ts *TextState) String() string

String returns a string representation of the text state.

func (*TextState) Translate

func (ts *TextState) Translate(tx, ty float64)

Translate moves the text position by (tx, ty) (Td operator).

The Td operator updates both the text matrix and text line matrix:

Tlm = Tlm * [1 0 0 1 tx ty]
Tm = Tlm

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

func (*TextState) TranslateSetLeading

func (ts *TextState) TranslateSetLeading(tx, ty float64)

TranslateSetLeading moves the text position and sets leading (TD operator).

The TD operator is equivalent to:

  • TL -ty (set leading to -ty)
  • Td tx ty (translate)

Reference: PDF 1.7 specification, Section 9.4.2 (Text Positioning Operators).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL