text-extraction

command

v0.6.0 Latest Latest Go to latest Published: Feb 24, 2026 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/coregx/gxpdf

Links

Open Source Insights

README ¶

Text Extraction Example

This example demonstrates how to extract text with positional information from PDF documents using GoPDF.

Features

Extract text from PDF pages
Get X, Y coordinates for each text element
Font information (name and size)
Text bounding boxes
Statistics (font usage, page coverage)

Usage

go run main.go <pdf-file>

Example Output

Opening PDF: sample.pdf
PDF has 1 pages

Extracting text from page 1...

Found 25 text elements:

Position details for first 20 elements:
-------------------------------------------------------
[1] Text: "Hello, World!"
    Position: (100.00, 700.00)
    Size: 72.00 x 12.00
    Font: Helvetica, Size: 12.0pt

[2] Text: "This is a sample PDF."
    Position: (100.00, 686.00)
    Size: 108.00 x 12.00
    Font: Helvetica, Size: 12.0pt

...

-------------------------------------------------------
All extracted text:
-------------------------------------------------------
Hello, World!
This is a sample PDF.
...

-------------------------------------------------------

Statistics:
  Total text elements: 25
  Unique fonts: 2
    Helvetica: 20 elements
    Helvetica-Bold: 5 elements
  Text bounding box:
    Bottom-left: (72.00, 100.00)
    Top-right: (540.00, 720.00)
    Dimensions: 468.00 x 620.00 points

Phase 2.5 Implementation

This example demonstrates Phase 2.5 of the GoPDF roadmap: Text Extraction with Positional Information.

What's Implemented:

Text Elements - Each piece of text with position data
Matrix Transformations - PDF text matrix handling
Text State Tracking - Font, size, spacing, leading
Content Stream Parsing - Parse PDF operators
Text Operators - BT, ET, Tj, TJ, Tm, Td, Tf, etc.
FlateDecode Support - Decompress compressed streams

Critical for Table Extraction:

The positional information extracted here (X, Y coordinates) is essential for Phase 2.6 (Table Detection). Table extraction algorithms need to know where text is located to determine:

Column boundaries (vertical alignment)
Row boundaries (horizontal alignment)
Cell grouping (proximity analysis)
Table regions (spatial clustering)

Implementation Details

Text Extraction Process:

Open PDF file with parser.OpenPDF()
Create extractor.NewTextExtractor(reader)
Call ExtractFromPage(pageNum) to get text elements
Process elements with position data

Text Operators Supported:

BT/ET - Text object delimiters
Tf - Set font and size
Tm - Set text matrix
Td/TD - Move text position
T* - Move to next line
Tj - Show text string
TJ - Show text with positioning
Tc/Tw/Tz/TL/Ts - Text state parameters

Stream Decoding:

FlateDecode - zlib decompression (most common)
Other filters can be added in future phases

Next Steps

After text extraction, the next phase is:

Phase 2.6 - Table Detection:

Use text positions to detect table regions
Identify ruling lines (lattice mode)
Analyze whitespace (stream mode)
Group text into cells
Build table structure

Testing

Test with your own PDF files:

# Simple text document
go run main.go document.pdf

# Invoice with tables
go run main.go invoice.pdf

# Financial report
go run main.go report.pdf

Requirements

Go 1.25+
PDF file (unencrypted)

Limitations (Phase 2.5)

Font width estimation is approximate (full font metrics in Phase 3)
Encryption not supported (Phase 4)
Some advanced encoders not supported (Phase 4)

Reference

PDF 1.7 Specification, Section 9.4 (Text Objects)
PDF 1.7 Specification, Section 9.3 (Text State)
GoPDF ROADMAP.md, Phase 2.5

License

MIT License - See LICENSE in project root

Documentation ¶

Overview ¶

Package main demonstrates text extraction with positional information.

This example shows how to extract text from PDF pages with X, Y coordinates. Positional information is critical for table extraction and layout analysis.

Usage:

go run main.go <pdf-file>

Example:

go run main.go sample.pdf

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL