pdfenhancer

package

v1.3.1 Latest Latest Go to latest Published: Jan 16, 2026 License: AGPL-3.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/platinummonkey/legible

Links

Open Source Insights

README ¶

PDF Enhancer

Package pdfenhancer provides utilities for reading, validating, and enhancing PDF files using the pdfcpu library.

Library Choice: pdfcpu

This package uses pdfcpu v0.11.1 as the PDF manipulation library.

Rationale

Why pdfcpu?

Pure Go: No CGO dependencies, making it cross-platform compatible and easy to build
Open Source: Apache 2.0 license, well-maintained with active development
Comprehensive API: Provides low-level access to PDF internals needed for text layer addition
PDF Compliance: Supports PDF versions 1.0 through 2.0
Feature-Rich: Includes validation, optimization, merging, splitting, and more
Production Ready: Actively used in production environments

Alternatives Considered:

gopdf: Limited to PDF creation, doesn't support reading/modification
gofpdf: Primarily for PDF generation, not manipulation of existing files
unidoc/unipdf: Commercial license required for production use
CGO-based libraries (poppler, mupdf): Cross-platform compilation challenges

Current Implementation

The current implementation provides:

✅ PDF validation and reading
✅ Page count extraction
✅ PDF optimization
✅ PDF merging and splitting
✅ Page dimension extraction
✅ Text layer addition: Fully implemented with invisible OCR text overlay

Text Layer Addition - Implementation Details

The package adds an invisible OCR text layer to PDFs using low-level PDF content stream manipulation. This makes PDFs searchable while preserving their original visual appearance.

Key Features:

PDF Content Stream Creation
- Creates new content streams with proper PDF text operators
- Uses BT/ET (Begin/End Text) to define text objects
- Sets font with Tf operator (Helvetica 10pt)
- Sets text rendering mode to invisible with Tr 3 (no fill, no stroke)
- Positions text using Tm operator (text matrix)
- Renders text with Tj operator (show text string)
Coordinate System Conversion
- Automatically converts OCR coordinates (top-left origin) to PDF coordinates (bottom-left origin)
- OCR: (0,0) is top-left, Y increases downward
- PDF: (0,0) is bottom-left, Y increases upward
- Conversion formula: PDF_Y = PageHeight - OCR_Y - OCR_Height
Text Encoding and Escaping
- Properly escapes special characters in PDF strings
- Handles parentheses, backslashes, newlines, tabs, carriage returns
- Uses standard Helvetica font (no embedding needed)
- Compatible with PDF string encoding requirements
Content Stream Integration
- Appends new content streams to existing page contents
- Handles both single content stream and content array cases
- Preserves existing page content and appearance
- Uses proper PDF indirect reference management

Implementation Methods:

AddTextLayer(): Main entry point, processes all pages (pdf.go:72-108)
addTextToPage(): Adds text to a single page (pdf.go:113-152)
createTextContentStream(): Generates PDF content stream with text operators (pdf.go:154-201)
escapePDFString(): Escapes special characters for PDF strings (pdf.go:203-213)
appendContentStream(): Adds content stream to page dictionary (pdf.go:215-258)

Usage Example

import "github.com/platinummonkey/legible/internal/pdfenhancer"

// Create enhancer
enhancer := pdfenhancer.New(&pdfenhancer.Config{})

// Validate PDF
if err := enhancer.ValidatePDF("input.pdf"); err != nil {
    log.Fatal(err)
}

// Get page count
pageCount, err := enhancer.GetPageCount("input.pdf")

// Optimize PDF
err = enhancer.OptimizePDF("input.pdf", "output.pdf")

// Add text layer (when OCR data is available)
ocrResults := ocr.NewDocumentOCR("doc-id", "eng")
// ... populate OCR results ...
err = enhancer.AddTextLayer("input.pdf", "output.pdf", ocrResults)

Testing

The package includes comprehensive tests with high coverage:

Test Coverage:

PDF validation (valid and invalid files)
Page counting and extraction
PDF information retrieval
Optimization operations
Merging and splitting
Text layer addition with OCR data
Content stream generation and text positioning
Coordinate system conversion
Special character escaping
Empty OCR handling
Multiple words positioning

Test Approach:

Test PDFs are generated programmatically using minimal valid PDF syntax
Mock OCR data used to test text layer addition without Tesseract dependency
Edge cases tested: empty text, special characters, multiple words, no OCR data
Integration tests verify generated PDFs are valid and can be read

Coordinate System Reference

Use CompareCoordinateSystems(pageHeight) to get detailed information about the coordinate system differences between PDF and OCR coordinate spaces.

Future Enhancements

Advanced Text Rendering
- Support for different text rendering modes (visible, invisible, outline)
- Text sizing to exactly match bounding box dimensions
- Rotated text support for angled words
Font Handling
- Custom font embedding for better Unicode support
- Font subsetting for reduced file size
- Multi-language font support
Performance Optimizations
- Batch processing for multiple PDFs
- Parallel page processing
- Streaming for large files
- Memory-efficient content stream building
Quality Improvements
- Confidence-based text filtering (only add high-confidence words)
- Text layer validation and verification
- OCR accuracy metrics in output
Monitoring and Progress
- Progress callbacks for long operations
- Detailed logging of text addition statistics
- Performance profiling and metrics

Documentation ¶

Overview ¶

Package pdfenhancer provides PDF manipulation and OCR text layer addition.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	Logger *logger.Logger
}

Config holds configuration for the PDF enhancer

type PDFEnhancer ¶

type PDFEnhancer struct {
	// contains filtered or unexported fields
}

PDFEnhancer provides utilities for reading and enhancing PDF files

func New ¶

func New(cfg *Config) *PDFEnhancer

New creates a new PDF enhancer instance

func (*PDFEnhancer) AddTextLayer ¶

func (pe *PDFEnhancer) AddTextLayer(inputPath, outputPath string, ocrResults *ocr.DocumentOCR) error

AddTextLayer adds an invisible OCR text layer to a PDF This makes the PDF searchable while preserving the original appearance

func (*PDFEnhancer) CompareCoordinateSystems ¶

func (pe *PDFEnhancer) CompareCoordinateSystems(pageHeight int) string

CompareCoordinateSystems returns information about coordinate system differences between OCR (top-left origin) and PDF (bottom-left origin)

func (*PDFEnhancer) ExtractPageInfo ¶

func (pe *PDFEnhancer) ExtractPageInfo(pdfPath string, pageNum int) (*PageInfo, error)

ExtractPageInfo extracts basic information about a PDF page

func (*PDFEnhancer) GetPDFInfo ¶

func (pe *PDFEnhancer) GetPDFInfo(pdfPath string) (*PDFInfo, error)

GetPDFInfo returns basic information about a PDF file

func (*PDFEnhancer) GetPageCount ¶

func (pe *PDFEnhancer) GetPageCount(pdfPath string) (int, error)

GetPageCount returns the number of pages in a PDF file

func (*PDFEnhancer) MergePDFs ¶

func (pe *PDFEnhancer) MergePDFs(inputPaths []string, outputPath string) error

MergePDFs merges multiple PDF files into a single output file

func (*PDFEnhancer) OptimizePDF ¶

func (pe *PDFEnhancer) OptimizePDF(inputPath, outputPath string) error

OptimizePDF optimizes a PDF file by compressing and removing unnecessary data

func (*PDFEnhancer) SplitPDF ¶

func (pe *PDFEnhancer) SplitPDF(inputPath, outputDir string) error

SplitPDF splits a PDF into individual pages

func (*PDFEnhancer) ValidatePDF ¶

func (pe *PDFEnhancer) ValidatePDF(pdfPath string) error

ValidatePDF checks if a file is a valid PDF

type PDFInfo ¶

type PDFInfo struct {
	PageCount  int
	PDFVersion string
	FileSize   int64
	Encrypted  bool
	Linearized bool
}

PDFInfo contains information about a PDF file

type PageInfo ¶

type PageInfo struct {
	PageNumber int
	Width      int
	Height     int
}

PageInfo contains basic information about a PDF page

Source Files ¶

View all Source files

pdf.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL