pdfenhancer

package
v1.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 16, 2026 License: AGPL-3.0 Imports: 9 Imported by: 0

README

PDF Enhancer

Package pdfenhancer provides utilities for reading, validating, and enhancing PDF files using the pdfcpu library.

Library Choice: pdfcpu

This package uses pdfcpu v0.11.1 as the PDF manipulation library.

Rationale

Why pdfcpu?

  1. Pure Go: No CGO dependencies, making it cross-platform compatible and easy to build
  2. Open Source: Apache 2.0 license, well-maintained with active development
  3. Comprehensive API: Provides low-level access to PDF internals needed for text layer addition
  4. PDF Compliance: Supports PDF versions 1.0 through 2.0
  5. Feature-Rich: Includes validation, optimization, merging, splitting, and more
  6. Production Ready: Actively used in production environments

Alternatives Considered:

  • gopdf: Limited to PDF creation, doesn't support reading/modification
  • gofpdf: Primarily for PDF generation, not manipulation of existing files
  • unidoc/unipdf: Commercial license required for production use
  • CGO-based libraries (poppler, mupdf): Cross-platform compilation challenges
Current Implementation

The current implementation provides:

  • ✅ PDF validation and reading
  • ✅ Page count extraction
  • ✅ PDF optimization
  • ✅ PDF merging and splitting
  • ✅ Page dimension extraction
  • Text layer addition: Fully implemented with invisible OCR text overlay
Text Layer Addition - Implementation Details

The package adds an invisible OCR text layer to PDFs using low-level PDF content stream manipulation. This makes PDFs searchable while preserving their original visual appearance.

Key Features:

  1. PDF Content Stream Creation

    • Creates new content streams with proper PDF text operators
    • Uses BT/ET (Begin/End Text) to define text objects
    • Sets font with Tf operator (Helvetica 10pt)
    • Sets text rendering mode to invisible with Tr 3 (no fill, no stroke)
    • Positions text using Tm operator (text matrix)
    • Renders text with Tj operator (show text string)
  2. Coordinate System Conversion

    • Automatically converts OCR coordinates (top-left origin) to PDF coordinates (bottom-left origin)
    • OCR: (0,0) is top-left, Y increases downward
    • PDF: (0,0) is bottom-left, Y increases upward
    • Conversion formula: PDF_Y = PageHeight - OCR_Y - OCR_Height
  3. Text Encoding and Escaping

    • Properly escapes special characters in PDF strings
    • Handles parentheses, backslashes, newlines, tabs, carriage returns
    • Uses standard Helvetica font (no embedding needed)
    • Compatible with PDF string encoding requirements
  4. Content Stream Integration

    • Appends new content streams to existing page contents
    • Handles both single content stream and content array cases
    • Preserves existing page content and appearance
    • Uses proper PDF indirect reference management

Implementation Methods:

  • AddTextLayer(): Main entry point, processes all pages (pdf.go:72-108)
  • addTextToPage(): Adds text to a single page (pdf.go:113-152)
  • createTextContentStream(): Generates PDF content stream with text operators (pdf.go:154-201)
  • escapePDFString(): Escapes special characters for PDF strings (pdf.go:203-213)
  • appendContentStream(): Adds content stream to page dictionary (pdf.go:215-258)
Usage Example
import "github.com/platinummonkey/legible/internal/pdfenhancer"

// Create enhancer
enhancer := pdfenhancer.New(&pdfenhancer.Config{})

// Validate PDF
if err := enhancer.ValidatePDF("input.pdf"); err != nil {
    log.Fatal(err)
}

// Get page count
pageCount, err := enhancer.GetPageCount("input.pdf")

// Optimize PDF
err = enhancer.OptimizePDF("input.pdf", "output.pdf")

// Add text layer (when OCR data is available)
ocrResults := ocr.NewDocumentOCR("doc-id", "eng")
// ... populate OCR results ...
err = enhancer.AddTextLayer("input.pdf", "output.pdf", ocrResults)
Testing

The package includes comprehensive tests with high coverage:

Test Coverage:

  • PDF validation (valid and invalid files)
  • Page counting and extraction
  • PDF information retrieval
  • Optimization operations
  • Merging and splitting
  • Text layer addition with OCR data
  • Content stream generation and text positioning
  • Coordinate system conversion
  • Special character escaping
  • Empty OCR handling
  • Multiple words positioning

Test Approach:

  • Test PDFs are generated programmatically using minimal valid PDF syntax
  • Mock OCR data used to test text layer addition without Tesseract dependency
  • Edge cases tested: empty text, special characters, multiple words, no OCR data
  • Integration tests verify generated PDFs are valid and can be read
Coordinate System Reference

Use CompareCoordinateSystems(pageHeight) to get detailed information about the coordinate system differences between PDF and OCR coordinate spaces.

Future Enhancements

  1. Advanced Text Rendering

    • Support for different text rendering modes (visible, invisible, outline)
    • Text sizing to exactly match bounding box dimensions
    • Rotated text support for angled words
  2. Font Handling

    • Custom font embedding for better Unicode support
    • Font subsetting for reduced file size
    • Multi-language font support
  3. Performance Optimizations

    • Batch processing for multiple PDFs
    • Parallel page processing
    • Streaming for large files
    • Memory-efficient content stream building
  4. Quality Improvements

    • Confidence-based text filtering (only add high-confidence words)
    • Text layer validation and verification
    • OCR accuracy metrics in output
  5. Monitoring and Progress

    • Progress callbacks for long operations
    • Detailed logging of text addition statistics
    • Performance profiling and metrics

Documentation

Overview

Package pdfenhancer provides PDF manipulation and OCR text layer addition.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Logger *logger.Logger
}

Config holds configuration for the PDF enhancer

type PDFEnhancer

type PDFEnhancer struct {
	// contains filtered or unexported fields
}

PDFEnhancer provides utilities for reading and enhancing PDF files

func New

func New(cfg *Config) *PDFEnhancer

New creates a new PDF enhancer instance

func (*PDFEnhancer) AddTextLayer

func (pe *PDFEnhancer) AddTextLayer(inputPath, outputPath string, ocrResults *ocr.DocumentOCR) error

AddTextLayer adds an invisible OCR text layer to a PDF This makes the PDF searchable while preserving the original appearance

func (*PDFEnhancer) CompareCoordinateSystems

func (pe *PDFEnhancer) CompareCoordinateSystems(pageHeight int) string

CompareCoordinateSystems returns information about coordinate system differences between OCR (top-left origin) and PDF (bottom-left origin)

func (*PDFEnhancer) ExtractPageInfo

func (pe *PDFEnhancer) ExtractPageInfo(pdfPath string, pageNum int) (*PageInfo, error)

ExtractPageInfo extracts basic information about a PDF page

func (*PDFEnhancer) GetPDFInfo

func (pe *PDFEnhancer) GetPDFInfo(pdfPath string) (*PDFInfo, error)

GetPDFInfo returns basic information about a PDF file

func (*PDFEnhancer) GetPageCount

func (pe *PDFEnhancer) GetPageCount(pdfPath string) (int, error)

GetPageCount returns the number of pages in a PDF file

func (*PDFEnhancer) MergePDFs

func (pe *PDFEnhancer) MergePDFs(inputPaths []string, outputPath string) error

MergePDFs merges multiple PDF files into a single output file

func (*PDFEnhancer) OptimizePDF

func (pe *PDFEnhancer) OptimizePDF(inputPath, outputPath string) error

OptimizePDF optimizes a PDF file by compressing and removing unnecessary data

func (*PDFEnhancer) SplitPDF

func (pe *PDFEnhancer) SplitPDF(inputPath, outputDir string) error

SplitPDF splits a PDF into individual pages

func (*PDFEnhancer) ValidatePDF

func (pe *PDFEnhancer) ValidatePDF(pdfPath string) error

ValidatePDF checks if a file is a valid PDF

type PDFInfo

type PDFInfo struct {
	PageCount  int
	PDFVersion string
	FileSize   int64
	Encrypted  bool
	Linearized bool
}

PDFInfo contains information about a PDF file

type PageInfo

type PageInfo struct {
	PageNumber int
	Width      int
	Height     int
}

PageInfo contains basic information about a PDF page

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL