extract

package
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 11, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package extraction provides comprehensive PDF content extraction This package extracts all content types from PDFs into structured data models that can be serialized to JSON

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CompareTextElements added in v0.8.0

func CompareTextElements(extracted, expected []types.TextElement) bool

CompareTextElements compares extracted text elements with expected ones Returns true if they match (allowing for small differences in width calculations)

func CreateTestPDFWithComplexText added in v0.8.0

func CreateTestPDFWithComplexText() ([]byte, []types.TextElement, error)

CreateTestPDFWithComplexText creates a PDF with complex text operations for testing

func CreateTestPDFWithGraphics added in v0.8.0

func CreateTestPDFWithGraphics() ([]byte, []types.Graphic, error)

CreateTestPDFWithGraphics creates a PDF with graphics for testing extraction

func CreateTestPDFWithText added in v0.8.0

func CreateTestPDFWithText(texts []TestText) ([]byte, []types.TextElement, error)

CreateTestPDFWithText creates a simple PDF with known text content for testing extraction Returns the PDF bytes and the expected text elements

func ExtractAllImages added in v0.8.0

func ExtractAllImages(pdfBytes []byte, password []byte, verbose bool) ([]types.Image, error)

ExtractAllImages extracts all images from a PDF document

func ExtractBookmarks

func ExtractBookmarks(pdfBytes []byte, pdf *parse.PDF, verbose bool) ([]types.Bookmark, error)

ExtractBookmarks extracts bookmarks/outlines from a PDF

func ExtractContent

func ExtractContent(pdfBytes []byte, password []byte, verbose bool) (*types.ContentDocument, error)

ExtractContent extracts all content from a PDF into a ContentDocument This is the main entry point for content extraction

func ExtractContentToJSON

func ExtractContentToJSON(pdfBytes []byte, password []byte, verbose bool) (string, error)

ExtractContentToJSON extracts content and returns as JSON string

func ExtractMetadata

func ExtractMetadata(pdfBytes []byte, pdf *parse.PDF, verbose bool) (*types.DocumentMetadata, error)

ExtractMetadata extracts document metadata

func ExtractPages

func ExtractPages(pdfBytes []byte, pdf *parse.PDF, verbose bool) ([]types.Page, error)

ExtractPages extracts all pages from a PDF

func ParseTestPDF added in v0.8.0

func ParseTestPDF(pdfBytes []byte) (*parse.PDF, error)

ParseTestPDF parses a PDF created by test helpers

Types

type TestText added in v0.8.0

type TestText struct {
	Text     string
	X        float64
	Y        float64
	FontSize float64
	Width    float64 // Expected width (approximate)
}

TestText represents text to be added to a test PDF

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL