Documentation
¶
Overview ¶
Package extraction provides comprehensive PDF content extraction This package extracts all content types from PDFs into structured data models that can be serialized to JSON
Index ¶
- func CompareTextElements(extracted, expected []types.TextElement) bool
- func CreateTestPDFWithComplexText() ([]byte, []types.TextElement, error)
- func CreateTestPDFWithGraphics() ([]byte, []types.Graphic, error)
- func CreateTestPDFWithText(texts []TestText) ([]byte, []types.TextElement, error)
- func ExtractAllImages(pdfBytes []byte, password []byte, verbose bool) ([]types.Image, error)
- func ExtractBookmarks(pdfBytes []byte, pdf *parse.PDF, verbose bool) ([]types.Bookmark, error)
- func ExtractContent(pdfBytes []byte, password []byte, verbose bool) (*types.ContentDocument, error)
- func ExtractContentToJSON(pdfBytes []byte, password []byte, verbose bool) (string, error)
- func ExtractMetadata(pdfBytes []byte, pdf *parse.PDF, verbose bool) (*types.DocumentMetadata, error)
- func ExtractPages(pdfBytes []byte, pdf *parse.PDF, verbose bool) ([]types.Page, error)
- func ParseTestPDF(pdfBytes []byte) (*parse.PDF, error)
- type TestText
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CompareTextElements ¶ added in v0.8.0
func CompareTextElements(extracted, expected []types.TextElement) bool
CompareTextElements compares extracted text elements with expected ones Returns true if they match (allowing for small differences in width calculations)
func CreateTestPDFWithComplexText ¶ added in v0.8.0
func CreateTestPDFWithComplexText() ([]byte, []types.TextElement, error)
CreateTestPDFWithComplexText creates a PDF with complex text operations for testing
func CreateTestPDFWithGraphics ¶ added in v0.8.0
CreateTestPDFWithGraphics creates a PDF with graphics for testing extraction
func CreateTestPDFWithText ¶ added in v0.8.0
func CreateTestPDFWithText(texts []TestText) ([]byte, []types.TextElement, error)
CreateTestPDFWithText creates a simple PDF with known text content for testing extraction Returns the PDF bytes and the expected text elements
func ExtractAllImages ¶ added in v0.8.0
ExtractAllImages extracts all images from a PDF document
func ExtractBookmarks ¶
ExtractBookmarks extracts bookmarks/outlines from a PDF
func ExtractContent ¶
ExtractContent extracts all content from a PDF into a ContentDocument This is the main entry point for content extraction
func ExtractContentToJSON ¶
ExtractContentToJSON extracts content and returns as JSON string
func ExtractMetadata ¶
func ExtractMetadata(pdfBytes []byte, pdf *parse.PDF, verbose bool) (*types.DocumentMetadata, error)
ExtractMetadata extracts document metadata
func ExtractPages ¶
ExtractPages extracts all pages from a PDF