Documentation
¶
Index ¶
- Constants
- func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message
- func ExtractText(data []byte, mime string, timeout time.Duration) (string, error)
- func ExtractWithProgress(ctx context.Context, data []byte, mime string, extractors []Extractor) <-chan ExtractProgress
- func ExtractorMaxPages(extractors []Extractor) int
- func ExtractorTimeout(extractors []Extractor) time.Duration
- func HasMatchingExtractor(extractors []Extractor, tool string, mime string) bool
- func HasPDFToPPM() bool
- func HasPDFToText() bool
- func HasTesseract() bool
- func ImageOCRAvailable() bool
- func IsImageMIME(mime string) bool
- func IsScanned(extractedText string) bool
- func NeedsOCR(extractors []Extractor, mime string) bool
- func OCRAvailable() bool
- func StripCodeFences(s string) string
- type EntityContext
- type ExtractProgress
- type ExtractionHints
- type ExtractionPromptInput
- type Extractor
- type ImageOCRExtractor
- type MaintenanceHint
- type PDFOCRExtractor
- type PDFTextExtractor
- type Pipeline
- type PlainTextExtractor
- type Result
- type TextSource
Constants ¶
const ( DocTypeQuote = "quote" DocTypeInvoice = "invoice" DocTypeReceipt = "receipt" DocTypeManual = "manual" DocTypeWarranty = "warranty" DocTypePermit = "permit" DocTypeInspection = "inspection" DocTypeContract = "contract" DocTypeOther = "other" )
Document type constants for ExtractionHints.DocumentType.
const ( EntityHintProject = "project" EntityHintAppliance = "appliance" EntityHintVendor = "vendor" EntityHintMaintenance = "maintenance" EntityHintQuote = "quote" EntityHintServiceLog = "service_log" )
Entity kind hint constants for ExtractionHints.EntityKindHint.
const DefaultMaxExtractPages = 20
DefaultMaxExtractPages is the default page limit for extraction. Front-loaded info (specs, warranty, maintenance) is typically in the first pages.
const DefaultTextTimeout = 30 * time.Second
DefaultTextTimeout is the default timeout for pdftotext.
const MIMEApplicationPDF = "application/pdf"
MIMEApplicationPDF is the MIME type for PDF documents.
Variables ¶
This section is empty.
Functions ¶
func BuildExtractionPrompt ¶
func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message
BuildExtractionPrompt creates the system and user messages for document extraction. The system prompt defines the JSON schema and rules; the user message contains the document metadata and extracted text from all sources.
func ExtractText ¶
ExtractText pulls plain text from document content based on MIME type. Returns empty string (not an error) for unsupported MIME types. PDF extraction uses pdftotext (poppler-utils) when available, returning empty for PDFs when the tool is missing. The timeout parameter caps how long pdftotext can run (0 = DefaultTextTimeout).
This is a convenience wrapper that delegates to PDFTextExtractor and PlainTextExtractor. For full pipeline extraction, use Pipeline.Run.
func ExtractWithProgress ¶ added in v1.47.0
func ExtractWithProgress( ctx context.Context, data []byte, mime string, extractors []Extractor, ) <-chan ExtractProgress
ExtractWithProgress runs async extraction with per-page progress updates sent on the returned channel. The channel closes when processing completes. The extractors list is consulted to determine whether to run image or PDF OCR. Unsupported types produce a single Done message with empty text.
func ExtractorMaxPages ¶ added in v1.47.0
ExtractorMaxPages returns the max pages from the first PDFOCRExtractor in the list, or 0 (meaning "use default") if none is found.
func ExtractorTimeout ¶ added in v1.47.0
ExtractorTimeout returns the timeout from the first PDFTextExtractor in the list, or 0 (meaning "use default") if none is found.
func HasMatchingExtractor ¶ added in v1.47.0
HasMatchingExtractor reports whether any extractor in the list with the given tool name matches the MIME type and is available.
func HasPDFToPPM ¶
func HasPDFToPPM() bool
HasPDFToPPM reports whether the pdftoppm binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.
func HasPDFToText ¶
func HasPDFToText() bool
HasPDFToText reports whether the pdftotext binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.
func HasTesseract ¶
func HasTesseract() bool
HasTesseract reports whether the tesseract binary is on PATH. The result is cached for the process lifetime.
func ImageOCRAvailable ¶
func ImageOCRAvailable() bool
ImageOCRAvailable reports whether tesseract is available for direct image OCR (no pdftoppm needed for image files).
func IsImageMIME ¶
IsImageMIME reports whether the MIME type is an image format that tesseract can process.
func IsScanned ¶
IsScanned returns true if the extracted text is empty or whitespace-only, indicating the document likely needs OCR.
func NeedsOCR ¶ added in v1.47.0
NeedsOCR reports whether any OCR-capable extractor in the list matches the MIME type and is available. Use this instead of checking tool names directly so callers don't couple to extractor internals.
func OCRAvailable ¶
func OCRAvailable() bool
OCRAvailable reports whether both tesseract and pdftoppm are available, which is the minimum needed to OCR scanned PDFs.
func StripCodeFences ¶
StripCodeFences removes markdown code fences that LLMs sometimes wrap around JSON output.
Types ¶
type EntityContext ¶
EntityContext provides existing entity names so the LLM can match extracted references against known data instead of hallucinating.
type ExtractProgress ¶ added in v1.47.0
type ExtractProgress struct {
Tool string // extractor tool name (set on Done)
Desc string // human description (set on Done)
Phase string // e.g. "rasterize", "extract"
Page int // current page (1-indexed)
Total int // total pages (0 until known)
Done bool // all phases finished
Text string // accumulated text (set on Done)
Data []byte // structured data (set on Done)
Err error // set on failure
}
ExtractProgress reports incremental progress from ExtractWithProgress.
type ExtractionHints ¶
type ExtractionHints struct {
DocumentType string `json:"document_type"`
TitleSugg string `json:"title_suggestion"`
Summary string `json:"summary"`
VendorHint string `json:"vendor_hint"`
TotalCents *int64 `json:"total_cents"`
LaborCents *int64 `json:"labor_cents"`
MaterialsCents *int64 `json:"materials_cents"`
Date *time.Time `json:"date"`
WarrantyExpiry *time.Time `json:"warranty_expiry"`
EntityKindHint string `json:"entity_kind_hint"`
EntityNameHint string `json:"entity_name_hint"`
Maintenance []MaintenanceHint `json:"maintenance_items"`
Notes string `json:"notes"`
}
ExtractionHints holds structured data extracted from a document by the LLM. Every field is optional -- the model fills what it can. These hints pre-fill form fields; the user confirms before saving.
func ParseExtractionResponse ¶
func ParseExtractionResponse(raw string) (ExtractionHints, error)
ParseExtractionResponse parses the LLM's JSON response into ExtractionHints. Tolerant of markdown fences, partial responses, and minor format variations in money/date fields.
type ExtractionPromptInput ¶
type ExtractionPromptInput struct {
Filename string
MIME string
SizeBytes int64
Entities EntityContext
Sources []TextSource
}
ExtractionPromptInput holds the inputs for building an extraction prompt.
type Extractor ¶ added in v1.47.0
type Extractor interface {
Tool() string
Matches(mime string) bool
Available() bool
Extract(ctx context.Context, data []byte) (TextSource, error)
}
Extractor extracts text from document bytes.
func DefaultExtractors ¶ added in v1.47.0
DefaultExtractors returns the standard extractors in priority order: pdftotext, plaintext, PDF OCR, image OCR. Zero values for maxPages and timeout cause the concrete extractors to use their own defaults.
type ImageOCRExtractor ¶ added in v1.47.0
type ImageOCRExtractor struct{}
ImageOCRExtractor wraps ocrImage for direct image OCR.
func (*ImageOCRExtractor) Available ¶ added in v1.47.0
func (e *ImageOCRExtractor) Available() bool
func (*ImageOCRExtractor) Extract ¶ added in v1.47.0
func (e *ImageOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error)
func (*ImageOCRExtractor) Matches ¶ added in v1.47.0
func (e *ImageOCRExtractor) Matches(mime string) bool
func (*ImageOCRExtractor) Tool ¶ added in v1.47.0
func (e *ImageOCRExtractor) Tool() string
type MaintenanceHint ¶
type MaintenanceHint struct {
Name string `json:"name"`
IntervalMonths int `json:"interval_months"`
}
MaintenanceHint is a maintenance schedule item extracted from a document (typically an appliance manual).
type PDFOCRExtractor ¶ added in v1.47.0
type PDFOCRExtractor struct {
MaxPages int
}
PDFOCRExtractor wraps ocrPDF for scanned PDF pages.
func (*PDFOCRExtractor) Available ¶ added in v1.47.0
func (e *PDFOCRExtractor) Available() bool
func (*PDFOCRExtractor) Extract ¶ added in v1.47.0
func (e *PDFOCRExtractor) Extract(ctx context.Context, data []byte) (TextSource, error)
func (*PDFOCRExtractor) Matches ¶ added in v1.47.0
func (e *PDFOCRExtractor) Matches(mime string) bool
func (*PDFOCRExtractor) Tool ¶ added in v1.47.0
func (e *PDFOCRExtractor) Tool() string
type PDFTextExtractor ¶ added in v1.47.0
PDFTextExtractor wraps pdftotext for digital PDF text extraction.
func (*PDFTextExtractor) Available ¶ added in v1.47.0
func (e *PDFTextExtractor) Available() bool
func (*PDFTextExtractor) Extract ¶ added in v1.47.0
func (e *PDFTextExtractor) Extract(ctx context.Context, data []byte) (TextSource, error)
func (*PDFTextExtractor) Matches ¶ added in v1.47.0
func (e *PDFTextExtractor) Matches(mime string) bool
func (*PDFTextExtractor) Tool ¶ added in v1.47.0
func (e *PDFTextExtractor) Tool() string
type Pipeline ¶
type Pipeline struct {
LLMClient *llm.Client // nil = skip LLM extraction
Extractors []Extractor // nil = DefaultExtractors(0, 0)
EntityContext EntityContext // existing entities for LLM matching
}
Pipeline orchestrates the document extraction layers: text extraction, OCR, and LLM-powered structured extraction. Each layer is independent and gracefully degrades when its dependencies are unavailable.
type PlainTextExtractor ¶ added in v1.47.0
type PlainTextExtractor struct{}
PlainTextExtractor normalizes whitespace from text/* content.
func (*PlainTextExtractor) Available ¶ added in v1.47.0
func (e *PlainTextExtractor) Available() bool
func (*PlainTextExtractor) Extract ¶ added in v1.47.0
func (e *PlainTextExtractor) Extract(_ context.Context, data []byte) (TextSource, error)
func (*PlainTextExtractor) Matches ¶ added in v1.47.0
func (e *PlainTextExtractor) Matches(mime string) bool
func (*PlainTextExtractor) Tool ¶ added in v1.47.0
func (e *PlainTextExtractor) Tool() string
type Result ¶
type Result struct {
Sources []TextSource // text from each extraction method
Hints *ExtractionHints // nil if LLM unavailable or failed
LLMUsed bool
Err error // non-fatal extraction error; document still saves
}
Result holds the output of a pipeline run.
func (*Result) HasSource ¶ added in v1.47.0
HasSource reports whether any source matches the given tool name.
func (*Result) SourceByTool ¶ added in v1.47.0
func (r *Result) SourceByTool(tool string) *TextSource
SourceByTool returns the first source matching the given tool name, or nil if not found.