Documentation
¶
Index ¶
- Constants
- func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message
- func ExtractText(data []byte, mime string, timeout time.Duration) (string, error)
- func HasPDFToPPM() bool
- func HasPDFToText() bool
- func HasTesseract() bool
- func ImageOCRAvailable() bool
- func IsImageMIME(mime string) bool
- func IsScanned(extractedText string) bool
- func OCR(ctx context.Context, data []byte, mime string, maxPages int) (text string, tsv []byte, err error)
- func OCRAvailable() bool
- func OCRWithProgress(ctx context.Context, data []byte, mime string, maxPages int) <-chan OCRProgress
- func StripCodeFences(s string) string
- type EntityContext
- type ExtractionHints
- type ExtractionPromptInput
- type MaintenanceHint
- type OCRProgress
- type Pipeline
- type Result
Constants ¶
const ( DocTypeQuote = "quote" DocTypeInvoice = "invoice" DocTypeReceipt = "receipt" DocTypeManual = "manual" DocTypeWarranty = "warranty" DocTypePermit = "permit" DocTypeInspection = "inspection" DocTypeContract = "contract" DocTypeOther = "other" )
Document type constants for ExtractionHints.DocumentType.
const ( EntityHintProject = "project" EntityHintAppliance = "appliance" EntityHintVendor = "vendor" EntityHintMaintenance = "maintenance" EntityHintQuote = "quote" EntityHintServiceLog = "service_log" )
Entity kind hint constants for ExtractionHints.EntityKindHint.
const DefaultMaxOCRPages = 20
DefaultMaxOCRPages is the default page limit for OCR. Front-loaded info (specs, warranty, maintenance) is typically in the first pages.
const DefaultTextTimeout = 30 * time.Second
DefaultTextTimeout is the default timeout for pdftotext.
Variables ¶
This section is empty.
Functions ¶
func BuildExtractionPrompt ¶
func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message
BuildExtractionPrompt creates the system and user messages for document extraction. The system prompt defines the JSON schema and rules; the user message contains the document metadata and extracted text from all sources.
For PDFs, both PdfText and OCRText may be present. The LLM receives both with source labels so it can reconcile differences.
func ExtractText ¶
ExtractText pulls plain text from document content based on MIME type. Returns empty string (not an error) for unsupported MIME types. PDF extraction uses pdftotext (poppler-utils) when available, returning empty for PDFs when the tool is missing. The timeout parameter caps how long pdftotext can run (0 = DefaultTextTimeout).
func HasPDFToPPM ¶
func HasPDFToPPM() bool
HasPDFToPPM reports whether the pdftoppm binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.
func HasPDFToText ¶
func HasPDFToText() bool
HasPDFToText reports whether the pdftotext binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.
func HasTesseract ¶
func HasTesseract() bool
HasTesseract reports whether the tesseract binary is on PATH. The result is cached for the process lifetime.
func ImageOCRAvailable ¶
func ImageOCRAvailable() bool
ImageOCRAvailable reports whether tesseract is available for direct image OCR (no pdftoppm needed for image files).
func IsImageMIME ¶
IsImageMIME reports whether the MIME type is an image format that tesseract can process.
func IsScanned ¶
IsScanned returns true if the extracted text is empty or whitespace-only, indicating the document likely needs OCR.
func OCR ¶
func OCR( ctx context.Context, data []byte, mime string, maxPages int, ) (text string, tsv []byte, err error)
OCR runs optical character recognition on the given document data. For PDFs, it rasterizes pages with pdftoppm then runs tesseract. For images, it runs tesseract directly.
Returns the extracted plain text, the raw TSV data (for future use with confidence scores and bounding boxes), and any error.
Callers should check OCRAvailable/ImageOCRAvailable before calling.
func OCRAvailable ¶
func OCRAvailable() bool
OCRAvailable reports whether both tesseract and pdftoppm are available, which is the minimum needed to OCR scanned PDFs.
func OCRWithProgress ¶
func OCRWithProgress( ctx context.Context, data []byte, mime string, maxPages int, ) <-chan OCRProgress
OCRWithProgress runs OCR with per-page progress updates sent on the returned channel. The channel closes when processing completes. Only PDF and image MIME types are supported; unsupported types produce a single Done message with empty text.
func StripCodeFences ¶
StripCodeFences removes markdown code fences that LLMs sometimes wrap around JSON output.
Types ¶
type EntityContext ¶
EntityContext provides existing entity names so the LLM can match extracted references against known data instead of hallucinating.
type ExtractionHints ¶
type ExtractionHints struct {
DocumentType string `json:"document_type"`
TitleSugg string `json:"title_suggestion"`
Summary string `json:"summary"`
VendorHint string `json:"vendor_hint"`
TotalCents *int64 `json:"total_cents"`
LaborCents *int64 `json:"labor_cents"`
MaterialsCents *int64 `json:"materials_cents"`
Date *time.Time `json:"date"`
WarrantyExpiry *time.Time `json:"warranty_expiry"`
EntityKindHint string `json:"entity_kind_hint"`
EntityNameHint string `json:"entity_name_hint"`
Maintenance []MaintenanceHint `json:"maintenance_items"`
Notes string `json:"notes"`
}
ExtractionHints holds structured data extracted from a document by the LLM. Every field is optional -- the model fills what it can. These hints pre-fill form fields; the user confirms before saving.
func ParseExtractionResponse ¶
func ParseExtractionResponse(raw string) (ExtractionHints, error)
ParseExtractionResponse parses the LLM's JSON response into ExtractionHints. Tolerant of markdown fences, partial responses, and minor format variations in money/date fields.
type ExtractionPromptInput ¶
type ExtractionPromptInput struct {
Filename string
MIME string
SizeBytes int64
Entities EntityContext
PdfText string // pdftotext output (PDFs only)
OCRText string // tesseract output (scanned PDFs and images)
Text string // fallback for non-PDF/non-image files (e.g. text/plain)
}
ExtractionPromptInput holds the inputs for building an extraction prompt.
type MaintenanceHint ¶
type MaintenanceHint struct {
Name string `json:"name"`
IntervalMonths int `json:"interval_months"`
}
MaintenanceHint is a maintenance schedule item extracted from a document (typically an appliance manual).
type OCRProgress ¶
type OCRProgress struct {
Phase string // "rasterize" or "ocr"
Page int // current page (1-indexed)
Total int // total pages (0 until known)
Done bool // all phases finished
Text string // accumulated text (set on Done)
TSV []byte // accumulated TSV (set on Done)
Err error // set on failure
}
OCRProgress reports incremental progress from OCRWithProgress.
type Pipeline ¶
type Pipeline struct {
LLMClient *llm.Client // nil = skip LLM extraction
MaxOCRPages int // 0 = DefaultMaxOCRPages
TextTimeout time.Duration // 0 = DefaultTextTimeout
EntityContext EntityContext // existing entities for LLM matching
}
Pipeline orchestrates the document extraction layers: text extraction, OCR, and LLM-powered structured extraction. Each layer is independent and gracefully degrades when its dependencies are unavailable.
type Result ¶
type Result struct {
ExtractedText string // best available text (pdftotext or OCR, merged)
PdfText string // raw pdftotext output (PDFs only)
OCRText string // raw OCR output (scanned PDFs and images)
OCRData []byte
Hints *ExtractionHints // nil if LLM unavailable or failed
OCRUsed bool
LLMUsed bool
Err error // non-fatal extraction error; document still saves
}
Result holds the output of a pipeline run.