extract

package

v1.46.0 Latest Latest Go to latest Published: Feb 23, 2026 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cpcloud/micasa

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message
func ExtractText(data []byte, mime string, timeout time.Duration) (string, error)
func HasPDFToPPM() bool
func HasPDFToText() bool
func HasTesseract() bool
func ImageOCRAvailable() bool
func IsImageMIME(mime string) bool
func IsScanned(extractedText string) bool
func OCR(ctx context.Context, data []byte, mime string, maxPages int) (text string, tsv []byte, err error)
func OCRAvailable() bool
func OCRWithProgress(ctx context.Context, data []byte, mime string, maxPages int) <-chan OCRProgress
func StripCodeFences(s string) string
type EntityContext
type ExtractionHints
- func ParseExtractionResponse(raw string) (ExtractionHints, error)
type ExtractionPromptInput
type MaintenanceHint
type OCRProgress
type Pipeline
- func (p *Pipeline) Run(ctx context.Context, data []byte, filename string, mime string) *Result
type Result

Constants ¶

View Source

const (
	DocTypeQuote      = "quote"
	DocTypeInvoice    = "invoice"
	DocTypeReceipt    = "receipt"
	DocTypeManual     = "manual"
	DocTypeWarranty   = "warranty"
	DocTypePermit     = "permit"
	DocTypeInspection = "inspection"
	DocTypeContract   = "contract"
	DocTypeOther      = "other"
)

Document type constants for ExtractionHints.DocumentType.

View Source

const (
	EntityHintProject     = "project"
	EntityHintAppliance   = "appliance"
	EntityHintVendor      = "vendor"
	EntityHintMaintenance = "maintenance"
	EntityHintQuote       = "quote"
	EntityHintServiceLog  = "service_log"
)

Entity kind hint constants for ExtractionHints.EntityKindHint.

View Source

const DefaultMaxOCRPages = 20

DefaultMaxOCRPages is the default page limit for OCR. Front-loaded info (specs, warranty, maintenance) is typically in the first pages.

View Source

const DefaultTextTimeout = 30 * time.Second

DefaultTextTimeout is the default timeout for pdftotext.

Variables ¶

This section is empty.

Functions ¶

func BuildExtractionPrompt ¶

func BuildExtractionPrompt(in ExtractionPromptInput) []llm.Message

BuildExtractionPrompt creates the system and user messages for document extraction. The system prompt defines the JSON schema and rules; the user message contains the document metadata and extracted text from all sources.

For PDFs, both PdfText and OCRText may be present. The LLM receives both with source labels so it can reconcile differences.

func ExtractText ¶

func ExtractText(data []byte, mime string, timeout time.Duration) (string, error)

ExtractText pulls plain text from document content based on MIME type. Returns empty string (not an error) for unsupported MIME types. PDF extraction uses pdftotext (poppler-utils) when available, returning empty for PDFs when the tool is missing. The timeout parameter caps how long pdftotext can run (0 = DefaultTextTimeout).

func HasPDFToPPM ¶

func HasPDFToPPM() bool

HasPDFToPPM reports whether the pdftoppm binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.

func HasPDFToText ¶

func HasPDFToText() bool

HasPDFToText reports whether the pdftotext binary (from poppler-utils) is on PATH. The result is cached for the process lifetime.

func HasTesseract ¶

func HasTesseract() bool

HasTesseract reports whether the tesseract binary is on PATH. The result is cached for the process lifetime.

func ImageOCRAvailable ¶

func ImageOCRAvailable() bool

ImageOCRAvailable reports whether tesseract is available for direct image OCR (no pdftoppm needed for image files).

func IsImageMIME ¶

func IsImageMIME(mime string) bool

IsImageMIME reports whether the MIME type is an image format that tesseract can process.

func IsScanned ¶

func IsScanned(extractedText string) bool

IsScanned returns true if the extracted text is empty or whitespace-only, indicating the document likely needs OCR.

func OCR ¶

func OCR(
	ctx context.Context,
	data []byte,
	mime string,
	maxPages int,
) (text string, tsv []byte, err error)

OCR runs optical character recognition on the given document data. For PDFs, it rasterizes pages with pdftoppm then runs tesseract. For images, it runs tesseract directly.

Returns the extracted plain text, the raw TSV data (for future use with confidence scores and bounding boxes), and any error.

Callers should check OCRAvailable/ImageOCRAvailable before calling.

func OCRAvailable ¶

func OCRAvailable() bool

OCRAvailable reports whether both tesseract and pdftoppm are available, which is the minimum needed to OCR scanned PDFs.

func OCRWithProgress ¶

func OCRWithProgress(
	ctx context.Context,
	data []byte,
	mime string,
	maxPages int,
) <-chan OCRProgress

OCRWithProgress runs OCR with per-page progress updates sent on the returned channel. The channel closes when processing completes. Only PDF and image MIME types are supported; unsupported types produce a single Done message with empty text.

func StripCodeFences ¶

func StripCodeFences(s string) string

StripCodeFences removes markdown code fences that LLMs sometimes wrap around JSON output.

Types ¶

type EntityContext ¶

type EntityContext struct {
	Vendors    []string
	Projects   []string
	Appliances []string
}

EntityContext provides existing entity names so the LLM can match extracted references against known data instead of hallucinating.

type ExtractionHints ¶

type ExtractionHints struct {
	DocumentType   string            `json:"document_type"`
	TitleSugg      string            `json:"title_suggestion"`
	Summary        string            `json:"summary"`
	VendorHint     string            `json:"vendor_hint"`
	TotalCents     *int64            `json:"total_cents"`
	LaborCents     *int64            `json:"labor_cents"`
	MaterialsCents *int64            `json:"materials_cents"`
	Date           *time.Time        `json:"date"`
	WarrantyExpiry *time.Time        `json:"warranty_expiry"`
	EntityKindHint string            `json:"entity_kind_hint"`
	EntityNameHint string            `json:"entity_name_hint"`
	Maintenance    []MaintenanceHint `json:"maintenance_items"`
	Notes          string            `json:"notes"`
}

ExtractionHints holds structured data extracted from a document by the LLM. Every field is optional -- the model fills what it can. These hints pre-fill form fields; the user confirms before saving.

func ParseExtractionResponse ¶

func ParseExtractionResponse(raw string) (ExtractionHints, error)

ParseExtractionResponse parses the LLM's JSON response into ExtractionHints. Tolerant of markdown fences, partial responses, and minor format variations in money/date fields.

type ExtractionPromptInput ¶

type ExtractionPromptInput struct {
	Filename  string
	MIME      string
	SizeBytes int64
	Entities  EntityContext
	PdfText   string // pdftotext output (PDFs only)
	OCRText   string // tesseract output (scanned PDFs and images)
	Text      string // fallback for non-PDF/non-image files (e.g. text/plain)
}

ExtractionPromptInput holds the inputs for building an extraction prompt.

type MaintenanceHint ¶

type MaintenanceHint struct {
	Name           string `json:"name"`
	IntervalMonths int    `json:"interval_months"`
}

MaintenanceHint is a maintenance schedule item extracted from a document (typically an appliance manual).

type OCRProgress ¶

type OCRProgress struct {
	Phase string // "rasterize" or "ocr"
	Page  int    // current page (1-indexed)
	Total int    // total pages (0 until known)
	Done  bool   // all phases finished
	Text  string // accumulated text (set on Done)
	TSV   []byte // accumulated TSV (set on Done)
	Err   error  // set on failure
}

OCRProgress reports incremental progress from OCRWithProgress.

type Pipeline ¶

type Pipeline struct {
	LLMClient     *llm.Client   // nil = skip LLM extraction
	MaxOCRPages   int           // 0 = DefaultMaxOCRPages
	TextTimeout   time.Duration // 0 = DefaultTextTimeout
	EntityContext EntityContext // existing entities for LLM matching
}

Pipeline orchestrates the document extraction layers: text extraction, OCR, and LLM-powered structured extraction. Each layer is independent and gracefully degrades when its dependencies are unavailable.

func (*Pipeline) Run ¶

func (p *Pipeline) Run(
	ctx context.Context,
	data []byte,
	filename string,
	mime string,
) *Result

Run executes the extraction pipeline on the given document data. It never returns a Go error -- all failures are captured in Result.Err so the caller can save the document regardless.

type Result ¶

type Result struct {
	ExtractedText string // best available text (pdftotext or OCR, merged)
	PdfText       string // raw pdftotext output (PDFs only)
	OCRText       string // raw OCR output (scanned PDFs and images)
	OCRData       []byte
	Hints         *ExtractionHints // nil if LLM unavailable or failed
	OCRUsed       bool
	LLMUsed       bool
	Err           error // non-fatal extraction error; document still saves
}

Result holds the output of a pipeline run.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL