Documentation
¶
Overview ¶
Package docparser provides a multi-backend document parser that converts files (PDF, DOCX, images, etc.) into []datasource.NormalizedItem for vectorization. Backends include Docling Serve and Gemini. The active provider is selected via Config.Provider.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DetectMIME ¶
DetectMIME returns the MIME type for a filename based on its extension. Falls back to "application/octet-stream" for unknown types.
func SplitOnPageMarkers ¶
SplitOnPageMarkers splits text on "--- PAGE N ---" markers. Returns at least one element (the whole text) if no markers found.
Types ¶
type Config ¶
type Config struct {
// Provider selects the active backend: "docling", "gemini".
Provider string `toml:"provider,omitempty" yaml:"provider,omitempty"`
Docling DoclingConfig `toml:"docling,omitempty" yaml:"docling,omitempty"`
Gemini GeminiConfig `toml:"gemini,omitempty" yaml:"gemini,omitempty"`
}
Config selects which parsing backend to use and holds per-provider settings. Only the sub-config matching Provider is used; the rest are ignored.
type DoclingConfig ¶
type DoclingConfig struct {
// BaseURL is the Docling Serve REST API base (e.g. "http://localhost:5001").
BaseURL string `toml:"base_url,omitempty" yaml:"base_url,omitempty"`
}
DoclingConfig holds settings for the Docling Serve sidecar backend.
type GeminiConfig ¶
type GeminiConfig struct {
// Model is the Gemini model to use for document parsing (e.g. "gemini-2.0-flash").
Model string `toml:"model,omitempty" yaml:"model,omitempty"`
}
GeminiConfig holds settings for the Gemini file-upload backend.
type ParseRequest ¶
type ParseRequest struct {
// Reader provides the raw file content.
Reader io.Reader
// Filename is the original filename (used for MIME-type detection and metadata).
Filename string
// SourceID is a stable ID prefix for generated items (e.g. "gdrive:fileId").
// Parsed pages/sections are suffixed with ":page:N".
SourceID string
}
ParseRequest carries the file to parse. Reader provides the file content; Filename is used for MIME detection and metadata. SourceID is a stable prefix for generated item IDs (e.g. "gdrive:abc123").
type Provider ¶
type Provider interface {
// Parse reads a document and returns one NormalizedItem per page or section.
// Each item's Content contains the extracted text; Metadata includes
// element_type, page_number, mime_type, and parser backend name.
Parse(ctx context.Context, req ParseRequest) ([]datasource.NormalizedItem, error)
}
Provider is the interface that all document parsing backends implement. Given a ParseRequest (io.Reader + filename), it returns structured items ready for vectorization. Multi-page documents produce one NormalizedItem per page or logical section.