docparser

package

v0.1.8-rc.22 Latest Latest Go to latest Published: Mar 29, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/stackgenhq/genie

Links

Open Source Insights

Documentation ¶

Overview ¶

Package docparser provides a multi-backend document parser that converts files (PDF, DOCX, images, etc.) into []datasource.NormalizedItem for vectorization. Backends include Docling Serve and Gemini. The active provider is selected via Config.Provider.

Index ¶

func DetectMIME(filename string) string
func SplitOnPageMarkers(text string) []string
type Config
- func (cfg Config) New(ctx context.Context, sp security.SecretProvider) (Provider, error)
type DoclingConfig
type GeminiConfig
type ParseRequest
type Provider

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DetectMIME ¶

func DetectMIME(filename string) string

DetectMIME returns the MIME type for a filename based on its extension. Falls back to "application/octet-stream" for unknown types.

func SplitOnPageMarkers ¶

func SplitOnPageMarkers(text string) []string

SplitOnPageMarkers splits text on "--- PAGE N ---" markers. Returns at least one element (the whole text) if no markers found.

Types ¶

type Config ¶

type Config struct {
	// Provider selects the active backend: "docling", "gemini".
	Provider string `toml:"provider,omitempty" yaml:"provider,omitempty"`

	Docling DoclingConfig `toml:"docling,omitempty" yaml:"docling,omitempty"`
	Gemini  GeminiConfig  `toml:"gemini,omitempty" yaml:"gemini,omitempty"`
}

Config selects which parsing backend to use and holds per-provider settings. Only the sub-config matching Provider is used; the rest are ignored.

func (Config) New ¶

func (cfg Config) New(ctx context.Context, sp security.SecretProvider) (Provider, error)

New creates a Provider from the given Config. Only the sub-config matching cfg.Provider is used. Returns an error for unknown or empty provider names.

type DoclingConfig ¶

type DoclingConfig struct {
	// BaseURL is the Docling Serve REST API base (e.g. "http://localhost:5001").
	BaseURL string `toml:"base_url,omitempty" yaml:"base_url,omitempty"`
}

DoclingConfig holds settings for the Docling Serve sidecar backend.

type GeminiConfig ¶

type GeminiConfig struct {
	// Model is the Gemini model to use for document parsing (e.g. "gemini-2.0-flash").
	Model string `toml:"model,omitempty" yaml:"model,omitempty"`
}

GeminiConfig holds settings for the Gemini file-upload backend.

type ParseRequest ¶

type ParseRequest struct {
	// Reader provides the raw file content.
	Reader io.Reader
	// Filename is the original filename (used for MIME-type detection and metadata).
	Filename string
	// SourceID is a stable ID prefix for generated items (e.g. "gdrive:fileId").
	// Parsed pages/sections are suffixed with ":page:N".
	SourceID string
}

ParseRequest carries the file to parse. Reader provides the file content; Filename is used for MIME detection and metadata. SourceID is a stable prefix for generated item IDs (e.g. "gdrive:abc123").

type Provider ¶

type Provider interface {
	// Parse reads a document and returns one NormalizedItem per page or section.
	// Each item's Content contains the extracted text; Metadata includes
	// element_type, page_number, mime_type, and parser backend name.
	Parse(ctx context.Context, req ParseRequest) ([]datasource.NormalizedItem, error)
}

Provider is the interface that all document parsing backends implement. Given a ParseRequest (io.Reader + filename), it returns structured items ready for vectorization. Multi-page documents produce one NormalizedItem per page or logical section.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
docparserfakes Code generated by counterfeiter.	Code generated by counterfeiter.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL