docparser

package
v0.1.8-rc.16 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

Documentation

Overview

Package docparser provides a multi-backend document parser that converts files (PDF, DOCX, images, etc.) into []datasource.NormalizedItem for vectorization. Backends include Docling Serve and Gemini. The active provider is selected via Config.Provider.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DetectMIME

func DetectMIME(filename string) string

DetectMIME returns the MIME type for a filename based on its extension. Falls back to "application/octet-stream" for unknown types.

func SplitOnPageMarkers

func SplitOnPageMarkers(text string) []string

SplitOnPageMarkers splits text on "--- PAGE N ---" markers. Returns at least one element (the whole text) if no markers found.

Types

type Config

type Config struct {
	// Provider selects the active backend: "docling", "gemini".
	Provider string `toml:"provider,omitempty" yaml:"provider,omitempty"`

	Docling DoclingConfig `toml:"docling,omitempty" yaml:"docling,omitempty"`
	Gemini  GeminiConfig  `toml:"gemini,omitempty" yaml:"gemini,omitempty"`
}

Config selects which parsing backend to use and holds per-provider settings. Only the sub-config matching Provider is used; the rest are ignored.

func (Config) New

New creates a Provider from the given Config. Only the sub-config matching cfg.Provider is used. Returns an error for unknown or empty provider names.

type DoclingConfig

type DoclingConfig struct {
	// BaseURL is the Docling Serve REST API base (e.g. "http://localhost:5001").
	BaseURL string `toml:"base_url,omitempty" yaml:"base_url,omitempty"`
}

DoclingConfig holds settings for the Docling Serve sidecar backend.

type GeminiConfig

type GeminiConfig struct {
	// Model is the Gemini model to use for document parsing (e.g. "gemini-2.0-flash").
	Model string `toml:"model,omitempty" yaml:"model,omitempty"`
}

GeminiConfig holds settings for the Gemini file-upload backend.

type ParseRequest

type ParseRequest struct {
	// Reader provides the raw file content.
	Reader io.Reader
	// Filename is the original filename (used for MIME-type detection and metadata).
	Filename string
	// SourceID is a stable ID prefix for generated items (e.g. "gdrive:fileId").
	// Parsed pages/sections are suffixed with ":page:N".
	SourceID string
}

ParseRequest carries the file to parse. Reader provides the file content; Filename is used for MIME detection and metadata. SourceID is a stable prefix for generated item IDs (e.g. "gdrive:abc123").

type Provider

type Provider interface {
	// Parse reads a document and returns one NormalizedItem per page or section.
	// Each item's Content contains the extracted text; Metadata includes
	// element_type, page_number, mime_type, and parser backend name.
	Parse(ctx context.Context, req ParseRequest) ([]datasource.NormalizedItem, error)
}

Provider is the interface that all document parsing backends implement. Given a ParseRequest (io.Reader + filename), it returns structured items ready for vectorization. Multi-page documents produce one NormalizedItem per page or logical section.

Directories

Path Synopsis
Code generated by counterfeiter.
Code generated by counterfeiter.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL