parser

package
v0.9.0-alpha.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 12, 2026 License: Apache-2.0 Imports: 5 Imported by: 22

Documentation

Overview

Package parser defines the Parser interface for converting raw byte streams into schema.Document values.

Overview

A Parser is not a standalone pipeline component — it is used inside a [document.Loader] to handle format-specific decoding. The loader fetches raw bytes; the parser converts them into documents.

Built-in Implementations

  • TextParser: treats the entire reader as plain text, one document per call
  • ExtParser: selects a parser by file extension (from [Options.URI]), with a configurable fallback for unknown extensions

Use ExtParser when you want format-agnostic loading: pass the source URI via WithURI and ExtParser picks the right sub-parser automatically.

Reader Contract

The io.Reader passed to [Parser.Parse] is consumed during the call — it cannot be read again. Loaders must not reuse the same reader across multiple Parse calls.

Metadata Propagation

Use WithExtraMeta to attach key-value pairs that are merged into every document's MetaData. This is the standard way to tag documents with source information (URI, content type, etc.) at parse time.

See https://www.cloudwego.io/docs/eino/core_modules/components/document_loader_guide/document_parser_interface_guide/

Index

Constants

View Source
const (
	// MetaKeySource is the metadata key storing the document's source URI.
	MetaKeySource = "_source"
)

Variables

This section is empty.

Functions

func GetImplSpecificOptions

func GetImplSpecificOptions[T any](base *T, opts ...Option) *T

GetImplSpecificOptions provides Parser author the ability to extract their own custom options from the unified Option type. T: the type of the impl specific options struct. This function should be used within the Parser implementation's Transform function. It is recommended to provide a base T as the first argument, within which the Parser author can provide default values for the impl specific options.

Types

type ExtParser

type ExtParser struct {
	// contains filtered or unexported fields
}

ExtParser is a parser that uses the file extension to determine which parser to use. You can register your own parsers by calling RegisterParser. Default parser is TextParser. Note:

parse 时,是通过 filepath.Ext(uri) 的方式找到对应的 parser,因此使用时需要:
 	① 必须使用 parser.WithURI 在请求时传入 URI
 	② URI 必须能通过 filepath.Ext 来解析出符合预期的 ext

eg:

pdf, _ := os.Open("./testdata/test.pdf")
docs, err := ExtParser.Parse(ctx, pdf, parser.WithURI("./testdata/test.pdf"))

func NewExtParser

func NewExtParser(ctx context.Context, conf *ExtParserConfig) (*ExtParser, error)

NewExtParser creates a new ExtParser.

func (*ExtParser) GetParsers

func (p *ExtParser) GetParsers() map[string]Parser

GetParsers returns a copy of the registered parsers. It is safe to modify the returned parsers.

func (*ExtParser) Parse

func (p *ExtParser) Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)

Parse parses the given reader and returns a list of documents.

type ExtParserConfig

type ExtParserConfig struct {
	// ext -> parser.
	// eg: map[string]Parser{
	// 	".pdf": &PDFParser{},
	// 	".md": &MarkdownParser{},
	// }
	Parsers map[string]Parser

	// Fallback parser to use when no other parser is found.
	// Default is TextParser if not set.
	FallbackParser Parser
}

ExtParserConfig defines the configuration for the ExtParser.

type Option

type Option struct {
	// contains filtered or unexported fields
}

Option defines call option for Parser component, which is part of the component interface signature. Each Parser implementation could define its own options struct and option funcs within its own package, then wrap the impl specific option funcs into this type, before passing to Transform.

func WithExtraMeta

func WithExtraMeta(meta map[string]any) Option

WithExtraMeta attaches extra metadata to the parsed document.

func WithURI

func WithURI(uri string) Option

WithURI specifies the source URI of the document. It will be used as to select parser in ExtParser.

func WrapImplSpecificOptFn

func WrapImplSpecificOptFn[T any](optFn func(*T)) Option

WrapImplSpecificOptFn wraps the impl specific option functions into Option type. T: the type of the impl specific options struct. Parser implementations are required to use this function to convert its own option functions into the unified Option type. For example, if the Parser impl defines its own options struct:

type customOptions struct {
    conf string
}

Then the impl needs to provide an option function as such:

func WithConf(conf string) Option {
    return WrapImplSpecificOptFn(func(o *customOptions) {
		o.conf = conf
	}
}

.

type Options

type Options struct {
	// uri of source.
	URI string

	// extra metadata will merge to each document.
	ExtraMeta map[string]any
}

Options configures the document parser with source URI and extra metadata.

func GetCommonOptions

func GetCommonOptions(base *Options, opts ...Option) *Options

GetCommonOptions extract parser Options from Option list, optionally providing a base Options with default values.

type Parser

type Parser interface {
	Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)
}

Parser converts raw content from an io.Reader into schema.Document values.

Parse may return multiple documents from a single reader (e.g. a PDF with per-page splitting). The reader is consumed during Parse and must not be reused.

Parsers are typically not called directly — they are configured on a [document.Loader] and invoked via [document.WithParserOptions].

type TextParser

type TextParser struct{}

TextParser is a simple parser that reads the text from a reader and returns a single document. eg:

docs, err := TextParser.Parse(ctx, strings.NewReader("hello world"))
fmt.Println(docs[0].Content) // "hello world"

func (TextParser) Parse

func (dp TextParser) Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)

Parse reads the text from a reader and returns a single document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL