Documentation
¶
Overview ¶
Package parser defines the Parser interface for converting raw byte streams into schema.Document values.
Overview ¶
A Parser is not a standalone pipeline component — it is used inside a [document.Loader] to handle format-specific decoding. The loader fetches raw bytes; the parser converts them into documents.
Built-in Implementations ¶
- TextParser: treats the entire reader as plain text, one document per call
- ExtParser: selects a parser by file extension (from [Options.URI]), with a configurable fallback for unknown extensions
Use ExtParser when you want format-agnostic loading: pass the source URI via WithURI and ExtParser picks the right sub-parser automatically.
Reader Contract ¶
The io.Reader passed to [Parser.Parse] is consumed during the call — it cannot be read again. Loaders must not reuse the same reader across multiple Parse calls.
Metadata Propagation ¶
Use WithExtraMeta to attach key-value pairs that are merged into every document's MetaData. This is the standard way to tag documents with source information (URI, content type, etc.) at parse time.
Index ¶
Constants ¶
const (
// MetaKeySource is the metadata key storing the document's source URI.
MetaKeySource = "_source"
)
Variables ¶
This section is empty.
Functions ¶
func GetImplSpecificOptions ¶
GetImplSpecificOptions provides Parser author the ability to extract their own custom options from the unified Option type. T: the type of the impl specific options struct. This function should be used within the Parser implementation's Transform function. It is recommended to provide a base T as the first argument, within which the Parser author can provide default values for the impl specific options.
Types ¶
type ExtParser ¶
type ExtParser struct {
// contains filtered or unexported fields
}
ExtParser is a parser that uses the file extension to determine which parser to use. You can register your own parsers by calling RegisterParser. Default parser is TextParser. Note:
parse 时,是通过 filepath.Ext(uri) 的方式找到对应的 parser,因此使用时需要: ① 必须使用 parser.WithURI 在请求时传入 URI ② URI 必须能通过 filepath.Ext 来解析出符合预期的 ext
eg:
pdf, _ := os.Open("./testdata/test.pdf")
docs, err := ExtParser.Parse(ctx, pdf, parser.WithURI("./testdata/test.pdf"))
func NewExtParser ¶
func NewExtParser(ctx context.Context, conf *ExtParserConfig) (*ExtParser, error)
NewExtParser creates a new ExtParser.
func (*ExtParser) GetParsers ¶
GetParsers returns a copy of the registered parsers. It is safe to modify the returned parsers.
type ExtParserConfig ¶
type ExtParserConfig struct {
// ext -> parser.
// eg: map[string]Parser{
// ".pdf": &PDFParser{},
// ".md": &MarkdownParser{},
// }
Parsers map[string]Parser
// Fallback parser to use when no other parser is found.
// Default is TextParser if not set.
FallbackParser Parser
}
ExtParserConfig defines the configuration for the ExtParser.
type Option ¶
type Option struct {
// contains filtered or unexported fields
}
Option defines call option for Parser component, which is part of the component interface signature. Each Parser implementation could define its own options struct and option funcs within its own package, then wrap the impl specific option funcs into this type, before passing to Transform.
func WithExtraMeta ¶
WithExtraMeta attaches extra metadata to the parsed document.
func WithURI ¶
WithURI specifies the source URI of the document. It will be used as to select parser in ExtParser.
func WrapImplSpecificOptFn ¶
WrapImplSpecificOptFn wraps the impl specific option functions into Option type. T: the type of the impl specific options struct. Parser implementations are required to use this function to convert its own option functions into the unified Option type. For example, if the Parser impl defines its own options struct:
type customOptions struct {
conf string
}
Then the impl needs to provide an option function as such:
func WithConf(conf string) Option {
return WrapImplSpecificOptFn(func(o *customOptions) {
o.conf = conf
}
}
.
type Options ¶
type Options struct {
// uri of source.
URI string
// extra metadata will merge to each document.
ExtraMeta map[string]any
}
Options configures the document parser with source URI and extra metadata.
func GetCommonOptions ¶
GetCommonOptions extract parser Options from Option list, optionally providing a base Options with default values.
type Parser ¶
type Parser interface {
Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)
}
Parser converts raw content from an io.Reader into schema.Document values.
Parse may return multiple documents from a single reader (e.g. a PDF with per-page splitting). The reader is consumed during Parse and must not be reused.
Parsers are typically not called directly — they are configured on a [document.Loader] and invoked via [document.WithParserOptions].
type TextParser ¶
type TextParser struct{}
TextParser is a simple parser that reads the text from a reader and returns a single document. eg:
docs, err := TextParser.Parse(ctx, strings.NewReader("hello world"))
fmt.Println(docs[0].Content) // "hello world"