Documentation
¶
Overview ¶
概述 ¶
Package loader 提供统一的 DocumentLoader 接口和内置文件加载器, 作为 RAG 管线的数据入口层。
它将原始数据源(本地文件、外部 API)与 rag.Document 类型桥接, 每个 Loader 读取特定格式并生成带有元数据的 []rag.Document, 供后续的 chunker、retriever 和 vector store 消费。
核心接口/类型 ¶
- DocumentLoader — 统一加载接口(Load + SupportedTypes)
- LoaderRegistry — 按文件扩展名路由到对应 Loader,支持自定义注册
- TextLoader — 纯文本加载器(.txt)
- MarkdownLoader — Markdown 加载器(.md),按一级标题拆分为多个 Document
- CSVLoader — CSV 加载器(.csv),支持自定义分隔符和行分组
- JSONLoader — JSON/JSONL 加载器(.json / .jsonl),支持字段映射
- GitHubSourceAdapter — 将 sources.GitHubSource 适配为 DocumentLoader
- ArxivSourceAdapter — 将 sources.ArxivSource 适配为 DocumentLoader
主要能力 ¶
- 扩展名路由:LoaderRegistry 根据文件扩展名自动选择 Loader
- 内置格式:开箱支持 .txt / .md / .csv / .json / .jsonl
- 自定义扩展:通过 Registry.Register 注册任意扩展名的 Loader
- 外部源适配:Adapter 模式将 GitHub / arXiv 等查询型数据源接入 Loader 体系
使用示例:
registry := loader.NewLoaderRegistry()
docs, err := registry.Load(ctx, "/path/to/data.csv")
// 注册自定义 Loader
registry.Register(".xml", myXMLLoader)
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ArxivSourceAdapter ¶
type ArxivSourceAdapter struct {
// contains filtered or unexported fields
}
ArxivSourceAdapter adapts sources.ArxivSource to the DocumentLoader interface. It searches arXiv papers by query and converts each result into a rag.Document.
func NewArxivSourceAdapter ¶
func NewArxivSourceAdapter(source *sources.ArxivSource, maxResults int) *ArxivSourceAdapter
NewArxivSourceAdapter creates an adapter around an existing ArxivSource.
func (*ArxivSourceAdapter) Load ¶
Load interprets source as a search query and returns matching papers as Documents.
func (*ArxivSourceAdapter) SupportedTypes ¶
func (a *ArxivSourceAdapter) SupportedTypes() []string
SupportedTypes returns an empty slice; this adapter is query-based, not file-based.
type CSVLoader ¶
type CSVLoader struct {
// contains filtered or unexported fields
}
CSVLoader loads CSV files. Each row (or group of rows) becomes a Document. The first row is treated as a header.
func NewCSVLoader ¶
func NewCSVLoader(config CSVLoaderConfig) *CSVLoader
NewCSVLoader creates a CSVLoader with the given config.
func (*CSVLoader) SupportedTypes ¶
SupportedTypes returns the extensions handled by CSVLoader.
type CSVLoaderConfig ¶
type CSVLoaderConfig struct {
// Delimiter is the field separator. Defaults to ','.
Delimiter rune
// RowsPerDocument controls how many rows are grouped into a single Document.
// 0 or 1 means each row becomes its own Document.
RowsPerDocument int
// ContentColumns lists column names (from the header) to include in Document.Content.
// If empty, all columns are concatenated.
ContentColumns []string
}
CSVLoaderConfig configures the CSV loader.
type DocumentLoader ¶
type DocumentLoader interface {
// Load reads the source and returns documents.
// source is typically a file path, but loaders may interpret it as a URL or query.
Load(ctx context.Context, source string) ([]rag.Document, error)
// SupportedTypes returns the file extensions this loader handles (e.g. ".txt", ".md").
SupportedTypes() []string
}
DocumentLoader is the unified interface for loading documents from any source.
type GitHubSourceAdapter ¶
type GitHubSourceAdapter struct {
// contains filtered or unexported fields
}
GitHubSourceAdapter adapts sources.GitHubSource to the DocumentLoader interface. It searches GitHub repos by query and converts each result into a rag.Document.
func NewGitHubSourceAdapter ¶
func NewGitHubSourceAdapter(source *sources.GitHubSource, maxResults int) *GitHubSourceAdapter
NewGitHubSourceAdapter creates an adapter around an existing GitHubSource.
func (*GitHubSourceAdapter) Load ¶
Load interprets source as a search query and returns matching repos as Documents.
func (*GitHubSourceAdapter) SupportedTypes ¶
func (a *GitHubSourceAdapter) SupportedTypes() []string
SupportedTypes returns an empty slice; this adapter is query-based, not file-based.
type JSONLoader ¶
type JSONLoader struct {
// contains filtered or unexported fields
}
JSONLoader loads JSON (single object or array) and JSONL files.
func NewJSONLoader ¶
func NewJSONLoader(config JSONLoaderConfig) *JSONLoader
NewJSONLoader creates a JSONLoader.
func (*JSONLoader) SupportedTypes ¶
func (l *JSONLoader) SupportedTypes() []string
SupportedTypes returns the extensions handled by JSONLoader.
type JSONLoaderConfig ¶
type JSONLoaderConfig struct {
// ContentField is the JSON field name to use as Document.Content.
// If empty, the entire JSON object is serialized as content.
ContentField string
// IDField is the JSON field name to use as Document.ID.
// If empty, a path-based ID is generated.
IDField string
}
JSONLoaderConfig configures the JSON/JSONL loader.
type LoaderRegistry ¶
type LoaderRegistry struct {
// contains filtered or unexported fields
}
LoaderRegistry routes Load calls to the appropriate DocumentLoader based on file extension.
func NewLoaderRegistry ¶
func NewLoaderRegistry() *LoaderRegistry
NewLoaderRegistry creates a registry pre-populated with the built-in loaders.
func (*LoaderRegistry) Load ¶
Load determines the loader from the source's file extension and delegates to it.
func (*LoaderRegistry) Register ¶
func (r *LoaderRegistry) Register(ext string, loader DocumentLoader)
Register adds or replaces a loader for the given file extension. ext should include the leading dot (e.g. ".pdf").
func (*LoaderRegistry) SupportedTypes ¶
func (r *LoaderRegistry) SupportedTypes() []string
SupportedTypes returns all registered extensions, sorted.
type MarkdownLoader ¶
type MarkdownLoader struct{}
MarkdownLoader loads Markdown files, splitting by top-level headings. Each heading section becomes a separate Document with the heading preserved in metadata. If the file has no headings, the entire content is returned as a single Document.
func NewMarkdownLoader ¶
func NewMarkdownLoader() *MarkdownLoader
NewMarkdownLoader creates a MarkdownLoader.
func (*MarkdownLoader) SupportedTypes ¶
func (l *MarkdownLoader) SupportedTypes() []string
SupportedTypes returns the extensions handled by MarkdownLoader.
type TextLoader ¶
type TextLoader struct{}
TextLoader loads plain text files as a single Document.
func (*TextLoader) SupportedTypes ¶
func (l *TextLoader) SupportedTypes() []string
SupportedTypes returns the extensions handled by TextLoader.