loader

package
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2026 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

概述

Package loader 提供统一的 DocumentLoader 接口和内置文件加载器, 作为 RAG 管线的数据入口层。

它将原始数据源(本地文件、外部 API)与 rag.Document 类型桥接, 每个 Loader 读取特定格式并生成带有元数据的 []rag.Document, 供后续的 chunker、retriever 和 vector store 消费。

核心接口/类型

  • DocumentLoader — 统一加载接口(Load + SupportedTypes)
  • LoaderRegistry — 按文件扩展名路由到对应 Loader,支持自定义注册
  • TextLoader — 纯文本加载器(.txt)
  • MarkdownLoader — Markdown 加载器(.md),按一级标题拆分为多个 Document
  • CSVLoader — CSV 加载器(.csv),支持自定义分隔符和行分组
  • JSONLoader — JSON/JSONL 加载器(.json / .jsonl),支持字段映射
  • GitHubSourceAdapter — 将 sources.GitHubSource 适配为 DocumentLoader
  • ArxivSourceAdapter — 将 sources.ArxivSource 适配为 DocumentLoader

主要能力

  • 扩展名路由:LoaderRegistry 根据文件扩展名自动选择 Loader
  • 内置格式:开箱支持 .txt / .md / .csv / .json / .jsonl
  • 自定义扩展:通过 Registry.Register 注册任意扩展名的 Loader
  • 外部源适配:Adapter 模式将 GitHub / arXiv 等查询型数据源接入 Loader 体系

使用示例:

registry := loader.NewLoaderRegistry()
docs, err := registry.Load(ctx, "/path/to/data.csv")

// 注册自定义 Loader
registry.Register(".xml", myXMLLoader)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ArxivSourceAdapter

type ArxivSourceAdapter struct {
	// contains filtered or unexported fields
}

ArxivSourceAdapter adapts sources.ArxivSource to the DocumentLoader interface. It searches arXiv papers by query and converts each result into a rag.Document.

func NewArxivSourceAdapter

func NewArxivSourceAdapter(source *sources.ArxivSource, maxResults int) *ArxivSourceAdapter

NewArxivSourceAdapter creates an adapter around an existing ArxivSource.

func (*ArxivSourceAdapter) Load

func (a *ArxivSourceAdapter) Load(ctx context.Context, source string) ([]rag.Document, error)

Load interprets source as a search query and returns matching papers as Documents.

func (*ArxivSourceAdapter) SupportedTypes

func (a *ArxivSourceAdapter) SupportedTypes() []string

SupportedTypes returns an empty slice; this adapter is query-based, not file-based.

type CSVLoader

type CSVLoader struct {
	// contains filtered or unexported fields
}

CSVLoader loads CSV files. Each row (or group of rows) becomes a Document. The first row is treated as a header.

func NewCSVLoader

func NewCSVLoader(config CSVLoaderConfig) *CSVLoader

NewCSVLoader creates a CSVLoader with the given config.

func (*CSVLoader) Load

func (l *CSVLoader) Load(ctx context.Context, source string) ([]rag.Document, error)

Load reads a CSV file and returns Documents.

func (*CSVLoader) SupportedTypes

func (l *CSVLoader) SupportedTypes() []string

SupportedTypes returns the extensions handled by CSVLoader.

type CSVLoaderConfig

type CSVLoaderConfig struct {
	// Delimiter is the field separator. Defaults to ','.
	Delimiter rune
	// RowsPerDocument controls how many rows are grouped into a single Document.
	// 0 or 1 means each row becomes its own Document.
	RowsPerDocument int
	// ContentColumns lists column names (from the header) to include in Document.Content.
	// If empty, all columns are concatenated.
	ContentColumns []string
}

CSVLoaderConfig configures the CSV loader.

type DocumentLoader

type DocumentLoader interface {
	// Load reads the source and returns documents.
	// source is typically a file path, but loaders may interpret it as a URL or query.
	Load(ctx context.Context, source string) ([]rag.Document, error)

	// SupportedTypes returns the file extensions this loader handles (e.g. ".txt", ".md").
	SupportedTypes() []string
}

DocumentLoader is the unified interface for loading documents from any source.

type GitHubSourceAdapter

type GitHubSourceAdapter struct {
	// contains filtered or unexported fields
}

GitHubSourceAdapter adapts sources.GitHubSource to the DocumentLoader interface. It searches GitHub repos by query and converts each result into a rag.Document.

func NewGitHubSourceAdapter

func NewGitHubSourceAdapter(source *sources.GitHubSource, maxResults int) *GitHubSourceAdapter

NewGitHubSourceAdapter creates an adapter around an existing GitHubSource.

func (*GitHubSourceAdapter) Load

func (a *GitHubSourceAdapter) Load(ctx context.Context, source string) ([]rag.Document, error)

Load interprets source as a search query and returns matching repos as Documents.

func (*GitHubSourceAdapter) SupportedTypes

func (a *GitHubSourceAdapter) SupportedTypes() []string

SupportedTypes returns an empty slice; this adapter is query-based, not file-based.

type JSONLoader

type JSONLoader struct {
	// contains filtered or unexported fields
}

JSONLoader loads JSON (single object or array) and JSONL files.

func NewJSONLoader

func NewJSONLoader(config JSONLoaderConfig) *JSONLoader

NewJSONLoader creates a JSONLoader.

func (*JSONLoader) Load

func (l *JSONLoader) Load(ctx context.Context, source string) ([]rag.Document, error)

Load reads a JSON or JSONL file and returns Documents.

func (*JSONLoader) SupportedTypes

func (l *JSONLoader) SupportedTypes() []string

SupportedTypes returns the extensions handled by JSONLoader.

type JSONLoaderConfig

type JSONLoaderConfig struct {
	// ContentField is the JSON field name to use as Document.Content.
	// If empty, the entire JSON object is serialized as content.
	ContentField string
	// IDField is the JSON field name to use as Document.ID.
	// If empty, a path-based ID is generated.
	IDField string
}

JSONLoaderConfig configures the JSON/JSONL loader.

type LoaderRegistry

type LoaderRegistry struct {
	// contains filtered or unexported fields
}

LoaderRegistry routes Load calls to the appropriate DocumentLoader based on file extension.

func NewLoaderRegistry

func NewLoaderRegistry() *LoaderRegistry

NewLoaderRegistry creates a registry pre-populated with the built-in loaders.

func (*LoaderRegistry) Load

func (r *LoaderRegistry) Load(ctx context.Context, source string) ([]rag.Document, error)

Load determines the loader from the source's file extension and delegates to it.

func (*LoaderRegistry) Register

func (r *LoaderRegistry) Register(ext string, loader DocumentLoader)

Register adds or replaces a loader for the given file extension. ext should include the leading dot (e.g. ".pdf").

func (*LoaderRegistry) SupportedTypes

func (r *LoaderRegistry) SupportedTypes() []string

SupportedTypes returns all registered extensions, sorted.

type MarkdownLoader

type MarkdownLoader struct{}

MarkdownLoader loads Markdown files, splitting by top-level headings. Each heading section becomes a separate Document with the heading preserved in metadata. If the file has no headings, the entire content is returned as a single Document.

func NewMarkdownLoader

func NewMarkdownLoader() *MarkdownLoader

NewMarkdownLoader creates a MarkdownLoader.

func (*MarkdownLoader) Load

func (l *MarkdownLoader) Load(ctx context.Context, source string) ([]rag.Document, error)

Load reads a Markdown file and splits it into Documents by heading.

func (*MarkdownLoader) SupportedTypes

func (l *MarkdownLoader) SupportedTypes() []string

SupportedTypes returns the extensions handled by MarkdownLoader.

type TextLoader

type TextLoader struct{}

TextLoader loads plain text files as a single Document.

func NewTextLoader

func NewTextLoader() *TextLoader

NewTextLoader creates a TextLoader.

func (*TextLoader) Load

func (l *TextLoader) Load(ctx context.Context, source string) ([]rag.Document, error)

Load reads a text file and returns it as a single Document.

func (*TextLoader) SupportedTypes

func (l *TextLoader) SupportedTypes() []string

SupportedTypes returns the extensions handled by TextLoader.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL