dataprep

package

v1.0.4 Latest Latest Go to latest Published: Mar 17, 2026 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/DotNetAge/gorag

Links

Open Source Insights

Documentation ¶

Overview ¶

Package dataprep provides data preparation utilities for RAG systems. It includes chunkers, parsers, and graph extractors to process and transform raw data into structured formats suitable for retrieval and generation.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Chunker ¶

type Chunker interface {
	// Chunk splits a single document into a slice of chunks.
	// The chunks should be optimized for retrieval and generation tasks.
	//
	// Parameters:
	// - ctx: The context for the operation
	// - doc: The document to split
	//
	// Returns:
	// - A slice of chunks
	// - An error if splitting fails
	Chunk(ctx context.Context, doc *entity.Document) ([]*entity.Chunk, error)
}

Chunker defines the interface for document splitting. Implementations of this interface are responsible for breaking down large documents into smaller, more manageable chunks for retrieval.

type GraphExtractor ¶

type GraphExtractor interface {
	// Extract parses a chunk and returns a list of Nodes (Entities) and Edges (Relationships).
	// This forms the foundation for GraphRAG indexing.
	Extract(ctx context.Context, chunk *entity.Chunk) ([]abstraction.Node, []abstraction.Edge, error)
}

GraphExtractor is responsible for LLM-based Entity and Relationship extraction.

type Indexer ¶

type Indexer interface {
	// IndexFile processes a single file into the Vector/Graph stores.
	IndexFile(ctx context.Context, filePath string) error

	// IndexDirectory concurrently processes an entire directory.
	IndexDirectory(ctx context.Context, dirPath string, recursive bool) error
}

Indexer defines the entry point for the offline data preparation pipeline.

type Parser ¶

type Parser interface {
	// ParseStream reads from an io.Reader and streams parsed Document objects.
	// This ensures O(1) memory complexity for handling massive files (e.g., 2GB logs).
	ParseStream(ctx context.Context, r io.Reader, metadata map[string]any) (<-chan *entity.Document, error)

	// GetSupportedTypes returns the file extensions or MIME types this parser supports.
	GetSupportedTypes() []string
}

Parser defines the streaming document parser for Next-Gen RAG.

type SemanticChunker ¶

type SemanticChunker interface {
	Chunker

	// HierarchicalChunk creates Parent-Child relationships for fine-grained retrieval
	// but broad context augmentation.
	// This pattern allows for retrieving specific chunks while maintaining
	// access to broader context.
	//
	// Parameters:
	// - ctx: The context for the operation
	// - doc: The document to split
	//
	// Returns:
	// - A slice of parent chunks
	// - A slice of child chunks
	// - An error if splitting fails
	HierarchicalChunk(ctx context.Context, doc *entity.Document) (parents []*entity.Chunk, children []*entity.Chunk, err error)

	// ContextualChunk injects a document-level summary into each child chunk's content
	// to preserve global context (Anthropic's Contextual Retrieval pattern).
	// This helps maintain context awareness during retrieval and generation.
	//
	// Parameters:
	// - ctx: The context for the operation
	// - doc: The document to split
	// - docSummary: A summary of the document to inject into each chunk
	//
	// Returns:
	// - A slice of chunks with injected context
	// - An error if splitting fails
	ContextualChunk(ctx context.Context, doc *entity.Document, docSummary string) ([]*entity.Chunk, error)
}

SemanticChunker extends Chunker to support Advanced RAG Chunking patterns. It provides additional methods for more sophisticated chunking strategies.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL