dataprep

package
v1.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 17, 2026 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package dataprep provides data preparation utilities for RAG systems. It includes chunkers, parsers, and graph extractors to process and transform raw data into structured formats suitable for retrieval and generation.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Chunker

type Chunker interface {
	// Chunk splits a single document into a slice of chunks.
	// The chunks should be optimized for retrieval and generation tasks.
	//
	// Parameters:
	// - ctx: The context for the operation
	// - doc: The document to split
	//
	// Returns:
	// - A slice of chunks
	// - An error if splitting fails
	Chunk(ctx context.Context, doc *entity.Document) ([]*entity.Chunk, error)
}

Chunker defines the interface for document splitting. Implementations of this interface are responsible for breaking down large documents into smaller, more manageable chunks for retrieval.

type GraphExtractor

type GraphExtractor interface {
	// Extract parses a chunk and returns a list of Nodes (Entities) and Edges (Relationships).
	// This forms the foundation for GraphRAG indexing.
	Extract(ctx context.Context, chunk *entity.Chunk) ([]abstraction.Node, []abstraction.Edge, error)
}

GraphExtractor is responsible for LLM-based Entity and Relationship extraction.

type Indexer

type Indexer interface {
	// IndexFile processes a single file into the Vector/Graph stores.
	IndexFile(ctx context.Context, filePath string) error

	// IndexDirectory concurrently processes an entire directory.
	IndexDirectory(ctx context.Context, dirPath string, recursive bool) error
}

Indexer defines the entry point for the offline data preparation pipeline.

type Parser

type Parser interface {
	// ParseStream reads from an io.Reader and streams parsed Document objects.
	// This ensures O(1) memory complexity for handling massive files (e.g., 2GB logs).
	ParseStream(ctx context.Context, r io.Reader, metadata map[string]any) (<-chan *entity.Document, error)

	// GetSupportedTypes returns the file extensions or MIME types this parser supports.
	GetSupportedTypes() []string
}

Parser defines the streaming document parser for Next-Gen RAG.

type SemanticChunker

type SemanticChunker interface {
	Chunker

	// HierarchicalChunk creates Parent-Child relationships for fine-grained retrieval
	// but broad context augmentation.
	// This pattern allows for retrieving specific chunks while maintaining
	// access to broader context.
	//
	// Parameters:
	// - ctx: The context for the operation
	// - doc: The document to split
	//
	// Returns:
	// - A slice of parent chunks
	// - A slice of child chunks
	// - An error if splitting fails
	HierarchicalChunk(ctx context.Context, doc *entity.Document) (parents []*entity.Chunk, children []*entity.Chunk, err error)

	// ContextualChunk injects a document-level summary into each child chunk's content
	// to preserve global context (Anthropic's Contextual Retrieval pattern).
	// This helps maintain context awareness during retrieval and generation.
	//
	// Parameters:
	// - ctx: The context for the operation
	// - doc: The document to split
	// - docSummary: A summary of the document to inject into each chunk
	//
	// Returns:
	// - A slice of chunks with injected context
	// - An error if splitting fails
	ContextualChunk(ctx context.Context, doc *entity.Document, docSummary string) ([]*entity.Chunk, error)
}

SemanticChunker extends Chunker to support Advanced RAG Chunking patterns. It provides additional methods for more sophisticated chunking strategies.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL