Documentation
¶
Overview ¶
Package indexing provides document indexing pipeline steps for RAG data preparation.
Package indexing provides document indexing pipeline steps for RAG data preparation.
This package contains reusable steps for building indexing pipelines:
- Discover: File discovery and validation
- Multi: Multi-parser streaming document parsing
- Semantic: Semantic chunking of documents
- Batch: Batch embedding generation
- Upsert: Vector storage upsert operations
- Entities: Entity extraction for graph indexing
Example usage:
p := pipeline.New[*core.State]()
p.AddSteps(
indexing.Discover(),
indexing.Multi(parsers...),
indexing.Semantic(chunker),
indexing.Batch(embedder, metrics),
indexing.Upsert(vectorStore, metrics),
)
err := p.Execute(ctx, &indexing.State{FilePath: "document.pdf"})
Index ¶
- func Batch(embedder embedding.Provider, metrics core.Metrics) pipeline.Step[*core.IndexingContext]
- func Chunk(chunker core.SemanticChunker) pipeline.Step[*core.IndexingContext]
- func DetectCommunities(detector core.CommunityDetector, graphStore core.GraphStore, ...) pipeline.Step[*core.IndexingContext]
- func Discover() pipeline.Step[*core.IndexingContext]
- func Entities(extractor core.EntityExtractor, logger logging.Logger) pipeline.Step[*core.IndexingContext]
- func ExtractTriples(extractor core.TriplesExtractor, graphStore core.GraphStore, ...) pipeline.Step[*core.IndexingContext]
- func GenerateSummaries(llm chat.Client, graphStore core.GraphStore, logger logging.Logger) pipeline.Step[*core.IndexingContext]
- func Multi(parsers ...core.Parser) pipeline.Step[*core.IndexingContext]
- func MultiFactory(registry *types.ParserRegistry) pipeline.Step[*core.IndexingContext]
- func MultiStore(vectorStore core.VectorStore, docStore core.DocStore, ...) pipeline.Step[*core.IndexingContext]
- func MultimodalEmbed(provider embedding.MultimodalProvider, metrics core.Metrics) pipeline.Step[*core.IndexingContext]
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Batch ¶
Batch creates a new batch embedding step with metrics collection.
Parameters:
- embedder: embedding provider implementation
- metrics: metrics collector (optional, can be nil)
Example:
p.AddStep(indexing.Batch(embedder, metrics))
func Chunk ¶
func Chunk(chunker core.SemanticChunker) pipeline.Step[*core.IndexingContext]
Chunk creates a semantic chunking step.
Parameters:
- chunker: semantic chunker implementation
Example:
p.AddStep(indexing.Chunk(chunker))
func DetectCommunities ¶ added in v1.1.6
func DetectCommunities(detector core.CommunityDetector, graphStore core.GraphStore, logger logging.Logger) pipeline.Step[*core.IndexingContext]
DetectCommunities creates a step that detects communities in the knowledge graph. Communities are hierarchical groups of related nodes that enable: - Global search (searching community summaries) - Hierarchical summarization (multi-level understanding)
func Discover ¶
func Discover() pipeline.Step[*core.IndexingContext]
Discover creates a new file discovery step.
Example:
p.AddStep(indexing.Discover())
func Entities ¶
func Entities(extractor core.EntityExtractor, logger logging.Logger) pipeline.Step[*core.IndexingContext]
Entities creates a new entity extraction step.
Parameters:
- extractor: entity extractor implementation
- logger: structured logger (auto-defaults to NoopLogger if nil)
Example:
p.AddStep(indexing.Entities(extractor, logger))
func ExtractTriples ¶
func ExtractTriples(extractor core.TriplesExtractor, graphStore core.GraphStore, logger logging.Logger) pipeline.Step[*core.IndexingContext]
ExtractTriples creates a new step for automated knowledge graph construction. It extracts triples (Subject-Predicate-Object) from chunks and upserts them into the GraphStore. Following Microsoft GraphRAG design: nodes and edges are bound to source chunks.
func GenerateSummaries ¶ added in v1.1.6
func GenerateSummaries(llm chat.Client, graphStore core.GraphStore, logger logging.Logger) pipeline.Step[*core.IndexingContext]
GenerateSummaries creates a step that generates summaries for detected communities. Following Microsoft GraphRAG design, this enables global search by providing high-level descriptions of community content.
func Multi ¶
Multi creates a new multi-parser step supporting multiple parsers. Deprecated: Use MultiFactory to prevent concurrency and state-sharing bugs.
func MultiFactory ¶ added in v1.1.3
func MultiFactory(registry *types.ParserRegistry) pipeline.Step[*core.IndexingContext]
MultiFactory creates a new multi-parser step that dynamically spawns parsers.
func MultiStore ¶
func MultiStore( vectorStore core.VectorStore, docStore core.DocStore, graphStore core.GraphStore, logger logging.Logger, metrics core.Metrics, ) pipeline.Step[*core.IndexingContext]
MultiStore creates a step to store chunks, vectors, and entities to multiple backends.
func MultimodalEmbed ¶
func MultimodalEmbed(provider embedding.MultimodalProvider, metrics core.Metrics) pipeline.Step[*core.IndexingContext]
MultimodalEmbed creates a step for multimodal vector generation.
Types ¶
This section is empty.