Documentation
¶
Overview ¶
Package indexing provides document indexing pipeline steps for RAG data preparation.
Package indexing provides document indexing pipeline steps for RAG data preparation.
This package contains reusable steps for building indexing pipelines:
- Discover: File discovery and validation
- Multi: Multi-parser streaming document parsing
- Semantic: Semantic chunking of documents
- Batch: Batch embedding generation
- Upsert: Vector storage upsert operations
- Entities: Entity extraction for graph indexing
Example usage:
p := pipeline.New[*core.State]()
p.AddSteps(
indexing.Discover(),
indexing.Multi(parsers...),
indexing.Semantic(chunker),
indexing.Batch(embedder, metrics),
indexing.Upsert(vectorStore, metrics),
)
err := p.Execute(ctx, &indexing.State{FilePath: "document.pdf"})
Index ¶
- func Batch(embedder embedding.Provider, metrics core.Metrics) pipeline.Step[*core.IndexingContext]
- func Chunk(chunker core.SemanticChunker) pipeline.Step[*core.IndexingContext]
- func Discover() pipeline.Step[*core.IndexingContext]
- func Entities(extractor core.EntityExtractor, logger logging.Logger) pipeline.Step[*core.IndexingContext]
- func ExtractTriples(extractor *base.TriplesExtractor, graphStore store.GraphStore, ...) pipeline.Step[*core.IndexingContext]
- func Multi(parsers ...core.Parser) pipeline.Step[*core.IndexingContext]
- func MultiStore(vectorStore core.VectorStore, docStore store.DocStore, ...) pipeline.Step[*core.IndexingContext]
- func MultimodalEmbed(provider embedding.MultimodalProvider, metrics core.Metrics) pipeline.Step[*core.IndexingContext]
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Batch ¶
Batch creates a new batch embedding step with metrics collection.
Parameters:
- embedder: embedding provider implementation
- metrics: metrics collector (optional, can be nil)
Example:
p.AddStep(indexing.Batch(embedder, metrics))
func Chunk ¶
func Chunk(chunker core.SemanticChunker) pipeline.Step[*core.IndexingContext]
Chunk creates a semantic chunking step.
Parameters:
- chunker: semantic chunker implementation
Example:
p.AddStep(indexing.Chunk(chunker))
func Discover ¶
func Discover() pipeline.Step[*core.IndexingContext]
Discover creates a new file discovery step.
Example:
p.AddStep(indexing.Discover())
func Entities ¶
func Entities(extractor core.EntityExtractor, logger logging.Logger) pipeline.Step[*core.IndexingContext]
Entities creates a new entity extraction step.
Parameters:
- extractor: entity extractor implementation
- logger: structured logger (auto-defaults to NoopLogger if nil)
Example:
p.AddStep(indexing.Entities(extractor, logger))
func ExtractTriples ¶
func ExtractTriples(extractor *base.TriplesExtractor, graphStore store.GraphStore, logger logging.Logger) pipeline.Step[*core.IndexingContext]
ExtractTriples creates a new step for automated knowledge graph construction. It extracts triples (Subject-Predicate-Object) from chunks and upserts them into the GraphStore.
func Multi ¶
Multi creates a new multi-parser step supporting multiple parsers.
Parameters:
- parsers: variadic list of parsers to use
Example:
p.AddStep(indexing.Multi(parser1, parser2, parser3))
func MultiStore ¶
func MultiStore( vectorStore core.VectorStore, docStore store.DocStore, graphStore store.GraphStore, logger logging.Logger, metrics core.Metrics, ) pipeline.Step[*core.IndexingContext]
MultiStore creates a step to store chunks, vectors, and entities to multiple backends.
func MultimodalEmbed ¶
func MultimodalEmbed(provider embedding.MultimodalProvider, metrics core.Metrics) pipeline.Step[*core.IndexingContext]
MultimodalEmbed creates a step for multimodal vector generation.
Types ¶
This section is empty.