Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Document ¶
type Document struct {
// Key is the simple path-based identity (e.g., "sub/a.txt")
Key Key
// Type specifies the content type of the document
// (e.g., "application/octet-stream", "application/json")
Type string
// Reader provides streaming access to document content
Reader io.Reader
// Metadata contains additional information about the document
Metadata Metadata
}
Document represents a single input document with metadata. Documents flow through the pipeline from Source → Processor → Sink.
func NewDocument ¶
NewDocument creates a new document with the given key and reader. The content type defaults to application/octet-stream.
func (*Document) EnsureExtension ¶ added in v0.2.0
func (d *Document) EnsureExtension()
EnsureExtension updates the document key to have the correct file extension based on the document's content type. If the extension already matches, the key is unchanged.
func (*Document) WithMetadata ¶
WithMetadata adds metadata to the document and returns it for chaining.
type Key ¶ added in v0.1.14
type Key string
Key is a simple filesystem-style path used as document identity. Keys are relative paths using forward slashes as separators.
Examples:
"a.txt" - file in root "sub/a.txt" - file in subdirectory "summary/a.txt" - file with emit prefix
type Metadata ¶ added in v0.1.14
type Metadata struct {
ContentType string // MIME type (e.g., "text/plain", "application/json")
Extension string // File extension for output (e.g., ".txt", ".md")
Size int64 // Document size in bytes
Custom map[string]string // User-defined attributes
}
Metadata holds document attributes (NOT part of key identity).
type Processor ¶
type Processor interface {
// Process takes input documents and produces zero or more output documents.
// - Single doc processing: len(docs)==1 (normal case)
// - Array processing: len(docs)>1 (array mode after collection)
// - EOF signal: docs[0].Type == ContentEOF (end of stream)
// Return empty slice to filter out documents.
// Return error for processing failures.
Process(ctx context.Context, docs []*Document) ([]*Document, error)
// Close releases any resources held by the processor.
Close() error
}
Processor transforms documents in a pipeline. Processors are monadic: accept array of documents, return array of documents.
A processor can:
- Transform documents (map)
- Split documents (flatMap - one doc becomes many)
- Filter documents (return empty slice)
- Collect/aggregate documents (ArrayCollector pattern)
Implementations should:
- Be stateless where possible
- Return errors for processing failures
- Release resources in Close()
type Sink ¶
type Sink interface {
// Write stores a document to the sink's destination.
// Returns error if the write fails.
Write(ctx context.Context, doc *Document) error
// Close finalizes output and releases resources.
// Should flush any buffered writes.
Close() error
}
Sink consumes processed documents. Sinks are the exit point of a processing pipeline.
Implementations should:
- Write documents to their destination
- Handle errors gracefully
- Buffer writes if beneficial for performance
- Flush buffers and release resources in Close()
type Source ¶
type Source interface {
// Next returns the next document or io.EOF when complete.
// Other errors indicate retrieval failures.
Next(ctx context.Context) (*Document, error)
// Close releases any resources held by the source.
Close() error
}
Source produces documents from an input source. Sources are the entry point of a processing pipeline.
Implementations should:
- Return documents one at a time via Next()
- Return io.EOF when no more documents are available
- Be safe for single-goroutine use (not necessarily concurrent)
- Release resources in Close()