Documentation
¶
Index ¶
- func LazyScanCSV(path string, opts ...CSVOption) *dataframe.LazyFrame
- func LazyScanParquet(path string, opts ...ParquetOption) *dataframe.LazyFrame
- type CSVOption
- func WithCSVAllocator(alloc memory.Allocator) CSVOption
- func WithCSVChunkSize(value int) CSVOption
- func WithCSVColumnTypes(types map[string]arrow.DataType) CSVOption
- func WithCSVComma(comma rune) CSVOption
- func WithCSVComment(comment rune) CSVOption
- func WithCSVHeader(value bool) CSVOption
- func WithCSVIncludeColumns(columns []string) CSVOption
- func WithCSVLazyQuotes(value bool) CSVOption
- func WithCSVNullValues(values []string) CSVOption
- func WithCSVStringsReplacer(replacer *strings.Replacer) CSVOption
- type CSVOptions
- type ParquetOption
- type ParquetOptions
- type RecordReader
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func LazyScanCSV ¶
LazyScanCSV builds a LazyFrame whose source is a streaming CSV file scan. Unlike df.Lazy(), which starts from an in-memory DataFrame, the returned plan reads the file batch-by-batch when collected: LazyScanCSV(path).Filter(...). CollectStream(ctx) never materializes the whole file. Calling Collect (eager) instead drains the file into a DataFrame. Options mirror scan.ScanCSV.
func LazyScanParquet ¶
func LazyScanParquet(path string, opts ...ParquetOption) *dataframe.LazyFrame
LazyScanParquet builds a LazyFrame whose source is a streaming Parquet file scan. See LazyScanCSV for the streaming-vs-eager semantics. Options mirror scan.ScanParquet.
Types ¶
type CSVOption ¶
type CSVOption func(*CSVOptions)
func WithCSVAllocator ¶
func WithCSVChunkSize ¶
func WithCSVComma ¶
func WithCSVComment ¶
func WithCSVHeader ¶
func WithCSVIncludeColumns ¶
func WithCSVLazyQuotes ¶
func WithCSVNullValues ¶
func WithCSVStringsReplacer ¶
type CSVOptions ¶
type CSVOptions struct {
HasHeader bool
ChunkSize int
NullValues []string
ColumnTypes map[string]arrow.DataType
IncludeColumns []string
Comma rune
Comment rune
LazyQuotes bool
StringsReplacer *strings.Replacer
Allocator memory.Allocator
}
func DefaultCSVOptions ¶
func DefaultCSVOptions() CSVOptions
type ParquetOption ¶
type ParquetOption func(*ParquetOptions)
func WithParquetAllocator ¶
func WithParquetAllocator(alloc memory.Allocator) ParquetOption
func WithParquetBatchSize ¶
func WithParquetBatchSize(size int64) ParquetOption
func WithParquetContext ¶
func WithParquetContext(ctx context.Context) ParquetOption
func WithParquetParallel ¶
func WithParquetParallel(value bool) ParquetOption
func WithParquetReadOptions ¶
func WithParquetReadOptions(opts ...file.ReadOption) ParquetOption
type ParquetOptions ¶
type ParquetOptions struct {
Allocator memory.Allocator
ReadOptions []file.ReadOption
ArrowReadProps *pqarrow.ArrowReadProperties
Context context.Context
}
func DefaultParquetOptions ¶
func DefaultParquetOptions() ParquetOptions
type RecordReader ¶
type RecordReader interface {
// Schema returns the schema of every batch produced by this reader.
Schema() *arrow.Schema
// Next advances to the next batch, returning false at end-of-stream or on
// error (inspect Err after a false return).
Next() bool
// Record returns the current batch. Valid only between Next()==true and the
// next Next() call.
Record() arrow.Record
// Err returns the first non-EOF error encountered, or nil if the stream
// ended cleanly.
Err() error
// Release frees the reader and any source it owns.
Release()
}
RecordReader is Cosma's public streaming contract: a forward-only iterator over Arrow record batches. It is the unit of work for execution that does not fit in memory (ADR 0002). Both file scans (scan.ScanCSV / scan.ScanParquet) and lazy plan outputs (LazyFrame.CollectStream) return a RecordReader, so a file on disk and the result of a streamed plan are interchangeable at the type level.
Ownership follows Arrow conventions. Record() is valid only between a Next()==true and the following Next() call; the reader owns the batch and may release it on the next advance. A caller that needs to keep a batch past the next Next() must Retain it. Release frees the reader and any source it owns (e.g. a file handle).
func ScanCSV ¶
ScanCSV opens a CSV file and returns a streaming RecordReader over its batches. It is a thin adapter over the shared ingestion seam. The returned reader honors ctx cancellation between batches and owns the file handle, closing it on Release.
func ScanParquet ¶
func ScanParquet(ctx context.Context, path string, opts ...ParquetOption) (RecordReader, error)
ScanParquet opens a Parquet file and returns a streaming RecordReader over its batches. It is a thin adapter over the shared ingestion seam. The returned reader honors ctx cancellation between batches and owns the underlying file, closing it on Release.
func WithContext ¶
func WithContext(ctx context.Context, inner RecordReader) RecordReader
WithContext returns a RecordReader that honors ctx cancellation between batches. If inner is already context-bound or ctx is nil, inner is returned unchanged.