scan

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func LazyScanCSV

func LazyScanCSV(path string, opts ...CSVOption) *dataframe.LazyFrame

LazyScanCSV builds a LazyFrame whose source is a streaming CSV file scan. Unlike df.Lazy(), which starts from an in-memory DataFrame, the returned plan reads the file batch-by-batch when collected: LazyScanCSV(path).Filter(...). CollectStream(ctx) never materializes the whole file. Calling Collect (eager) instead drains the file into a DataFrame. Options mirror scan.ScanCSV.

func LazyScanParquet

func LazyScanParquet(path string, opts ...ParquetOption) *dataframe.LazyFrame

LazyScanParquet builds a LazyFrame whose source is a streaming Parquet file scan. See LazyScanCSV for the streaming-vs-eager semantics. Options mirror scan.ScanParquet.

Types

type CSVOption

type CSVOption func(*CSVOptions)

func WithCSVAllocator

func WithCSVAllocator(alloc memory.Allocator) CSVOption

func WithCSVChunkSize

func WithCSVChunkSize(value int) CSVOption

func WithCSVColumnTypes

func WithCSVColumnTypes(types map[string]arrow.DataType) CSVOption

func WithCSVComma

func WithCSVComma(comma rune) CSVOption

func WithCSVComment

func WithCSVComment(comment rune) CSVOption

func WithCSVHeader

func WithCSVHeader(value bool) CSVOption

func WithCSVIncludeColumns

func WithCSVIncludeColumns(columns []string) CSVOption

func WithCSVLazyQuotes

func WithCSVLazyQuotes(value bool) CSVOption

func WithCSVNullValues

func WithCSVNullValues(values []string) CSVOption

func WithCSVStringsReplacer

func WithCSVStringsReplacer(replacer *strings.Replacer) CSVOption

type CSVOptions

type CSVOptions struct {
	HasHeader       bool
	ChunkSize       int
	NullValues      []string
	ColumnTypes     map[string]arrow.DataType
	IncludeColumns  []string
	Comma           rune
	Comment         rune
	LazyQuotes      bool
	StringsReplacer *strings.Replacer
	Allocator       memory.Allocator
}

func DefaultCSVOptions

func DefaultCSVOptions() CSVOptions

type ParquetOption

type ParquetOption func(*ParquetOptions)

func WithParquetAllocator

func WithParquetAllocator(alloc memory.Allocator) ParquetOption

func WithParquetBatchSize

func WithParquetBatchSize(size int64) ParquetOption

func WithParquetContext

func WithParquetContext(ctx context.Context) ParquetOption

func WithParquetParallel

func WithParquetParallel(value bool) ParquetOption

func WithParquetReadOptions

func WithParquetReadOptions(opts ...file.ReadOption) ParquetOption

type ParquetOptions

type ParquetOptions struct {
	Allocator      memory.Allocator
	ReadOptions    []file.ReadOption
	ArrowReadProps *pqarrow.ArrowReadProperties
	Context        context.Context
}

func DefaultParquetOptions

func DefaultParquetOptions() ParquetOptions

type RecordReader

type RecordReader interface {
	// Schema returns the schema of every batch produced by this reader.
	Schema() *arrow.Schema
	// Next advances to the next batch, returning false at end-of-stream or on
	// error (inspect Err after a false return).
	Next() bool
	// Record returns the current batch. Valid only between Next()==true and the
	// next Next() call.
	Record() arrow.Record
	// Err returns the first non-EOF error encountered, or nil if the stream
	// ended cleanly.
	Err() error
	// Release frees the reader and any source it owns.
	Release()
}

RecordReader is Cosma's public streaming contract: a forward-only iterator over Arrow record batches. It is the unit of work for execution that does not fit in memory (ADR 0002). Both file scans (scan.ScanCSV / scan.ScanParquet) and lazy plan outputs (LazyFrame.CollectStream) return a RecordReader, so a file on disk and the result of a streamed plan are interchangeable at the type level.

Ownership follows Arrow conventions. Record() is valid only between a Next()==true and the following Next() call; the reader owns the batch and may release it on the next advance. A caller that needs to keep a batch past the next Next() must Retain it. Release frees the reader and any source it owns (e.g. a file handle).

func ScanCSV

func ScanCSV(ctx context.Context, path string, opts ...CSVOption) (RecordReader, error)

ScanCSV opens a CSV file and returns a streaming RecordReader over its batches. It is a thin adapter over the shared ingestion seam. The returned reader honors ctx cancellation between batches and owns the file handle, closing it on Release.

func ScanParquet

func ScanParquet(ctx context.Context, path string, opts ...ParquetOption) (RecordReader, error)

ScanParquet opens a Parquet file and returns a streaming RecordReader over its batches. It is a thin adapter over the shared ingestion seam. The returned reader honors ctx cancellation between batches and owns the underlying file, closing it on Release.

func WithContext

func WithContext(ctx context.Context, inner RecordReader) RecordReader

WithContext returns a RecordReader that honors ctx cancellation between batches. If inner is already context-bound or ctx is nil, inner is returned unchanged.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL