scan

package

v0.1.0 Latest Latest Go to latest Published: Jun 12, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/karthedew/cosma

Documentation ¶

Index ¶

func LazyScanCSV(path string, opts ...CSVOption) *dataframe.LazyFrame
func LazyScanParquet(path string, opts ...ParquetOption) *dataframe.LazyFrame
type CSVOption
type CSVOptions
- func DefaultCSVOptions() CSVOptions
type ParquetOption
type ParquetOptions
- func DefaultParquetOptions() ParquetOptions
type RecordReader

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func LazyScanCSV ¶

func LazyScanCSV(path string, opts ...CSVOption) *dataframe.LazyFrame

LazyScanCSV builds a LazyFrame whose source is a streaming CSV file scan. Unlike df.Lazy(), which starts from an in-memory DataFrame, the returned plan reads the file batch-by-batch when collected: LazyScanCSV(path).Filter(...). CollectStream(ctx) never materializes the whole file. Calling Collect (eager) instead drains the file into a DataFrame. Options mirror scan.ScanCSV.

func LazyScanParquet ¶

func LazyScanParquet(path string, opts ...ParquetOption) *dataframe.LazyFrame

LazyScanParquet builds a LazyFrame whose source is a streaming Parquet file scan. See LazyScanCSV for the streaming-vs-eager semantics. Options mirror scan.ScanParquet.

Types ¶

type CSVOption ¶

type CSVOption func(*CSVOptions)

func WithCSVAllocator ¶

func WithCSVAllocator(alloc memory.Allocator) CSVOption

func WithCSVChunkSize ¶

func WithCSVChunkSize(value int) CSVOption

func WithCSVColumnTypes ¶

func WithCSVColumnTypes(types map[string]arrow.DataType) CSVOption

func WithCSVComma ¶

func WithCSVComma(comma rune) CSVOption

func WithCSVComment ¶

func WithCSVComment(comment rune) CSVOption

func WithCSVHeader ¶

func WithCSVHeader(value bool) CSVOption

func WithCSVIncludeColumns ¶

func WithCSVIncludeColumns(columns []string) CSVOption

func WithCSVLazyQuotes ¶

func WithCSVLazyQuotes(value bool) CSVOption

func WithCSVNullValues ¶

func WithCSVNullValues(values []string) CSVOption

func WithCSVStringsReplacer ¶

func WithCSVStringsReplacer(replacer *strings.Replacer) CSVOption

type CSVOptions ¶

type CSVOptions struct {
	HasHeader       bool
	ChunkSize       int
	NullValues      []string
	ColumnTypes     map[string]arrow.DataType
	IncludeColumns  []string
	Comma           rune
	Comment         rune
	LazyQuotes      bool
	StringsReplacer *strings.Replacer
	Allocator       memory.Allocator
}

func DefaultCSVOptions ¶

func DefaultCSVOptions() CSVOptions

type ParquetOption ¶

type ParquetOption func(*ParquetOptions)

func WithParquetAllocator ¶

func WithParquetAllocator(alloc memory.Allocator) ParquetOption

func WithParquetBatchSize ¶

func WithParquetBatchSize(size int64) ParquetOption

func WithParquetContext ¶

func WithParquetContext(ctx context.Context) ParquetOption

func WithParquetParallel ¶

func WithParquetParallel(value bool) ParquetOption

func WithParquetReadOptions ¶

func WithParquetReadOptions(opts ...file.ReadOption) ParquetOption

type ParquetOptions ¶

type ParquetOptions struct {
	Allocator      memory.Allocator
	ReadOptions    []file.ReadOption
	ArrowReadProps *pqarrow.ArrowReadProperties
	Context        context.Context
}

func DefaultParquetOptions ¶

func DefaultParquetOptions() ParquetOptions

type RecordReader ¶

type RecordReader interface {
	// Schema returns the schema of every batch produced by this reader.
	Schema() *arrow.Schema
	// Next advances to the next batch, returning false at end-of-stream or on
	// error (inspect Err after a false return).
	Next() bool
	// Record returns the current batch. Valid only between Next()==true and the
	// next Next() call.
	Record() arrow.Record
	// Err returns the first non-EOF error encountered, or nil if the stream
	// ended cleanly.
	Err() error
	// Release frees the reader and any source it owns.
	Release()
}

RecordReader is Cosma's public streaming contract: a forward-only iterator over Arrow record batches. It is the unit of work for execution that does not fit in memory (ADR 0002). Both file scans (scan.ScanCSV / scan.ScanParquet) and lazy plan outputs (LazyFrame.CollectStream) return a RecordReader, so a file on disk and the result of a streamed plan are interchangeable at the type level.

Ownership follows Arrow conventions. Record() is valid only between a Next()==true and the following Next() call; the reader owns the batch and may release it on the next advance. A caller that needs to keep a batch past the next Next() must Retain it. Release frees the reader and any source it owns (e.g. a file handle).

func ScanCSV ¶

func ScanCSV(ctx context.Context, path string, opts ...CSVOption) (RecordReader, error)

ScanCSV opens a CSV file and returns a streaming RecordReader over its batches. It is a thin adapter over the shared ingestion seam. The returned reader honors ctx cancellation between batches and owns the file handle, closing it on Release.

func ScanParquet ¶

func ScanParquet(ctx context.Context, path string, opts ...ParquetOption) (RecordReader, error)

ScanParquet opens a Parquet file and returns a streaming RecordReader over its batches. It is a thin adapter over the shared ingestion seam. The returned reader honors ctx cancellation between batches and owns the underlying file, closing it on Release.

func WithContext ¶

func WithContext(ctx context.Context, inner RecordReader) RecordReader

WithContext returns a RecordReader that honors ctx cancellation between batches. If inner is already context-bound or ctx is nil, inner is returned unchanged.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL