read

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 26, 2026 License: Apache-2.0 Imports: 24 Imported by: 0

Documentation

Overview

Package read provides the ReadBuilder, TableScan, Plan, and TableRead pipeline.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type DataSplit

type DataSplit struct {
	Partition     *manifest.ManifestEntry // representative entry for partition info
	Bucket        int
	Files         []manifest.DataFileMeta
	NeedsMerge    bool // true for primary-key tables that require sort-merge deduplication
	DeletionFiles map[string]manifest.DeletionVectorMeta
}

DataSplit represents a unit of work: a set of data files from one bucket/partition that should be read together.

DeletionFiles maps data file name → DeletionVectorMeta for any files in this split that have associated deletion vectors. Nil means no DVs are present. DV resolution is performed during planning (batch reads only); stream delta reads always produce splits with DeletionFiles == nil.

type Plan

type Plan struct {
	Splits     []DataSplit
	SnapshotID int64
}

Plan is the output of TableScan.Plan(): a list of splits ready to be read.

type ReadBuilder

type ReadBuilder struct {
	// contains filtered or unexported fields
}

ReadBuilder is the entry point for building a table scan + read pipeline.

func NewReadBuilder

func NewReadBuilder(tbl *table.FileStoreTable) *ReadBuilder

NewReadBuilder creates a ReadBuilder for the given table.

By default all columns are returned (no projection) and no filter is applied. Call ReadBuilder.WithProjection, ReadBuilder.WithFilter, and ReadBuilder.WithLimit to refine the read before calling ReadBuilder.NewScan and ReadBuilder.NewRead.

A ReadBuilder is not safe for concurrent use. Create one per goroutine or protect it with a mutex.

func (*ReadBuilder) NewPredicateBuilder

func (rb *ReadBuilder) NewPredicateBuilder() *predicate.Builder

NewPredicateBuilder returns a PredicateBuilder scoped to the (possibly projected) schema.

func (*ReadBuilder) NewRead

func (rb *ReadBuilder) NewRead() *TableRead

NewRead returns a TableRead configured with this builder's settings.

func (*ReadBuilder) NewScan

func (rb *ReadBuilder) NewScan() *TableScan

NewScan returns a TableScan configured with this builder's settings.

func (*ReadBuilder) WithFilter

func (rb *ReadBuilder) WithFilter(p *predicate.Predicate) *ReadBuilder

WithFilter attaches a filter predicate applied during both stats-based file pruning (planning) and row-level evaluation (reading). Passing nil clears any previously set filter. Use ReadBuilder.NewPredicateBuilder to construct predicates scoped to the effective read schema.

func (*ReadBuilder) WithLimit

func (rb *ReadBuilder) WithLimit(n int64) *ReadBuilder

WithLimit sets a row limit hint (used for early exit, not guaranteed ordering).

func (*ReadBuilder) WithProjection

func (rb *ReadBuilder) WithProjection(cols []string) *ReadBuilder

WithProjection limits the columns returned to the named subset. Column names must match those in the table schema exactly; unknown names are silently ignored. Passing nil or an empty slice returns all columns.

type StartingFrom

type StartingFrom int

StartingFrom controls where a stream read begins.

const (
	// StartingFromLatest skips all existing data and only emits rows from
	// snapshots that appear after the stream is started.
	StartingFromLatest StartingFrom = iota

	// StartingFromEarliest replays all existing snapshots before emitting new ones.
	StartingFromEarliest
)

type StreamBatch

type StreamBatch struct {
	Splits     []DataSplit
	SnapshotID int64
}

StreamBatch is the result of a single TableStream.Next() call.

type StreamReadBuilder

type StreamReadBuilder struct {
	// contains filtered or unexported fields
}

StreamReadBuilder is the entry point for building a continuous stream read.

func NewStreamReadBuilder

func NewStreamReadBuilder(tbl *table.FileStoreTable) *StreamReadBuilder

NewStreamReadBuilder creates a StreamReadBuilder for the given table.

By default reads start from the latest snapshot (StartingFromLatest) with no filter and no projection. Use StreamReadBuilder.WithStartingFrom, StreamReadBuilder.WithFilter, and StreamReadBuilder.WithProjection to configure before calling StreamReadBuilder.NewStream.

func (*StreamReadBuilder) NewStream

func (sb *StreamReadBuilder) NewStream() *TableStream

NewStream returns a TableStream that iterates over new snapshots. Each call to TableStream.Next blocks until a new APPEND snapshot arrives.

func (*StreamReadBuilder) WithFilter

WithFilter attaches a filter predicate applied at the row level to each snapshot batch. Stats-based pruning is also applied during planning. Passing nil clears any previously set filter.

func (*StreamReadBuilder) WithPollInterval

func (sb *StreamReadBuilder) WithPollInterval(d time.Duration) *StreamReadBuilder

WithPollInterval sets how often to check for new snapshots (default: 2s).

func (*StreamReadBuilder) WithProjection

func (sb *StreamReadBuilder) WithProjection(cols []string) *StreamReadBuilder

WithProjection limits the columns returned to the named subset. Unknown names are silently ignored. Passing nil or empty returns all columns.

func (*StreamReadBuilder) WithStartingFrom

func (sb *StreamReadBuilder) WithStartingFrom(s StartingFrom) *StreamReadBuilder

WithStartingFrom sets where the stream begins (default: StartingFromLatest).

type TableRead

type TableRead struct {
	// contains filtered or unexported fields
}

TableRead executes the read of splits and produces Arrow data.

func (*TableRead) ToArrow

func (tr *TableRead) ToArrow(ctx context.Context, splits []DataSplit) (arrow.Table, error)

ToArrow reads all splits and materialises the result as a single arrow.Table held entirely in memory.

For large tables or streaming workloads prefer TableRead.ToArrowReader, which streams record batches one at a time without accumulating all data.

The caller must call Release() on the returned table when done to free memory.

func (*TableRead) ToArrowReader

func (tr *TableRead) ToArrowReader(ctx context.Context, splits []DataSplit) (array.RecordReader, error)

ToArrowReader returns an array.RecordReader that streams record batches from all splits. The caller is responsible for calling Release() on each record and Release() on the reader.

type TableScan

type TableScan struct {
	// contains filtered or unexported fields
}

TableScan resolves a snapshot into a Plan of DataSplits.

func (*TableScan) Plan

func (ts *TableScan) Plan(ctx context.Context) (*Plan, error)

Plan resolves the latest snapshot, reads all manifest files, prunes by stats, and returns a Plan of DataSplits ready for reading.

type TableStream

type TableStream struct {
	// contains filtered or unexported fields
}

TableStream iterates over snapshots as they arrive, emitting one batch of splits per new snapshot. Call Next in a loop; each successful call provides splits that can be read with NewRead().ToArrowReader().

func (*TableStream) LastSnapshotID

func (ts *TableStream) LastSnapshotID() int64

LastSnapshotID returns the ID of the last snapshot consumed, or -1 if none yet.

func (*TableStream) Next

func (ts *TableStream) Next(ctx context.Context) (*StreamBatch, error)

Next blocks until a new snapshot is available, then returns its splits.

Only APPEND commits produce splits; COMPACT/OVERWRITE/ANALYZE snapshots are silently skipped (lastID is advanced but no batch is returned to the caller). Returns a non-nil error (including context.Canceled / context.DeadlineExceeded) when the context is done.

type TableStreamReader

type TableStreamReader struct {
	// contains filtered or unexported fields
}

TableStreamReader is a convenience wrapper that implements array.RecordReader across an unbounded stream of snapshots. It blocks inside Next() waiting for new data, and respects context cancellation.

func NewStreamReader

func NewStreamReader(ctx context.Context, sb *StreamReadBuilder) *TableStreamReader

NewStreamReader creates a TableStreamReader from a StreamReadBuilder. The returned reader implements array.RecordReader and can be used wherever a standard Arrow record reader is expected. It blocks inside TableStreamReader.Next until a new snapshot batch arrives or ctx is cancelled.

func (*TableStreamReader) Err

func (r *TableStreamReader) Err() error

Err returns the first error encountered, or nil if the reader stopped because the context was cancelled. Always check Err after Next returns false.

func (*TableStreamReader) Next

func (r *TableStreamReader) Next() bool

Next advances to the next record batch, blocking until new data arrives. Returns false when the context is cancelled or an error occurs; check TableStreamReader.Err to distinguish the two cases.

func (*TableStreamReader) Record

func (r *TableStreamReader) Record() arrow.RecordBatch

Record is an alias for RecordBatch, present for interface compatibility.

func (*TableStreamReader) RecordBatch

func (r *TableStreamReader) RecordBatch() arrow.RecordBatch

RecordBatch returns the current record batch. Valid only after Next returns true.

func (*TableStreamReader) Release

func (r *TableStreamReader) Release()

Release releases the current record batch and any open inner reader. Safe to call multiple times. Call when you are done with the reader.

func (*TableStreamReader) Retain

func (r *TableStreamReader) Retain()

Retain is a no-op; TableStreamReader does not use reference counting.

func (*TableStreamReader) Schema

func (r *TableStreamReader) Schema() *arrow.Schema

Schema returns the Arrow schema of the record batches produced by this reader.

func (*TableStreamReader) SnapshotID

func (r *TableStreamReader) SnapshotID() int64

SnapshotID returns the snapshot ID of the current record batch. Returns -1 before the first successful Next call.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL