dataset

package
v0.0.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 22, 2026 License: MIT Imports: 14 Imported by: 0

Documentation

Overview

Package dataset provides columnar data abstractions for the Grammar of Graphics pipeline. Frame verbs execute eagerly via the dataset's engine (memory and arrow backends materialize on each verb); the BigQuery engine is the only backend with internal lazy SQL accumulation. Arrow IPC and Parquet ingest paths support zero-copy reads.

Engine-First Architecture

Every data operation is delegated to an Engine backend. The dataset package defines only interfaces and contracts — no concrete column types, no fallbacks. Engines (Arrow, memory, SQL) implement sub-interfaces (Aggregator, Windower, Joiner, etc.) for the operations they support.

Type System

The type system is aligned with Apache Arrow:

  • Field maps to arrow.Field (name, type, nullable, metadata)
  • Schema maps to arrow.Schema (ordered collection of fields)
  • AnyColumn is the type-erased column interface (engine-native storage)
  • Column is the generic typed access layer
  • GetColumn bridges untyped to typed via a single type assertion

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrUncollected is returned when an operation requires a collected Dataset.
	ErrUncollected = errors.New("dataset: operation on uncollected Dataset — call Collect(ctx) first")

	// ErrUnsupportedEngine is returned when an engine lacks a required capability.
	ErrUnsupportedEngine = errors.New("dataset: unsupported engine capability")

	// ErrNoEngine is returned when a Dataset has no engine.
	ErrNoEngine = errors.New("dataset: Dataset requires an engine")

	// ErrInvalidSlice is returned when a slice range is invalid.
	ErrInvalidSlice = errors.New("dataset: invalid slice range")

	// ErrNoAggResults is returned when there are no aggregation results.
	ErrNoAggResults = errors.New("dataset: no aggregation results to merge")

	// ErrUnsupportedAggFunc is returned for unknown aggregation functions.
	ErrUnsupportedAggFunc = errors.New("dataset: unknown AggFunc")

	// ErrUnsupportedDType is returned for unsupported data types.
	ErrUnsupportedDType = errors.New("dataset: unsupported DType")

	// ErrTypeMismatch is returned when column types don't match.
	ErrTypeMismatch = errors.New("dataset: column type mismatch")

	// ErrColumnNotNumeric is returned when a numeric column is required.
	ErrColumnNotNumeric = errors.New("dataset: column is not numeric")

	// ErrUnsupportedPredicate is returned for unsupported filter predicates.
	ErrUnsupportedPredicate = errors.New("dataset: unsupported predicate operator")
)

Sentinel errors for the dataset package.

Functions

func Abs

func Abs(s []float64) []float64

Abs applies math.Abs to all elements in place.

func Clamp

func Clamp[T cmp.Ordered](lo, hi T) func([]T) []T

Clamp returns an option that clamps slice elements to the range [lo, hi]. Example: Clamp[int64](-5, 5), Clamp(0.0, 1.0)

func Clean

func Clean(s []float64) []float64

Clean drops NaN and ±Inf from a float64 slice.

func Close

func Close(ds Table) error

Close releases resources if the dataset implements Closer. Safe to call on any Dataset — returns nil for datasets without resources.

func Names

func Names(ds Table) []string

Names returns the column names from a dataset's schema.

func RegisterCSVReader added in v0.0.6

func RegisterCSVReader(engineName string, r CSVReader)

RegisterCSVReader registers a CSVReader implementation for an engine.

func RegisterCSVWriter added in v0.0.6

func RegisterCSVWriter(engineName string, w CSVWriter)

RegisterCSVWriter registers a CSVWriter implementation for an engine.

func RegisterParquetReader added in v0.0.6

func RegisterParquetReader(engineName string, r ParquetReader)

RegisterParquetReader registers a ParquetReader implementation for an engine.

func RegisterParquetWriter added in v0.0.6

func RegisterParquetWriter(engineName string, w ParquetWriter)

RegisterParquetWriter registers a ParquetWriter implementation for an engine.

func ScalarFloat64 added in v0.0.5

func ScalarFloat64(col AnyColumn) (float64, bool)

ScalarFloat64 extracts a single float64 from a 1-element aggregate column (e.g. the result of Aggregator.Sum). Returns 0, false if the column is empty, not float64, or has zero value.

func Sorted

func Sorted[T cmp.Ordered](s []T) []T

Sorted sorts the slice in place. Generic over any ordered type.

Types

type AggFunc

type AggFunc int

AggFunc identifies an aggregation function.

const (
	AggSum AggFunc = iota
	AggMean
	AggMin
	AggMax
	AggCount
	AggMedian
	AggVariance
	AggStdDev     // population standard deviation = sqrt(variance)
	AggFirst      // first element
	AggLast       // last element
	AggMode       // most frequent value
	AggPercentile // quantile; requires PercentileSpec.P
)

AggSum is the sum aggregation.

type AggSpec

type AggSpec struct {
	OutputName string  // name of the result column
	InputName  string  // name of the source column
	Fn         AggFunc // which aggregation to apply
	P          float64 // percentile ∈ [0,1]; only used when Fn == AggPercentile
}

AggSpec describes a single aggregation to apply in Summarize.

func Count

func Count(out, in string) AggSpec

Count builds a count aggregation spec.

func First added in v0.0.5

func First(out, in string) AggSpec

First builds a first-element aggregation spec.

func Last added in v0.0.5

func Last(out, in string) AggSpec

Last builds a last-element aggregation spec.

func Max

func Max(out, in string) AggSpec

Max builds a max aggregation spec.

func Mean

func Mean(out, in string) AggSpec

Mean builds a mean aggregation spec.

func Median

func Median(out, in string) AggSpec

Median builds a median aggregation spec.

func Min

func Min(out, in string) AggSpec

Min builds a min aggregation spec.

func Mode added in v0.0.5

func Mode(out, in string) AggSpec

Mode builds a mode (most-frequent-value) aggregation spec.

func Percentile added in v0.0.5

func Percentile(out, in string, p float64) AggSpec

Percentile builds a percentile aggregation spec. p ∈ [0,1].

func StdDev added in v0.0.5

func StdDev(out, in string) AggSpec

StdDev builds a standard deviation aggregation spec.

func Sum

func Sum(out, in string) AggSpec

Sum builds a sum aggregation spec.

func Variance

func Variance(out, in string) AggSpec

Variance builds a variance aggregation spec.

type Aggregator

type Aggregator interface {
	Sum(col AnyColumn) (AnyColumn, error)
	Mean(col AnyColumn) (AnyColumn, error)
	MinMax(col AnyColumn) (mnCol AnyColumn, mxCol AnyColumn, err error)
	Count(col AnyColumn) (AnyColumn, error)
	Median(col AnyColumn) (AnyColumn, error)
	Variance(col AnyColumn) (AnyColumn, error)
	StdDev(col AnyColumn) (AnyColumn, error)                // sqrt(variance)
	First(col AnyColumn) (AnyColumn, error)                 // first element
	Last(col AnyColumn) (AnyColumn, error)                  // last element
	Mode(col AnyColumn) (AnyColumn, error)                  // most frequent value
	Percentile(col AnyColumn, p float64) (AnyColumn, error) // quantile ∈ [0,1]
}

Aggregator provides vectorized aggregation kernels. All methods return AnyColumn (single-element column) preserving the input type — aligned with Arrow compute kernel type rules:

  • Sum: numeric → same type (int64→int64, float64→float64)
  • Mean: numeric → float64 (always widens)
  • MinMax: any ordered type → (min, max) of same type
  • Count: any → int64
  • Median: numeric → float64
  • Variance: numeric → float64

For Arrow: delegates to arrow/math SIMD operations. For SQL: generates SELECT SUM/AVG/MIN/MAX/COUNT queries.

type AndPred

type AndPred struct{ Preds []Masker }

AndPred combines masks with AND.

func And

func And(preds ...Masker) AndPred

And combines multiple predicates with logical AND.

func (AndPred) Expr

func (p AndPred) Expr() string

Expr returns the SQL representation of the AND combination.

func (AndPred) Mask

func (p AndPred) Mask(ds Table) ([]bool, error)

Mask evaluates all sub-predicates and combines them with AND.

type AnyColumn

type AnyColumn interface {
	Name() string
	Len() int64
	DType() DType
}

AnyColumn is the type-erased column interface. This is what Dataset stores, engines operate on, and maps hold. Every engine-native column type implements this.

func ConstInt64Column added in v0.0.4

func ConstInt64Column(eng Engine, name string, val int64, n int) AnyColumn

ConstInt64Column creates a constant int64 column with the given name and value, repeated n times. Useful for injecting system columns like PANEL.

func ConstStringColumn added in v0.0.4

func ConstStringColumn(eng Engine, name string, val string, n int) AnyColumn

ConstStringColumn creates a constant string column with the given name and value, repeated n times.

func Int64ColumnFromStrings added in v0.0.4

func Int64ColumnFromStrings(eng Engine, name string, values []string) (AnyColumn, []string)

Int64ColumnFromStrings creates an int64 column by mapping distinct string values to 0-based indices, preserving first-occurrence order. Returns the column and the ordered list of distinct values.

type BetweenPred

type BetweenPred struct {
	Col    string
	Lo, Hi any
}

BetweenPred selects rows where a column value is between Lo and Hi.

func Between

func Between(col string, lo, hi any) BetweenPred

Between builds a BETWEEN predicate for the given column and bounds.

func (BetweenPred) Expr

func (p BetweenPred) Expr() string

Expr returns the SQL representation of this BETWEEN predicate.

func (BetweenPred) Mask

func (p BetweenPred) Mask(ds Table) ([]bool, error)

Mask evaluates the BETWEEN predicate against each row.

type BoolAppender

type BoolAppender interface {
	Append(v bool)
	AppendNull()
	AppendValues(vs []bool)
	Reserve(n int)
}

BoolAppender streams bool values into a column.

type BoolMask

type BoolMask []bool

BoolMask is a pre-computed boolean mask that implements Masker. Useful when the filter has already been computed externally (e.g. faceting).

func (BoolMask) Expr

func (m BoolMask) Expr() string

Expr returns a constant "TRUE" placeholder for SQL contexts.

func (BoolMask) Mask

func (m BoolMask) Mask(_ Table) ([]bool, error)

Mask returns the pre-computed boolean slice unchanged.

type Builder

type Builder interface {
	Float64(col string) Float64Appender
	Int64(col string) Int64Appender
	String(col string) StringAppender
	Bool(col string) BoolAppender

	Build() (Table, error)
}

Builder provides streaming, typed, zero-boxing construction. Each column has its own typed appender — no any boxing, no allocations per row.

type BuilderFactory

type BuilderFactory interface {
	NewBuilder(schema *Schema) Builder
}

BuilderFactory creates schema-aware builders for streaming construction.

type CSVConfig

type CSVConfig struct {
	HasHeader  bool
	Comma      rune
	Comment    rune
	NullValues []string
	// ChunkSize is the number of rows per batch. 0 means engine default.
	// Arrow default: 65536, Memory default: unlimited.
	ChunkSize int
}

CSVConfig holds engine-agnostic CSV configuration. The dataset/csv facade constructs this from functional options and passes it to the engine's CSVReader/CSVWriter implementation.

type CSVReader

type CSVReader interface {
	ReadCSV(ctx context.Context, eng Engine, r io.Reader, cfg CSVConfig) (Table, error)
}

CSVReader reads CSV data into an engine-native Dataset. Memory engine: uses go-simdcsv + schema inference. Arrow engine: uses arrow/csv.NewInferringReader for zero-copy ingest.

func GetCSVReader added in v0.0.6

func GetCSVReader(engineName string) (CSVReader, bool)

GetCSVReader retrieves a registered CSVReader for an engine.

type CSVWriter

type CSVWriter interface {
	WriteCSV(ctx context.Context, eng Engine, w io.Writer, ds Table, cfg CSVConfig) error
}

CSVWriter writes a Dataset to CSV. Memory engine: uses go-simdcsv Writer. Arrow engine: uses go-simdcsv Writer (generic — CSV output is string-based).

func GetCSVWriter added in v0.0.6

func GetCSVWriter(engineName string) (CSVWriter, bool)

GetCSVWriter retrieves a registered CSVWriter for an engine.

type Caster

type Caster interface {
	Cast(col AnyColumn, target DType) (AnyColumn, error)
}

Caster provides engine-controlled type casting. Casting is an engine operation — the engine knows its native column types and how to convert between them.

type Closer

type Closer interface {
	Close() error
}

Closer is optionally implemented by datasets that hold resources requiring explicit cleanup (e.g., Arrow tables, database connections).

type Column

type Column[T any] interface {
	AnyColumn
	Values() []T
	IsNull() []bool
}

Column is the typed access layer. Engine-specific column types implement both AnyColumn and Column[T] for their native type.

Values returns the underlying typed slice — zero-copy for both Arrow (returns the Arrow buffer) and memory (returns the Go slice).

IsNull returns the null bitmap. nil means no nulls (common case, zero alloc).

func GetColumn

func GetColumn[T any](ds Table, name string) (Column[T], error)

GetColumn retrieves a typed column from a dataset. This is the only place a type assertion occurs — call sites get compile-time type safety from this point forward.

type ColumnFactory

type ColumnFactory interface {
	NewFloat64Column(name string, data []float64) AnyColumn
	NewInt64Column(name string, data []int64) AnyColumn
	NewStringColumn(name string, data []string) AnyColumn
	NewBoolColumn(name string, data []bool) AnyColumn
	NewTimestampColumn(name string, data []int64) AnyColumn

	// FromColumns assembles columns into a Dataset with the given schema.
	// All columns must have the same length.
	FromColumns(schema *Schema, cols ...AnyColumn) (Table, error)
}

ColumnFactory wraps existing typed slices into engine-native columns. Memory engine: wraps the slice (zero-copy). Arrow engine: builds an Arrow array (one allocation).

type ColumnNotFoundError added in v0.0.2

type ColumnNotFoundError struct {
	Name string
}

ColumnNotFoundError indicates a requested column does not exist.

func (*ColumnNotFoundError) Error added in v0.0.2

func (e *ColumnNotFoundError) Error() string

type CompPred

type CompPred struct {
	Col string
	Op  Op
	Val any
}

CompPred compares a column against a scalar value. Implements both Masker (local eval) and Expr() (SQL pushdown).

func Eq

func Eq(col string, val any) CompPred

Eq builds a col == val predicate.

func Ge

func Ge(col string, val any) CompPred

Ge builds a col >= val predicate.

func Gt

func Gt(col string, val any) CompPred

Gt builds a col > val predicate.

func Le

func Le(col string, val any) CompPred

Le builds a col <= val predicate.

func Lt

func Lt(col string, val any) CompPred

Lt builds a col < val predicate.

func Ne

func Ne(col string, val any) CompPred

Ne builds a col != val predicate.

func (CompPred) Expr

func (p CompPred) Expr() string

Expr returns the SQL representation of this comparison.

func (CompPred) Mask

func (p CompPred) Mask(ds Table) ([]bool, error)

Mask evaluates the comparison predicate against each row.

type Composer

type Composer interface {
	Stack(datasets ...Table) (Table, error)
	Combine(datasets ...Table) (Table, error)
}

Composer provides row/column binding operations. For Arrow: zero-copy concatenation of Arrow arrays. For SQL: UNION ALL / lateral join.

type DType

type DType int

DType represents the logical data type of a column. This is the type ID — analogous to arrow.Type.

const (
	// DTypeFloat64 is a 64-bit floating point column.
	DTypeFloat64 DType = iota
	// DTypeInt64 is a 64-bit integer column.
	DTypeInt64
	// DTypeString is a string/categorical column.
	DTypeString
	// DTypeBool is a boolean column.
	DTypeBool
	// DTypeTimestamp is a timestamp column stored as int64 nanoseconds
	// since the Unix epoch (1970-01-01T00:00:00Z). This representation
	// is zero-copy compatible with Arrow's TIMESTAMP(ns) type.
	DTypeTimestamp
	// DTypeDate is a date-only column stored as int64 days since the
	// Unix epoch (1970-01-01). Compatible with Arrow's DATE32 type.
	DTypeDate
	// DTypeTime is a time-of-day column stored as int64 nanoseconds
	// since midnight (00:00:00.000000000). Compatible with Arrow's TIME64(ns).
	DTypeTime
	// DTypeUnknown is an unrecognized type.
	DTypeUnknown
)

func (DType) String

func (d DType) String() string

String returns the human-readable name of the DType.

type Dataset

type Dataset struct {
	// contains filtered or unexported fields
}

Dataset is the fluent API for data manipulation. All verbs return a new Dataset that records the operation lazily — no computation happens until Dataset.Collect is called. The chain forms a linked list of [op] nodes rooted at a materialised Table.

Usage:

result, err := dataset.From(ds).
    Select("x", "y").
    Filter(dataset.Gt("x", 0)).
    Arrange("x").
    Collect(ctx)

func From

func From(ds Table) Dataset

From wraps a Table in a Dataset for fluent verb chaining.

func NewDataset

func NewDataset(eng Engine, cols ...AnyColumn) (Dataset, error)

NewDataset creates a Dataset from an engine and columns. The schema is inferred from the columns' names and types.

func ReplaceColumn

func ReplaceColumn(ds Dataset, name string, values []float64) Dataset

ReplaceColumn returns a lazy Dataset that replaces a named column with new float64 values when collected.

func (Dataset) AntiJoin

func (f Dataset) AntiJoin(other Table, spec JoinSpec) Dataset

AntiJoin keeps rows from the left that have no match in other.

func (Dataset) Arrange

func (f Dataset) Arrange(cols ...string) Dataset

Arrange sorts the dataset by the named column (ascending).

func (Dataset) Bools

func (d Dataset) Bools(name string) ([]bool, error)

Bools returns the bool values of the named column.

func (Dataset) Collect

func (f Dataset) Collect(ctx context.Context) (Dataset, error)

Collect materialises the lazy operation chain, returning a new Dataset with the result Table populated. If already materialised, returns self.

This is the single materialisation boundary — all data access must go through a collected Dataset.

func (Dataset) Collected

func (f Dataset) Collected() bool

Collected reports whether the Dataset has been materialised.

func (Dataset) Column

func (f Dataset) Column(name string) (AnyColumn, error)

Column retrieves a named column. Requires a collected Dataset.

func (Dataset) Combine

func (f Dataset) Combine(others ...Table) Dataset

Combine horizontally concatenates this dataset with others (column-bind).

func (Dataset) Distinct

func (f Dataset) Distinct(cols ...string) Dataset

Distinct removes duplicate rows based on the specified columns.

func (Dataset) DropNA

func (f Dataset) DropNA(cols ...string) Dataset

DropNA removes rows with missing values in the specified columns.

func (Dataset) Err

func (f Dataset) Err() error

Err returns the first error encountered in the chain, or nil.

func (Dataset) Fill

func (f Dataset) Fill(col string, dir FillDirection) Dataset

Fill forward- or backward-fills missing values in the named column.

func (Dataset) Filter

func (f Dataset) Filter(mask Masker) Dataset

Filter keeps rows where the Masker evaluates to true.

func (Dataset) Float64

func (d Dataset) Float64(name string, opts ...Float64Opt) ([]float64, error)

Float64 returns the float64 values of the named column, optionally transformed by a chain of Float64Opts. With no opts, the returned slice aliases the underlying column data (zero-copy). Any opt forces a copy before the chain runs, so callers may freely mutate the result.

If the column is int64-backed (DTypeInt64, DTypeTimestamp, DTypeDate, DTypeTime), the values are converted to float64 automatically. This enables all draw functions to work with temporal data without changes.

func (Dataset) FullJoin

func (f Dataset) FullJoin(other Table, spec JoinSpec) Dataset

FullJoin performs a full outer join against other on the given key spec.

func (Dataset) GroupBy

func (f Dataset) GroupBy(cols ...string) GroupedFrame

GroupBy specifies columns to group by. Returns a GroupedFrame for Summarize.

func (Dataset) Head

func (f Dataset) Head(n int) Dataset

Head returns the first n rows.

func (Dataset) InnerJoin

func (f Dataset) InnerJoin(other Table, spec JoinSpec) Dataset

InnerJoin performs an inner join against other on the given key spec.

func (Dataset) Int64

func (d Dataset) Int64(name string, opts ...Int64Opt) ([]int64, error)

Int64 returns the int64 values of the named column, optionally transformed.

func (Dataset) LeftJoin

func (f Dataset) LeftJoin(other Table, spec JoinSpec) Dataset

LeftJoin performs a left join against other on the given key spec.

func (Dataset) Mutate

func (f Dataset) Mutate(name string, fn MutateFunc) Dataset

Mutate appends or replaces a column using a MutateFunc.

func (Dataset) NumCols

func (f Dataset) NumCols() int64

NumCols returns the number of columns, or 0 if uncollected.

func (Dataset) NumRows

func (f Dataset) NumRows() int64

NumRows returns the number of rows, or 0 if uncollected.

func (Dataset) PivotLonger

func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset

PivotLonger reshapes wide data to long format.

func (Dataset) PivotWider

func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset

PivotWider reshapes long data to wide format.

func (Dataset) Rename

func (f Dataset) Rename(oldName, newName string) Dataset

Rename renames a column.

func (Dataset) RightJoin

func (f Dataset) RightJoin(other Table, spec JoinSpec) Dataset

RightJoin performs a right join against other on the given key spec.

func (Dataset) Schema

func (f Dataset) Schema() *Schema

Schema returns the schema, or nil if uncollected.

func (Dataset) Select

func (f Dataset) Select(cols ...string) Dataset

Select keeps only the named columns, in the order specified.

func (Dataset) SelectRows

func (f Dataset) SelectRows(indices []int) (Dataset, error)

SelectRows returns a new materialised Dataset containing only the rows at the given indices. This is more efficient than Filter when you already have indices (avoids O(n) bool-mask allocation).

The Dataset must be materialised (collected). Use on collected datasets only.

func (Dataset) SemiJoin

func (f Dataset) SemiJoin(other Table, spec JoinSpec) Dataset

SemiJoin keeps rows from the left that have a match in other.

func (Dataset) Separate

func (f Dataset) Separate(col string, into []string, sep string) Dataset

Separate splits a string column into multiple columns by a separator.

func (Dataset) Slice

func (f Dataset) Slice(start, end int) Dataset

Slice returns rows in the range [start, end).

func (Dataset) Stack

func (f Dataset) Stack(others ...Table) Dataset

Stack vertically concatenates this dataset with others (row-bind).

func (Dataset) Strings

func (d Dataset) Strings(name string, opts ...StringOpt) ([]string, error)

Strings returns the string values of the named column, optionally transformed.

func (Dataset) Table

func (f Dataset) Table() Table

Table returns the underlying Table, or nil if the Dataset is uncollected. Callers must check for nil or call Collect(ctx) before accessing the Table.

func (Dataset) Tail

func (f Dataset) Tail(n int) Dataset

Tail returns the last n rows.

func (Dataset) WithColumn added in v0.0.4

func (f Dataset) WithColumn(col AnyColumn) Dataset

WithColumn appends or replaces a pre-built column in the dataset. This is the simplest way to inject a column that was constructed externally (e.g., via [ColumnFactory.NewInt64Column]).

type Engine

type Engine interface {
	// Name returns a human-readable identifier (e.g., "arrow", "memory", "sql").
	Name() string
	// Context returns the engine's lifecycle context.
	Context() context.Context
}

Engine is the marker interface that all compute backends implement. Every engine carries a context.Context that governs its lifecycle. Long-running operations should check Context().Err() for cancellation.

func GetEngine

func GetEngine(ds Table) Engine

GetEngine extracts the engine from a dataset. Returns nil if the dataset does not carry an engine.

type Field

type Field struct {
	Name     string
	Dtype    DType
	Nullable bool
	Metadata map[string]string
}

Field describes a single column in a dataset — its name, logical type, nullability, and optional metadata. This maps directly to arrow.Field.

Metadata carries type-specific parameters that DType alone cannot express:

  • Timestamp timezone: {"tz": "America/Sao_Paulo"}
  • Display format: {"format": "2006-01-02"}
  • Units: {"unit": "ns"}

func BoolCol

func BoolCol(name string) Field

BoolCol creates a bool field descriptor.

func DateCol added in v0.0.6

func DateCol(name string) Field

DateCol creates a date-only field descriptor (days since epoch).

func FloatCol

func FloatCol(name string) Field

FloatCol creates a float64 field descriptor.

func IntCol

func IntCol(name string) Field

IntCol creates an int64 field descriptor.

func NullableFloatCol

func NullableFloatCol(name string) Field

NullableFloatCol creates a nullable float64 field.

func NullableIntCol

func NullableIntCol(name string) Field

NullableIntCol creates a nullable int64 field.

func NullableStringCol

func NullableStringCol(name string) Field

NullableStringCol creates a nullable string field.

func StringCol

func StringCol(name string) Field

StringCol creates a string field descriptor.

func TimeCol added in v0.0.6

func TimeCol(name string) Field

TimeCol creates a time-of-day field descriptor (ns since midnight).

func TimestampCol

func TimestampCol(name string) Field

TimestampCol creates a timestamp field descriptor.

func (Field) WithMetadata

func (f Field) WithMetadata(md map[string]string) Field

WithMetadata returns a copy of the field with the given metadata.

func (Field) WithNullable

func (f Field) WithNullable() Field

WithNullable returns a copy of the field with Nullable set.

type FillDirection

type FillDirection int

FillDirection specifies the direction for filling missing values.

const (
	// FillDown fills missing values with the previous non-null value (carry forward).
	FillDown FillDirection = iota
	// FillUp fills missing values with the next non-null value (carry backward).
	FillUp
)

type Filler

type Filler interface {
	Fill(col AnyColumn, dir FillDirection) (AnyColumn, error)
	DropNA(ds Table, cols ...string) (Table, error)
	ReplaceNA(col AnyColumn, defaultVal float64) (AnyColumn, error)
}

Filler provides missing-value handling operations. For Arrow: streaming fill with zero allocation. For SQL: generates COALESCE / window-based fill.

type Filterer

type Filterer interface {
	Filter(ds Table, mask Masker) (Table, error)
}

Filterer provides mask-based row filtering. For Arrow: boolean mask filtering with zero-copy. For SQL: generates WHERE clauses.

type Float64Appender

type Float64Appender interface {
	Append(v float64)
	AppendNull()
	AppendValues(vs []float64)
	Reserve(n int)
}

Float64Appender streams float64 values into a column.

type Float64Opt

type Float64Opt = func([]float64) []float64

Float64Opt transforms a float64 slice (e.g. Clean, Clamp, Sorted).

func FillNaN

func FillNaN(fill float64) Float64Opt

FillNaN replaces all NaNs with the provided value.

type GroupedFrame

type GroupedFrame struct {
	// contains filtered or unexported fields
}

GroupedFrame holds a Frame with group-by columns set.

func (GroupedFrame) Summarize

func (gf GroupedFrame) Summarize(specs ...AggSpec) Dataset

Summarize applies aggregations per group, producing a lazy Dataset.

type HasEngine

type HasEngine interface {
	Table
	Engine() Engine
}

HasEngine is implemented by datasets that carry an engine reference. This enables engine propagation through transformations — stat packages and ggplot internals can produce new datasets using the same engine without importing engine-specific packages.

type InPred

type InPred struct {
	Col  string
	Vals []any
}

InPred selects rows where a column value is in a set of values.

func In

func In(col string, vals ...any) InPred

In builds an IN predicate for the given column and value set.

func (InPred) Expr

func (p InPred) Expr() string

Expr returns the SQL representation of this IN predicate.

func (InPred) Mask

func (p InPred) Mask(ds Table) ([]bool, error)

Mask evaluates the IN predicate against each row.

type Int64Appender

type Int64Appender interface {
	Append(v int64)
	AppendNull()
	AppendValues(vs []int64)
	Reserve(n int)
}

Int64Appender streams int64 values into a column.

type Int64Opt

type Int64Opt = func([]int64) []int64

Int64Opt transforms an int64 slice.

type IsNotNullPred

type IsNotNullPred struct{ Col string }

IsNotNullPred selects rows where a column value is not null.

func IsNotNull

func IsNotNull(col string) IsNotNullPred

IsNotNull builds a not-null-check predicate.

func (IsNotNullPred) Expr

func (p IsNotNullPred) Expr() string

Expr returns the SQL representation of an IS NOT NULL check.

func (IsNotNullPred) Mask

func (p IsNotNullPred) Mask(ds Table) ([]bool, error)

Mask evaluates the IS NOT NULL predicate against each row.

type IsNullPred

type IsNullPred struct{ Col string }

IsNullPred selects rows where a column value is null.

func IsNull

func IsNull(col string) IsNullPred

IsNull builds a null-check predicate.

func (IsNullPred) Expr

func (p IsNullPred) Expr() string

Expr returns the SQL representation of an IS NULL check.

func (IsNullPred) Mask

func (p IsNullPred) Mask(ds Table) ([]bool, error)

Mask evaluates the IS NULL predicate against each row.

type JoinSpec

type JoinSpec struct {
	Type      JoinType
	LeftCols  []string
	RightCols []string
}

JoinSpec describes how to match rows between two datasets.

func On

func On(cols ...string) JoinSpec

On creates a JoinSpec matching on columns with the same name in both datasets.

type JoinType

type JoinType int

JoinType identifies the kind of join to perform.

const (
	// JoinLeft keeps all rows from the left dataset; unmatched right rows are null-filled.
	JoinLeft JoinType = iota
	// JoinRight keeps all rows from the right dataset; unmatched left rows are null-filled.
	JoinRight
	// JoinInner keeps only rows with matches in both datasets.
	JoinInner
	// JoinFull keeps all rows from both datasets; unmatched sides are null-filled.
	JoinFull
	// JoinSemi keeps rows from the left that have at least one match in the right.
	// No columns from the right are included.
	JoinSemi
	// JoinAnti keeps rows from the left that have NO match in the right.
	// No columns from the right are included.
	JoinAnti
)

type Joiner

type Joiner interface {
	Join(left, right Table, spec JoinSpec) (Table, error)
}

Joiner provides join operations across datasets. For Arrow: hash-join with lazy indexed column views. For SQL: generates JOIN ... ON ... clauses.

type Masker

type Masker interface {
	// Mask computes a boolean mask of length int(ds.NumRows()). True entries are kept.
	Mask(ds Table) ([]bool, error)
}

Masker describes a row-level filter condition that can be lazily evaluated against a dataset to produce a boolean mask.

type MathKernel

type MathKernel interface {
	// Binary arithmetic (column × column, same length required)
	AddCols(a, b AnyColumn) (AnyColumn, error)
	SubCols(a, b AnyColumn) (AnyColumn, error)
	MulCols(a, b AnyColumn) (AnyColumn, error)
	DivCols(a, b AnyColumn) (AnyColumn, error)

	// Scalar arithmetic (column × scalar)
	AddScalar(col AnyColumn, val float64) (AnyColumn, error)
	MulScalar(col AnyColumn, val float64) (AnyColumn, error)

	// Unary numeric
	Abs(col AnyColumn) (AnyColumn, error)
	Neg(col AnyColumn) (AnyColumn, error)
	Sign(col AnyColumn) (AnyColumn, error)
	Sqrt(col AnyColumn) (AnyColumn, error)
	Pow(col AnyColumn, exp float64) (AnyColumn, error)

	// Transcendental — logarithmic
	Exp(col AnyColumn) (AnyColumn, error)
	Ln(col AnyColumn) (AnyColumn, error)
	Log2(col AnyColumn) (AnyColumn, error)
	Log10(col AnyColumn) (AnyColumn, error)

	// Transcendental — trigonometric
	Sin(col AnyColumn) (AnyColumn, error)
	Cos(col AnyColumn) (AnyColumn, error)
	Tan(col AnyColumn) (AnyColumn, error)
	Asin(col AnyColumn) (AnyColumn, error)
	Acos(col AnyColumn) (AnyColumn, error)
	Atan(col AnyColumn) (AnyColumn, error)
	Atan2(y, x AnyColumn) (AnyColumn, error)

	// Transcendental — hyperbolic / special
	Tanh(col AnyColumn) (AnyColumn, error)
	Sigmoid(col AnyColumn) (AnyColumn, error)
	Erf(col AnyColumn) (AnyColumn, error)

	// Rounding
	Round(col AnyColumn) (AnyColumn, error)
	Floor(col AnyColumn) (AnyColumn, error)
	Ceil(col AnyColumn) (AnyColumn, error)

	// Bitwise (int64 columns only)
	BitAnd(a, b AnyColumn) (AnyColumn, error)
	BitOr(a, b AnyColumn) (AnyColumn, error)
	BitXor(a, b AnyColumn) (AnyColumn, error)
	BitNot(col AnyColumn) (AnyColumn, error)
	BitShiftLeft(col AnyColumn, n int) (AnyColumn, error)
	BitShiftRight(col AnyColumn, n int) (AnyColumn, error)
}

MathKernel provides element-wise mathematical transforms on numeric columns.

Arrow engine: uses Arrow compute Datum API when available, highway SIMD for gaps. Memory engine: uses highway SIMD on raw slices, falls back to math stdlib. SQL engine: generates MATH functions (EXP, LOG, SIN, etc.)

All methods require float64 columns unless noted (bitwise requires int64).

type MutateFunc

type MutateFunc interface {
	// Apply produces a new column from the dataset.
	Apply(ds Table) (AnyColumn, error)
}

MutateFunc describes a column transformation for Mutate.

type NotPred

type NotPred struct{ Pred Masker }

NotPred inverts a mask.

func Not

func Not(pred Masker) NotPred

Not returns a predicate that inverts the given mask.

func (NotPred) Expr

func (p NotPred) Expr() string

Expr returns the SQL NOT(...) expression.

func (NotPred) Mask

func (p NotPred) Mask(ds Table) ([]bool, error)

Mask evaluates the NOT predicate against the dataset rows.

type Op

type Op int

Op identifies a comparison operator.

const (
	OpGt        Op = iota // col > val
	OpLt                  // col < val
	OpGe                  // col >= val
	OpLe                  // col <= val
	OpEq                  // col == val
	OpNe                  // col != val
	OpBetween             // lo <= col <= hi
	OpIn                  // col IN (vals...)
	OpIsNull              // col IS NULL
	OpIsNotNull           // col IS NOT NULL
)

OpGt identifies the greater-than operator.

type Optimizer

type Optimizer interface {
	Optimize(ops []op) []op
}

Optimizer is optionally implemented by engines that can fuse or reorder operations for efficiency. BigQuery uses this to fuse verb chains into a single SQL query.

type OrPred

type OrPred struct{ Preds []Masker }

OrPred combines masks with OR.

func Or

func Or(preds ...Masker) OrPred

Or combines multiple predicates with logical OR.

func (OrPred) Expr

func (p OrPred) Expr() string

Expr returns the SQL representation of the OR combination.

func (OrPred) Mask

func (p OrPred) Mask(ds Table) ([]bool, error)

Mask evaluates the OR predicate against the dataset rows.

type ParquetConfig

type ParquetConfig struct {
	// Compression codec: "snappy", "gzip", "zstd", "lz4", "none".
	Compression string
}

ParquetConfig holds engine-agnostic Parquet configuration.

type ParquetReader

type ParquetReader interface {
	ReadParquet(ctx context.Context, eng Engine, r io.ReaderAt, size int64, cfg ParquetConfig) (Table, error)
}

ParquetReader reads Parquet data into an engine-native Dataset. Memory engine: uses parquet-go for struct-based row reading. Arrow engine: uses pqarrow.ReadTable for zero-copy columnar ingest.

func GetParquetReader added in v0.0.6

func GetParquetReader(engineName string) (ParquetReader, bool)

GetParquetReader retrieves a registered ParquetReader for an engine.

type ParquetWriter

type ParquetWriter interface {
	WriteParquet(ctx context.Context, eng Engine, w io.Writer, ds Table, cfg ParquetConfig) error
}

ParquetWriter writes a Dataset to Parquet format. Memory engine: uses parquet-go GenericWriter. Arrow engine: uses pqarrow.WriteTable.

func GetParquetWriter added in v0.0.6

func GetParquetWriter(engineName string) (ParquetWriter, bool)

GetParquetWriter retrieves a registered ParquetWriter for an engine.

type PivotLongerSpec

type PivotLongerSpec struct {
	// Cols are the column names to pivot from wide to long format.
	// These columns are "gathered" into a single name+value pair.
	Cols []string
	// NamesTo is the output column name that will hold the original column names.
	NamesTo string
	// ValuesTo is the output column name that will hold the values.
	ValuesTo string
}

PivotLongerSpec configures a PivotLonger operation.

type PivotWiderSpec

type PivotWiderSpec struct {
	// NamesFrom is the column whose unique values become new column names.
	NamesFrom string
	// ValuesFrom is the column whose values fill the new columns.
	ValuesFrom string
}

PivotWiderSpec configures a PivotWider operation.

type Reshaper

type Reshaper interface {
	PivotLonger(ds Table, spec PivotLongerSpec) (Table, error)
	PivotWider(ds Table, spec PivotWiderSpec) (Table, error)
	Separate(ds Table, col string, into []string, sep string) (Table, error)
	Concatenate(ds Table, col string, from []string, sep string) (Table, error)
	Complete(ds Table, cols ...string) (Table, error)
}

Reshaper provides reshape/pivot operations. For Arrow: lazy column views (repeatedView, interleavedView). For SQL: generates CASE WHEN / UNPIVOT / CROSSTAB.

type Schema

type Schema struct {
	// contains filtered or unexported fields
}

Schema describes the complete structure of a dataset — an ordered collection of Fields with a name-to-index lookup. This maps directly to arrow.Schema.

func NewSchema

func NewSchema(fields ...Field) *Schema

NewSchema creates a Schema from an ordered list of fields. Panics if any two fields share the same name.

func (*Schema) Field

func (s *Schema) Field(i int) Field

Field returns the field at index i.

func (*Schema) FieldIndex

func (s *Schema) FieldIndex(name string) int

FieldIndex returns the index of the named field, or -1.

func (*Schema) Fields

func (s *Schema) Fields() []Field

Fields returns a copy of the schema's fields.

func (*Schema) HasField

func (s *Schema) HasField(name string) bool

HasField returns true if the schema contains a field with the given name.

func (*Schema) NumFields

func (s *Schema) NumFields() int

NumFields returns the number of fields.

type Selector

type Selector interface {
	// Select reorders/selects rows by index (scatter-gather).
	// This is the Arrow "Take" kernel.
	Select(col AnyColumn, indices []int) (AnyColumn, error)

	// Slice returns rows [start, end) from a column.
	// For Arrow: zero-copy via array.NewSlice.
	Slice(col AnyColumn, start, end int) (AnyColumn, error)

	// SortIndices returns the permutation that sorts the column ascending.
	// Returns an int slice, not a column — it's metadata for Take().
	SortIndices(col AnyColumn) ([]int, error)

	// FilterIndices returns the row indices where mask[i] == true.
	// Returns an int slice for use with Take().
	FilterIndices(mask []bool) []int
}

Selector provides engine-native column/row manipulation primitives. These are the building blocks for Frame verbs (Select, Arrange, Head, etc.).

For Arrow: zero-copy slicing, compute Take kernel, sort-indices kernel. For Memory: direct slice operations. For SQL: generates ORDER BY, LIMIT/OFFSET, WHERE rowid IN (...).

type StatKernel added in v0.0.5

type StatKernel interface {
	// Histogram bins a numeric column into equal-width bins.
	// Returns a Table with columns: "x" (bin centers) and "count" (frequencies).
	// nBins <= 0 means auto-select using Sturges' rule.
	Histogram(col AnyColumn, nBins int) (Table, error)

	// KDE computes kernel density estimation over a numeric column.
	// Returns a Table with columns: "x" (grid points) and "density".
	// bandwidth <= 0 means Silverman auto-select. points is the output grid size.
	KDE(ctx context.Context, col AnyColumn, bandwidth float64, points int) (Table, error)

	// LinearFit computes OLS linear regression y = a + b*x.
	// Returns a Table with columns: "x" (grid) and "y" (fitted values).
	// nOut is the number of output grid points.
	LinearFit(xCol, yCol AnyColumn, nOut int) (Table, error)

	// LoessFit computes locally weighted regression (LOESS).
	// Returns a Table with columns: "x" (grid) and "y" (fitted values).
	// nOut is the number of output grid points.
	LoessFit(ctx context.Context, xCol, yCol AnyColumn, nOut int) (Table, error)

	// LinearFitSE computes OLS regression with 95% confidence bands.
	// Returns a Table with columns: "x", "y" (fitted), "ymin", "ymax".
	// nOut is the number of output grid points.
	LinearFitSE(xCol, yCol AnyColumn, nOut int) (Table, error)

	// LoessFitSE computes LOESS with approximate 95% confidence bands.
	// Returns a Table with columns: "x", "y" (fitted), "ymin", "ymax".
	// nOut is the number of output grid points.
	LoessFitSE(ctx context.Context, xCol, yCol AnyColumn, nOut int) (Table, error)

	// Boxplot computes the five-number summary for a numeric column,
	// optionally grouped by a categorical column.
	// Returns a Table with columns: "x", "lower", "q1", "middle", "q3",
	// "upper", "notch_lower", "notch_upper".
	// groupCol may be nil for a single-group boxplot.
	// whisker is "tukey" (1.5*IQR) or "range" (min-max).
	Boxplot(yCol, groupCol AnyColumn, whisker string, notch bool) (Table, error)
}

StatKernel provides statistical compute kernels that produce new Tables. These are higher-level operations that consume one or more columns and produce a complete result table.

For Memory/Arrow: implemented via go-highway SIMD + stdlib math. For SQL: could generate UDFs or client-side fallback.

type StringAppender

type StringAppender interface {
	Append(v string)
	AppendNull()
	AppendValues(vs []string)
	Reserve(n int)
}

StringAppender streams string values into a column.

type StringOpt

type StringOpt = func([]string) []string

StringOpt transforms a string slice.

type Table

type Table interface {
	// Schema returns the dataset's schema.
	Schema() *Schema

	// Column retrieves a named column. Returns [ColumnNotFoundError] if absent.
	// The returned [AnyColumn] can be type-asserted to [Column[T]] for typed
	// access, or use [GetColumn] for a safe generic retrieval.
	Column(name string) (AnyColumn, error)

	// NumRows returns the logical number of rows.
	NumRows() int64

	// NumCols returns the number of columns.
	NumCols() int64
}

Table represents an immutable, columnar data source.

Implementations include in-memory tables, Arrow tables, and BigQuery-backed remote tables. ETL verbs are exposed by wrapping a Table in a Dataset (the fluent API defined in frame.go) via From.

type Windower

type Windower interface {
	Lag(col AnyColumn, n int) (AnyColumn, error)
	Lead(col AnyColumn, n int) (AnyColumn, error)
	CumSum(col AnyColumn) (AnyColumn, error)
	CumMax(col AnyColumn) (AnyColumn, error)
	CumMin(col AnyColumn) (AnyColumn, error)
	Rank(col AnyColumn) (AnyColumn, error)
	DenseRank(col AnyColumn) (AnyColumn, error)
	PercentRank(col AnyColumn) (AnyColumn, error)
	RowNumber(n int) (AnyColumn, error)
}

Windower provides window function kernels. For Arrow: streaming accumulators over Arrow arrays. For SQL: generates OVER() / WINDOW clauses.

Directories

Path Synopsis
Package arrow provides an Apache Arrow-backed compute engine for the dataset package.
Package arrow provides an Apache Arrow-backed compute engine for the dataset package.
csv
Package csv provides the Arrow CSV engine driver.
Package csv provides the Arrow CSV engine driver.
parquet
Package parquet provides the Arrow Parquet engine driver.
Package parquet provides the Arrow Parquet engine driver.
Package bigquery implements a BigQuery SQL pushdown engine for the dataset library.
Package bigquery implements a BigQuery SQL pushdown engine for the dataset library.
Package compute provides portable SIMD primitives for the dataset engines.
Package compute provides portable SIMD primitives for the dataset engines.
Package csv provides CSV reading and writing for the dataset package.
Package csv provides CSV reading and writing for the dataset package.
Package math provides SIMD-accelerated mathematical transforms for the dataset engines.
Package math provides SIMD-accelerated mathematical transforms for the dataset engines.
Package memory provides a lightweight Go-slice-backed compute engine for the dataset package.
Package memory provides a lightweight Go-slice-backed compute engine for the dataset package.
csv
Package csv provides the Memory CSV engine driver.
Package csv provides the Memory CSV engine driver.
parquet
Package parquet provides the Memory Parquet engine driver.
Package parquet provides the Memory Parquet engine driver.
Package parquet provides Parquet reading and writing for the dataset package.
Package parquet provides Parquet reading and writing for the dataset package.
Package sort provides SIMD-accelerated sorting for the dataset engines.
Package sort provides SIMD-accelerated sorting for the dataset engines.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL