dataset

package

v0.0.0-...-370038a Latest Latest Go to latest Published: Apr 24, 2026 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/TuSKan/ggplot

Links

Open Source Insights

Documentation ¶

Overview ¶

Package dataset provides zero-copy, lazy-evaluating columnar data abstractions for the Grammar of Graphics pipeline.

Engine-First Architecture ¶

Every data operation is delegated to an Engine backend. The dataset package defines only interfaces and contracts — no concrete column types, no fallbacks. Engines (Arrow, memory, SQL) implement sub-interfaces (Aggregator, Windower, Joiner, etc.) for the operations they support.

Type System ¶

The type system is aligned with Apache Arrow:

Field maps to arrow.Field (name, type, nullable, metadata)
Schema maps to arrow.Schema (ordered collection of fields)
AnyColumn is the type-erased column interface (engine-native storage)
Column is the generic typed access layer
GetColumn bridges untyped to typed via a single type assertion

Index ¶

func Close(ds Table) error
func Names(ds Table) []string
type AggFunc
type AggSpec
- func Count(out, in string) AggSpec
- func Max(out, in string) AggSpec
- func Mean(out, in string) AggSpec
- func Median(out, in string) AggSpec
- func Min(out, in string) AggSpec
- func Sum(out, in string) AggSpec
- func Variance(out, in string) AggSpec
type Aggregator
type AndPred
- func And(preds ...Masker) AndPred
- func (p AndPred) Expr() string
- func (p AndPred) Mask(ds Table) ([]bool, error)
type AnyColumn
type BetweenPred
- func Between(col string, lo, hi any) BetweenPred
- func (p BetweenPred) Expr() string
- func (p BetweenPred) Mask(ds Table) ([]bool, error)
type BoolAppender
type BoolMask
- func (m BoolMask) Expr() string
- func (m BoolMask) Mask(_ Table) ([]bool, error)
type Builder
type BuilderFactory
type CSVConfig
type CSVReader
type CSVWriter
type Caster
type Closer
type Column
- func GetColumn[T any](ds Table, name string) (Column[T], error)
type ColumnFactory
type CompPred
- func Eq(col string, val any) CompPred
- func Ge(col string, val any) CompPred
- func Gt(col string, val any) CompPred
- func Le(col string, val any) CompPred
- func Lt(col string, val any) CompPred
- func Ne(col string, val any) CompPred
- func (p CompPred) Expr() string
- func (p CompPred) Mask(ds Table) ([]bool, error)
type Composer
type DType
- func (d DType) String() string
type Dataset
- func From(ds Table) Dataset
- func NewDataset(eng Engine, cols ...AnyColumn) (Dataset, error)
- func ReplaceColumn(ds Dataset, name string, values []float64) (Dataset, error)
- func (f Dataset) AntiJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Arrange(cols ...string) Dataset
- func (f Dataset) Collect() (Table, error)
- func (f Dataset) Column(name string) (AnyColumn, error)
- func (f Dataset) Combine(others ...Table) Dataset
- func (f Dataset) Distinct(cols ...string) Dataset
- func (f Dataset) DropNA(cols ...string) Dataset
- func (f Dataset) Err() error
- func (f Dataset) Fill(col string, dir FillDirection) Dataset
- func (f Dataset) Filter(mask Masker) Dataset
- func (f Dataset) FullJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) GroupBy(cols ...string) GroupedFrame
- func (f Dataset) Head(n int) Dataset
- func (f Dataset) InnerJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) LeftJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Mutate(name string, fn MutateFunc) Dataset
- func (f Dataset) NumCols() int64
- func (f Dataset) NumRows() int64
- func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset
- func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset
- func (f Dataset) Rename(oldName, newName string) Dataset
- func (f Dataset) RightJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Schema() *Schema
- func (f Dataset) Select(cols ...string) Dataset
- func (f Dataset) SemiJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Separate(col string, into []string, sep string) Dataset
- func (f Dataset) Slice(start, end int) Dataset
- func (f Dataset) Stack(others ...Table) Dataset
- func (f Dataset) Table() Table
- func (f Dataset) Tail(n int) Dataset
type Engine
- func GetEngine(ds Table) Engine
type ErrColumnNotFound
- func (e *ErrColumnNotFound) Error() string
type Field
- func BoolCol(name string) Field
- func FloatCol(name string) Field
- func IntCol(name string) Field
- func NullableFloatCol(name string) Field
- func NullableIntCol(name string) Field
- func NullableStringCol(name string) Field
- func StringCol(name string) Field
- func TimestampCol(name string) Field
- func (f Field) WithMetadata(md map[string]string) Field
- func (f Field) WithNullable() Field
type FillDirection
type Filler
type Filterer
type Float64Appender
type GroupedFrame
- func (gf GroupedFrame) Summarize(specs ...AggSpec) Dataset
type HasEngine
type InPred
- func In(col string, vals ...any) InPred
- func (p InPred) Expr() string
- func (p InPred) Mask(ds Table) ([]bool, error)
type Int64Appender
type IsNotNullPred
- func IsNotNull(col string) IsNotNullPred
- func (p IsNotNullPred) Expr() string
- func (p IsNotNullPred) Mask(ds Table) ([]bool, error)
type IsNullPred
- func IsNull(col string) IsNullPred
- func (p IsNullPred) Expr() string
- func (p IsNullPred) Mask(ds Table) ([]bool, error)
type JoinSpec
- func On(cols ...string) JoinSpec
type JoinType
type Joiner
type Masker
type MathKernel
type MutateFunc
type NotPred
- func Not(pred Masker) NotPred
- func (p NotPred) Expr() string
- func (p NotPred) Mask(ds Table) ([]bool, error)
type Op
type OrPred
- func Or(preds ...Masker) OrPred
- func (p OrPred) Expr() string
- func (p OrPred) Mask(ds Table) ([]bool, error)
type ParquetConfig
type ParquetReader
type ParquetWriter
type PivotLongerSpec
type PivotWiderSpec
type Reshaper
type Schema
- func NewSchema(fields ...Field) *Schema
- func (s *Schema) Field(i int) Field
- func (s *Schema) FieldIndex(name string) int
- func (s *Schema) Fields() []Field
- func (s *Schema) HasField(name string) bool
- func (s *Schema) NumFields() int
type Selector
type StringAppender
type Table
type Windower

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Close ¶

func Close(ds Table) error

Close releases resources if the dataset implements Closer. Safe to call on any Dataset — returns nil for datasets without resources.

func Names ¶

func Names(ds Table) []string

Names returns the column names from a dataset's schema.

Types ¶

type AggFunc ¶

type AggFunc int

AggFunc identifies an aggregation function.

const (
	AggSum AggFunc = iota
	AggMean
	AggMin
	AggMax
	AggCount
	AggMedian
	AggVariance
)

type AggSpec ¶

type AggSpec struct {
	OutputName string  // name of the result column
	InputName  string  // name of the source column
	Fn         AggFunc // which aggregation to apply
}

AggSpec describes a single aggregation to apply in Summarize.

func Count ¶

func Count(out, in string) AggSpec

func Max ¶

func Max(out, in string) AggSpec

func Mean ¶

func Mean(out, in string) AggSpec

func Median ¶

func Median(out, in string) AggSpec

func Min ¶

func Min(out, in string) AggSpec

func Sum ¶

func Sum(out, in string) AggSpec

Agg helpers for building AggSpecs.

func Variance ¶

func Variance(out, in string) AggSpec

type Aggregator ¶

type Aggregator interface {
	Sum(col AnyColumn) (AnyColumn, error)
	Mean(col AnyColumn) (AnyColumn, error)
	MinMax(col AnyColumn) (min AnyColumn, max AnyColumn, err error)
	Count(col AnyColumn) (AnyColumn, error)
	Median(col AnyColumn) (AnyColumn, error)
	Variance(col AnyColumn) (AnyColumn, error)
}

Aggregator provides vectorized aggregation kernels. All methods return AnyColumn (single-element column) preserving the input type — aligned with Arrow compute kernel type rules:

Sum: numeric → same type (int64→int64, float64→float64)
Mean: numeric → float64 (always widens)
MinMax: any ordered type → (min, max) of same type
Count: any → int64
Median: numeric → float64
Variance: numeric → float64

For Arrow: delegates to arrow/math SIMD operations. For SQL: generates SELECT SUM/AVG/MIN/MAX/COUNT queries.

type AndPred ¶

type AndPred struct{ Preds []Masker }

AndPred combines masks with AND.

func And ¶

func And(preds ...Masker) AndPred

func (AndPred) Expr ¶

func (p AndPred) Expr() string

func (AndPred) Mask ¶

func (p AndPred) Mask(ds Table) ([]bool, error)

type AnyColumn ¶

type AnyColumn interface {
	Name() string
	Len() int64
	DType() DType
}

AnyColumn is the type-erased column interface. This is what Dataset stores, engines operate on, and maps hold. Every engine-native column type implements this.

type BetweenPred ¶

type BetweenPred struct {
	Col    string
	Lo, Hi any
}

func Between ¶

func Between(col string, lo, hi any) BetweenPred

func (BetweenPred) Expr ¶

func (p BetweenPred) Expr() string

func (BetweenPred) Mask ¶

func (p BetweenPred) Mask(ds Table) ([]bool, error)

type BoolAppender ¶

type BoolAppender interface {
	Append(v bool)
	AppendNull()
	AppendValues(vs []bool)
	Reserve(n int)
}

BoolAppender streams bool values into a column.

type BoolMask ¶

type BoolMask []bool

BoolMask is a pre-computed boolean mask that implements Masker. Useful when the filter has already been computed externally (e.g. faceting).

func (BoolMask) Expr ¶

func (m BoolMask) Expr() string

func (BoolMask) Mask ¶

func (m BoolMask) Mask(_ Table) ([]bool, error)

type Builder ¶

type Builder interface {
	Float64(col string) Float64Appender
	Int64(col string) Int64Appender
	String(col string) StringAppender
	Bool(col string) BoolAppender

	Build() (Table, error)
}

Builder provides streaming, typed, zero-boxing construction. Each column has its own typed appender — no any boxing, no allocations per row.

type BuilderFactory ¶

type BuilderFactory interface {
	NewBuilder(schema *Schema) Builder
}

BuilderFactory creates schema-aware builders for streaming construction.

type CSVConfig ¶

type CSVConfig struct {
	HasHeader  bool
	Comma      rune
	Comment    rune
	NullValues []string
	// ChunkSize is the number of rows per batch. 0 means engine default.
	// Arrow default: 65536, Memory default: unlimited.
	ChunkSize int
}

CSVConfig holds engine-agnostic CSV configuration. The dataset/csv facade constructs this from functional options and passes it to the engine's CSVReader/CSVWriter implementation.

type CSVReader ¶

type CSVReader interface {
	ReadCSV(ctx context.Context, r io.Reader, cfg CSVConfig) (Table, error)
}

CSVReader reads CSV data into an engine-native Dataset. Memory engine: uses go-simdcsv + schema inference. Arrow engine: uses arrow/csv.NewInferringReader for zero-copy ingest.

type CSVWriter ¶

type CSVWriter interface {
	WriteCSV(ctx context.Context, w io.Writer, ds Table, cfg CSVConfig) error
}

CSVWriter writes a Dataset to CSV. Memory engine: uses go-simdcsv Writer. Arrow engine: uses go-simdcsv Writer (generic — CSV output is string-based).

type Caster ¶

type Caster interface {
	Cast(col AnyColumn, target DType) (AnyColumn, error)
}

Caster provides engine-controlled type casting. Casting is an engine operation — the engine knows its native column types and how to convert between them.

type Closer ¶

type Closer interface {
	Close() error
}

Closer is optionally implemented by datasets that hold resources requiring explicit cleanup (e.g., Arrow tables, database connections).

type Column ¶

type Column[T any] interface {
	AnyColumn
	Values() []T
	IsNull() []bool
}

Column is the typed access layer. Engine-specific column types implement both AnyColumn and Column[T] for their native type.

Values returns the underlying typed slice — zero-copy for both Arrow (returns the Arrow buffer) and memory (returns the Go slice).

IsNull returns the null bitmap. nil means no nulls (common case, zero alloc).

func GetColumn ¶

func GetColumn[T any](ds Table, name string) (Column[T], error)

GetColumn retrieves a typed column from a dataset. This is the only place a type assertion occurs — call sites get compile-time type safety from this point forward.

type ColumnFactory ¶

type ColumnFactory interface {
	NewFloat64Column(name string, data []float64) AnyColumn
	NewInt64Column(name string, data []int64) AnyColumn
	NewStringColumn(name string, data []string) AnyColumn
	NewBoolColumn(name string, data []bool) AnyColumn
	NewTimestampColumn(name string, data []int64) AnyColumn

	// FromColumns assembles columns into a Dataset with the given schema.
	// All columns must have the same length.
	FromColumns(schema *Schema, cols ...AnyColumn) (Table, error)
}

ColumnFactory wraps existing typed slices into engine-native columns. Memory engine: wraps the slice (zero-copy). Arrow engine: builds an Arrow array (one allocation).

type CompPred ¶

type CompPred struct {
	Col string
	Op  Op
	Val any
}

CompPred compares a column against a scalar value. Implements both Masker (local eval) and Expr() (SQL pushdown).

func Eq ¶

func Eq(col string, val any) CompPred

func Ge ¶

func Ge(col string, val any) CompPred

func Gt ¶

func Gt(col string, val any) CompPred

func Le ¶

func Le(col string, val any) CompPred

func Lt ¶

func Lt(col string, val any) CompPred

func Ne ¶

func Ne(col string, val any) CompPred

func (CompPred) Expr ¶

func (p CompPred) Expr() string

func (CompPred) Mask ¶

func (p CompPred) Mask(ds Table) ([]bool, error)

type Composer ¶

type Composer interface {
	Stack(datasets ...Table) (Table, error)
	Combine(datasets ...Table) (Table, error)
}

Composer provides row/column binding operations. For Arrow: zero-copy concatenation of Arrow arrays. For SQL: UNION ALL / lateral join.

type DType ¶

type DType int

DType represents the logical data type of a column. This is the type ID — analogous to arrow.Type.

const (
	// DTypeFloat64 is a 64-bit floating point column.
	DTypeFloat64 DType = iota
	// DTypeInt64 is a 64-bit integer column.
	DTypeInt64
	// DTypeString is a string/categorical column.
	DTypeString
	// DTypeBool is a boolean column.
	DTypeBool
	// DTypeTimestamp is a timestamp column stored as int64 nanoseconds
	// since the Unix epoch (1970-01-01T00:00:00Z). This representation
	// is zero-copy compatible with Arrow's TIMESTAMP(ns) type.
	DTypeTimestamp
	// DTypeUnknown is an unrecognized type.
	DTypeUnknown
)

func (DType) String ¶

func (d DType) String() string

String returns the human-readable name of the DType.

type Dataset ¶

type Dataset struct {
	// contains filtered or unexported fields
}

Frame is the fluent API for data manipulation. All verbs return a new Frame (immutable chain). Every operation delegates to the dataset's engine via sub-interfaces — the Frame never touches raw data directly.

Usage:

result := dataset.From(ds).
    Select("x", "y").
    Filter(dataset.Gt("x", 0)).
    Arrange("x").
    Collect()

func From ¶

func From(ds Table) Dataset

From wraps a Table in a Dataset for fluent verb chaining.

func NewDataset ¶

func NewDataset(eng Engine, cols ...AnyColumn) (Dataset, error)

NewDataset creates a Dataset from an engine and columns. The schema is inferred from the columns' names and types.

func ReplaceColumn ¶

func ReplaceColumn(ds Dataset, name string, values []float64) (Dataset, error)

ReplaceColumn replaces a named column in a Dataset with new float64 values. All other columns are preserved. Used for discrete-to-numeric remapping.

func (Dataset) AntiJoin ¶

func (f Dataset) AntiJoin(other Table, spec JoinSpec) Dataset

func (Dataset) Arrange ¶

func (f Dataset) Arrange(cols ...string) Dataset

Arrange sorts the dataset by the named column (ascending). Engine's Selector.SortIndices computes the permutation; Selector.Take applies it.

func (Dataset) Collect ¶

func (f Dataset) Collect() (Table, error)

Collect materializes the frame's pipeline and returns the Dataset and error.

func (Dataset) Column ¶

func (f Dataset) Column(name string) (AnyColumn, error)

Convenience forwarding methods — allow Dataset to be used where Table is expected.

func (Dataset) Combine ¶

func (f Dataset) Combine(others ...Table) Dataset

func (Dataset) Distinct ¶

func (f Dataset) Distinct(cols ...string) Dataset

Distinct removes duplicate rows based on the specified columns. If no columns are specified, all columns are used.

func (Dataset) DropNA ¶

func (f Dataset) DropNA(cols ...string) Dataset

func (Dataset) Err ¶

func (f Dataset) Err() error

Err returns the first error encountered in the chain, or nil.

func (Dataset) Fill ¶

func (f Dataset) Fill(col string, dir FillDirection) Dataset

func (Dataset) Filter ¶

func (f Dataset) Filter(mask Masker) Dataset

Filter keeps rows where the Masker evaluates to true.

func (Dataset) FullJoin ¶

func (f Dataset) FullJoin(other Table, spec JoinSpec) Dataset

func (Dataset) GroupBy ¶

func (f Dataset) GroupBy(cols ...string) GroupedFrame

GroupBy specifies columns to group by. Returns a GroupedFrame for Summarize.

func (Dataset) Head ¶

func (f Dataset) Head(n int) Dataset

Head returns the first n rows.

func (Dataset) InnerJoin ¶

func (f Dataset) InnerJoin(other Table, spec JoinSpec) Dataset

func (Dataset) LeftJoin ¶

func (f Dataset) LeftJoin(other Table, spec JoinSpec) Dataset

func (Dataset) Mutate ¶

func (f Dataset) Mutate(name string, fn MutateFunc) Dataset

Mutate appends or replaces a column using a MutateFunc.

func (Dataset) NumCols ¶

func (f Dataset) NumCols() int64

func (Dataset) NumRows ¶

func (f Dataset) NumRows() int64

func (Dataset) PivotLonger ¶

func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset

func (Dataset) PivotWider ¶

func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset

func (Dataset) Rename ¶

func (f Dataset) Rename(oldName, newName string) Dataset

Rename renames a column.

func (Dataset) RightJoin ¶

func (f Dataset) RightJoin(other Table, spec JoinSpec) Dataset

func (Dataset) Schema ¶

func (f Dataset) Schema() *Schema

func (Dataset) Select ¶

func (f Dataset) Select(cols ...string) Dataset

Select keeps only the named columns, in the order specified.

func (Dataset) SemiJoin ¶

func (f Dataset) SemiJoin(other Table, spec JoinSpec) Dataset

func (Dataset) Separate ¶

func (f Dataset) Separate(col string, into []string, sep string) Dataset

func (Dataset) Slice ¶

func (f Dataset) Slice(start, end int) Dataset

Slice returns rows in the range [start, end). Engine's Selector.SliceColumn handles this — for Arrow, zero-copy via array.NewSlice.

func (Dataset) Stack ¶

func (f Dataset) Stack(others ...Table) Dataset

func (Dataset) Table ¶

func (f Dataset) Table() Table

Table returns the underlying Table, or nil if an error occurred.

func (Dataset) Tail ¶

func (f Dataset) Tail(n int) Dataset

Tail returns the last n rows.

type Engine ¶

type Engine interface {
	// Name returns a human-readable identifier (e.g., "arrow", "memory", "sql").
	Name() string
}

Engine is the marker interface that all compute backends implement.

func GetEngine ¶

func GetEngine(ds Table) Engine

GetEngine extracts the engine from a dataset. Returns nil if the dataset does not carry an engine.

type ErrColumnNotFound ¶

type ErrColumnNotFound struct {
	Name string
}

ErrColumnNotFound indicates a requested column does not exist.

func (*ErrColumnNotFound) Error ¶

func (e *ErrColumnNotFound) Error() string

type Field ¶

type Field struct {
	Name     string
	Dtype    DType
	Nullable bool
	Metadata map[string]string
}

Field describes a single column in a dataset — its name, logical type, nullability, and optional metadata. This maps directly to arrow.Field.

Metadata carries type-specific parameters that DType alone cannot express:

Timestamp timezone: {"tz": "America/Sao_Paulo"}
Display format: {"format": "2006-01-02"}
Units: {"unit": "ns"}

func BoolCol ¶

func BoolCol(name string) Field

func FloatCol ¶

func FloatCol(name string) Field

func IntCol ¶

func IntCol(name string) Field

func NullableFloatCol ¶

func NullableFloatCol(name string) Field

NullableFloatCol creates a nullable float64 field.

func NullableIntCol ¶

func NullableIntCol(name string) Field

NullableIntCol creates a nullable int64 field.

func NullableStringCol ¶

func NullableStringCol(name string) Field

NullableStringCol creates a nullable string field.

func StringCol ¶

func StringCol(name string) Field

func TimestampCol ¶

func TimestampCol(name string) Field

func (Field) WithMetadata ¶

func (f Field) WithMetadata(md map[string]string) Field

WithMetadata returns a copy of the field with the given metadata.

func (Field) WithNullable ¶

func (f Field) WithNullable() Field

WithNullable returns a copy of the field with Nullable set.

type FillDirection ¶

type FillDirection int

FillDirection specifies the direction for filling missing values.

const (
	// FillDown fills missing values with the previous non-null value (carry forward).
	FillDown FillDirection = iota
	// FillUp fills missing values with the next non-null value (carry backward).
	FillUp
)

type Filler ¶

type Filler interface {
	Fill(col AnyColumn, dir FillDirection) (AnyColumn, error)
	DropNA(ds Table, cols ...string) (Table, error)
	ReplaceNA(col AnyColumn, defaultVal float64) (AnyColumn, error)
}

Filler provides missing-value handling operations. For Arrow: streaming fill with zero allocation. For SQL: generates COALESCE / window-based fill.

type Filterer ¶

type Filterer interface {
	Filter(ds Table, mask Masker) (Table, error)
}

Filterer provides mask-based row filtering. For Arrow: boolean mask filtering with zero-copy. For SQL: generates WHERE clauses.

type Float64Appender ¶

type Float64Appender interface {
	Append(v float64)
	AppendNull()
	AppendValues(vs []float64)
	Reserve(n int)
}

Float64Appender streams float64 values into a column.

type GroupedFrame ¶

type GroupedFrame struct {
	// contains filtered or unexported fields
}

GroupedFrame holds a Frame with group-by columns set.

func (GroupedFrame) Summarize ¶

func (gf GroupedFrame) Summarize(specs ...AggSpec) Dataset

Summarize applies aggregations per group using the engine's Aggregator. All computation is delegated to the engine — the Frame only orchestrates grouping.

type HasEngine ¶

type HasEngine interface {
	Table
	Engine() Engine
}

HasEngine is implemented by datasets that carry an engine reference. This enables engine propagation through transformations — stat packages and ggplot internals can produce new datasets using the same engine without importing engine-specific packages.

type InPred ¶

type InPred struct {
	Col  string
	Vals []any
}

func In ¶

func In(col string, vals ...any) InPred

func (InPred) Expr ¶

func (p InPred) Expr() string

func (InPred) Mask ¶

func (p InPred) Mask(ds Table) ([]bool, error)

type Int64Appender ¶

type Int64Appender interface {
	Append(v int64)
	AppendNull()
	AppendValues(vs []int64)
	Reserve(n int)
}

Int64Appender streams int64 values into a column.

type IsNotNullPred ¶

type IsNotNullPred struct{ Col string }

func IsNotNull ¶

func IsNotNull(col string) IsNotNullPred

func (IsNotNullPred) Expr ¶

func (p IsNotNullPred) Expr() string

func (IsNotNullPred) Mask ¶

func (p IsNotNullPred) Mask(ds Table) ([]bool, error)

type IsNullPred ¶

type IsNullPred struct{ Col string }

func IsNull ¶

func IsNull(col string) IsNullPred

func (IsNullPred) Expr ¶

func (p IsNullPred) Expr() string

func (IsNullPred) Mask ¶

func (p IsNullPred) Mask(ds Table) ([]bool, error)

type JoinSpec ¶

type JoinSpec struct {
	Type      JoinType
	LeftCols  []string
	RightCols []string
}

JoinSpec describes how to match rows between two datasets.

func On ¶

func On(cols ...string) JoinSpec

On creates a JoinSpec matching on columns with the same name in both datasets.

type JoinType ¶

type JoinType int

JoinType identifies the kind of join to perform.

const (
	// JoinLeft keeps all rows from the left dataset; unmatched right rows are null-filled.
	JoinLeft JoinType = iota
	// JoinRight keeps all rows from the right dataset; unmatched left rows are null-filled.
	JoinRight
	// JoinInner keeps only rows with matches in both datasets.
	JoinInner
	// JoinFull keeps all rows from both datasets; unmatched sides are null-filled.
	JoinFull
	// JoinSemi keeps rows from the left that have at least one match in the right.
	// No columns from the right are included.
	JoinSemi
	// JoinAnti keeps rows from the left that have NO match in the right.
	// No columns from the right are included.
	JoinAnti
)

type Joiner ¶

type Joiner interface {
	Join(left, right Table, spec JoinSpec) (Table, error)
}

Joiner provides join operations across datasets. For Arrow: hash-join with lazy indexed column views. For SQL: generates JOIN ... ON ... clauses.

type Masker ¶

type Masker interface {
	// Mask computes a boolean mask of length int(ds.NumRows()). True entries are kept.
	Mask(ds Table) ([]bool, error)
}

Masker describes a row-level filter condition that can be lazily evaluated against a dataset to produce a boolean mask.

type MathKernel ¶

type MathKernel interface {
	// Binary arithmetic (column × column, same length required)
	AddCols(a, b AnyColumn) (AnyColumn, error)
	SubCols(a, b AnyColumn) (AnyColumn, error)
	MulCols(a, b AnyColumn) (AnyColumn, error)
	DivCols(a, b AnyColumn) (AnyColumn, error)

	// Scalar arithmetic (column × scalar)
	AddScalar(col AnyColumn, val float64) (AnyColumn, error)
	MulScalar(col AnyColumn, val float64) (AnyColumn, error)

	// Unary numeric
	Abs(col AnyColumn) (AnyColumn, error)
	Neg(col AnyColumn) (AnyColumn, error)
	Sign(col AnyColumn) (AnyColumn, error)
	Sqrt(col AnyColumn) (AnyColumn, error)
	Pow(col AnyColumn, exp float64) (AnyColumn, error)

	// Transcendental — logarithmic
	Exp(col AnyColumn) (AnyColumn, error)
	Ln(col AnyColumn) (AnyColumn, error)
	Log2(col AnyColumn) (AnyColumn, error)
	Log10(col AnyColumn) (AnyColumn, error)

	// Transcendental — trigonometric
	Sin(col AnyColumn) (AnyColumn, error)
	Cos(col AnyColumn) (AnyColumn, error)
	Tan(col AnyColumn) (AnyColumn, error)
	Asin(col AnyColumn) (AnyColumn, error)
	Acos(col AnyColumn) (AnyColumn, error)
	Atan(col AnyColumn) (AnyColumn, error)
	Atan2(y, x AnyColumn) (AnyColumn, error)

	// Transcendental — hyperbolic / special
	Tanh(col AnyColumn) (AnyColumn, error)
	Sigmoid(col AnyColumn) (AnyColumn, error)
	Erf(col AnyColumn) (AnyColumn, error)

	// Rounding
	Round(col AnyColumn) (AnyColumn, error)
	Floor(col AnyColumn) (AnyColumn, error)
	Ceil(col AnyColumn) (AnyColumn, error)

	// Bitwise (int64 columns only)
	BitAnd(a, b AnyColumn) (AnyColumn, error)
	BitOr(a, b AnyColumn) (AnyColumn, error)
	BitXor(a, b AnyColumn) (AnyColumn, error)
	BitNot(col AnyColumn) (AnyColumn, error)
	BitShiftLeft(col AnyColumn, n int) (AnyColumn, error)
	BitShiftRight(col AnyColumn, n int) (AnyColumn, error)
}

MathKernel provides element-wise mathematical transforms on numeric columns.

Arrow engine: uses Arrow compute Datum API when available, highway SIMD for gaps. Memory engine: uses highway SIMD on raw slices, falls back to math stdlib. SQL engine: generates MATH functions (EXP, LOG, SIN, etc.)

All methods require float64 columns unless noted (bitwise requires int64).

type MutateFunc ¶

type MutateFunc interface {
	// Apply produces a new column from the dataset.
	Apply(ds Table) (AnyColumn, error)
}

MutateFunc describes a column transformation for Mutate.

type NotPred ¶

type NotPred struct{ Pred Masker }

NotPred inverts a mask.

func Not ¶

func Not(pred Masker) NotPred

func (NotPred) Expr ¶

func (p NotPred) Expr() string

func (NotPred) Mask ¶

func (p NotPred) Mask(ds Table) ([]bool, error)

type Op ¶

type Op int

Op identifies a comparison operator.

const (
	OpGt        Op = iota // col > val
	OpLt                  // col < val
	OpGe                  // col >= val
	OpLe                  // col <= val
	OpEq                  // col == val
	OpNe                  // col != val
	OpBetween             // lo <= col <= hi
	OpIn                  // col IN (vals...)
	OpIsNull              // col IS NULL
	OpIsNotNull           // col IS NOT NULL
)

type OrPred ¶

type OrPred struct{ Preds []Masker }

OrPred combines masks with OR.

func Or ¶

func Or(preds ...Masker) OrPred

func (OrPred) Expr ¶

func (p OrPred) Expr() string

func (OrPred) Mask ¶

func (p OrPred) Mask(ds Table) ([]bool, error)

type ParquetConfig ¶

type ParquetConfig struct {
	// Compression codec: "snappy", "gzip", "zstd", "lz4", "none".
	Compression string
}

ParquetConfig holds engine-agnostic Parquet configuration.

type ParquetReader ¶

type ParquetReader interface {
	ReadParquet(ctx context.Context, r io.ReaderAt, size int64, cfg ParquetConfig) (Table, error)
}

ParquetReader reads Parquet data into an engine-native Dataset. Memory engine: uses parquet-go for struct-based row reading. Arrow engine: uses pqarrow.ReadTable for zero-copy columnar ingest.

type ParquetWriter ¶

type ParquetWriter interface {
	WriteParquet(ctx context.Context, w io.Writer, ds Table, cfg ParquetConfig) error
}

ParquetWriter writes a Dataset to Parquet format. Memory engine: uses parquet-go GenericWriter. Arrow engine: uses pqarrow.WriteTable.

type PivotLongerSpec ¶

type PivotLongerSpec struct {
	// Cols are the column names to pivot from wide to long format.
	// These columns are "gathered" into a single name+value pair.
	Cols []string
	// NamesTo is the output column name that will hold the original column names.
	NamesTo string
	// ValuesTo is the output column name that will hold the values.
	ValuesTo string
}

PivotLongerSpec configures a PivotLonger operation.

type PivotWiderSpec ¶

type PivotWiderSpec struct {
	// NamesFrom is the column whose unique values become new column names.
	NamesFrom string
	// ValuesFrom is the column whose values fill the new columns.
	ValuesFrom string
}

PivotWiderSpec configures a PivotWider operation.

type Reshaper ¶

type Reshaper interface {
	PivotLonger(ds Table, spec PivotLongerSpec) (Table, error)
	PivotWider(ds Table, spec PivotWiderSpec) (Table, error)
	Separate(ds Table, col string, into []string, sep string) (Table, error)
	Concatenate(ds Table, col string, from []string, sep string) (Table, error)
	Complete(ds Table, cols ...string) (Table, error)
}

Reshaper provides reshape/pivot operations. For Arrow: lazy column views (repeatedView, interleavedView). For SQL: generates CASE WHEN / UNPIVOT / CROSSTAB.

type Schema ¶

type Schema struct {
	// contains filtered or unexported fields
}

Schema describes the complete structure of a dataset — an ordered collection of Fields with a name-to-index lookup. This maps directly to arrow.Schema.

func NewSchema ¶

func NewSchema(fields ...Field) *Schema

NewSchema creates a Schema from an ordered list of fields. Panics if any two fields share the same name.

func (*Schema) Field ¶

func (s *Schema) Field(i int) Field

Field returns the field at index i.

func (*Schema) FieldIndex ¶

func (s *Schema) FieldIndex(name string) int

FieldIndex returns the index of the named field, or -1.

func (*Schema) Fields ¶

func (s *Schema) Fields() []Field

Fields returns a copy of the schema's fields.

func (*Schema) HasField ¶

func (s *Schema) HasField(name string) bool

HasField returns true if the schema contains a field with the given name.

func (*Schema) NumFields ¶

func (s *Schema) NumFields() int

NumFields returns the number of fields.

type Selector ¶

type Selector interface {
	// Select reorders/selects rows by index (scatter-gather).
	// This is the Arrow "Take" kernel.
	Select(col AnyColumn, indices []int) (AnyColumn, error)

	// Slice returns rows [start, end) from a column.
	// For Arrow: zero-copy via array.NewSlice.
	Slice(col AnyColumn, start, end int) (AnyColumn, error)

	// SortIndices returns the permutation that sorts the column ascending.
	// Returns an int slice, not a column — it's metadata for Take().
	SortIndices(col AnyColumn) ([]int, error)

	// FilterIndices returns the row indices where mask[i] == true.
	// Returns an int slice for use with Take().
	FilterIndices(mask []bool) []int
}

Selector provides engine-native column/row manipulation primitives. These are the building blocks for Frame verbs (Select, Arrange, Head, etc.).

For Arrow: zero-copy slicing, compute Take kernel, sort-indices kernel. For Memory: direct slice operations. For SQL: generates ORDER BY, LIMIT/OFFSET, WHERE rowid IN (...).

type StringAppender ¶

type StringAppender interface {
	Append(v string)
	AppendNull()
	AppendValues(vs []string)
	Reserve(n int)
}

StringAppender streams string values into a column.

type Table ¶

type Table interface {
	// Schema returns the dataset's schema.
	Schema() *Schema

	// Column retrieves a named column. Returns [ErrColumnNotFound] if absent.
	// The returned [AnyColumn] can be type-asserted to [Column[T]] for typed
	// access, or use [GetColumn] for a safe generic retrieval.
	Column(name string) (AnyColumn, error)

	// NumRows returns the logical number of rows.
	NumRows() int64

	// NumCols returns the number of columns.
	NumCols() int64
}

Dataset represents an immutable, columnar data source.

Implementations include in-memory frames, Arrow tables, and SQL-backed remote tables. All ETL operations are available via [Frame].

type Windower ¶

type Windower interface {
	Lag(col AnyColumn, n int) (AnyColumn, error)
	Lead(col AnyColumn, n int) (AnyColumn, error)
	CumSum(col AnyColumn) (AnyColumn, error)
	CumMax(col AnyColumn) (AnyColumn, error)
	CumMin(col AnyColumn) (AnyColumn, error)
	Rank(col AnyColumn) (AnyColumn, error)
	DenseRank(col AnyColumn) (AnyColumn, error)
	PercentRank(col AnyColumn) (AnyColumn, error)
	RowNumber(n int) (AnyColumn, error)
}

Windower provides window function kernels. For Arrow: streaming accumulators over Arrow arrays. For SQL: generates OVER() / WINDOW clauses.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
arrow Package arrow provides an Apache Arrow-backed compute engine for the dataset package.	Package arrow provides an Apache Arrow-backed compute engine for the dataset package.
bigquery Package bigquery implements a BigQuery SQL pushdown engine for the dataset library.	Package bigquery implements a BigQuery SQL pushdown engine for the dataset library.
compute Package compute provides portable SIMD primitives for the dataset engines.	Package compute provides portable SIMD primitives for the dataset engines.
csv Package csv provides CSV reading and writing for the dataset package.	Package csv provides CSV reading and writing for the dataset package.
math Package math provides SIMD-accelerated mathematical transforms for the dataset engines.	Package math provides SIMD-accelerated mathematical transforms for the dataset engines.
memory Package memory provides a lightweight Go-slice-backed compute engine for the dataset package.	Package memory provides a lightweight Go-slice-backed compute engine for the dataset package.
parquet Package parquet provides Parquet reading and writing for the dataset package.	Package parquet provides Parquet reading and writing for the dataset package.
sort Package sort provides SIMD-accelerated sorting for the dataset engines.	Package sort provides SIMD-accelerated sorting for the dataset engines.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL