Documentation
¶
Overview ¶
Package dataset provides columnar data abstractions for the Grammar of Graphics pipeline. Frame verbs execute eagerly via the dataset's engine (memory and arrow backends materialize on each verb); the BigQuery engine is the only backend with internal lazy SQL accumulation. Arrow IPC and Parquet ingest paths support zero-copy reads.
Engine-First Architecture ¶
Every data operation is delegated to an Engine backend. The dataset package defines only interfaces and contracts — no concrete column types, no fallbacks. Engines (Arrow, memory, SQL) implement sub-interfaces (Aggregator, Windower, Joiner, etc.) for the operations they support.
Type System ¶
The type system is aligned with Apache Arrow:
Index ¶
- Variables
- func Abs(s []float64) []float64
- func Clamp[T cmp.Ordered](lo, hi T) func([]T) []T
- func Clean(s []float64) []float64
- func Close(ds Table) error
- func Names(ds Table) []string
- func RegisterCSVReader(engineName string, r CSVReader)
- func RegisterCSVWriter(engineName string, w CSVWriter)
- func RegisterParquetReader(engineName string, r ParquetReader)
- func RegisterParquetWriter(engineName string, w ParquetWriter)
- func ScalarFloat64(col AnyColumn) (float64, bool)
- func Sorted[T cmp.Ordered](s []T) []T
- type AggFunc
- type AggSpec
- func Count(out, in string) AggSpec
- func First(out, in string) AggSpec
- func Last(out, in string) AggSpec
- func Max(out, in string) AggSpec
- func Mean(out, in string) AggSpec
- func Median(out, in string) AggSpec
- func Min(out, in string) AggSpec
- func Mode(out, in string) AggSpec
- func Percentile(out, in string, p float64) AggSpec
- func StdDev(out, in string) AggSpec
- func Sum(out, in string) AggSpec
- func Variance(out, in string) AggSpec
- type Aggregator
- type AndPred
- type AnyColumn
- type BetweenPred
- type BoolAppender
- type BoolMask
- type Builder
- type BuilderFactory
- type CSVConfig
- type CSVReader
- type CSVWriter
- type Caster
- type Closer
- type Column
- type ColumnFactory
- type ColumnNotFoundError
- type CompPred
- type Composer
- type DType
- type Dataset
- func (f Dataset) AntiJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Arrange(cols ...string) Dataset
- func (d Dataset) Bools(name string) ([]bool, error)
- func (f Dataset) Collect(ctx context.Context) (Dataset, error)
- func (f Dataset) Collected() bool
- func (f Dataset) Column(name string) (AnyColumn, error)
- func (f Dataset) Combine(others ...Table) Dataset
- func (f Dataset) Distinct(cols ...string) Dataset
- func (f Dataset) DropNA(cols ...string) Dataset
- func (f Dataset) Err() error
- func (f Dataset) Fill(col string, dir FillDirection) Dataset
- func (f Dataset) Filter(mask Masker) Dataset
- func (d Dataset) Float64(name string, opts ...Float64Opt) ([]float64, error)
- func (f Dataset) FullJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) GroupBy(cols ...string) GroupedFrame
- func (f Dataset) Head(n int) Dataset
- func (f Dataset) InnerJoin(other Table, spec JoinSpec) Dataset
- func (d Dataset) Int64(name string, opts ...Int64Opt) ([]int64, error)
- func (f Dataset) LeftJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Mutate(name string, fn MutateFunc) Dataset
- func (f Dataset) NumCols() int64
- func (f Dataset) NumRows() int64
- func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset
- func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset
- func (f Dataset) Rename(oldName, newName string) Dataset
- func (f Dataset) RightJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Schema() *Schema
- func (f Dataset) Select(cols ...string) Dataset
- func (f Dataset) SelectRows(indices []int) (Dataset, error)
- func (f Dataset) SemiJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Separate(col string, into []string, sep string) Dataset
- func (f Dataset) Slice(start, end int) Dataset
- func (f Dataset) Stack(others ...Table) Dataset
- func (d Dataset) Strings(name string, opts ...StringOpt) ([]string, error)
- func (f Dataset) Table() Table
- func (f Dataset) Tail(n int) Dataset
- func (f Dataset) WithColumn(col AnyColumn) Dataset
- type Engine
- type Field
- func BoolCol(name string) Field
- func DateCol(name string) Field
- func FloatCol(name string) Field
- func IntCol(name string) Field
- func NullableFloatCol(name string) Field
- func NullableIntCol(name string) Field
- func NullableStringCol(name string) Field
- func StringCol(name string) Field
- func TimeCol(name string) Field
- func TimestampCol(name string) Field
- type FillDirection
- type Filler
- type Filterer
- type Float64Appender
- type Float64Opt
- type GroupedFrame
- type HasEngine
- type InPred
- type Int64Appender
- type Int64Opt
- type IsNotNullPred
- type IsNullPred
- type JoinSpec
- type JoinType
- type Joiner
- type Masker
- type MathKernel
- type MutateFunc
- type NotPred
- type Op
- type Optimizer
- type OrPred
- type ParquetConfig
- type ParquetReader
- type ParquetWriter
- type PivotLongerSpec
- type PivotWiderSpec
- type Reshaper
- type Schema
- type Selector
- type StatKernel
- type StringAppender
- type StringOpt
- type Table
- type Windower
Constants ¶
This section is empty.
Variables ¶
var ( // ErrUncollected is returned when an operation requires a collected Dataset. ErrUncollected = errors.New("dataset: operation on uncollected Dataset — call Collect(ctx) first") // ErrUnsupportedEngine is returned when an engine lacks a required capability. ErrUnsupportedEngine = errors.New("dataset: unsupported engine capability") // ErrNoEngine is returned when a Dataset has no engine. ErrNoEngine = errors.New("dataset: Dataset requires an engine") // ErrInvalidSlice is returned when a slice range is invalid. ErrInvalidSlice = errors.New("dataset: invalid slice range") // ErrNoAggResults is returned when there are no aggregation results. ErrNoAggResults = errors.New("dataset: no aggregation results to merge") // ErrUnsupportedAggFunc is returned for unknown aggregation functions. ErrUnsupportedAggFunc = errors.New("dataset: unknown AggFunc") // ErrUnsupportedDType is returned for unsupported data types. ErrUnsupportedDType = errors.New("dataset: unsupported DType") // ErrTypeMismatch is returned when column types don't match. ErrTypeMismatch = errors.New("dataset: column type mismatch") // ErrColumnNotNumeric is returned when a numeric column is required. ErrColumnNotNumeric = errors.New("dataset: column is not numeric") // ErrUnsupportedPredicate is returned for unsupported filter predicates. ErrUnsupportedPredicate = errors.New("dataset: unsupported predicate operator") )
Sentinel errors for the dataset package.
Functions ¶
func Clamp ¶
Clamp returns an option that clamps slice elements to the range [lo, hi]. Example: Clamp[int64](-5, 5), Clamp(0.0, 1.0)
func Close ¶
Close releases resources if the dataset implements Closer. Safe to call on any Dataset — returns nil for datasets without resources.
func RegisterCSVReader ¶ added in v0.0.6
RegisterCSVReader registers a CSVReader implementation for an engine.
func RegisterCSVWriter ¶ added in v0.0.6
RegisterCSVWriter registers a CSVWriter implementation for an engine.
func RegisterParquetReader ¶ added in v0.0.6
func RegisterParquetReader(engineName string, r ParquetReader)
RegisterParquetReader registers a ParquetReader implementation for an engine.
func RegisterParquetWriter ¶ added in v0.0.6
func RegisterParquetWriter(engineName string, w ParquetWriter)
RegisterParquetWriter registers a ParquetWriter implementation for an engine.
func ScalarFloat64 ¶ added in v0.0.5
ScalarFloat64 extracts a single float64 from a 1-element aggregate column (e.g. the result of Aggregator.Sum). Returns 0, false if the column is empty, not float64, or has zero value.
Types ¶
type AggFunc ¶
type AggFunc int
AggFunc identifies an aggregation function.
const ( AggSum AggFunc = iota AggMean AggMin AggMax AggCount AggMedian AggVariance AggStdDev // population standard deviation = sqrt(variance) AggFirst // first element AggLast // last element AggMode // most frequent value AggPercentile // quantile; requires PercentileSpec.P )
AggSum is the sum aggregation.
type AggSpec ¶
type AggSpec struct {
OutputName string // name of the result column
InputName string // name of the source column
Fn AggFunc // which aggregation to apply
P float64 // percentile ∈ [0,1]; only used when Fn == AggPercentile
}
AggSpec describes a single aggregation to apply in Summarize.
func Percentile ¶ added in v0.0.5
Percentile builds a percentile aggregation spec. p ∈ [0,1].
type Aggregator ¶
type Aggregator interface {
Sum(col AnyColumn) (AnyColumn, error)
Mean(col AnyColumn) (AnyColumn, error)
MinMax(col AnyColumn) (mnCol AnyColumn, mxCol AnyColumn, err error)
Count(col AnyColumn) (AnyColumn, error)
Median(col AnyColumn) (AnyColumn, error)
Variance(col AnyColumn) (AnyColumn, error)
StdDev(col AnyColumn) (AnyColumn, error) // sqrt(variance)
First(col AnyColumn) (AnyColumn, error) // first element
Last(col AnyColumn) (AnyColumn, error) // last element
Mode(col AnyColumn) (AnyColumn, error) // most frequent value
Percentile(col AnyColumn, p float64) (AnyColumn, error) // quantile ∈ [0,1]
}
Aggregator provides vectorized aggregation kernels. All methods return AnyColumn (single-element column) preserving the input type — aligned with Arrow compute kernel type rules:
- Sum: numeric → same type (int64→int64, float64→float64)
- Mean: numeric → float64 (always widens)
- MinMax: any ordered type → (min, max) of same type
- Count: any → int64
- Median: numeric → float64
- Variance: numeric → float64
For Arrow: delegates to arrow/math SIMD operations. For SQL: generates SELECT SUM/AVG/MIN/MAX/COUNT queries.
type AndPred ¶
type AndPred struct{ Preds []Masker }
AndPred combines masks with AND.
type AnyColumn ¶
AnyColumn is the type-erased column interface. This is what Dataset stores, engines operate on, and maps hold. Every engine-native column type implements this.
func ConstInt64Column ¶ added in v0.0.4
ConstInt64Column creates a constant int64 column with the given name and value, repeated n times. Useful for injecting system columns like PANEL.
func ConstStringColumn ¶ added in v0.0.4
ConstStringColumn creates a constant string column with the given name and value, repeated n times.
func Int64ColumnFromStrings ¶ added in v0.0.4
Int64ColumnFromStrings creates an int64 column by mapping distinct string values to 0-based indices, preserving first-occurrence order. Returns the column and the ordered list of distinct values.
type BetweenPred ¶
BetweenPred selects rows where a column value is between Lo and Hi.
func Between ¶
func Between(col string, lo, hi any) BetweenPred
Between builds a BETWEEN predicate for the given column and bounds.
func (BetweenPred) Expr ¶
func (p BetweenPred) Expr() string
Expr returns the SQL representation of this BETWEEN predicate.
type BoolAppender ¶
BoolAppender streams bool values into a column.
type BoolMask ¶
type BoolMask []bool
BoolMask is a pre-computed boolean mask that implements Masker. Useful when the filter has already been computed externally (e.g. faceting).
type Builder ¶
type Builder interface {
Float64(col string) Float64Appender
Int64(col string) Int64Appender
String(col string) StringAppender
Bool(col string) BoolAppender
Build() (Table, error)
}
Builder provides streaming, typed, zero-boxing construction. Each column has its own typed appender — no any boxing, no allocations per row.
type BuilderFactory ¶
BuilderFactory creates schema-aware builders for streaming construction.
type CSVConfig ¶
type CSVConfig struct {
HasHeader bool
Comma rune
Comment rune
NullValues []string
// ChunkSize is the number of rows per batch. 0 means engine default.
// Arrow default: 65536, Memory default: unlimited.
ChunkSize int
}
CSVConfig holds engine-agnostic CSV configuration. The dataset/csv facade constructs this from functional options and passes it to the engine's CSVReader/CSVWriter implementation.
type CSVReader ¶
type CSVReader interface {
ReadCSV(ctx context.Context, eng Engine, r io.Reader, cfg CSVConfig) (Table, error)
}
CSVReader reads CSV data into an engine-native Dataset. Memory engine: uses go-simdcsv + schema inference. Arrow engine: uses arrow/csv.NewInferringReader for zero-copy ingest.
func GetCSVReader ¶ added in v0.0.6
GetCSVReader retrieves a registered CSVReader for an engine.
type CSVWriter ¶
type CSVWriter interface {
WriteCSV(ctx context.Context, eng Engine, w io.Writer, ds Table, cfg CSVConfig) error
}
CSVWriter writes a Dataset to CSV. Memory engine: uses go-simdcsv Writer. Arrow engine: uses go-simdcsv Writer (generic — CSV output is string-based).
func GetCSVWriter ¶ added in v0.0.6
GetCSVWriter retrieves a registered CSVWriter for an engine.
type Caster ¶
Caster provides engine-controlled type casting. Casting is an engine operation — the engine knows its native column types and how to convert between them.
type Closer ¶
type Closer interface {
Close() error
}
Closer is optionally implemented by datasets that hold resources requiring explicit cleanup (e.g., Arrow tables, database connections).
type Column ¶
Column is the typed access layer. Engine-specific column types implement both AnyColumn and Column[T] for their native type.
Values returns the underlying typed slice — zero-copy for both Arrow (returns the Arrow buffer) and memory (returns the Go slice).
IsNull returns the null bitmap. nil means no nulls (common case, zero alloc).
type ColumnFactory ¶
type ColumnFactory interface {
NewFloat64Column(name string, data []float64) AnyColumn
NewInt64Column(name string, data []int64) AnyColumn
NewStringColumn(name string, data []string) AnyColumn
NewBoolColumn(name string, data []bool) AnyColumn
NewTimestampColumn(name string, data []int64) AnyColumn
// FromColumns assembles columns into a Dataset with the given schema.
// All columns must have the same length.
FromColumns(schema *Schema, cols ...AnyColumn) (Table, error)
}
ColumnFactory wraps existing typed slices into engine-native columns. Memory engine: wraps the slice (zero-copy). Arrow engine: builds an Arrow array (one allocation).
type ColumnNotFoundError ¶ added in v0.0.2
type ColumnNotFoundError struct {
Name string
}
ColumnNotFoundError indicates a requested column does not exist.
func (*ColumnNotFoundError) Error ¶ added in v0.0.2
func (e *ColumnNotFoundError) Error() string
type CompPred ¶
CompPred compares a column against a scalar value. Implements both Masker (local eval) and Expr() (SQL pushdown).
type Composer ¶
type Composer interface {
Stack(datasets ...Table) (Table, error)
Combine(datasets ...Table) (Table, error)
}
Composer provides row/column binding operations. For Arrow: zero-copy concatenation of Arrow arrays. For SQL: UNION ALL / lateral join.
type DType ¶
type DType int
DType represents the logical data type of a column. This is the type ID — analogous to arrow.Type.
const ( // DTypeFloat64 is a 64-bit floating point column. DTypeFloat64 DType = iota // DTypeInt64 is a 64-bit integer column. DTypeInt64 // DTypeString is a string/categorical column. DTypeString // DTypeBool is a boolean column. DTypeBool // DTypeTimestamp is a timestamp column stored as int64 nanoseconds // since the Unix epoch (1970-01-01T00:00:00Z). This representation // is zero-copy compatible with Arrow's TIMESTAMP(ns) type. DTypeTimestamp // DTypeDate is a date-only column stored as int64 days since the // Unix epoch (1970-01-01). Compatible with Arrow's DATE32 type. DTypeDate // DTypeTime is a time-of-day column stored as int64 nanoseconds // since midnight (00:00:00.000000000). Compatible with Arrow's TIME64(ns). DTypeTime // DTypeUnknown is an unrecognized type. DTypeUnknown )
type Dataset ¶
type Dataset struct {
// contains filtered or unexported fields
}
Dataset is the fluent API for data manipulation. All verbs return a new Dataset that records the operation lazily — no computation happens until Dataset.Collect is called. The chain forms a linked list of [op] nodes rooted at a materialised Table.
Usage:
result, err := dataset.From(ds).
Select("x", "y").
Filter(dataset.Gt("x", 0)).
Arrange("x").
Collect(ctx)
func NewDataset ¶
NewDataset creates a Dataset from an engine and columns. The schema is inferred from the columns' names and types.
func ReplaceColumn ¶
ReplaceColumn returns a lazy Dataset that replaces a named column with new float64 values when collected.
func (Dataset) Collect ¶
Collect materialises the lazy operation chain, returning a new Dataset with the result Table populated. If already materialised, returns self.
This is the single materialisation boundary — all data access must go through a collected Dataset.
func (Dataset) Fill ¶
func (f Dataset) Fill(col string, dir FillDirection) Dataset
Fill forward- or backward-fills missing values in the named column.
func (Dataset) Float64 ¶
func (d Dataset) Float64(name string, opts ...Float64Opt) ([]float64, error)
Float64 returns the float64 values of the named column, optionally transformed by a chain of Float64Opts. With no opts, the returned slice aliases the underlying column data (zero-copy). Any opt forces a copy before the chain runs, so callers may freely mutate the result.
If the column is int64-backed (DTypeInt64, DTypeTimestamp, DTypeDate, DTypeTime), the values are converted to float64 automatically. This enables all draw functions to work with temporal data without changes.
func (Dataset) GroupBy ¶
func (f Dataset) GroupBy(cols ...string) GroupedFrame
GroupBy specifies columns to group by. Returns a GroupedFrame for Summarize.
func (Dataset) Mutate ¶
func (f Dataset) Mutate(name string, fn MutateFunc) Dataset
Mutate appends or replaces a column using a MutateFunc.
func (Dataset) PivotLonger ¶
func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset
PivotLonger reshapes wide data to long format.
func (Dataset) PivotWider ¶
func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset
PivotWider reshapes long data to wide format.
func (Dataset) SelectRows ¶
SelectRows returns a new materialised Dataset containing only the rows at the given indices. This is more efficient than Filter when you already have indices (avoids O(n) bool-mask allocation).
The Dataset must be materialised (collected). Use on collected datasets only.
func (Dataset) Strings ¶
Strings returns the string values of the named column, optionally transformed.
func (Dataset) Table ¶
Table returns the underlying Table, or nil if the Dataset is uncollected. Callers must check for nil or call Collect(ctx) before accessing the Table.
func (Dataset) WithColumn ¶ added in v0.0.4
WithColumn appends or replaces a pre-built column in the dataset. This is the simplest way to inject a column that was constructed externally (e.g., via [ColumnFactory.NewInt64Column]).
type Engine ¶
type Engine interface {
// Name returns a human-readable identifier (e.g., "arrow", "memory", "sql").
Name() string
// Context returns the engine's lifecycle context.
Context() context.Context
}
Engine is the marker interface that all compute backends implement. Every engine carries a context.Context that governs its lifecycle. Long-running operations should check Context().Err() for cancellation.
type Field ¶
Field describes a single column in a dataset — its name, logical type, nullability, and optional metadata. This maps directly to arrow.Field.
Metadata carries type-specific parameters that DType alone cannot express:
- Timestamp timezone: {"tz": "America/Sao_Paulo"}
- Display format: {"format": "2006-01-02"}
- Units: {"unit": "ns"}
func NullableFloatCol ¶
NullableFloatCol creates a nullable float64 field.
func NullableIntCol ¶
NullableIntCol creates a nullable int64 field.
func NullableStringCol ¶
NullableStringCol creates a nullable string field.
func TimestampCol ¶
TimestampCol creates a timestamp field descriptor.
func (Field) WithMetadata ¶
WithMetadata returns a copy of the field with the given metadata.
func (Field) WithNullable ¶
WithNullable returns a copy of the field with Nullable set.
type FillDirection ¶
type FillDirection int
FillDirection specifies the direction for filling missing values.
const ( // FillDown fills missing values with the previous non-null value (carry forward). FillDown FillDirection = iota // FillUp fills missing values with the next non-null value (carry backward). FillUp )
type Filler ¶
type Filler interface {
Fill(col AnyColumn, dir FillDirection) (AnyColumn, error)
DropNA(ds Table, cols ...string) (Table, error)
ReplaceNA(col AnyColumn, defaultVal float64) (AnyColumn, error)
}
Filler provides missing-value handling operations. For Arrow: streaming fill with zero allocation. For SQL: generates COALESCE / window-based fill.
type Filterer ¶
Filterer provides mask-based row filtering. For Arrow: boolean mask filtering with zero-copy. For SQL: generates WHERE clauses.
type Float64Appender ¶
type Float64Appender interface {
Append(v float64)
AppendNull()
AppendValues(vs []float64)
Reserve(n int)
}
Float64Appender streams float64 values into a column.
type Float64Opt ¶
Float64Opt transforms a float64 slice (e.g. Clean, Clamp, Sorted).
func FillNaN ¶
func FillNaN(fill float64) Float64Opt
FillNaN replaces all NaNs with the provided value.
type GroupedFrame ¶
type GroupedFrame struct {
// contains filtered or unexported fields
}
GroupedFrame holds a Frame with group-by columns set.
func (GroupedFrame) Summarize ¶
func (gf GroupedFrame) Summarize(specs ...AggSpec) Dataset
Summarize applies aggregations per group, producing a lazy Dataset.
type HasEngine ¶
HasEngine is implemented by datasets that carry an engine reference. This enables engine propagation through transformations — stat packages and ggplot internals can produce new datasets using the same engine without importing engine-specific packages.
type InPred ¶
InPred selects rows where a column value is in a set of values.
type Int64Appender ¶
type Int64Appender interface {
Append(v int64)
AppendNull()
AppendValues(vs []int64)
Reserve(n int)
}
Int64Appender streams int64 values into a column.
type IsNotNullPred ¶
type IsNotNullPred struct{ Col string }
IsNotNullPred selects rows where a column value is not null.
func IsNotNull ¶
func IsNotNull(col string) IsNotNullPred
IsNotNull builds a not-null-check predicate.
func (IsNotNullPred) Expr ¶
func (p IsNotNullPred) Expr() string
Expr returns the SQL representation of an IS NOT NULL check.
type IsNullPred ¶
type IsNullPred struct{ Col string }
IsNullPred selects rows where a column value is null.
func (IsNullPred) Expr ¶
func (p IsNullPred) Expr() string
Expr returns the SQL representation of an IS NULL check.
type JoinType ¶
type JoinType int
JoinType identifies the kind of join to perform.
const ( // JoinLeft keeps all rows from the left dataset; unmatched right rows are null-filled. JoinLeft JoinType = iota // JoinRight keeps all rows from the right dataset; unmatched left rows are null-filled. JoinRight // JoinInner keeps only rows with matches in both datasets. JoinInner // JoinFull keeps all rows from both datasets; unmatched sides are null-filled. JoinFull // JoinSemi keeps rows from the left that have at least one match in the right. // No columns from the right are included. JoinSemi // JoinAnti keeps rows from the left that have NO match in the right. // No columns from the right are included. JoinAnti )
type Joiner ¶
Joiner provides join operations across datasets. For Arrow: hash-join with lazy indexed column views. For SQL: generates JOIN ... ON ... clauses.
type Masker ¶
type Masker interface {
// Mask computes a boolean mask of length int(ds.NumRows()). True entries are kept.
Mask(ds Table) ([]bool, error)
}
Masker describes a row-level filter condition that can be lazily evaluated against a dataset to produce a boolean mask.
type MathKernel ¶
type MathKernel interface {
// Binary arithmetic (column × column, same length required)
AddCols(a, b AnyColumn) (AnyColumn, error)
SubCols(a, b AnyColumn) (AnyColumn, error)
MulCols(a, b AnyColumn) (AnyColumn, error)
DivCols(a, b AnyColumn) (AnyColumn, error)
// Scalar arithmetic (column × scalar)
AddScalar(col AnyColumn, val float64) (AnyColumn, error)
MulScalar(col AnyColumn, val float64) (AnyColumn, error)
// Unary numeric
Abs(col AnyColumn) (AnyColumn, error)
Neg(col AnyColumn) (AnyColumn, error)
Sign(col AnyColumn) (AnyColumn, error)
Sqrt(col AnyColumn) (AnyColumn, error)
Pow(col AnyColumn, exp float64) (AnyColumn, error)
// Transcendental — logarithmic
Exp(col AnyColumn) (AnyColumn, error)
Ln(col AnyColumn) (AnyColumn, error)
Log2(col AnyColumn) (AnyColumn, error)
Log10(col AnyColumn) (AnyColumn, error)
// Transcendental — trigonometric
Sin(col AnyColumn) (AnyColumn, error)
Cos(col AnyColumn) (AnyColumn, error)
Tan(col AnyColumn) (AnyColumn, error)
Asin(col AnyColumn) (AnyColumn, error)
Acos(col AnyColumn) (AnyColumn, error)
Atan(col AnyColumn) (AnyColumn, error)
Atan2(y, x AnyColumn) (AnyColumn, error)
// Transcendental — hyperbolic / special
Tanh(col AnyColumn) (AnyColumn, error)
Sigmoid(col AnyColumn) (AnyColumn, error)
Erf(col AnyColumn) (AnyColumn, error)
// Rounding
Round(col AnyColumn) (AnyColumn, error)
Floor(col AnyColumn) (AnyColumn, error)
Ceil(col AnyColumn) (AnyColumn, error)
// Bitwise (int64 columns only)
BitAnd(a, b AnyColumn) (AnyColumn, error)
BitOr(a, b AnyColumn) (AnyColumn, error)
BitXor(a, b AnyColumn) (AnyColumn, error)
BitNot(col AnyColumn) (AnyColumn, error)
BitShiftLeft(col AnyColumn, n int) (AnyColumn, error)
BitShiftRight(col AnyColumn, n int) (AnyColumn, error)
}
MathKernel provides element-wise mathematical transforms on numeric columns.
Arrow engine: uses Arrow compute Datum API when available, highway SIMD for gaps. Memory engine: uses highway SIMD on raw slices, falls back to math stdlib. SQL engine: generates MATH functions (EXP, LOG, SIN, etc.)
All methods require float64 columns unless noted (bitwise requires int64).
type MutateFunc ¶
type MutateFunc interface {
// Apply produces a new column from the dataset.
Apply(ds Table) (AnyColumn, error)
}
MutateFunc describes a column transformation for Mutate.
type Optimizer ¶
type Optimizer interface {
Optimize(ops []op) []op
}
Optimizer is optionally implemented by engines that can fuse or reorder operations for efficiency. BigQuery uses this to fuse verb chains into a single SQL query.
type OrPred ¶
type OrPred struct{ Preds []Masker }
OrPred combines masks with OR.
type ParquetConfig ¶
type ParquetConfig struct {
// Compression codec: "snappy", "gzip", "zstd", "lz4", "none".
Compression string
}
ParquetConfig holds engine-agnostic Parquet configuration.
type ParquetReader ¶
type ParquetReader interface {
ReadParquet(ctx context.Context, eng Engine, r io.ReaderAt, size int64, cfg ParquetConfig) (Table, error)
}
ParquetReader reads Parquet data into an engine-native Dataset. Memory engine: uses parquet-go for struct-based row reading. Arrow engine: uses pqarrow.ReadTable for zero-copy columnar ingest.
func GetParquetReader ¶ added in v0.0.6
func GetParquetReader(engineName string) (ParquetReader, bool)
GetParquetReader retrieves a registered ParquetReader for an engine.
type ParquetWriter ¶
type ParquetWriter interface {
WriteParquet(ctx context.Context, eng Engine, w io.Writer, ds Table, cfg ParquetConfig) error
}
ParquetWriter writes a Dataset to Parquet format. Memory engine: uses parquet-go GenericWriter. Arrow engine: uses pqarrow.WriteTable.
func GetParquetWriter ¶ added in v0.0.6
func GetParquetWriter(engineName string) (ParquetWriter, bool)
GetParquetWriter retrieves a registered ParquetWriter for an engine.
type PivotLongerSpec ¶
type PivotLongerSpec struct {
// Cols are the column names to pivot from wide to long format.
// These columns are "gathered" into a single name+value pair.
Cols []string
// NamesTo is the output column name that will hold the original column names.
NamesTo string
// ValuesTo is the output column name that will hold the values.
ValuesTo string
}
PivotLongerSpec configures a PivotLonger operation.
type PivotWiderSpec ¶
type PivotWiderSpec struct {
// NamesFrom is the column whose unique values become new column names.
NamesFrom string
// ValuesFrom is the column whose values fill the new columns.
ValuesFrom string
}
PivotWiderSpec configures a PivotWider operation.
type Reshaper ¶
type Reshaper interface {
PivotLonger(ds Table, spec PivotLongerSpec) (Table, error)
PivotWider(ds Table, spec PivotWiderSpec) (Table, error)
Separate(ds Table, col string, into []string, sep string) (Table, error)
Concatenate(ds Table, col string, from []string, sep string) (Table, error)
Complete(ds Table, cols ...string) (Table, error)
}
Reshaper provides reshape/pivot operations. For Arrow: lazy column views (repeatedView, interleavedView). For SQL: generates CASE WHEN / UNPIVOT / CROSSTAB.
type Schema ¶
type Schema struct {
// contains filtered or unexported fields
}
Schema describes the complete structure of a dataset — an ordered collection of Fields with a name-to-index lookup. This maps directly to arrow.Schema.
func NewSchema ¶
NewSchema creates a Schema from an ordered list of fields. Panics if any two fields share the same name.
func (*Schema) FieldIndex ¶
FieldIndex returns the index of the named field, or -1.
type Selector ¶
type Selector interface {
// Select reorders/selects rows by index (scatter-gather).
// This is the Arrow "Take" kernel.
Select(col AnyColumn, indices []int) (AnyColumn, error)
// Slice returns rows [start, end) from a column.
// For Arrow: zero-copy via array.NewSlice.
Slice(col AnyColumn, start, end int) (AnyColumn, error)
// SortIndices returns the permutation that sorts the column ascending.
// Returns an int slice, not a column — it's metadata for Take().
SortIndices(col AnyColumn) ([]int, error)
// FilterIndices returns the row indices where mask[i] == true.
// Returns an int slice for use with Take().
FilterIndices(mask []bool) []int
}
Selector provides engine-native column/row manipulation primitives. These are the building blocks for Frame verbs (Select, Arrange, Head, etc.).
For Arrow: zero-copy slicing, compute Take kernel, sort-indices kernel. For Memory: direct slice operations. For SQL: generates ORDER BY, LIMIT/OFFSET, WHERE rowid IN (...).
type StatKernel ¶ added in v0.0.5
type StatKernel interface {
// Histogram bins a numeric column into equal-width bins.
// Returns a Table with columns: "x" (bin centers) and "count" (frequencies).
// nBins <= 0 means auto-select using Sturges' rule.
Histogram(col AnyColumn, nBins int) (Table, error)
// KDE computes kernel density estimation over a numeric column.
// Returns a Table with columns: "x" (grid points) and "density".
// bandwidth <= 0 means Silverman auto-select. points is the output grid size.
KDE(ctx context.Context, col AnyColumn, bandwidth float64, points int) (Table, error)
// LinearFit computes OLS linear regression y = a + b*x.
// Returns a Table with columns: "x" (grid) and "y" (fitted values).
// nOut is the number of output grid points.
LinearFit(xCol, yCol AnyColumn, nOut int) (Table, error)
// LoessFit computes locally weighted regression (LOESS).
// Returns a Table with columns: "x" (grid) and "y" (fitted values).
// nOut is the number of output grid points.
LoessFit(ctx context.Context, xCol, yCol AnyColumn, nOut int) (Table, error)
// LinearFitSE computes OLS regression with 95% confidence bands.
// Returns a Table with columns: "x", "y" (fitted), "ymin", "ymax".
// nOut is the number of output grid points.
LinearFitSE(xCol, yCol AnyColumn, nOut int) (Table, error)
// LoessFitSE computes LOESS with approximate 95% confidence bands.
// Returns a Table with columns: "x", "y" (fitted), "ymin", "ymax".
// nOut is the number of output grid points.
LoessFitSE(ctx context.Context, xCol, yCol AnyColumn, nOut int) (Table, error)
// Boxplot computes the five-number summary for a numeric column,
// optionally grouped by a categorical column.
// Returns a Table with columns: "x", "lower", "q1", "middle", "q3",
// "upper", "notch_lower", "notch_upper".
// groupCol may be nil for a single-group boxplot.
// whisker is "tukey" (1.5*IQR) or "range" (min-max).
Boxplot(yCol, groupCol AnyColumn, whisker string, notch bool) (Table, error)
}
StatKernel provides statistical compute kernels that produce new Tables. These are higher-level operations that consume one or more columns and produce a complete result table.
For Memory/Arrow: implemented via go-highway SIMD + stdlib math. For SQL: could generate UDFs or client-side fallback.
type StringAppender ¶
type StringAppender interface {
Append(v string)
AppendNull()
AppendValues(vs []string)
Reserve(n int)
}
StringAppender streams string values into a column.
type Table ¶
type Table interface {
// Schema returns the dataset's schema.
Schema() *Schema
// Column retrieves a named column. Returns [ColumnNotFoundError] if absent.
// The returned [AnyColumn] can be type-asserted to [Column[T]] for typed
// access, or use [GetColumn] for a safe generic retrieval.
Column(name string) (AnyColumn, error)
// NumRows returns the logical number of rows.
NumRows() int64
// NumCols returns the number of columns.
NumCols() int64
}
Table represents an immutable, columnar data source.
Implementations include in-memory tables, Arrow tables, and BigQuery-backed remote tables. ETL verbs are exposed by wrapping a Table in a Dataset (the fluent API defined in frame.go) via From.
type Windower ¶
type Windower interface {
Lag(col AnyColumn, n int) (AnyColumn, error)
Lead(col AnyColumn, n int) (AnyColumn, error)
CumSum(col AnyColumn) (AnyColumn, error)
CumMax(col AnyColumn) (AnyColumn, error)
CumMin(col AnyColumn) (AnyColumn, error)
Rank(col AnyColumn) (AnyColumn, error)
DenseRank(col AnyColumn) (AnyColumn, error)
PercentRank(col AnyColumn) (AnyColumn, error)
RowNumber(n int) (AnyColumn, error)
}
Windower provides window function kernels. For Arrow: streaming accumulators over Arrow arrays. For SQL: generates OVER() / WINDOW clauses.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package arrow provides an Apache Arrow-backed compute engine for the dataset package.
|
Package arrow provides an Apache Arrow-backed compute engine for the dataset package. |
|
csv
Package csv provides the Arrow CSV engine driver.
|
Package csv provides the Arrow CSV engine driver. |
|
parquet
Package parquet provides the Arrow Parquet engine driver.
|
Package parquet provides the Arrow Parquet engine driver. |
|
Package bigquery implements a BigQuery SQL pushdown engine for the dataset library.
|
Package bigquery implements a BigQuery SQL pushdown engine for the dataset library. |
|
Package compute provides portable SIMD primitives for the dataset engines.
|
Package compute provides portable SIMD primitives for the dataset engines. |
|
Package csv provides CSV reading and writing for the dataset package.
|
Package csv provides CSV reading and writing for the dataset package. |
|
Package math provides SIMD-accelerated mathematical transforms for the dataset engines.
|
Package math provides SIMD-accelerated mathematical transforms for the dataset engines. |
|
Package memory provides a lightweight Go-slice-backed compute engine for the dataset package.
|
Package memory provides a lightweight Go-slice-backed compute engine for the dataset package. |
|
csv
Package csv provides the Memory CSV engine driver.
|
Package csv provides the Memory CSV engine driver. |
|
parquet
Package parquet provides the Memory Parquet engine driver.
|
Package parquet provides the Memory Parquet engine driver. |
|
Package parquet provides Parquet reading and writing for the dataset package.
|
Package parquet provides Parquet reading and writing for the dataset package. |
|
Package sort provides SIMD-accelerated sorting for the dataset engines.
|
Package sort provides SIMD-accelerated sorting for the dataset engines. |