Documentation
¶
Overview ¶
Package dataset provides zero-copy, lazy-evaluating columnar data abstractions for the Grammar of Graphics pipeline.
Engine-First Architecture ¶
Every data operation is delegated to an Engine backend. The dataset package defines only interfaces and contracts — no concrete column types, no fallbacks. Engines (Arrow, memory, SQL) implement sub-interfaces (Aggregator, Windower, Joiner, etc.) for the operations they support.
Type System ¶
The type system is aligned with Apache Arrow:
Index ¶
- func Close(ds Table) error
- func Names(ds Table) []string
- type AggFunc
- type AggSpec
- type Aggregator
- type AndPred
- type AnyColumn
- type BetweenPred
- type BoolAppender
- type BoolMask
- type Builder
- type BuilderFactory
- type CSVConfig
- type CSVReader
- type CSVWriter
- type Caster
- type Closer
- type Column
- type ColumnFactory
- type CompPred
- type Composer
- type DType
- type Dataset
- func (f Dataset) AntiJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Arrange(cols ...string) Dataset
- func (f Dataset) Collect() (Table, error)
- func (f Dataset) Column(name string) (AnyColumn, error)
- func (f Dataset) Combine(others ...Table) Dataset
- func (f Dataset) Distinct(cols ...string) Dataset
- func (f Dataset) DropNA(cols ...string) Dataset
- func (f Dataset) Err() error
- func (f Dataset) Fill(col string, dir FillDirection) Dataset
- func (f Dataset) Filter(mask Masker) Dataset
- func (f Dataset) FullJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) GroupBy(cols ...string) GroupedFrame
- func (f Dataset) Head(n int) Dataset
- func (f Dataset) InnerJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) LeftJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Mutate(name string, fn MutateFunc) Dataset
- func (f Dataset) NumCols() int64
- func (f Dataset) NumRows() int64
- func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset
- func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset
- func (f Dataset) Rename(oldName, newName string) Dataset
- func (f Dataset) RightJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Schema() *Schema
- func (f Dataset) Select(cols ...string) Dataset
- func (f Dataset) SemiJoin(other Table, spec JoinSpec) Dataset
- func (f Dataset) Separate(col string, into []string, sep string) Dataset
- func (f Dataset) Slice(start, end int) Dataset
- func (f Dataset) Stack(others ...Table) Dataset
- func (f Dataset) Table() Table
- func (f Dataset) Tail(n int) Dataset
- type Engine
- type ErrColumnNotFound
- type Field
- type FillDirection
- type Filler
- type Filterer
- type Float64Appender
- type GroupedFrame
- type HasEngine
- type InPred
- type Int64Appender
- type IsNotNullPred
- type IsNullPred
- type JoinSpec
- type JoinType
- type Joiner
- type Masker
- type MathKernel
- type MutateFunc
- type NotPred
- type Op
- type OrPred
- type ParquetConfig
- type ParquetReader
- type ParquetWriter
- type PivotLongerSpec
- type PivotWiderSpec
- type Reshaper
- type Schema
- type Selector
- type StringAppender
- type Table
- type Windower
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type AggSpec ¶
type AggSpec struct {
OutputName string // name of the result column
InputName string // name of the source column
Fn AggFunc // which aggregation to apply
}
AggSpec describes a single aggregation to apply in Summarize.
type Aggregator ¶
type Aggregator interface {
Sum(col AnyColumn) (AnyColumn, error)
Mean(col AnyColumn) (AnyColumn, error)
MinMax(col AnyColumn) (min AnyColumn, max AnyColumn, err error)
Count(col AnyColumn) (AnyColumn, error)
Median(col AnyColumn) (AnyColumn, error)
Variance(col AnyColumn) (AnyColumn, error)
}
Aggregator provides vectorized aggregation kernels. All methods return AnyColumn (single-element column) preserving the input type — aligned with Arrow compute kernel type rules:
- Sum: numeric → same type (int64→int64, float64→float64)
- Mean: numeric → float64 (always widens)
- MinMax: any ordered type → (min, max) of same type
- Count: any → int64
- Median: numeric → float64
- Variance: numeric → float64
For Arrow: delegates to arrow/math SIMD operations. For SQL: generates SELECT SUM/AVG/MIN/MAX/COUNT queries.
type AnyColumn ¶
AnyColumn is the type-erased column interface. This is what Dataset stores, engines operate on, and maps hold. Every engine-native column type implements this.
type BetweenPred ¶
func Between ¶
func Between(col string, lo, hi any) BetweenPred
func (BetweenPred) Expr ¶
func (p BetweenPred) Expr() string
type BoolAppender ¶
BoolAppender streams bool values into a column.
type BoolMask ¶
type BoolMask []bool
BoolMask is a pre-computed boolean mask that implements Masker. Useful when the filter has already been computed externally (e.g. faceting).
type Builder ¶
type Builder interface {
Float64(col string) Float64Appender
Int64(col string) Int64Appender
String(col string) StringAppender
Bool(col string) BoolAppender
Build() (Table, error)
}
Builder provides streaming, typed, zero-boxing construction. Each column has its own typed appender — no any boxing, no allocations per row.
type BuilderFactory ¶
BuilderFactory creates schema-aware builders for streaming construction.
type CSVConfig ¶
type CSVConfig struct {
HasHeader bool
Comma rune
Comment rune
NullValues []string
// ChunkSize is the number of rows per batch. 0 means engine default.
// Arrow default: 65536, Memory default: unlimited.
ChunkSize int
}
CSVConfig holds engine-agnostic CSV configuration. The dataset/csv facade constructs this from functional options and passes it to the engine's CSVReader/CSVWriter implementation.
type CSVReader ¶
type CSVReader interface {
ReadCSV(ctx context.Context, r io.Reader, cfg CSVConfig) (Table, error)
}
CSVReader reads CSV data into an engine-native Dataset. Memory engine: uses go-simdcsv + schema inference. Arrow engine: uses arrow/csv.NewInferringReader for zero-copy ingest.
type CSVWriter ¶
type CSVWriter interface {
WriteCSV(ctx context.Context, w io.Writer, ds Table, cfg CSVConfig) error
}
CSVWriter writes a Dataset to CSV. Memory engine: uses go-simdcsv Writer. Arrow engine: uses go-simdcsv Writer (generic — CSV output is string-based).
type Caster ¶
Caster provides engine-controlled type casting. Casting is an engine operation — the engine knows its native column types and how to convert between them.
type Closer ¶
type Closer interface {
Close() error
}
Closer is optionally implemented by datasets that hold resources requiring explicit cleanup (e.g., Arrow tables, database connections).
type Column ¶
Column is the typed access layer. Engine-specific column types implement both AnyColumn and Column[T] for their native type.
Values returns the underlying typed slice — zero-copy for both Arrow (returns the Arrow buffer) and memory (returns the Go slice).
IsNull returns the null bitmap. nil means no nulls (common case, zero alloc).
type ColumnFactory ¶
type ColumnFactory interface {
NewFloat64Column(name string, data []float64) AnyColumn
NewInt64Column(name string, data []int64) AnyColumn
NewStringColumn(name string, data []string) AnyColumn
NewBoolColumn(name string, data []bool) AnyColumn
NewTimestampColumn(name string, data []int64) AnyColumn
// FromColumns assembles columns into a Dataset with the given schema.
// All columns must have the same length.
FromColumns(schema *Schema, cols ...AnyColumn) (Table, error)
}
ColumnFactory wraps existing typed slices into engine-native columns. Memory engine: wraps the slice (zero-copy). Arrow engine: builds an Arrow array (one allocation).
type CompPred ¶
CompPred compares a column against a scalar value. Implements both Masker (local eval) and Expr() (SQL pushdown).
type Composer ¶
type Composer interface {
Stack(datasets ...Table) (Table, error)
Combine(datasets ...Table) (Table, error)
}
Composer provides row/column binding operations. For Arrow: zero-copy concatenation of Arrow arrays. For SQL: UNION ALL / lateral join.
type DType ¶
type DType int
DType represents the logical data type of a column. This is the type ID — analogous to arrow.Type.
const ( // DTypeFloat64 is a 64-bit floating point column. DTypeFloat64 DType = iota // DTypeInt64 is a 64-bit integer column. DTypeInt64 // DTypeString is a string/categorical column. DTypeString // DTypeBool is a boolean column. DTypeBool // DTypeTimestamp is a timestamp column stored as int64 nanoseconds // since the Unix epoch (1970-01-01T00:00:00Z). This representation // is zero-copy compatible with Arrow's TIMESTAMP(ns) type. DTypeTimestamp // DTypeUnknown is an unrecognized type. DTypeUnknown )
type Dataset ¶
type Dataset struct {
// contains filtered or unexported fields
}
Frame is the fluent API for data manipulation. All verbs return a new Frame (immutable chain). Every operation delegates to the dataset's engine via sub-interfaces — the Frame never touches raw data directly.
Usage:
result := dataset.From(ds).
Select("x", "y").
Filter(dataset.Gt("x", 0)).
Arrange("x").
Collect()
func NewDataset ¶
NewDataset creates a Dataset from an engine and columns. The schema is inferred from the columns' names and types.
func ReplaceColumn ¶
ReplaceColumn replaces a named column in a Dataset with new float64 values. All other columns are preserved. Used for discrete-to-numeric remapping.
func (Dataset) Arrange ¶
Arrange sorts the dataset by the named column (ascending). Engine's Selector.SortIndices computes the permutation; Selector.Take applies it.
func (Dataset) Collect ¶
Collect materializes the frame's pipeline and returns the Dataset and error.
func (Dataset) Column ¶
Convenience forwarding methods — allow Dataset to be used where Table is expected.
func (Dataset) Distinct ¶
Distinct removes duplicate rows based on the specified columns. If no columns are specified, all columns are used.
func (Dataset) GroupBy ¶
func (f Dataset) GroupBy(cols ...string) GroupedFrame
GroupBy specifies columns to group by. Returns a GroupedFrame for Summarize.
func (Dataset) Mutate ¶
func (f Dataset) Mutate(name string, fn MutateFunc) Dataset
Mutate appends or replaces a column using a MutateFunc.
func (Dataset) PivotLonger ¶
func (f Dataset) PivotLonger(spec PivotLongerSpec) Dataset
func (Dataset) PivotWider ¶
func (f Dataset) PivotWider(spec PivotWiderSpec) Dataset
func (Dataset) Slice ¶
Slice returns rows in the range [start, end). Engine's Selector.SliceColumn handles this — for Arrow, zero-copy via array.NewSlice.
type Engine ¶
type Engine interface {
// Name returns a human-readable identifier (e.g., "arrow", "memory", "sql").
Name() string
}
Engine is the marker interface that all compute backends implement.
type ErrColumnNotFound ¶
type ErrColumnNotFound struct {
Name string
}
ErrColumnNotFound indicates a requested column does not exist.
func (*ErrColumnNotFound) Error ¶
func (e *ErrColumnNotFound) Error() string
type Field ¶
Field describes a single column in a dataset — its name, logical type, nullability, and optional metadata. This maps directly to arrow.Field.
Metadata carries type-specific parameters that DType alone cannot express:
- Timestamp timezone: {"tz": "America/Sao_Paulo"}
- Display format: {"format": "2006-01-02"}
- Units: {"unit": "ns"}
func NullableFloatCol ¶
NullableFloatCol creates a nullable float64 field.
func NullableIntCol ¶
NullableIntCol creates a nullable int64 field.
func NullableStringCol ¶
NullableStringCol creates a nullable string field.
func TimestampCol ¶
func (Field) WithMetadata ¶
WithMetadata returns a copy of the field with the given metadata.
func (Field) WithNullable ¶
WithNullable returns a copy of the field with Nullable set.
type FillDirection ¶
type FillDirection int
FillDirection specifies the direction for filling missing values.
const ( // FillDown fills missing values with the previous non-null value (carry forward). FillDown FillDirection = iota // FillUp fills missing values with the next non-null value (carry backward). FillUp )
type Filler ¶
type Filler interface {
Fill(col AnyColumn, dir FillDirection) (AnyColumn, error)
DropNA(ds Table, cols ...string) (Table, error)
ReplaceNA(col AnyColumn, defaultVal float64) (AnyColumn, error)
}
Filler provides missing-value handling operations. For Arrow: streaming fill with zero allocation. For SQL: generates COALESCE / window-based fill.
type Filterer ¶
Filterer provides mask-based row filtering. For Arrow: boolean mask filtering with zero-copy. For SQL: generates WHERE clauses.
type Float64Appender ¶
type Float64Appender interface {
Append(v float64)
AppendNull()
AppendValues(vs []float64)
Reserve(n int)
}
Float64Appender streams float64 values into a column.
type GroupedFrame ¶
type GroupedFrame struct {
// contains filtered or unexported fields
}
GroupedFrame holds a Frame with group-by columns set.
func (GroupedFrame) Summarize ¶
func (gf GroupedFrame) Summarize(specs ...AggSpec) Dataset
Summarize applies aggregations per group using the engine's Aggregator. All computation is delegated to the engine — the Frame only orchestrates grouping.
type HasEngine ¶
HasEngine is implemented by datasets that carry an engine reference. This enables engine propagation through transformations — stat packages and ggplot internals can produce new datasets using the same engine without importing engine-specific packages.
type Int64Appender ¶
type Int64Appender interface {
Append(v int64)
AppendNull()
AppendValues(vs []int64)
Reserve(n int)
}
Int64Appender streams int64 values into a column.
type IsNotNullPred ¶
type IsNotNullPred struct{ Col string }
func IsNotNull ¶
func IsNotNull(col string) IsNotNullPred
func (IsNotNullPred) Expr ¶
func (p IsNotNullPred) Expr() string
type IsNullPred ¶
type IsNullPred struct{ Col string }
func IsNull ¶
func IsNull(col string) IsNullPred
func (IsNullPred) Expr ¶
func (p IsNullPred) Expr() string
type JoinType ¶
type JoinType int
JoinType identifies the kind of join to perform.
const ( // JoinLeft keeps all rows from the left dataset; unmatched right rows are null-filled. JoinLeft JoinType = iota // JoinRight keeps all rows from the right dataset; unmatched left rows are null-filled. JoinRight // JoinInner keeps only rows with matches in both datasets. JoinInner // JoinFull keeps all rows from both datasets; unmatched sides are null-filled. JoinFull // JoinSemi keeps rows from the left that have at least one match in the right. // No columns from the right are included. JoinSemi // JoinAnti keeps rows from the left that have NO match in the right. // No columns from the right are included. JoinAnti )
type Joiner ¶
Joiner provides join operations across datasets. For Arrow: hash-join with lazy indexed column views. For SQL: generates JOIN ... ON ... clauses.
type Masker ¶
type Masker interface {
// Mask computes a boolean mask of length int(ds.NumRows()). True entries are kept.
Mask(ds Table) ([]bool, error)
}
Masker describes a row-level filter condition that can be lazily evaluated against a dataset to produce a boolean mask.
type MathKernel ¶
type MathKernel interface {
// Binary arithmetic (column × column, same length required)
AddCols(a, b AnyColumn) (AnyColumn, error)
SubCols(a, b AnyColumn) (AnyColumn, error)
MulCols(a, b AnyColumn) (AnyColumn, error)
DivCols(a, b AnyColumn) (AnyColumn, error)
// Scalar arithmetic (column × scalar)
AddScalar(col AnyColumn, val float64) (AnyColumn, error)
MulScalar(col AnyColumn, val float64) (AnyColumn, error)
// Unary numeric
Abs(col AnyColumn) (AnyColumn, error)
Neg(col AnyColumn) (AnyColumn, error)
Sign(col AnyColumn) (AnyColumn, error)
Sqrt(col AnyColumn) (AnyColumn, error)
Pow(col AnyColumn, exp float64) (AnyColumn, error)
// Transcendental — logarithmic
Exp(col AnyColumn) (AnyColumn, error)
Ln(col AnyColumn) (AnyColumn, error)
Log2(col AnyColumn) (AnyColumn, error)
Log10(col AnyColumn) (AnyColumn, error)
// Transcendental — trigonometric
Sin(col AnyColumn) (AnyColumn, error)
Cos(col AnyColumn) (AnyColumn, error)
Tan(col AnyColumn) (AnyColumn, error)
Asin(col AnyColumn) (AnyColumn, error)
Acos(col AnyColumn) (AnyColumn, error)
Atan(col AnyColumn) (AnyColumn, error)
Atan2(y, x AnyColumn) (AnyColumn, error)
// Transcendental — hyperbolic / special
Tanh(col AnyColumn) (AnyColumn, error)
Sigmoid(col AnyColumn) (AnyColumn, error)
Erf(col AnyColumn) (AnyColumn, error)
// Rounding
Round(col AnyColumn) (AnyColumn, error)
Floor(col AnyColumn) (AnyColumn, error)
Ceil(col AnyColumn) (AnyColumn, error)
// Bitwise (int64 columns only)
BitAnd(a, b AnyColumn) (AnyColumn, error)
BitOr(a, b AnyColumn) (AnyColumn, error)
BitXor(a, b AnyColumn) (AnyColumn, error)
BitNot(col AnyColumn) (AnyColumn, error)
BitShiftLeft(col AnyColumn, n int) (AnyColumn, error)
BitShiftRight(col AnyColumn, n int) (AnyColumn, error)
}
MathKernel provides element-wise mathematical transforms on numeric columns.
Arrow engine: uses Arrow compute Datum API when available, highway SIMD for gaps. Memory engine: uses highway SIMD on raw slices, falls back to math stdlib. SQL engine: generates MATH functions (EXP, LOG, SIN, etc.)
All methods require float64 columns unless noted (bitwise requires int64).
type MutateFunc ¶
type MutateFunc interface {
// Apply produces a new column from the dataset.
Apply(ds Table) (AnyColumn, error)
}
MutateFunc describes a column transformation for Mutate.
type ParquetConfig ¶
type ParquetConfig struct {
// Compression codec: "snappy", "gzip", "zstd", "lz4", "none".
Compression string
}
ParquetConfig holds engine-agnostic Parquet configuration.
type ParquetReader ¶
type ParquetReader interface {
ReadParquet(ctx context.Context, r io.ReaderAt, size int64, cfg ParquetConfig) (Table, error)
}
ParquetReader reads Parquet data into an engine-native Dataset. Memory engine: uses parquet-go for struct-based row reading. Arrow engine: uses pqarrow.ReadTable for zero-copy columnar ingest.
type ParquetWriter ¶
type ParquetWriter interface {
WriteParquet(ctx context.Context, w io.Writer, ds Table, cfg ParquetConfig) error
}
ParquetWriter writes a Dataset to Parquet format. Memory engine: uses parquet-go GenericWriter. Arrow engine: uses pqarrow.WriteTable.
type PivotLongerSpec ¶
type PivotLongerSpec struct {
// Cols are the column names to pivot from wide to long format.
// These columns are "gathered" into a single name+value pair.
Cols []string
// NamesTo is the output column name that will hold the original column names.
NamesTo string
// ValuesTo is the output column name that will hold the values.
ValuesTo string
}
PivotLongerSpec configures a PivotLonger operation.
type PivotWiderSpec ¶
type PivotWiderSpec struct {
// NamesFrom is the column whose unique values become new column names.
NamesFrom string
// ValuesFrom is the column whose values fill the new columns.
ValuesFrom string
}
PivotWiderSpec configures a PivotWider operation.
type Reshaper ¶
type Reshaper interface {
PivotLonger(ds Table, spec PivotLongerSpec) (Table, error)
PivotWider(ds Table, spec PivotWiderSpec) (Table, error)
Separate(ds Table, col string, into []string, sep string) (Table, error)
Concatenate(ds Table, col string, from []string, sep string) (Table, error)
Complete(ds Table, cols ...string) (Table, error)
}
Reshaper provides reshape/pivot operations. For Arrow: lazy column views (repeatedView, interleavedView). For SQL: generates CASE WHEN / UNPIVOT / CROSSTAB.
type Schema ¶
type Schema struct {
// contains filtered or unexported fields
}
Schema describes the complete structure of a dataset — an ordered collection of Fields with a name-to-index lookup. This maps directly to arrow.Schema.
func NewSchema ¶
NewSchema creates a Schema from an ordered list of fields. Panics if any two fields share the same name.
func (*Schema) FieldIndex ¶
FieldIndex returns the index of the named field, or -1.
type Selector ¶
type Selector interface {
// Select reorders/selects rows by index (scatter-gather).
// This is the Arrow "Take" kernel.
Select(col AnyColumn, indices []int) (AnyColumn, error)
// Slice returns rows [start, end) from a column.
// For Arrow: zero-copy via array.NewSlice.
Slice(col AnyColumn, start, end int) (AnyColumn, error)
// SortIndices returns the permutation that sorts the column ascending.
// Returns an int slice, not a column — it's metadata for Take().
SortIndices(col AnyColumn) ([]int, error)
// FilterIndices returns the row indices where mask[i] == true.
// Returns an int slice for use with Take().
FilterIndices(mask []bool) []int
}
Selector provides engine-native column/row manipulation primitives. These are the building blocks for Frame verbs (Select, Arrange, Head, etc.).
For Arrow: zero-copy slicing, compute Take kernel, sort-indices kernel. For Memory: direct slice operations. For SQL: generates ORDER BY, LIMIT/OFFSET, WHERE rowid IN (...).
type StringAppender ¶
type StringAppender interface {
Append(v string)
AppendNull()
AppendValues(vs []string)
Reserve(n int)
}
StringAppender streams string values into a column.
type Table ¶
type Table interface {
// Schema returns the dataset's schema.
Schema() *Schema
// Column retrieves a named column. Returns [ErrColumnNotFound] if absent.
// The returned [AnyColumn] can be type-asserted to [Column[T]] for typed
// access, or use [GetColumn] for a safe generic retrieval.
Column(name string) (AnyColumn, error)
// NumRows returns the logical number of rows.
NumRows() int64
// NumCols returns the number of columns.
NumCols() int64
}
Dataset represents an immutable, columnar data source.
Implementations include in-memory frames, Arrow tables, and SQL-backed remote tables. All ETL operations are available via [Frame].
type Windower ¶
type Windower interface {
Lag(col AnyColumn, n int) (AnyColumn, error)
Lead(col AnyColumn, n int) (AnyColumn, error)
CumSum(col AnyColumn) (AnyColumn, error)
CumMax(col AnyColumn) (AnyColumn, error)
CumMin(col AnyColumn) (AnyColumn, error)
Rank(col AnyColumn) (AnyColumn, error)
DenseRank(col AnyColumn) (AnyColumn, error)
PercentRank(col AnyColumn) (AnyColumn, error)
RowNumber(n int) (AnyColumn, error)
}
Windower provides window function kernels. For Arrow: streaming accumulators over Arrow arrays. For SQL: generates OVER() / WINDOW clauses.
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package arrow provides an Apache Arrow-backed compute engine for the dataset package.
|
Package arrow provides an Apache Arrow-backed compute engine for the dataset package. |
|
Package bigquery implements a BigQuery SQL pushdown engine for the dataset library.
|
Package bigquery implements a BigQuery SQL pushdown engine for the dataset library. |
|
Package compute provides portable SIMD primitives for the dataset engines.
|
Package compute provides portable SIMD primitives for the dataset engines. |
|
Package csv provides CSV reading and writing for the dataset package.
|
Package csv provides CSV reading and writing for the dataset package. |
|
Package math provides SIMD-accelerated mathematical transforms for the dataset engines.
|
Package math provides SIMD-accelerated mathematical transforms for the dataset engines. |
|
Package memory provides a lightweight Go-slice-backed compute engine for the dataset package.
|
Package memory provides a lightweight Go-slice-backed compute engine for the dataset package. |
|
Package parquet provides Parquet reading and writing for the dataset package.
|
Package parquet provides Parquet reading and writing for the dataset package. |
|
Package sort provides SIMD-accelerated sorting for the dataset engines.
|
Package sort provides SIMD-accelerated sorting for the dataset engines. |