arrow

package
v0.0.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2026 License: MIT Imports: 24 Imported by: 0

Documentation

Overview

Package arrow provides an Apache Arrow-backed compute engine for the dataset package. It implements dataset.ColumnFactory, dataset.BuilderFactory, dataset.Aggregator, and dataset.Caster using Arrow arrays and arrow/math SIMD kernels.

Usage:

eng := arrow.NewEngine(ctx, memory.DefaultAllocator)
f := eng.(dataset.ColumnFactory)
ds, _ := f.FromColumns(
    dataset.NewSchema(dataset.FloatCol("x"), dataset.StringCol("label")),
    f.NewFloat64Column("x", []float64{1, 2, 3}),
    f.NewStringColumn("label", []string{"a", "b", "c"}),
)

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrUnsupportedType is returned for unsupported column types.
	ErrUnsupportedType = errors.New("arrow: unsupported column type")

	// ErrLengthMismatch is returned when column lengths don't match.
	ErrLengthMismatch = errors.New("arrow: column length mismatch")

	// ErrEmptyColumn is returned when an operation requires non-empty data.
	ErrEmptyColumn = errors.New("arrow: empty column")

	// ErrRequiresFloat64 is returned when a float64 column is required.
	ErrRequiresFloat64 = errors.New("arrow: operation requires float64 column")

	// ErrRequiresInt64 is returned when an int64 column is required.
	ErrRequiresInt64 = errors.New("arrow: operation requires int64 column")

	// ErrRequiresNumeric is returned when a numeric column is required.
	ErrRequiresNumeric = errors.New("arrow: operation requires numeric column")

	// ErrJoinKeyMismatch is returned when join key types don't match.
	ErrJoinKeyMismatch = errors.New("arrow: join key type mismatch")

	// ErrTakeTypeMismatch is returned when a Take/Slice result has unexpected type.
	ErrTakeTypeMismatch = errors.New("arrow: unexpected result type from Take/Slice")

	// ErrComputeTypeMismatch is returned when a compute kernel result has unexpected type.
	ErrComputeTypeMismatch = errors.New("arrow: unexpected result type from compute kernel")

	// ErrOutOfRange is returned when a parameter value is out of the expected range.
	ErrOutOfRange = errors.New("arrow: value out of range")
)

Sentinel errors for the arrow engine package.

Functions

This section is empty.

Types

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Engine is the Arrow compute backend.

func NewEngine

func NewEngine(ctx context.Context, alloc memory.Allocator) *Engine

NewEngine creates an Arrow engine with the given lifecycle context and memory allocator.

func (*Engine) Abs

func (e *Engine) Abs(col dataset.AnyColumn) (dataset.AnyColumn, error)

Abs returns the absolute value of each element (Arrow native).

func (*Engine) Acos

func (e *Engine) Acos(col dataset.AnyColumn) (dataset.AnyColumn, error)

Acos returns the arc cosine of each element (Arrow native).

func (*Engine) AddCols

func (e *Engine) AddCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

AddCols returns the element-wise sum of two float64 columns (Arrow native).

func (*Engine) AddScalar

func (e *Engine) AddScalar(col dataset.AnyColumn, val float64) (dataset.AnyColumn, error)

AddScalar adds a scalar value to every element of a float64 column.

func (*Engine) Alloc

func (e *Engine) Alloc() memory.Allocator

Alloc returns the engine's memory allocator.

func (*Engine) Asin

func (e *Engine) Asin(col dataset.AnyColumn) (dataset.AnyColumn, error)

Asin returns the arc sine of each element (Arrow native).

func (*Engine) Atan

func (e *Engine) Atan(col dataset.AnyColumn) (dataset.AnyColumn, error)

Atan returns the arc tangent of each element (Arrow native).

func (*Engine) Atan2

func (e *Engine) Atan2(y, x dataset.AnyColumn) (dataset.AnyColumn, error)

Atan2 returns the two-argument arc tangent of y/x (Arrow native).

func (*Engine) BitAnd

func (e *Engine) BitAnd(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

BitAnd returns the bitwise AND of two int64 columns.

func (*Engine) BitNot

func (e *Engine) BitNot(col dataset.AnyColumn) (dataset.AnyColumn, error)

BitNot returns the bitwise NOT of each int64 element.

func (*Engine) BitOr

func (e *Engine) BitOr(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

BitOr returns the bitwise OR of two int64 columns.

func (*Engine) BitShiftLeft

func (e *Engine) BitShiftLeft(col dataset.AnyColumn, n int) (dataset.AnyColumn, error)

BitShiftLeft shifts each int64 element left by n bits.

func (*Engine) BitShiftRight

func (e *Engine) BitShiftRight(col dataset.AnyColumn, n int) (dataset.AnyColumn, error)

BitShiftRight shifts each int64 element right by n bits.

func (*Engine) BitXor

func (e *Engine) BitXor(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

BitXor returns the bitwise XOR of two int64 columns.

func (*Engine) Boxplot added in v0.0.5

func (e *Engine) Boxplot(yCol, groupCol dataset.AnyColumn, whisker string, notch bool) (dataset.Table, error)

Boxplot computes the five-number summary for a numeric column.

func (*Engine) Cast

func (e *Engine) Cast(col dataset.AnyColumn, target dataset.DType) (dataset.AnyColumn, error)

Cast converts a column to the target DType.

func (*Engine) Ceil

func (e *Engine) Ceil(col dataset.AnyColumn) (dataset.AnyColumn, error)

Ceil — Arrow lacks this, use stdlib

func (*Engine) Combine

func (e *Engine) Combine(datasets ...dataset.Table) (dataset.Table, error)

Combine horizontally concatenates datasets of equal row count.

func (*Engine) Complete

func (e *Engine) Complete(ds dataset.Table, cols ...string) (dataset.Table, error)

Complete generates all combinations of the specified columns' unique values.

func (*Engine) Concatenate

func (e *Engine) Concatenate(ds dataset.Table, col string, from []string, sep string) (dataset.Table, error)

Concatenate joins multiple string columns into one with a separator.

func (*Engine) Context

func (e *Engine) Context() context.Context

Context returns the engine's lifecycle context.

func (*Engine) Cos

func (e *Engine) Cos(col dataset.AnyColumn) (dataset.AnyColumn, error)

Cos returns the cosine of each element (Arrow native).

func (*Engine) Count

func (e *Engine) Count(col dataset.AnyColumn) (dataset.AnyColumn, error)

Count returns a single-element int64 column with the non-null count.

func (*Engine) CumMax

func (e *Engine) CumMax(col dataset.AnyColumn) (dataset.AnyColumn, error)

CumMax returns the cumulative maximum of a numeric column.

func (*Engine) CumMin

func (e *Engine) CumMin(col dataset.AnyColumn) (dataset.AnyColumn, error)

CumMin returns the cumulative minimum of a numeric column.

func (*Engine) CumSum

func (e *Engine) CumSum(col dataset.AnyColumn) (dataset.AnyColumn, error)

CumSum returns the cumulative sum of the column.

func (*Engine) DenseRank

func (e *Engine) DenseRank(col dataset.AnyColumn) (dataset.AnyColumn, error)

DenseRank returns the dense rank (no gaps between ranks for ties).

func (*Engine) DivCols

func (e *Engine) DivCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

DivCols returns the element-wise quotient of two float64 columns.

func (*Engine) DropNA

func (e *Engine) DropNA(ds dataset.Table, cols ...string) (dataset.Table, error)

DropNA removes rows containing null values in the specified columns.

func (*Engine) Erf

func (e *Engine) Erf(col dataset.AnyColumn) (dataset.AnyColumn, error)

Erf returns the error function of each element (highway).

func (*Engine) Exp

func (e *Engine) Exp(col dataset.AnyColumn) (dataset.AnyColumn, error)

Exp — Arrow lacks this, use highway

func (*Engine) Fill

Fill forward- or backward-fills null values in a column.

func (*Engine) Filter

func (e *Engine) Filter(ds dataset.Table, mask dataset.Masker) (dataset.Table, error)

Filter returns a new Table keeping only rows where mask is true.

func (*Engine) FilterIndices

func (e *Engine) FilterIndices(mask []bool) []int

FilterIndices returns the indices where mask is true.

func (*Engine) First added in v0.0.5

func (e *Engine) First(col dataset.AnyColumn) (dataset.AnyColumn, error)

First returns the first element of a column as a single-row column.

func (*Engine) Floor

func (e *Engine) Floor(col dataset.AnyColumn) (dataset.AnyColumn, error)

Floor — Arrow lacks this, use stdlib

func (*Engine) FromColumns

func (e *Engine) FromColumns(schema *dataset.Schema, cols ...dataset.AnyColumn) (dataset.Table, error)

FromColumns builds a Table from pre-built Arrow columns.

func (*Engine) Histogram added in v0.0.5

func (e *Engine) Histogram(col dataset.AnyColumn, nBins int) (dataset.Table, error)

Histogram bins a numeric column into equal-width bins.

func (*Engine) Join

func (e *Engine) Join(left, right dataset.Table, spec dataset.JoinSpec) (dataset.Table, error)

Join implements the Joiner interface with a hash-join algorithm. It supports Inner, Left, Right, Full, Semi, and Anti joins.

func (*Engine) KDE added in v0.0.5

func (e *Engine) KDE(ctx context.Context, col dataset.AnyColumn, bandwidth float64, points int) (dataset.Table, error)

KDE computes kernel density estimation over a numeric column.

func (*Engine) Lag

func (e *Engine) Lag(col dataset.AnyColumn, offset int) (dataset.AnyColumn, error)

Lag shifts column values down by offset positions.

func (*Engine) Last added in v0.0.5

func (e *Engine) Last(col dataset.AnyColumn) (dataset.AnyColumn, error)

Last returns the last element of a column as a single-row column.

func (*Engine) Lead

func (e *Engine) Lead(col dataset.AnyColumn, offset int) (dataset.AnyColumn, error)

Lead shifts column values up by offset positions.

func (*Engine) LinearFit added in v0.0.5

func (e *Engine) LinearFit(xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LinearFit computes OLS linear regression y = a + b*x.

func (*Engine) LinearFitSE added in v0.0.5

func (e *Engine) LinearFitSE(xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LinearFitSE computes OLS regression with 95% confidence bands.

func (*Engine) Ln

Ln — Arrow native

func (*Engine) LoessFit added in v0.0.5

func (e *Engine) LoessFit(ctx context.Context, xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LoessFit computes locally weighted regression (LOESS).

func (*Engine) LoessFitSE added in v0.0.5

func (e *Engine) LoessFitSE(ctx context.Context, xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LoessFitSE computes LOESS with approximate 95% confidence bands.

func (*Engine) Log2

func (e *Engine) Log2(col dataset.AnyColumn) (dataset.AnyColumn, error)

Log2 — Arrow native

func (*Engine) Log10

func (e *Engine) Log10(col dataset.AnyColumn) (dataset.AnyColumn, error)

Log10 — Arrow native

func (*Engine) Mean

func (e *Engine) Mean(col dataset.AnyColumn) (dataset.AnyColumn, error)

Mean returns a single-element column containing the arithmetic mean.

func (*Engine) Median

func (e *Engine) Median(col dataset.AnyColumn) (dataset.AnyColumn, error)

Median returns a single-element column containing the median.

func (*Engine) MinMax

MinMax returns two single-element columns containing the min and max.

func (*Engine) Mode added in v0.0.5

func (e *Engine) Mode(col dataset.AnyColumn) (dataset.AnyColumn, error)

Mode returns the most frequent value as a single-row column. For ties, the first sorted value wins (deterministic). Float64 and int64 use a sort-based scan (O(n log n), no map overhead). Strings iterate over the Arrow array directly (no materialization to []string).

func (*Engine) MulCols

func (e *Engine) MulCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

MulCols returns the element-wise product of two float64 columns.

func (*Engine) MulScalar

func (e *Engine) MulScalar(col dataset.AnyColumn, val float64) (dataset.AnyColumn, error)

MulScalar multiplies every element of a float64 column by a scalar.

func (*Engine) Name

func (e *Engine) Name() string

Name returns "arrow".

func (*Engine) Neg

func (e *Engine) Neg(col dataset.AnyColumn) (dataset.AnyColumn, error)

Neg returns the negation of each element (Arrow native).

func (*Engine) NewBoolColumn

func (e *Engine) NewBoolColumn(name string, data []bool) dataset.AnyColumn

NewBoolColumn creates a bool column backed by an Arrow array.

func (*Engine) NewBuilder

func (e *Engine) NewBuilder(schema *dataset.Schema) dataset.Builder

NewBuilder creates a row-at-a-time Table builder.

func (*Engine) NewFloat64Column

func (e *Engine) NewFloat64Column(name string, data []float64) dataset.AnyColumn

NewFloat64Column creates a float64 column backed by an Arrow array.

func (*Engine) NewInt64Column

func (e *Engine) NewInt64Column(name string, data []int64) dataset.AnyColumn

NewInt64Column creates an int64 column backed by an Arrow array.

func (*Engine) NewStringColumn

func (e *Engine) NewStringColumn(name string, data []string) dataset.AnyColumn

NewStringColumn creates a string column backed by an Arrow array.

func (*Engine) NewTimestampColumn

func (e *Engine) NewTimestampColumn(name string, data []int64) dataset.AnyColumn

NewTimestampColumn creates a timestamp column stored as Arrow int64.

func (*Engine) PercentRank

func (e *Engine) PercentRank(col dataset.AnyColumn) (dataset.AnyColumn, error)

PercentRank returns the percent rank ((rank-1) / (n-1)).

func (*Engine) Percentile added in v0.0.5

func (e *Engine) Percentile(col dataset.AnyColumn, p float64) (dataset.AnyColumn, error)

Percentile returns the p-th quantile as a single-row float64 column. p ∈ [0,1]. Uses sort-based R-7 linear interpolation. Float64: zero-copy slice → copy → sort → interpolate. Int64: zero-copy → convert to float64 → sort → interpolate.

func (*Engine) PivotLonger

func (e *Engine) PivotLonger(ds dataset.Table, spec dataset.PivotLongerSpec) (dataset.Table, error)

PivotLonger reshapes a wide dataset to long format.

func (*Engine) PivotWider

func (e *Engine) PivotWider(ds dataset.Table, spec dataset.PivotWiderSpec) (dataset.Table, error)

PivotWider reshapes a long dataset to wide format.

func (*Engine) Pow

func (e *Engine) Pow(col dataset.AnyColumn, exp float64) (dataset.AnyColumn, error)

Pow raises each element to the given exponent (Arrow native).

func (*Engine) Rank

func (e *Engine) Rank(col dataset.AnyColumn) (dataset.AnyColumn, error)

Rank returns the 1-based rank of each element (ties get the same rank).

func (*Engine) ReadCSV

func (e *Engine) ReadCSV(_ context.Context, r io.Reader, cfg dataset.CSVConfig) (dataset.Table, error)

ReadCSV reads CSV data using arrow/csv.NewInferringReader with chunked streaming. Default chunk is 64K rows per batch to bound memory for large files.

func (*Engine) ReadParquet

func (e *Engine) ReadParquet(ctx context.Context, r io.ReaderAt, size int64, _ dataset.ParquetConfig) (dataset.Table, error)

ReadParquet reads Parquet data using pqarrow for zero-copy columnar ingest.

func (*Engine) ReplaceNA

func (e *Engine) ReplaceNA(col dataset.AnyColumn, defaultVal float64) (dataset.AnyColumn, error)

ReplaceNA replaces null values with a default float64.

func (*Engine) Round

func (e *Engine) Round(col dataset.AnyColumn) (dataset.AnyColumn, error)

Round — use stdlib for predictable half-away-from-zero behavior

func (*Engine) RowNumber

func (e *Engine) RowNumber(n int) (dataset.AnyColumn, error)

RowNumber returns a 1-based sequential row-number column of length n.

func (*Engine) Select

func (e *Engine) Select(col dataset.AnyColumn, indices []int) (dataset.AnyColumn, error)

Select gathers elements at the given indices.

func (*Engine) Separate

func (e *Engine) Separate(ds dataset.Table, col string, into []string, sep string) (dataset.Table, error)

Separate splits a string column by a delimiter into multiple columns.

func (*Engine) Sigmoid

func (e *Engine) Sigmoid(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sigmoid returns the logistic sigmoid of each element (highway).

func (*Engine) Sign

func (e *Engine) Sign(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sign returns the sign (-1, 0, or 1) of each element (Arrow native).

func (*Engine) Sin

func (e *Engine) Sin(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sin returns the sine of each element (Arrow native).

func (*Engine) Slice

func (e *Engine) Slice(col dataset.AnyColumn, start, end int) (dataset.AnyColumn, error)

Slice returns a contiguous sub-range [start, end) of the column.

func (*Engine) SortIndices

func (e *Engine) SortIndices(col dataset.AnyColumn) ([]int, error)

SortIndices uses Arrow compute's SortIndicesArray kernel. Arrow's implementation handles null placement and type dispatch natively.

func (*Engine) Sqrt

func (e *Engine) Sqrt(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sqrt — Arrow lacks this, use highway/stdlib

func (*Engine) Stack

func (e *Engine) Stack(datasets ...dataset.Table) (dataset.Table, error)

Stack vertically concatenates datasets with identical schemas.

func (*Engine) StdDev added in v0.0.5

func (e *Engine) StdDev(col dataset.AnyColumn) (dataset.AnyColumn, error)

StdDev returns the sample standard deviation as a single-row float64 column.

func (*Engine) SubCols

func (e *Engine) SubCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

SubCols returns the element-wise difference of two float64 columns.

func (*Engine) Sum

func (e *Engine) Sum(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sum returns a single-element column containing the sum.

func (*Engine) Tan

func (e *Engine) Tan(col dataset.AnyColumn) (dataset.AnyColumn, error)

Tan returns the tangent of each element (Arrow native).

func (*Engine) Tanh

func (e *Engine) Tanh(col dataset.AnyColumn) (dataset.AnyColumn, error)

Tanh returns the hyperbolic tangent of each element (highway).

func (*Engine) Variance

func (e *Engine) Variance(col dataset.AnyColumn) (dataset.AnyColumn, error)

Variance returns a single-element column containing the sample variance.

func (*Engine) WriteCSV

func (e *Engine) WriteCSV(_ context.Context, w io.Writer, ds dataset.Table, cfg dataset.CSVConfig) error

WriteCSV writes a Dataset as CSV using go-simdcsv (generic string-based output).

func (*Engine) WriteParquet

func (e *Engine) WriteParquet(_ context.Context, w io.Writer, ds dataset.Table, _ dataset.ParquetConfig) error

WriteParquet writes a Dataset as Parquet using pqarrow.WriteTable.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL