arrow

package
v0.0.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 24, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Overview

Package arrow provides an Apache Arrow-backed compute engine for the dataset package. It implements dataset.ColumnFactory, dataset.BuilderFactory, dataset.Aggregator, and dataset.Caster using Arrow arrays and arrow/math SIMD kernels.

Usage:

eng := arrow.NewEngine(ctx, memory.DefaultAllocator)
f := eng.(dataset.ColumnFactory)
ds, _ := f.FromColumns(
    dataset.NewSchema(dataset.FloatCol("x"), dataset.StringCol("label")),
    f.NewFloat64Column("x", []float64{1, 2, 3}),
    f.NewStringColumn("label", []string{"a", "b", "c"}),
)

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrUnsupportedType is returned for unsupported column types.
	ErrUnsupportedType = errors.New("arrow: unsupported column type")

	// ErrLengthMismatch is returned when column lengths don't match.
	ErrLengthMismatch = errors.New("arrow: column length mismatch")

	// ErrEmptyColumn is returned when an operation requires non-empty data.
	ErrEmptyColumn = errors.New("arrow: empty column")

	// ErrRequiresFloat64 is returned when a float64 column is required.
	ErrRequiresFloat64 = errors.New("arrow: operation requires float64 column")

	// ErrRequiresInt64 is returned when an int64 column is required.
	ErrRequiresInt64 = errors.New("arrow: operation requires int64 column")

	// ErrRequiresNumeric is returned when a numeric column is required.
	ErrRequiresNumeric = errors.New("arrow: operation requires numeric column")

	// ErrJoinKeyMismatch is returned when join key types don't match.
	ErrJoinKeyMismatch = errors.New("arrow: join key type mismatch")

	// ErrTakeTypeMismatch is returned when a Take/Slice result has unexpected type.
	ErrTakeTypeMismatch = errors.New("arrow: unexpected result type from Take/Slice")

	// ErrComputeTypeMismatch is returned when a compute kernel result has unexpected type.
	ErrComputeTypeMismatch = errors.New("arrow: unexpected result type from compute kernel")

	// ErrOutOfRange is returned when a parameter value is out of the expected range.
	ErrOutOfRange = errors.New("arrow: value out of range")
)

Sentinel errors for the arrow engine package.

Functions

This section is empty.

Types

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Engine is the Arrow compute backend.

func NewEngine

func NewEngine(ctx context.Context, alloc memory.Allocator) *Engine

NewEngine creates an Arrow engine with the given lifecycle context and memory allocator.

func (*Engine) Abs

func (e *Engine) Abs(col dataset.AnyColumn) (dataset.AnyColumn, error)

Abs returns the absolute value of each element (Arrow native).

func (*Engine) Acos

func (e *Engine) Acos(col dataset.AnyColumn) (dataset.AnyColumn, error)

Acos returns the arc cosine of each element (Arrow native).

func (*Engine) AddCols

func (e *Engine) AddCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

AddCols returns the element-wise sum of two float64 columns (Arrow native).

func (*Engine) AddScalar

func (e *Engine) AddScalar(col dataset.AnyColumn, val float64) (dataset.AnyColumn, error)

AddScalar adds a scalar value to every element of a float64 column.

func (*Engine) Alloc

func (e *Engine) Alloc() memory.Allocator

Alloc returns the engine's memory allocator.

func (*Engine) Asin

func (e *Engine) Asin(col dataset.AnyColumn) (dataset.AnyColumn, error)

Asin returns the arc sine of each element (Arrow native).

func (*Engine) Atan

func (e *Engine) Atan(col dataset.AnyColumn) (dataset.AnyColumn, error)

Atan returns the arc tangent of each element (Arrow native).

func (*Engine) Atan2

func (e *Engine) Atan2(y, x dataset.AnyColumn) (dataset.AnyColumn, error)

Atan2 returns the two-argument arc tangent of y/x (Arrow native).

func (*Engine) BitAnd

func (e *Engine) BitAnd(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

BitAnd returns the bitwise AND of two int64 columns.

func (*Engine) BitNot

func (e *Engine) BitNot(col dataset.AnyColumn) (dataset.AnyColumn, error)

BitNot returns the bitwise NOT of each int64 element.

func (*Engine) BitOr

func (e *Engine) BitOr(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

BitOr returns the bitwise OR of two int64 columns.

func (*Engine) BitShiftLeft

func (e *Engine) BitShiftLeft(col dataset.AnyColumn, n int) (dataset.AnyColumn, error)

BitShiftLeft shifts each int64 element left by n bits.

func (*Engine) BitShiftRight

func (e *Engine) BitShiftRight(col dataset.AnyColumn, n int) (dataset.AnyColumn, error)

BitShiftRight shifts each int64 element right by n bits.

func (*Engine) BitXor

func (e *Engine) BitXor(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

BitXor returns the bitwise XOR of two int64 columns.

func (*Engine) Boxplot added in v0.0.5

func (e *Engine) Boxplot(yCol, groupCol dataset.AnyColumn, whisker string, notch bool) (dataset.Table, error)

Boxplot computes the five-number summary for a numeric column.

func (*Engine) Cast

func (e *Engine) Cast(col dataset.AnyColumn, target dataset.DType) (dataset.AnyColumn, error)

Cast converts a column to the target DType.

func (*Engine) Ceil

func (e *Engine) Ceil(col dataset.AnyColumn) (dataset.AnyColumn, error)

Ceil — Arrow lacks this, use stdlib

func (*Engine) Combine

func (e *Engine) Combine(datasets ...dataset.Table) (dataset.Table, error)

Combine horizontally concatenates datasets of equal row count.

func (*Engine) Complete

func (e *Engine) Complete(ds dataset.Table, cols ...string) (dataset.Table, error)

Complete generates all combinations of the specified columns' unique values.

func (*Engine) Concatenate

func (e *Engine) Concatenate(ds dataset.Table, col string, from []string, sep string) (dataset.Table, error)

Concatenate joins multiple string columns into one with a separator.

func (*Engine) Context

func (e *Engine) Context() context.Context

Context returns the engine's lifecycle context.

func (*Engine) Cos

func (e *Engine) Cos(col dataset.AnyColumn) (dataset.AnyColumn, error)

Cos returns the cosine of each element (Arrow native).

func (*Engine) Count

func (e *Engine) Count(col dataset.AnyColumn) (dataset.AnyColumn, error)

Count returns a single-element int64 column with the non-null count.

func (*Engine) CumMax

func (e *Engine) CumMax(col dataset.AnyColumn) (dataset.AnyColumn, error)

CumMax returns the cumulative maximum of a numeric column.

func (*Engine) CumMin

func (e *Engine) CumMin(col dataset.AnyColumn) (dataset.AnyColumn, error)

CumMin returns the cumulative minimum of a numeric column.

func (*Engine) CumSum

func (e *Engine) CumSum(col dataset.AnyColumn) (dataset.AnyColumn, error)

CumSum returns the cumulative sum of the column.

func (*Engine) DenseRank

func (e *Engine) DenseRank(col dataset.AnyColumn) (dataset.AnyColumn, error)

DenseRank returns the dense rank (no gaps between ranks for ties).

func (*Engine) DivCols

func (e *Engine) DivCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

DivCols returns the element-wise quotient of two float64 columns.

func (*Engine) DropNA

func (e *Engine) DropNA(ds dataset.Table, cols ...string) (dataset.Table, error)

DropNA removes rows containing null values in the specified columns.

func (*Engine) Erf

func (e *Engine) Erf(col dataset.AnyColumn) (dataset.AnyColumn, error)

Erf returns the error function of each element (highway).

func (*Engine) Exp

func (e *Engine) Exp(col dataset.AnyColumn) (dataset.AnyColumn, error)

Exp — Arrow lacks this, use highway

func (*Engine) Fill

Fill forward- or backward-fills null values in a column.

func (*Engine) Filter

func (e *Engine) Filter(ds dataset.Table, mask dataset.Masker) (dataset.Table, error)

Filter returns a new Table keeping only rows where mask is true.

func (*Engine) FilterIndices

func (e *Engine) FilterIndices(mask []bool) []int

FilterIndices returns the indices where mask is true.

func (*Engine) First added in v0.0.5

func (e *Engine) First(col dataset.AnyColumn) (dataset.AnyColumn, error)

First returns the first element of a column as a single-row column.

func (*Engine) Floor

func (e *Engine) Floor(col dataset.AnyColumn) (dataset.AnyColumn, error)

Floor — Arrow lacks this, use stdlib

func (*Engine) FromColumns

func (e *Engine) FromColumns(schema *dataset.Schema, cols ...dataset.AnyColumn) (dataset.Table, error)

FromColumns builds a Table from pre-built Arrow columns.

func (*Engine) Histogram added in v0.0.5

func (e *Engine) Histogram(col dataset.AnyColumn, nBins int) (dataset.Table, error)

Histogram bins a numeric column into equal-width bins.

func (*Engine) Join

func (e *Engine) Join(left, right dataset.Table, spec dataset.JoinSpec) (dataset.Table, error)

Join implements the Joiner interface with a hash-join algorithm. It supports Inner, Left, Right, Full, Semi, and Anti joins.

func (*Engine) KDE added in v0.0.5

func (e *Engine) KDE(ctx context.Context, col dataset.AnyColumn, bandwidth float64, points int) (dataset.Table, error)

KDE computes kernel density estimation over a numeric column.

func (*Engine) Lag

func (e *Engine) Lag(col dataset.AnyColumn, offset int) (dataset.AnyColumn, error)

Lag shifts column values down by offset positions.

func (*Engine) Last added in v0.0.5

func (e *Engine) Last(col dataset.AnyColumn) (dataset.AnyColumn, error)

Last returns the last element of a column as a single-row column.

func (*Engine) Lead

func (e *Engine) Lead(col dataset.AnyColumn, offset int) (dataset.AnyColumn, error)

Lead shifts column values up by offset positions.

func (*Engine) LinearFit added in v0.0.5

func (e *Engine) LinearFit(xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LinearFit computes OLS linear regression y = a + b*x.

func (*Engine) LinearFitSE added in v0.0.5

func (e *Engine) LinearFitSE(xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LinearFitSE computes OLS regression with 95% confidence bands.

func (*Engine) Ln

Ln — Arrow native

func (*Engine) LoessFit added in v0.0.5

func (e *Engine) LoessFit(ctx context.Context, xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LoessFit computes locally weighted regression (LOESS).

func (*Engine) LoessFitSE added in v0.0.5

func (e *Engine) LoessFitSE(ctx context.Context, xCol, yCol dataset.AnyColumn, nOut int) (dataset.Table, error)

LoessFitSE computes LOESS with approximate 95% confidence bands.

func (*Engine) Log2

func (e *Engine) Log2(col dataset.AnyColumn) (dataset.AnyColumn, error)

Log2 — Arrow native

func (*Engine) Log10

func (e *Engine) Log10(col dataset.AnyColumn) (dataset.AnyColumn, error)

Log10 — Arrow native

func (*Engine) Mean

func (e *Engine) Mean(col dataset.AnyColumn) (dataset.AnyColumn, error)

Mean returns a single-element column containing the arithmetic mean.

func (*Engine) Median

func (e *Engine) Median(col dataset.AnyColumn) (dataset.AnyColumn, error)

Median returns a single-element column containing the median.

func (*Engine) MinMax

MinMax returns two single-element columns containing the min and max.

func (*Engine) Mode added in v0.0.5

func (e *Engine) Mode(col dataset.AnyColumn) (dataset.AnyColumn, error)

Mode returns the most frequent value as a single-row column. For ties, the first sorted value wins (deterministic). Float64 and int64 use a sort-based scan (O(n log n), no map overhead). Strings iterate over the Arrow array directly (no materialization to []string).

func (*Engine) MulCols

func (e *Engine) MulCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

MulCols returns the element-wise product of two float64 columns.

func (*Engine) MulScalar

func (e *Engine) MulScalar(col dataset.AnyColumn, val float64) (dataset.AnyColumn, error)

MulScalar multiplies every element of a float64 column by a scalar.

func (*Engine) Name

func (e *Engine) Name() string

Name returns "arrow".

func (*Engine) Neg

func (e *Engine) Neg(col dataset.AnyColumn) (dataset.AnyColumn, error)

Neg returns the negation of each element (Arrow native).

func (*Engine) NewBoolColumn

func (e *Engine) NewBoolColumn(name string, data []bool) dataset.AnyColumn

NewBoolColumn creates a bool column backed by an Arrow array.

func (*Engine) NewBuilder

func (e *Engine) NewBuilder(schema *dataset.Schema) dataset.Builder

NewBuilder creates a row-at-a-time Table builder.

func (*Engine) NewDateColumn added in v0.0.6

func (e *Engine) NewDateColumn(name string, data []int64) dataset.AnyColumn

NewDateColumn creates a date column (days since epoch) stored as Arrow int64.

func (*Engine) NewFloat64Column

func (e *Engine) NewFloat64Column(name string, data []float64) dataset.AnyColumn

NewFloat64Column creates a float64 column backed by an Arrow array.

func (*Engine) NewInt64Column

func (e *Engine) NewInt64Column(name string, data []int64) dataset.AnyColumn

NewInt64Column creates an int64 column backed by an Arrow array.

func (*Engine) NewStringColumn

func (e *Engine) NewStringColumn(name string, data []string) dataset.AnyColumn

NewStringColumn creates a string column backed by an Arrow array.

func (*Engine) NewTimeColumn added in v0.0.6

func (e *Engine) NewTimeColumn(name string, data []int64) dataset.AnyColumn

NewTimeColumn creates a time-of-day column (nanoseconds since midnight) stored as Arrow int64.

func (*Engine) NewTimestampColumn

func (e *Engine) NewTimestampColumn(name string, data []int64) dataset.AnyColumn

NewTimestampColumn creates a timestamp column stored as Arrow int64.

func (*Engine) PercentRank

func (e *Engine) PercentRank(col dataset.AnyColumn) (dataset.AnyColumn, error)

PercentRank returns the percent rank ((rank-1) / (n-1)).

func (*Engine) Percentile added in v0.0.5

func (e *Engine) Percentile(col dataset.AnyColumn, p float64) (dataset.AnyColumn, error)

Percentile returns the p-th quantile as a single-row float64 column. p ∈ [0,1]. Uses sort-based R-7 linear interpolation. Float64: zero-copy slice → copy → sort → interpolate. Int64: zero-copy → convert to float64 → sort → interpolate.

func (*Engine) PivotLonger

func (e *Engine) PivotLonger(ds dataset.Table, spec dataset.PivotLongerSpec) (dataset.Table, error)

PivotLonger reshapes a wide dataset to long format.

func (*Engine) PivotWider

func (e *Engine) PivotWider(ds dataset.Table, spec dataset.PivotWiderSpec) (dataset.Table, error)

PivotWider reshapes a long dataset to wide format.

func (*Engine) Pow

func (e *Engine) Pow(col dataset.AnyColumn, exp float64) (dataset.AnyColumn, error)

Pow raises each element to the given exponent (Arrow native).

func (*Engine) Rank

func (e *Engine) Rank(col dataset.AnyColumn) (dataset.AnyColumn, error)

Rank returns the 1-based rank of each element (ties get the same rank).

func (*Engine) ReplaceNA

func (e *Engine) ReplaceNA(col dataset.AnyColumn, defaultVal float64) (dataset.AnyColumn, error)

ReplaceNA replaces null values with a default float64.

func (*Engine) Round

func (e *Engine) Round(col dataset.AnyColumn) (dataset.AnyColumn, error)

Round — use stdlib for predictable half-away-from-zero behavior

func (*Engine) RowNumber

func (e *Engine) RowNumber(n int) (dataset.AnyColumn, error)

RowNumber returns a 1-based sequential row-number column of length n.

func (*Engine) Select

func (e *Engine) Select(col dataset.AnyColumn, indices []int) (dataset.AnyColumn, error)

Select gathers elements at the given indices.

func (*Engine) Separate

func (e *Engine) Separate(ds dataset.Table, col string, into []string, sep string) (dataset.Table, error)

Separate splits a string column by a delimiter into multiple columns.

func (*Engine) Sigmoid

func (e *Engine) Sigmoid(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sigmoid returns the logistic sigmoid of each element (highway).

func (*Engine) Sign

func (e *Engine) Sign(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sign returns the sign (-1, 0, or 1) of each element (Arrow native).

func (*Engine) Sin

func (e *Engine) Sin(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sin returns the sine of each element (Arrow native).

func (*Engine) Slice

func (e *Engine) Slice(col dataset.AnyColumn, start, end int) (dataset.AnyColumn, error)

Slice returns a contiguous sub-range [start, end) of the column.

func (*Engine) SortIndices

func (e *Engine) SortIndices(col dataset.AnyColumn) ([]int, error)

SortIndices uses Arrow compute's SortIndicesArray kernel. Arrow's implementation handles null placement and type dispatch natively.

func (*Engine) Sqrt

func (e *Engine) Sqrt(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sqrt — Arrow lacks this, use highway/stdlib

func (*Engine) Stack

func (e *Engine) Stack(datasets ...dataset.Table) (dataset.Table, error)

Stack vertically concatenates datasets with identical schemas.

func (*Engine) StdDev added in v0.0.5

func (e *Engine) StdDev(col dataset.AnyColumn) (dataset.AnyColumn, error)

StdDev returns the sample standard deviation as a single-row float64 column.

func (*Engine) SubCols

func (e *Engine) SubCols(a, b dataset.AnyColumn) (dataset.AnyColumn, error)

SubCols returns the element-wise difference of two float64 columns.

func (*Engine) Sum

func (e *Engine) Sum(col dataset.AnyColumn) (dataset.AnyColumn, error)

Sum returns a single-element column containing the sum.

func (*Engine) Tan

func (e *Engine) Tan(col dataset.AnyColumn) (dataset.AnyColumn, error)

Tan returns the tangent of each element (Arrow native).

func (*Engine) Tanh

func (e *Engine) Tanh(col dataset.AnyColumn) (dataset.AnyColumn, error)

Tanh returns the hyperbolic tangent of each element (highway).

func (*Engine) Variance

func (e *Engine) Variance(col dataset.AnyColumn) (dataset.AnyColumn, error)

Variance returns a single-element column containing the sample variance.

Directories

Path Synopsis
Package csv provides the Arrow CSV engine driver.
Package csv provides the Arrow CSV engine driver.
Package parquet provides the Arrow Parquet engine driver.
Package parquet provides the Arrow Parquet engine driver.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL