dataset

package
v3.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 26, 2026 License: AGPL-3.0 Imports: 29 Imported by: 0

Documentation

Overview

Package dataset contains utilities for working with datasets. Datasets hold columnar data across multiple pages.

Index

Constants

This section is empty.

Variables

View Source
var (
	// Column statistics
	StatPrimaryColumns   = xcap.NewStatisticInt64("primary.columns", xcap.AggregationTypeSum)
	StatSecondaryColumns = xcap.NewStatisticInt64("secondary.columns", xcap.AggregationTypeSum)

	// Page statistics
	StatPrimaryColumnPages   = xcap.NewStatisticInt64("primary.column.pages", xcap.AggregationTypeSum)
	StatSecondaryColumnPages = xcap.NewStatisticInt64("secondary.column.pages", xcap.AggregationTypeSum)

	// Row statistics
	StatMaxRows           = xcap.NewStatisticInt64("row.max", xcap.AggregationTypeSum)
	StatRowsAfterPruning  = xcap.NewStatisticInt64("rows.after.pruning", xcap.AggregationTypeSum)
	StatPrimaryRowsRead   = xcap.NewStatisticInt64("primary.rows.read", xcap.AggregationTypeSum)
	StatSecondaryRowsRead = xcap.NewStatisticInt64("secondary.rows.read", xcap.AggregationTypeSum)
	StatPrimaryRowBytes   = xcap.NewStatisticInt64("primary.row.read.bytes", xcap.AggregationTypeSum)
	StatSecondaryRowBytes = xcap.NewStatisticInt64("secondary.row.read.bytes", xcap.AggregationTypeSum)

	// Download/Page scan statistics
	StatPagesScanned         = xcap.NewStatisticInt64("pages.scanned", xcap.AggregationTypeSum)
	StatPagesFoundInCache    = xcap.NewStatisticInt64("pages.cache.hit", xcap.AggregationTypeSum)
	StatPageDownloadRequests = xcap.NewStatisticInt64("pages.download.requests", xcap.AggregationTypeSum)
	StatPageDownloadTime     = xcap.NewStatisticInt64("pages.download.duration.ns", xcap.AggregationTypeSum)

	// Page download byte statistics
	StatPrimaryPagesDownloaded           = xcap.NewStatisticInt64("primary.pages.downloaded", xcap.AggregationTypeSum)
	StatSecondaryPagesDownloaded         = xcap.NewStatisticInt64("secondary.pages.downloaded", xcap.AggregationTypeSum)
	StatPrimaryColumnBytes               = xcap.NewStatisticInt64("primary.pages.compressed.bytes", xcap.AggregationTypeSum)
	StatSecondaryColumnBytes             = xcap.NewStatisticInt64("secondary.pages.compressed.bytes", xcap.AggregationTypeSum)
	StatPrimaryColumnUncompressedBytes   = xcap.NewStatisticInt64("primary.pages.uncompressed.bytes", xcap.AggregationTypeSum)
	StatSecondaryColumnUncompressedBytes = xcap.NewStatisticInt64("secondary.pages.uncompressed.bytes", xcap.AggregationTypeSum)

	// Read operation statistics
	StatReadCalls = xcap.NewStatisticInt64("read.calls", xcap.AggregationTypeSum)
)

xcap statistics for dataset reader operations.

Functions

func CompareValues

func CompareValues(a, b *Value) int

CompareValues returns -1 if a<b, 0 if a==b, or 1 if a>b. CompareValues panics if a and b are not the same type.

As a special case, either a or b may be nil. Two nil values are equal, and a nil value is always less than a non-nil value.

func WalkPredicate added in v3.5.0

func WalkPredicate(p Predicate, fn func(p Predicate) bool)

WalkPredicate traverses a predicate in depth-first order: it starts by calling fn(p). If fn(p) returns true, WalkPredicate is invoked recursively with fn for each of the non-nil children of p, followed by a call of fn(nil).

Types

type AndPredicate added in v3.5.0

type AndPredicate struct{ Left, Right Predicate }

An AndPredicate is a Predicate which asserts that a row may only be included if both the Left and Right Predicate are true.

type BinaryValueSet added in v3.6.0

type BinaryValueSet struct {
	// contains filtered or unexported fields
}

func NewBinaryValueSet added in v3.6.0

func NewBinaryValueSet(values []Value) BinaryValueSet

func (BinaryValueSet) Contains added in v3.6.0

func (s BinaryValueSet) Contains(value Value) bool

func (BinaryValueSet) Iter added in v3.6.0

func (s BinaryValueSet) Iter() iter.Seq[Value]

func (BinaryValueSet) Size added in v3.6.0

func (s BinaryValueSet) Size() int

type BuilderOptions

type BuilderOptions struct {
	// PageSizeHint is the soft limit for the size of the page. Builders try to
	// fill pages as close to this size as possible, but the actual size may be
	// slightly larger or smaller.
	PageSizeHint int

	// PageMaxRowCount is the limit for the number of rows of the page.
	// When 0 or a negative number, then builders use the [BuilderOptions.PageSizeHint]
	// option to determine when a page needs to be flushed.
	PageMaxRowCount int

	// Type is the type of data in the column. Type.Physical is used for
	// encoding; Type.Logical is used as a hint to readers.
	Type ColumnType

	// Encoding is the encoding algorithm to use for values.
	Encoding datasetmd.EncodingType

	// Compression is the compression algorithm to use for values.
	Compression datasetmd.CompressionType

	// CompressionOptions holds optional configuration for compression.
	CompressionOptions *CompressionOptions

	// StatisticsOptions holds optional configuration for statistics.
	Statistics StatisticsOptions
}

BuilderOptions configures common settings for building pages.

type Column

type Column interface {
	// ColumnDesc returns the metadata for the Column.
	ColumnDesc() *ColumnDesc

	// ListPages returns the set of ordered pages in the column.
	ListPages(ctx context.Context) result.Seq[Page]
}

A Column represents a sequence of values within a dataset. Columns are split up across one or more [Page]s to limit the amount of memory needed to read a portion of the column at a time.

type ColumnBuilder

type ColumnBuilder struct {
	// contains filtered or unexported fields
}

A ColumnBuilder builds a sequence of Value entries of a common type into a column. Values are accumulated into a buffer and then flushed into [MemPage]s once the size of data exceeds a configurable limit.

func NewColumnBuilder

func NewColumnBuilder(tag string, opts BuilderOptions) (*ColumnBuilder, error)

NewColumnBuilder creates a new ColumnBuilder from the optional tag and provided options. NewColumnBuilder returns an error if the options are invalid.

func (*ColumnBuilder) Append

func (cb *ColumnBuilder) Append(row int, value Value) error

Append adds a new value into cb with the given zero-indexed row number. If the row number is higher than the current number of rows in cb, null values are added up to the new row.

Append returns an error if the row number is out-of-order.

func (*ColumnBuilder) Backfill

func (cb *ColumnBuilder) Backfill(row int)

Backfill adds NULLs into cb up to (but not including) the provided row number. If values exist up to the provided row number, Backfill does nothing.

func (*ColumnBuilder) EstimatedSize

func (cb *ColumnBuilder) EstimatedSize() int

EstimatedSize returns the estimated size of all data in cb. EstimatedSize includes the compressed size of all cut pages in cb, followed by the size estimate of the in-progress page.

Because compression isn't considered for the in-progress page, EstimatedSize tends to overestimate the actual size after flushing.

func (*ColumnBuilder) Flush

func (cb *ColumnBuilder) Flush() (*MemColumn, error)

Flush converts data in cb into a MemColumn. Afterwards, cb is reset to a fresh state and can be reused.

func (*ColumnBuilder) Reset

func (cb *ColumnBuilder) Reset()

Reset clears all data in cb and resets it to a fresh state.

type ColumnDesc added in v3.6.0

type ColumnDesc struct {
	Type        ColumnType                // Type of values in the column.
	Tag         string                    // Optional string to distinguish columns with the same type.
	Compression datasetmd.CompressionType // Compression used for the column.

	PagesCount       int // Total number of pages in the column.
	RowsCount        int // Total number of rows in the column.
	ValuesCount      int // Total number of non-NULL values in the column.
	CompressedSize   int // Total size of all pages in the column after compression.
	UncompressedSize int // Total size of all pages in the column before compression.

	Statistics *datasetmd.Statistics // Optional statistics for the column.
}

ColumnDesc describes a column.

type ColumnType added in v3.6.0

type ColumnType struct {
	// Physical is the type of values physically stored within the column.
	Physical datasetmd.PhysicalType

	// Logical is a custom string indicating how dataset-derived sections
	// should interpret the physical type.
	Logical string
}

ColumnType represents the type of data stored in a column. Column types are represented as a tuple of a physical and logical type.

type CompressionOptions

type CompressionOptions struct {
	// Zstd holds encoding options for Zstd compression. Only used for
	// [datasetmd.COMPRESSION_TYPE_ZSTD].
	Zstd []zstd.EOption
	// contains filtered or unexported fields
}

CompressionOptions customizes the compressor used when building pages. CompressionOptions cache byte compressors to reduce total allocations. As an optimization, callers should reuse CompressionOptions pointers wherever possible.

type Dataset

type Dataset interface {
	// ListColumns returns the set of [Column]s in the Dataset. The order of
	// Columns in the returned sequence must be consistent across calls.
	ListColumns(ctx context.Context) result.Seq[Column]

	// ListPages retrieves a set of [Pages] given a list of [Column]s.
	// Implementations of Dataset may use ListPages to optimize for batch reads.
	// The order of [Pages] in the returned sequence must match the order of the
	// columns argument.
	ListPages(ctx context.Context, columns []Column) result.Seq[Pages]

	// ReadPages returns the set of [PageData] for the specified slice of pages.
	// Implementations of Dataset may use ReadPages to optimize for batch reads.
	// The order of [PageData] in the returned sequence must match the order of
	// the pages argument.
	ReadPages(ctx context.Context, pages []Page) result.Seq[PageData]
}

A Dataset holds a collection of [Columns], each of which is split into a set of Pages and further split into a sequence of [Values].

Dataset is read-only; callers must not modify any of the values returned by methods in Dataset.

func FromMemory

func FromMemory(columns []*MemColumn) Dataset

FromMemory returns an in-memory Dataset from the given list of [MemColumn]s.

type EqualPredicate added in v3.5.0

type EqualPredicate struct {
	Column Column // Column to check.
	Value  Value  // Value to check equality for.
}

An EqualPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is equal to the Value.

type FalsePredicate added in v3.5.0

type FalsePredicate struct{}

FalsePredicate is a Predicate which always returns false.

type FuncPredicate added in v3.5.0

type FuncPredicate struct {
	Column Column // Column to check.

	// Keep is invoked with the column and value pair to check. Keep is given
	// the Column instance to allow for reusing the same function across
	// multiple columns, if necessary.
	//
	// If Keep returns true, the row is kept.
	Keep func(column Column, value Value) bool
}

FuncPredicate is a Predicate which asserts that a row may only be included if the Value of the Column passes the Keep function.

Instances of FuncPredicate are ineligible for page filtering and should only be used when there isn't a more explicit Predicate implementation.

type GreaterThanPredicate added in v3.5.0

type GreaterThanPredicate struct {
	Column Column // Column to check.
	Value  Value  // Value for which rows in Column must be greater than.
}

A GreaterThanPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is greater than the provided Value.

type InPredicate added in v3.6.0

type InPredicate struct {
	Column Column   // Column to check.
	Values ValueSet // Set of values to check.
}

An InPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is present in the provided Values.

type Int64Set added in v3.6.0

type Int64Set struct {
	// contains filtered or unexported fields
}

func NewInt64ValueSet added in v3.6.0

func NewInt64ValueSet(values []Value) Int64Set

func (Int64Set) Contains added in v3.6.0

func (s Int64Set) Contains(value Value) bool

func (Int64Set) Iter added in v3.6.0

func (s Int64Set) Iter() iter.Seq[Value]

func (Int64Set) Size added in v3.6.0

func (s Int64Set) Size() int

type InvalidTypeError added in v3.6.0

type InvalidTypeError struct {
	Expected datasetmd.PhysicalType
	Actual   datasetmd.PhysicalType
}

InvalidTypeError is used as a panic value when using Value methods with the incorrect type.

func (*InvalidTypeError) Error added in v3.6.0

func (e *InvalidTypeError) Error() string

Error returns a string representation denoting the expected and actual types.

type LessThanPredicate added in v3.5.0

type LessThanPredicate struct {
	Column Column // Column to check.
	Value  Value  // Value for which rows in Column must be less than.
}

A LessThanPredicate is a Predicate which asserts that a row may only be included if the Value of the Column is less than the provided Value.

type MemColumn

type MemColumn struct {
	Desc  ColumnDesc // Description of the column.
	Pages []*MemPage // The set of pages in the column.
}

MemColumn holds a set of pages of a common type.

func (*MemColumn) ColumnDesc added in v3.6.0

func (c *MemColumn) ColumnDesc() *ColumnDesc

ColumnDesc implements Column and returns c.Desc.

func (*MemColumn) ListPages

func (c *MemColumn) ListPages(_ context.Context) result.Seq[Page]

ListPages implements Column and iterates through c.Pages.

type MemPage

type MemPage struct {
	Desc PageDesc // Description of the page.
	Data PageData // Data for the page.
}

MemPage holds an encoded (and optionally compressed) sequence of Value entries of a common type. Use ColumnBuilder to construct sets of pages.

func (*MemPage) PageDesc added in v3.6.0

func (p *MemPage) PageDesc() *PageDesc

PageDesc implements Page and returns p.Desc.

func (*MemPage) ReadPage

func (p *MemPage) ReadPage(_ context.Context) (PageData, error)

ReadPage implements Page and returns p.Data.

type NotPredicate added in v3.5.0

type NotPredicate struct{ Inner Predicate }

A NotePredicate is a Predicate which asserts that a row may only be included if the inner Predicate is false.

type OrPredicate added in v3.5.0

type OrPredicate struct{ Left, Right Predicate }

An OrPredicate is a Predicate which asserts that a row may only be included if either the Left or Right Predicate are true.

type Page

type Page interface {
	// PageDesc returns the metadata for the Page.
	PageDesc() *PageDesc

	// ReadPage returns the [PageData] for the Page.
	ReadPage(ctx context.Context) (PageData, error)
}

A Page holds an encoded and optionally compressed sequence of [Value]s within a Column.

type PageData

type PageData []byte

PageData holds the raw data for a page. Data is formatted as:

<uvarint(presence-bitmap-size)> <presence-bitmap> <values-data>

The presence-bitmap is a bitmap-encoded sequence of booleans, where values describe which rows are present (1) or nil (0). The presence bitmap is always stored uncompressed.

values-data is then the encoded and optionally compressed sequence of non-NULL values.

type PageDesc added in v3.6.0

type PageDesc struct {
	UncompressedSize int    // UncompressedSize is the size of a page before compression.
	CompressedSize   int    // CompressedSize is the size of a page after compression.
	CRC32            uint32 // CRC32 checksum of the page after encoding and compression.
	RowCount         int    // RowCount is the number of rows in the page, including NULLs.
	ValuesCount      int    // ValuesCount is the number of non-NULL values in the page.

	Encoding datasetmd.EncodingType // Encoding used for values in the page.
	Stats    *datasetmd.Statistics  // Optional statistics for the page.
}

PageDesc describes a page.

type Pages

type Pages []Page

Pages is a set of [Page]s.

type Predicate added in v3.5.0

type Predicate interface {
	// contains filtered or unexported methods
}

Predicate is an expression used to filter rows in a RowReader.

type Row

type Row struct {
	Index  int     // Index of the row in the dataset.
	Values []Value // Values for the row, one per [Column].
}

A Row in a Dataset is a set of values across multiple columns with the same row number.

func (Row) Size added in v3.6.0

func (r Row) Size() int64

Size returns the size of all values in the row.

func (Row) SizeOfColumns added in v3.6.0

func (r Row) SizeOfColumns(idxs []int) int64

SizeOfColumns returns the size of values in the row for the given column indices.

type RowReader added in v3.7.0

type RowReader struct {
	// contains filtered or unexported fields
}

A RowReader reads [Row]s from a Dataset.

func NewRowReader added in v3.7.0

func NewRowReader(opts RowReaderOptions) *RowReader

NewRowReader creates a new RowReader from the provided options.

Call RowReader.Open before calling RowReader.Read.

func (*RowReader) Close added in v3.7.0

func (r *RowReader) Close() error

Close closes the RowReader. Closed RowReaders can be reused by calling RowReader.Reset.

func (*RowReader) Open added in v3.7.0

func (r *RowReader) Open(ctx context.Context) error

Open initializes RowReader resources.

Open must be called before RowReader.Read. Open is safe to call multiple times.

func (*RowReader) Read added in v3.7.0

func (r *RowReader) Read(ctx context.Context, s []Row) (int, error)

Read reads up to the next len(s) rows from r and stores them into s. It returns the number of rows read and any error encountered. At the end of the Dataset, Read returns 0, io.EOF.

Read returns an error if RowReader.Open was not called first.

func (*RowReader) Reset added in v3.7.0

func (r *RowReader) Reset(opts RowReaderOptions)

Reset discards any state and resets the RowReader with a new set of options. This permits reusing a RowReader rather than allocating a new one.

type RowReaderOptions added in v3.7.0

type RowReaderOptions struct {
	Dataset Dataset // Dataset to read from.

	// Columns to read from the Dataset. It is invalid to provide a Column that
	// is not in Dataset.
	//
	// The set of Columns can include columns not used in Predicate; such columns
	// are considered non-predicate columns.
	Columns []Column

	// Predicates filter the data returned by a RowReader. Predicates are
	// optional; if nil, all rows from Columns are returned.
	//
	// Expressions in Predicate may only reference columns in Columns.
	// Holds a list of predicates that can be sequentially applied to the dataset.
	Predicates []Predicate

	// Prefetch enables bulk retrieving pages from the dataset when reading
	// starts. To reduce read latency, this option should only be disabled when
	// the entire Dataset is already held in memory.
	Prefetch bool
}

RowReaderOptions configures how a RowReader will read [Row]s.

type StatisticsOptions added in v3.5.0

type StatisticsOptions struct {
	// StoreRangeStats indicates whether to store value range statistics for the
	// column and pages.
	StoreRangeStats bool

	// StoreCardinalityStats indicates whether to store cardinality estimations,
	// facilitated by hyperloglog
	StoreCardinalityStats bool
}

StatisticsOptions customizes the collection of statistics for a column.

type TruePredicate added in v3.6.0

type TruePredicate struct{}

TruePredicate is a Predicate which always returns true.

type Uint64ValueSet added in v3.6.0

type Uint64ValueSet struct {
	// contains filtered or unexported fields
}

func NewUint64ValueSet added in v3.6.0

func NewUint64ValueSet(values []Value) Uint64ValueSet

func (Uint64ValueSet) Contains added in v3.6.0

func (s Uint64ValueSet) Contains(value Value) bool

func (Uint64ValueSet) Iter added in v3.6.0

func (s Uint64ValueSet) Iter() iter.Seq[Value]

func (Uint64ValueSet) Size added in v3.6.0

func (s Uint64ValueSet) Size() int

type UnsupportedTypeError added in v3.6.0

type UnsupportedTypeError struct {
	Got datasetmd.PhysicalType
}

UnsupportedTypeError is used as a panic value when using Value methods with an unsupported type.

func (*UnsupportedTypeError) Error added in v3.6.0

func (e *UnsupportedTypeError) Error() string

Error returns a string representation denoting the unsupported type.

type Value

type Value struct {
	// contains filtered or unexported fields
}

A Value represents a single value within a dataset. Unlike [any], Values can be constructed without allocations. The zero Value corresponds to nil.

func BinaryValue added in v3.6.0

func BinaryValue(v []byte) Value

BinaryValue returns a Value for a byte slice representing a string.

func Int64Value

func Int64Value(v int64) Value

Int64Value rerturns a Value for an int64.

func Uint64Value

func Uint64Value(v uint64) Value

Uint64Value returns a Value for a uint64.

func (*Value) Binary added in v3.6.0

func (v *Value) Binary() []byte

ByteSlice returns v's value as binary data. If v is not a string, ByteSlice returns a byte slice of the form "PHYSICAL_TYPE_T", where T is the underlying type of v.

func (*Value) Buffer added in v3.6.0

func (v *Value) Buffer() []byte

Buffer returns any memory that was allocated for v, even if v is currently null.

If Value does not hold underlying memory, Buffer returns nil.

func (*Value) Int64

func (v *Value) Int64() int64

Int64 returns v's value as an int64. It panics if v is not a datasetmd.PHYSICAL_TYPE_INT64.

func (*Value) IsNil

func (v *Value) IsNil() bool

IsNil returns whether v is nil.

func (*Value) IsZero

func (v *Value) IsZero() bool

IsZero reports whether v is the zero value.

func (Value) MarshalBinary added in v3.5.0

func (v Value) MarshalBinary() (data []byte, err error)

MarshalBinary encodes v into a binary representation. Non-NULL values encode first with the type (encoded as uvarint), followed by an encoded value, where:

NULL values encode as nil.

func (Value) Size added in v3.6.0

func (v Value) Size() int

Size returns the size of v in bytes when encoded.

func (*Value) Type

func (v *Value) Type() datasetmd.PhysicalType

Type returns the datasetmd.PhysicalType of v. If v is nil, Type returns datasetmd.PHYSICAL_TYPE_UNSPECIFIED.

func (*Value) Uint64

func (v *Value) Uint64() uint64

Uint64 returns v's value as a uint64. It panics if v is not a datasetmd.PHYSICAL_TYPE_UINT64.

func (*Value) UnmarshalBinary added in v3.5.0

func (v *Value) UnmarshalBinary(data []byte) error

UnmarshalBinary decodes a Value from a binary representation. See Value.MarshalBinary for the encoding format.

func (*Value) Zero added in v3.6.0

func (v *Value) Zero()

Zero sets Value to its zero state while retaining any underlying memory if Value was a datasetmd.PHYSICAL_TYPE_BINARY. After calling Zero, Value.IsNil and Value.IsZero will both report true.

However, Value.Binary will continue to return the underlying memory.

type ValueSet added in v3.6.0

type ValueSet interface {
	Contains(value Value) bool
	Iter() iter.Seq[Value]
	Size() int
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL