arrow

package
v0.8.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 18, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Overview

Package arrow provides Arrow IPC (Feather V2) import and export for the pulse I/O pipeline, plus shared Arrow<->Pulse type-mapping helpers used by both this package and io/parquet.

Format scope:

  • This package implements the Arrow IPC FILE format (random-access, footer-indexed). The standard file extensions are .arrow and .feather; Feather V2 is identical to the Arrow IPC file format on disk.
  • The streaming variant (Arrow IPC stream, .arrows) is intentionally NOT supported. The file format is sufficient for batch import/export and trivially seekable, which lets the reader iterate batches without speculative buffering.

Memory model:

The reader materializes the full table into memory at init time, mirroring the policy in io/parquet. Files larger than available memory are out of scope for v1; users that need bounded streaming should use NDJSON or CSV.

Type policy:

All values written by Writer are encoded as Arrow String. Type-promotion on write is intentionally deferred until a real consumer demands it; this matches the parquet writer policy and keeps the export path simple.

Index

Constants

View Source
const (
	PulseTypeMetadataKey = "pulse:type"
)

Pulse field metadata key carried as an Arrow Field.Metadata pair so a reader can recover the original Pulse type for fields that map to a generic Arrow type. Decimal128 carries its precision/scale natively in arrow.Decimal128Type and does not need an extension marker.

Variables

This section is empty.

Functions

func AppendDecimal128

func AppendDecimal128(b array.Builder, f encoding.Field, v any) error

AppendDecimal128 appends a decimal value to an Arrow Decimal128 builder, accepting either an encoding.Decimal128 typed value or a canonical decimal string at the field's scale. nil and "" append a null. Used by the Arrow IPC and Parquet writers.

func FieldFromPulse

func FieldFromPulse(f encoding.Field) arrow.Field

FieldFromPulse builds an Arrow Field for a Pulse schema entry, including per-type details (decimal128 precision/scale).

func FormatValue

func FormatValue(arr arrow.Array, idx int) string

FormatValue renders a single element of an Arrow array as a string suitable for emission through the pio.Reader interface. Lifted from the original io/parquet implementation. Null handling is the caller's responsibility: FormatValue assumes idx is a non-null position.

Formatting rules:

  • Integers: base-10, unsigned where possible.
  • Float32 / Float64: shortest round-trippable decimal at the given precision (strconv 'f', precision -1).
  • Boolean: lowercase "true" / "false".
  • String: the raw value verbatim.
  • Date32: rendered as YYYY-MM-DD using days-since-epoch.
  • Date64: rendered as YYYY-MM-DD; the sub-day milliseconds are dropped.
  • Dictionary with a string value type: the resolved string label.
  • Anything else: falls back to fmt-formatting GetOneForMarshal, which yields the Arrow library's canonical text representation.

func PulseFieldFromArrow

func PulseFieldFromArrow(af arrow.Field) (encoding.Field, bool)

PulseFieldFromArrow inverts FieldFromPulse for decimal128 columns. Returns (field, true) if a Pulse decimal type was identified; (zero, false) otherwise — the caller should fall back to TypeToPulse.

func TypeFromPulse

func TypeFromPulse(ft encoding.FieldType) arrow.DataType

TypeFromPulse maps a Pulse FieldType to an Arrow DataType for use when constructing an Arrow schema for export. Lifted from the original io/parquet implementation.

Mapping rules:

  • Unsigned integer widths map to the matching Arrow primitive.
  • Float widths map to the matching Arrow primitive.
  • Date maps to Arrow Date32 (days-since-epoch), the more compact of the two Arrow date types and the natural fit for Pulse's storage.
  • All bool variants (packed and nullable) collapse to Arrow Boolean; nullability is carried by the field's Nullable flag, not the data type.
  • All categorical widths collapse to Arrow String. Writers that want dictionary encoding configure that as a column-encoding hint at write time, not as a data-type choice.
  • Anything unrecognized falls back to Float64.

func TypeToPulse

func TypeToPulse(dt arrow.DataType, nullable bool) encoding.FieldType

TypeToPulse maps an Arrow data type (with its nullability flag) to a Pulse FieldType. Lifted from the original io/parquet implementation so both the Arrow IPC reader and the Parquet reader share a single source of truth.

Mapping rules:

  • Signed and unsigned integers of the same width collapse to Pulse's unsigned type at that width (Pulse has no signed integer types).
  • Nullable variants are chosen for u8/u16 widths and for bool when the Arrow field is marked nullable.
  • Both Date32 and Date64 collapse to FieldTypeDate; Pulse stores dates as days-since-epoch and discards the sub-day precision Date64 carries.
  • String and binary (in both standard and large variants), and any dictionary-encoded type, collapse to FieldTypeCategoricalU8. The import pipeline upgrades to wider categorical widths when the dictionary overflows.
  • Anything unrecognized falls back to FieldTypeF64. This is a deliberate conservative default: it lets unknown numeric Arrow types load without error, at the cost of precision for unusual cases.

Types

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader reads Arrow IPC (Feather V2) data and implements pio.Reader and pio.ResetReader. The full file is materialized in memory at init time — see the package doc for the rationale.

func NewReader

func NewReader(fs afero.Fs, path string) *Reader

NewReader creates an Arrow IPC reader that loads from the given filesystem path on first read. The file is not opened until ReadHeader, ReadRows, or InferPulseSchema is called.

func NewReaderFromBytes

func NewReaderFromBytes(data []byte) *Reader

NewReaderFromBytes creates an Arrow IPC reader over an in-memory byte slice. The slice is not copied; callers must not mutate it after handing it to the reader.

func (*Reader) Close

func (r *Reader) Close() error

Close releases all materialized record batches and the underlying file reader. Safe to call multiple times.

func (*Reader) InferPulseSchema

func (r *Reader) InferPulseSchema() (*encoding.Schema, error)

InferPulseSchema builds a Pulse schema from the Arrow file's schema using the shared TypeToPulse mapping.

func (*Reader) ReadHeader

func (r *Reader) ReadHeader() ([]string, error)

ReadHeader returns column names from the Arrow schema.

func (*Reader) ReadRows

func (r *Reader) ReadRows(ctx context.Context, fn func(row []string) error) error

ReadRows iterates rows across all materialized record batches, calling fn for each row. Null cells produce empty strings.

func (*Reader) Reset

func (r *Reader) Reset() error

Reset rewinds the reader to the beginning. The next Read call will re-open the underlying file and re-materialize batches.

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

Writer writes tabular data as Arrow IPC (Feather V2), accumulating rows into an Arrow RecordBuilder and flushing in batch-sized chunks to bound peak memory.

func NewWriter

func NewWriter(fs afero.Fs, path string) *Writer

NewWriter creates an Arrow IPC writer targeting a filesystem path. The file is materialized in an internal buffer and flushed to fs on Close.

func NewWriterToBuffer

func NewWriterToBuffer() *Writer

NewWriterToBuffer creates an Arrow IPC writer that writes to an internal buffer accessible via Bytes after Close.

func (*Writer) Bytes

func (w *Writer) Bytes() []byte

Bytes returns the buffered Arrow IPC output. Only meaningful after Close.

func (*Writer) Close

func (w *Writer) Close() error

Close flushes any pending batch, closes the underlying Arrow IPC writer, and writes the buffered file to fs if a path was configured. If no header was written, Close is a no-op and Bytes returns an empty slice. If a header was written but no rows, an empty record batch is emitted so the file is structurally valid.

func (*Writer) SetPulseSchema

func (w *Writer) SetPulseSchema(s *encoding.Schema)

SetPulseSchema records the source .pulse schema so subsequent initWriter can build native typed Arrow columns. Implements pio.SchemaAwareWriter.

func (*Writer) WriteHeader

func (w *Writer) WriteHeader(columns []string) error

WriteHeader records the column names. The Arrow schema and writer are constructed lazily on the first WriteRow so that callers can WriteHeader and immediately Close to produce a schema-only file.

func (*Writer) WriteRow

func (w *Writer) WriteRow(values []any) error

WriteRow writes a single row of values. With a Pulse schema set the row dispatches per-column to the native typed builder for that column's Pulse type. Without a schema every value is stringified via fmt.Sprintf and appended to a StringBuilder. Buffered rows flush in defaultRecordBatchSize chunks; batch boundaries fall between rows.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL