arrow

package
v0.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 27, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Overview

Package arrow provides Arrow IPC (Feather V2) import and export for the pulse I/O pipeline, plus shared Arrow<->Pulse type-mapping helpers used by both this package and io/parquet.

Format scope:

  • This package implements the Arrow IPC FILE format (random-access, footer-indexed). The standard file extensions are .arrow and .feather; Feather V2 is identical to the Arrow IPC file format on disk.
  • The streaming variant (Arrow IPC stream, .arrows) is intentionally NOT supported. The file format is sufficient for batch import/export and trivially seekable, which lets the reader iterate batches without speculative buffering.

Memory model:

The reader materializes the full table into memory at init time, mirroring the policy in io/parquet. Files larger than available memory are out of scope for v1; users that need bounded streaming should use NDJSON or CSV.

Type policy:

All values written by Writer are encoded as Arrow String. Type-promotion on write is intentionally deferred until a real consumer demands it; this matches the parquet writer policy and keeps the export path simple.

Index

Constants

View Source
const (
	PulseTypeMetadataKey = "pulse:type"
)

Pulse field metadata key carried as an Arrow Field.Metadata pair so a reader can recover the original Pulse type for fields that map to a generic Arrow type. Decimal128 carries its precision/scale natively in arrow.Decimal128Type and does not need an extension marker.

Variables

This section is empty.

Functions

func AppendDecimal128

func AppendDecimal128(b array.Builder, f encoding.Field, v any) error

AppendDecimal128 appends a decimal value to an Arrow Decimal128 builder, accepting either an encoding.Decimal128 typed value or a canonical decimal string at the field's scale. nil and "" append a null. Used by the Arrow IPC and Parquet writers.

func FieldFromPulse

func FieldFromPulse(f encoding.Field) arrow.Field

FieldFromPulse builds an Arrow Field for a Pulse schema entry, including per-type details (decimal128 precision/scale). Arrow's nullable flag is driven by the Pulse field's Nullable attribute.

func FormatValue

func FormatValue(arr arrow.Array, idx int) string

FormatValue renders a single element of an Arrow array as a string suitable for emission through the pio.Reader interface. Lifted from the original io/parquet implementation. Null handling is the caller's responsibility: FormatValue assumes idx is a non-null position.

Formatting rules:

  • Integers: base-10, unsigned where possible.
  • Float32 / Float64: shortest round-trippable decimal at the given precision (strconv 'f', precision -1).
  • Boolean: lowercase "true" / "false".
  • String: the raw value verbatim.
  • Date32: rendered as YYYY-MM-DD using days-since-epoch.
  • Date64: rendered as YYYY-MM-DD; the sub-day milliseconds are dropped.
  • Dictionary with a string value type: the resolved string label.
  • Anything else: falls back to fmt-formatting GetOneForMarshal, which yields the Arrow library's canonical text representation.

func PulseFieldFromArrow

func PulseFieldFromArrow(af arrow.Field) (encoding.Field, bool)

PulseFieldFromArrow inverts FieldFromPulse for decimal128 columns. Returns (field, true) if a Pulse decimal type was identified; (zero, false) otherwise — the caller should fall back to TypeToPulse.

func TypeFromPulse

func TypeFromPulse(ft encoding.FieldType) arrow.DataType

TypeFromPulse maps a Pulse FieldType to an Arrow DataType for use when constructing an Arrow schema for export. Nullability is carried by the arrow.Field's Nullable flag, not the data type.

func TypeToPulse

func TypeToPulse(dt arrow.DataType) encoding.FieldType

TypeToPulse maps an Arrow data type to a Pulse FieldType. Nullability is carried separately by encoding.Field.Nullable (driven by the Arrow Field's nullable flag at the call site).

Mapping rules:

  • Signed and unsigned integers of the same width collapse to Pulse's unsigned type at that width (Pulse has no signed integer types).
  • Both Date32 and Date64 collapse to FieldTypeDate; Pulse stores dates as days-since-epoch and discards the sub-day precision Date64 carries.
  • String and binary (in both standard and large variants), and any dictionary-encoded type, collapse to FieldTypeCategoricalU8. The import pipeline upgrades to wider categorical widths when the dictionary overflows.
  • Anything unrecognized falls back to FieldTypeF64.

Types

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader reads Arrow IPC (Feather V2) data and implements pio.Reader and pio.ResetReader. The full file is materialized in memory at init time — see the package doc for the rationale.

func NewReader

func NewReader(fs afero.Fs, path string) *Reader

NewReader creates an Arrow IPC reader that loads from the given filesystem path on first read. The file is not opened until ReadHeader, ReadRows, or InferPulseSchema is called.

func NewReaderFromBytes

func NewReaderFromBytes(data []byte) *Reader

NewReaderFromBytes creates an Arrow IPC reader over an in-memory byte slice. The slice is not copied; callers must not mutate it after handing it to the reader.

func (*Reader) Close

func (r *Reader) Close() error

Close releases all materialized record batches and the underlying file reader. Safe to call multiple times.

func (*Reader) InferPulseSchema

func (r *Reader) InferPulseSchema() (*encoding.Schema, error)

InferPulseSchema builds a Pulse schema from the Arrow file's schema using the shared TypeToPulse mapping.

func (*Reader) ReadHeader

func (r *Reader) ReadHeader() ([]string, error)

ReadHeader returns column names from the Arrow schema.

func (*Reader) ReadRows

func (r *Reader) ReadRows(ctx context.Context, fn func(row []string) error) error

ReadRows iterates rows across all materialized record batches, calling fn for each row. Null cells produce empty strings.

func (*Reader) Reset

func (r *Reader) Reset() error

Reset rewinds the reader to the beginning. The next Read call will re-open the underlying file and re-materialize batches.

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

Writer writes tabular data as Arrow IPC (Feather V2), accumulating rows into an Arrow RecordBuilder and flushing in batch-sized chunks to bound peak memory.

func NewWriter

func NewWriter(fs afero.Fs, path string) *Writer

NewWriter creates an Arrow IPC writer targeting a filesystem path. The file is materialized in an internal buffer and flushed to fs on Close.

func NewWriterToBuffer

func NewWriterToBuffer() *Writer

NewWriterToBuffer creates an Arrow IPC writer that writes to an internal buffer accessible via Bytes after Close.

func (*Writer) Bytes

func (w *Writer) Bytes() []byte

Bytes returns the buffered Arrow IPC output. Only meaningful after Close.

func (*Writer) Close

func (w *Writer) Close() error

Close flushes any pending batch, closes the underlying Arrow IPC writer, and writes the buffered file to fs if a path was configured. If no header was written, Close is a no-op and Bytes returns an empty slice. If a header was written but no rows, an empty record batch is emitted so the file is structurally valid.

func (*Writer) SetPulseSchema

func (w *Writer) SetPulseSchema(s *encoding.Schema)

SetPulseSchema records the source .pulse schema so subsequent initWriter can build native typed Arrow columns. Implements pio.SchemaAwareWriter.

func (*Writer) WriteHeader

func (w *Writer) WriteHeader(columns []string) error

WriteHeader records the column names. The Arrow schema and writer are constructed lazily on the first WriteRow so that callers can WriteHeader and immediately Close to produce a schema-only file.

func (*Writer) WriteRow

func (w *Writer) WriteRow(values []any) error

WriteRow writes a single row of values. With a Pulse schema set the row dispatches per-column to the native typed builder for that column's Pulse type. Without a schema every value is stringified via fmt.Sprintf and appended to a StringBuilder. Buffered rows flush in defaultRecordBatchSize chunks; batch boundaries fall between rows.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL