parquetreader

package
v0.0.17 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 24, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Overview

Package parquetreader provides utilities for reading individual rows from Parquet files stored on S3-compatible object storage.

Index key format (produced by the DPS Benthos output dimo_parquet_writer):

[{full_uri}|]{object_key}#{row_offset}

Full URI form: s3://bucket/prefix/.../file.parquet#row (bucket and key are parsed). Relative form: prefix/.../file.parquet#row (caller must supply bucket). Use ParseIndexKey to split into bucket (if present), object key, and 0-based row offset. Use IsParquetRef to distinguish from legacy JSON object keys (no '#').

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsParquetRef

func IsParquetRef(indexKey string) bool

IsParquetRef returns true if the given index_key uses the new Parquet reference format (contains #). Legacy keys (individual S3 JSON files) do not contain #.

Types

type IndexKeyRef

type IndexKeyRef struct {
	// Bucket is set when index_key is a full s3://bucket/key#row URI; otherwise empty.
	Bucket string
	// ObjectKey is the S3 object key (path) of the Parquet file, without bucket.
	ObjectKey string
	// RowOffset is the 0-based row index within the Parquet file.
	RowOffset int
}

IndexKeyRef holds the parsed components of a Parquet index_key reference.

func ParseIndexKey

func ParseIndexKey(indexKey string) (IndexKeyRef, error)

ParseIndexKey parses an index_key string into bucket (if s3:// URI), object key, and row offset. Supports: "s3://bucket/key.parquet#row" (sets Bucket) or "key.parquet#row" (Bucket empty).

type ObjectGetter

type ObjectGetter interface {
	GetObject(ctx context.Context, params *s3.GetObjectInput, optFns ...func(*s3.Options)) (*s3.GetObjectOutput, error)
}

ObjectGetter is an interface for fetching objects from S3-compatible storage. *s3.Client implements it; use this for testing or alternate backends.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader reads individual row payloads from Parquet files on S3-compatible storage. Schema is compatible with CloudEvent Parquet files written by dimo_parquet_writer.

func New

func New(objGetter ObjectGetter) *Reader

New returns a Reader that uses the given ObjectGetter (e.g. *s3.Client).

func (*Reader) ReadData

func (r *Reader) ReadData(ctx context.Context, bucket string, ref IndexKeyRef) ([]byte, error)

ReadData reads the "data" column value for the row at ref.RowOffset in the Parquet file at ref.ObjectKey. Bucket is used only when ref.Bucket is empty (relative key). Returns nil, nil for null data.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL