puffin

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Documentation

Overview

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Package puffin provides reading and writing of Puffin files.

Puffin is a file format designed to store statistics and indexes for Iceberg tables. A Puffin file contains blobs (opaque byte sequences) with associated metadata, such as Apache DataSketches or deletion vectors.

File structure:

[Magic] [Blob]* [Magic] [Footer Payload] [Footer Payload Size] [Flags] [Magic]

See the specification at https://iceberg.apache.org/puffin-spec/

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Constants

View Source
const (
	//[Magic] [FooterPayload] [FooterPayloadSize] [Flags] [Magic]
	// MagicSize is the number of bytes in the magic marker.
	MagicSize = 4

	// FooterFlagCompressed indicates a compressed footer; unsupported in this implementation.
	FooterFlagCompressed = 1 // bit 0

	// Prevents OOM
	// DefaultMaxBlobSize is the maximum blob size allowed when reading (256 MB).
	// Override with WithMaxBlobSize when creating a reader.
	DefaultMaxBlobSize = 256 << 20

	// CreatedBy is a human-readable identification of the application writing the file, along with its version.
	// Example: "Trino version 381".
	CreatedBy = "created-by"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type BlobData

type BlobData struct {
	Metadata BlobMetadata
	Data     []byte
}

BlobData pairs a blob's metadata with its content.

type BlobMetadata

type BlobMetadata struct {
	Type             BlobType          `json:"type"`
	Fields           []int32           `json:"fields"`
	SnapshotID       int64             `json:"snapshot-id"`
	SequenceNumber   int64             `json:"sequence-number"`
	Offset           int64             `json:"offset"`
	Length           int64             `json:"length"`
	CompressionCodec *string           `json:"compression-codec,omitempty"`
	Properties       map[string]string `json:"properties,omitempty"`
}

type BlobMetadataInput

type BlobMetadataInput struct {
	Type           BlobType
	SnapshotID     int64
	SequenceNumber int64
	Fields         []int32
	Properties     map[string]string
}

BlobMetadataInput contains fields the caller provides when adding a blob. Offset, Length, and CompressionCodec are set by the writer.

type BlobType

type BlobType string
const (
	// BlobTypeDataSketchesTheta is a serialized compact Theta sketch
	// produced by the Apache DataSketches library.
	BlobTypeDataSketchesTheta BlobType = "apache-datasketches-theta-v1"

	// BlobTypeDeletionVector is a serialized deletion vector per the
	// Iceberg spec. Requires snapshot-id and sequence-number to be -1.
	BlobTypeDeletionVector BlobType = "deletion-vector-v1"
)
type Footer struct {
	Blobs      []BlobMetadata    `json:"blobs"`
	Properties map[string]string `json:"properties,omitempty"`
}

Footer describes the blobs and file-level properties stored in a Puffin file.

type Reader

type Reader struct {
	// contains filtered or unexported fields
}

Reader reads blobs and metadata from a Puffin file.

Usage:

r, err := puffin.NewReader(file)
if err != nil {
    return err
}
for i := range r.Blobs() {
    blob, err := r.ReadBlob(i)
    // process blob.Data
}

func NewReader

func NewReader(r ReaderAtSeeker, opts ...ReaderOption) (*Reader, error)

NewReader creates a new Puffin file reader. The file size is auto-detected using Seek. It validates magic bytes and reads the footer eagerly. The caller is responsible for closing the underlying reader.

func (*Reader) Blobs

func (r *Reader) Blobs() []BlobMetadata

Blobs returns the blob metadata entries from the footer.

func (*Reader) Properties

func (r *Reader) Properties() map[string]string

Properties returns the file-level properties from the footer.

func (*Reader) ReadAllBlobs

func (r *Reader) ReadAllBlobs() ([]*BlobData, error)

ReadAllBlobs reads all blobs from the file.

func (*Reader) ReadAt

func (r *Reader) ReadAt(p []byte, off int64) (n int, err error)

ReadAt implements io.ReaderAt, reading from the blob data region. It validates that the read range is within the blob data region This is useful for deletion vector use case. offset/length pointing directly into the Puffin file in manifest.

func (*Reader) ReadBlob

func (r *Reader) ReadBlob(index int) (*BlobData, error)

ReadBlob reads the content of a specific blob by index. The footer is read automatically if not already cached.

func (*Reader) ReadBlobByMetadata

func (r *Reader) ReadBlobByMetadata(meta BlobMetadata) ([]byte, error)

ReadBlobByMetadata reads a blob using its metadata directly. This is useful when you have metadata from an external source.

type ReaderAtSeeker

type ReaderAtSeeker interface {
	io.ReaderAt
	io.Seeker
}

ReaderAtSeeker combines io.ReaderAt and io.Seeker for reading Puffin files. This interface is implemented by *os.File, *bytes.Reader, and similar types.

type ReaderOption

type ReaderOption func(*Reader)

ReaderOption configures a Reader.

func WithMaxBlobSize

func WithMaxBlobSize(size int64) ReaderOption

WithMaxBlobSize sets the maximum blob size allowed when reading. This prevents OOM attacks from malicious files with huge blob lengths. Default is DefaultMaxBlobSize (256 MB).

type Writer

type Writer struct {
	// contains filtered or unexported fields
}

Writer writes blobs and metadata to a Puffin file.

Usage:

w, err := puffin.NewWriter(file)
if err != nil {
    return err
}
_, err = w.AddBlob(puffin.BlobMetadataInput{
    Type:       puffin.BlobTypeDataSketchesTheta,
    SnapshotID: 123,
    Fields:     []int32{1},
}, sketchBytes)
if err != nil {
    return err
}
return w.Finish()

func NewWriter

func NewWriter(w io.Writer) (*Writer, error)

NewWriter creates a new Writer and writes the file header magic. The caller is responsible for closing the underlying writer after Finish returns.

func (*Writer) AddBlob

func (w *Writer) AddBlob(input BlobMetadataInput, data []byte) (BlobMetadata, error)

AddBlob writes blob data and records its metadata for the footer. Returns the complete BlobMetadata including the computed Offset and Length. The input.Type is required; use constants like ApacheDataSketchesThetaV1.

func (*Writer) AddProperties

func (w *Writer) AddProperties(props map[string]string) error

SetProperties merges the provided properties into the file-level properties written to the footer. Can be called multiple times before Finish.

func (*Writer) ClearProperties

func (w *Writer) ClearProperties()

clear properties

func (*Writer) Finish

func (w *Writer) Finish() error

Finish writes the footer and completes the Puffin file structure. Must be called exactly once after all blobs are written. After Finish returns, no further operations are allowed on the writer.

func (*Writer) SetCreatedBy

func (w *Writer) SetCreatedBy(createdBy string) error

SetCreatedBy overrides the default "created-by" property written to the footer. The default value is "iceberg-go". Example: "MyApp version 1.2.3".

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL