Documentation
¶
Overview ¶
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Package puffin provides reading and writing of Puffin files.
Puffin is a file format designed to store statistics and indexes for Iceberg tables. A Puffin file contains blobs (opaque byte sequences) with associated metadata, such as Apache DataSketches or deletion vectors.
File structure:
[Magic] [Blob]* [Magic] [Footer Payload] [Footer Payload Size] [Flags] [Magic]
See the specification at https://iceberg.apache.org/puffin-spec/
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Index ¶
- Constants
- type BlobData
- type BlobMetadata
- type BlobMetadataInput
- type BlobType
- type Footer
- type Reader
- func (r *Reader) Blobs() []BlobMetadata
- func (r *Reader) Properties() map[string]string
- func (r *Reader) ReadAllBlobs() ([]*BlobData, error)
- func (r *Reader) ReadAt(p []byte, off int64) (n int, err error)
- func (r *Reader) ReadBlob(index int) (*BlobData, error)
- func (r *Reader) ReadBlobByMetadata(meta BlobMetadata) ([]byte, error)
- type ReaderAtSeeker
- type ReaderOption
- type Writer
Constants ¶
const ( //[Magic] [FooterPayload] [FooterPayloadSize] [Flags] [Magic] // MagicSize is the number of bytes in the magic marker. MagicSize = 4 FooterFlagCompressed = 1 // bit 0 // Prevents OOM // DefaultMaxBlobSize is the maximum blob size allowed when reading (256 MB). // Override with WithMaxBlobSize when creating a reader. DefaultMaxBlobSize = 256 << 20 // CreatedBy is a human-readable identification of the application writing the file, along with its version. // Example: "Trino version 381". CreatedBy = "created-by" )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type BlobData ¶
type BlobData struct {
Metadata BlobMetadata
Data []byte
}
BlobData pairs a blob's metadata with its content.
type BlobMetadata ¶
type BlobMetadata struct {
Type BlobType `json:"type"`
Fields []int32 `json:"fields"`
SnapshotID int64 `json:"snapshot-id"`
SequenceNumber int64 `json:"sequence-number"`
Offset int64 `json:"offset"`
Length int64 `json:"length"`
CompressionCodec *string `json:"compression-codec,omitempty"`
Properties map[string]string `json:"properties,omitempty"`
}
type BlobMetadataInput ¶
type BlobMetadataInput struct {
Type BlobType
SnapshotID int64
SequenceNumber int64
Fields []int32
Properties map[string]string
}
BlobMetadataInput contains fields the caller provides when adding a blob. Offset, Length, and CompressionCodec are set by the writer.
type BlobType ¶
type BlobType string
const ( // BlobTypeDataSketchesTheta is a serialized compact Theta sketch // produced by the Apache DataSketches library. BlobTypeDataSketchesTheta BlobType = "apache-datasketches-theta-v1" // BlobTypeDeletionVector is a serialized deletion vector per the // Iceberg spec. Requires snapshot-id and sequence-number to be -1. BlobTypeDeletionVector BlobType = "deletion-vector-v1" )
type Footer ¶
type Footer struct {
}
Footer describes the blobs and file-level properties stored in a Puffin file.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader reads blobs and metadata from a Puffin file.
Usage:
r, err := puffin.NewReader(file)
if err != nil {
return err
}
for i := range r.Blobs() {
blob, err := r.ReadBlob(i)
// process blob.Data
}
func NewReader ¶
func NewReader(r ReaderAtSeeker, opts ...ReaderOption) (*Reader, error)
NewReader creates a new Puffin file reader. The file size is auto-detected using Seek. It validates magic bytes and reads the footer eagerly. The caller is responsible for closing the underlying reader.
func (*Reader) Blobs ¶
func (r *Reader) Blobs() []BlobMetadata
Blobs returns the blob metadata entries from the footer.
func (*Reader) Properties ¶
Properties returns the file-level properties from the footer.
func (*Reader) ReadAllBlobs ¶
ReadAllBlobs reads all blobs from the file.
func (*Reader) ReadAt ¶
ReadAt implements io.ReaderAt, reading from the blob data region. It validates that the read range is within the blob data region This is useful for deletion vector use case. offset/length pointing directly into the Puffin file in manifest.
func (*Reader) ReadBlob ¶
ReadBlob reads the content of a specific blob by index. The footer is read automatically if not already cached.
func (*Reader) ReadBlobByMetadata ¶
func (r *Reader) ReadBlobByMetadata(meta BlobMetadata) ([]byte, error)
ReadBlobByMetadata reads a blob using its metadata directly. This is useful when you have metadata from an external source.
type ReaderAtSeeker ¶
ReaderAtSeeker combines io.ReaderAt and io.Seeker for reading Puffin files. This interface is implemented by *os.File, *bytes.Reader, and similar types.
type ReaderOption ¶
type ReaderOption func(*Reader)
ReaderOption configures a Reader.
func WithMaxBlobSize ¶
func WithMaxBlobSize(size int64) ReaderOption
WithMaxBlobSize sets the maximum blob size allowed when reading. This prevents OOM attacks from malicious files with huge blob lengths. Default is DefaultMaxBlobSize (256 MB).
type Writer ¶
type Writer struct {
// contains filtered or unexported fields
}
Writer writes blobs and metadata to a Puffin file.
Usage:
w, err := puffin.NewWriter(file)
if err != nil {
return err
}
_, err = w.AddBlob(puffin.BlobMetadataInput{
Type: puffin.BlobTypeDataSketchesTheta,
SnapshotID: 123,
Fields: []int32{1},
}, sketchBytes)
if err != nil {
return err
}
return w.Finish()
func NewWriter ¶
NewWriter creates a new Writer and writes the file header magic. The caller is responsible for closing the underlying writer after Finish returns.
func (*Writer) AddBlob ¶
func (w *Writer) AddBlob(input BlobMetadataInput, data []byte) (BlobMetadata, error)
AddBlob writes blob data and records its metadata for the footer. Returns the complete BlobMetadata including the computed Offset and Length. The input.Type is required; use constants like ApacheDataSketchesThetaV1.
func (*Writer) AddProperties ¶
SetProperties merges the provided properties into the file-level properties written to the footer. Can be called multiple times before Finish.
func (*Writer) Finish ¶
Finish writes the footer and completes the Puffin file structure. Must be called exactly once after all blobs are written. After Finish returns, no further operations are allowed on the writer.
func (*Writer) SetCreatedBy ¶
SetCreatedBy overrides the default "created-by" property written to the footer. The default value is "iceberg-go". Example: "MyApp version 1.2.3".