lakeorm

package module
v0.0.0-...-74b1f9c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2026 License: MIT Imports: 22 Imported by: 0

README

lake-orm

lake-orm is the batteries-included ORM for building on the lakehouse in Go.

The problem

Building reliable data systems on top of lakehouse technologies is harder than it should be.

Data is often written to staging tables, replayed through pipelines, and fixed downstream after the fact. Failures are handled asynchronously, and correctness is inferred later rather than enforced at the boundary.

Across organisations, the same ingestion and validation systems are rebuilt repeatedly, yet data quality issues still persist.

This leads to systems that are:

  • difficult to reason about
  • hard to debug
  • prone to silent data quality issues

lake-orm takes a different approach: validate data at the application boundary and write it correctly once, using deterministic, request-scoped operations.

It implements transactional, synchronous semantics at the application layer without sacrificing batch performance.

A data-quality-first approach to the lakehouse.

Designed for production workloads and compatible with Databricks, Delta, and Iceberg.

Write correct data once, so you don’t have to fix it later.

CQRS-shaped, untyped at the core, typed at the edge. Typed structs in, tagged columns out.


Quick start

package main

import (
    "context"

    "github.com/datalake-go/lake-orm"
    "github.com/datalake-go/lake-orm/backend"
    "github.com/datalake-go/lake-orm/dialect/iceberg"
    "github.com/datalake-go/lake-orm/driver/spark"
)

type User struct {
    ID    string `lake:"id,pk"         validate:"required"`
    Email string `lake:"email,mergeKey" validate:"required,email"`
}

func main() {
    ctx := context.Background()

    store, _ := backend.S3("s3://bucket/lake")
    drv := spark.Remote("sc://host:15002")
    db, _ := lakeorm.Open(drv, iceberg.Dialect(), store)
    defer db.Close()

    _ = db.Migrate(ctx, &User{})
    _ = db.Insert(ctx, []*User{{ID: "u1", Email: "alice@example.com"}})

    users, _ := lakeorm.Query[User](ctx, db, drv.FromSQL("SELECT * FROM users WHERE id = ?", "u1"))
    _ = users
}

lake-orm scopes every write to a single ingest_id.

Small writes can go direct; large writes are staged as Parquet on object storage and materialised with a single operation. This staging layer is temporary and request-scoped—it is not a bronze table or long-lived source of truth.

Each operation is bounded:

  • Insert paths materialise exactly one ingest
  • Merge paths reconcile exactly one ingest against the target table
  • All reads during a merge are filtered by ingest_id

If any step fails, the request fails. There is no partially committed state and no downstream recovery process.

This keeps write semantics deterministic, synchronous, and easy to debug.


Philosophy

Write correct data once. Prefer synchronous workflows with retries at the client, so you don’t have to fix it later.

Rethinking the medallion architecture

The bronze layer commonly described as "best practice" exists for two practical reasons, neither of which is data quality.

Spark is pull-based—until Spark Connect, external applications could not push data into a running cluster. Data had to land in object storage first so Spark could read it.

The second is Delta's optimistic concurrency control: concurrent MERGE operations conflict at the file level, so teams batch incoming data into append-only tables and merge in bulk to minimise contention.

These are engine constraints, not principles of OLAP.

Software engineers have validated inputs at the application boundary for decades. A Go struct, a protobuf message, a Pydantic model — the type system is the contract, and the contract is enforced before anything touches storage.

Spark Connect, gRPC over HTTP/2, lets a stateful Go application dial directly into a cluster, construct a DataFrame from already-validated inputs, and write straight to Iceberg or Delta.

No bronze. No autoloader. No asynchronous pipeline that fails hours later.

lake-orm pushes validation upstream and writes validated, tagged structs directly into the lakehouse.

Read the long form at https://callumdempseyleach.tech/writing/medallion-architecture/

Migrations: authoring, not evolution

The ORM emits ALTER TABLE statements when your struct drifts from the most-recent migration's recorded state. MigrateGenerate produces goose-format .sql files with -- DESTRUCTIVE: <reason> comments on risky ops and an lakeorm.sum manifest that catches post-generation edits.

But the library actively discourages schema evolution as a routine operation. Every emitted Up block carries a block comment reminding the reviewer that evolution is a tax on every downstream consumer and a common source of silent-wrong data. The right move when a schema needs to change is almost always to rethink the model — not to ship a column rename.

Schemas are contracts. Define them up front; clean data at the application boundary; let lake-orm enforce the shape. If your data doesn't fit, fix the producer, not the table. See the FAQ in the wiki for the long answer on handling unstructured upstream data.


Features

  • Go structs as the schema contract. Tag a field lake:"id,pk" and the column, primary-key, merge-key, partition, and nullability all flow from the struct. lake:"...", lakeorm:"...", and spark:"..." parse equivalently.

  • Validation at the application boundary. lakeorm.Validate(records) wraps go-playground/validator — rules live on the standard validate:"required,email,uuid,..." struct tag, errors unwrap to validator.ValidationErrors for per-field HTTP-400 responses.

  • Strict JSON decoding. lakeorm.FromJSON[T](payload) decodes HTTP / queue payloads straight into your tagged model and rejects any field the struct doesn't declare — the ingest-boundary corollary to "no schema evolution". The struct is the semantic layer.

  • Two write paths, one API. Small batches go direct; large ones use Parquet staging plus a single Spark operation.

  • Typed reads — Query[T] / QueryStream[T] / QueryFirst[T]. One shape for every read, whether the query comes from a SQL string, a native Spark DataFrame chain, or a pre-opened *sql.Rows. The projection struct is the contract; the driver builds the Source (drv.FromSQL(...), drv.FromDataFrame(...), drv.FromTable(...), drv.FromRows(...)).

  • Iceberg and Delta dialects. Pluggable via a Dialect interface.

  • Multiple drivers.

    • Spark Connect (self-hosted)
    • Databricks SQL (warehouse)
    • Databricks Connect (interactive clusters)
  • Pluggable backends. S3, GCS, file, memory.

  • Migration authoring support. Django-style file generation for schema drift; schema evolution itself is deliberately discouraged — see Philosophy.

  • UUIDv7 ingest IDs. Batch-scoped identity and reconciliation.

  • No telemetry. Only connects to configured endpoints.


Architecture

Lakehouse systems are analytical and columnar. Traditional ORMs assume stable row shapes, but joins produce dynamic projections.

lake-orm uses a CQRS-style approach:

  • Write-side: one struct per table
  • Read-side: one struct per query shape

Typing is applied at the materialization edge.

Go structs (lake:"..." tags)
    │
    ▼
lakeorm.Client ──► Dialect ──► Driver ──► Spark Connect
    │                                      │
    │   writes: Insert (plan → finalize)   ▼
    │   reads:  Query[T] over              Iceberg / Delta
    │          drivers.Source              │
    │          (drv.FromSQL /               ▼
    │           drv.FromDataFrame /        Object storage
    │           drv.FromTable /
    │           drv.FromRows)

Large writes stream Parquet directly to object storage and are materialised with a single Spark operation.


Upserts

Declare a mergeKey on a struct field and Insert flips from append semantics to upsert semantics. On the fast path the driver emits MERGE INTO instead of INSERT INTO, with the MERGE source filtered by the current operation's _ingest_id so retry-on-OCC-conflict is idempotent.

Single mergeKey:

type Customer struct {
    ID    types.SortableID `lake:"id,pk"`
    Email string           `lake:"email,mergeKey"` // upsert identity
    Tier  string           `lake:"tier"`
}

Emitted SQL on the large-batch path (Iceberg / Delta):

MERGE INTO customers AS target
USING (SELECT * FROM <staging> WHERE _ingest_id = '<batch>') AS source
ON target.email = source.email
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Composite mergeKey — declare mergeKey on multiple fields and every key becomes part of the ON clause, AND-joined. Useful when identity is a tuple (e.g. (user_id, date) for a daily position table, or (tenant_id, resource_id) for multi-tenant rows):

type Position struct {
    UserID types.SortableID `lake:"user_id,mergeKey"`
    Date   time.Time        `lake:"date,mergeKey"`
    Value  int64            `lake:"value"`
    Note   string           `lake:"note"`
}

Emitted SQL:

MERGE INTO positions AS target
USING (SELECT * FROM <staging> WHERE _ingest_id = '<batch>') AS source
ON target.user_id = source.user_id AND target.date = source.date
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

WHEN MATCHED THEN UPDATE SET * updates every non-key column, which is usually what you want. If you need to update only a subset (e.g. keep created_at immutable on upsert), write the MERGE yourself via Client.Exec or drop to the concrete driver (db.Driver().(*spark.Driver).Session(ctx)) — the auto-routed path doesn't split non-key columns into update vs insert buckets.

For a full walkthrough see examples/ (the bulk and upsert examples) and the wiki's Advanced Functions page.


Ingesting from JSON

lakeorm.FromJSON[T](payload) is the strict-decoding counterpart to Validate. Use it at the HTTP handler / queue consumer boundary when the upstream payload is JSON — it rejects any field your model doesn't declare, then runs the registered validator. If the decode or validation fails, return a 400 and don't touch the lakehouse.

type Book struct {
    ID     string `json:"id"     lake:"id,pk"        validate:"required,uuid"`
    Title  string `json:"title"  lake:"title"        validate:"required"`
    Author string `json:"author" lake:"author"`
}

func handleCreateBook(w http.ResponseWriter, r *http.Request) {
    book, err := lakeorm.FromJSON[Book](readBody(r))
    if err != nil {
        http.Error(w, err.Error(), http.StatusBadRequest)
        return
    }
    if err := db.Insert(r.Context(), []*Book{book}); err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
    w.WriteHeader(http.StatusCreated)
}

Rejection semantics:

  • Extra top-level keys — a JSON key that doesn't correspond to a declared struct field → decode error (encoding/json's DisallowUnknownFields).
  • Extra keys inside a declared nested struct — same strict rule propagates down.
  • Type mismatches — JSON object where the struct expects a string, JSON number where it expects a bool, etc. → decode error.
  • Missing required fieldsvalidate:"required" enforcement via the standard validator.
  • Trailing data after the first JSON value — rejected. One payload, one call.

Why strict. lake-orm's write semantics treat the struct as the schema contract. Silent drop-on-ingest of unknown fields would make two very different problems indistinguishable at the ingest boundary: your producer is sending the wrong shape (fix the producer) vs. your model is out of date (update the model + ship the migration). Forcing the decode to error surfaces the question before the data lands.

FromJSON is a regular Go function, not a framework primitive. If you need looser semantics (e.g. you're legitimately accepting third-party JSON you don't control), decode with encoding/json directly, transform in Go code, and hand the validated result to Insert. The library doesn't hide anything; it just refuses to do the unsafe thing by default.


Examples

  • basic usage
  • bulk inserts
  • streaming reads
  • joins (CQRS)
  • validation
  • ingest ID usage
  • migrations
  • Databricks integration

Install

go get github.com/datalake-go/lake-orm

  • lakehouse — full runtime composition
  • spark-connect-go — Spark execution layer

Under the hood

A brief for readers who want to know what's actually happening when they call Insert or Query[T]. Depth lives in the wiki; this section is the five-minute version.

  • Struct tags are the contract. ParseSchema reads lake:"..." (or lakeorm:"...", or spark:"...") once, caches the resolved LakeSchema per Go type. Column names, primary key, merge keys, partition intent, nullability all live there. The schema is what the Dialect consults to emit DDL and what the parquet writer uses to synthesise its row shape.
  • Composition at Open. Three orthogonal pieces plug together into a Client: a Driver (transport — Spark Connect / Databricks SQL warehouse / Databricks Connect), a Dialect (DDL grammar, INSERT vs MERGE planning — Iceberg / Delta / DuckDB), and a Backend (where bytes live — S3, GCS, file, memory). Swapping one leaves the other two untouched.
  • Validation is pass-through. lakeorm.Validate(records) calls validator.Struct() from go-playground/validator. Validate at the boundary before Insert — the same rules run either way, but surfacing failures early gives the HTTP handler a clean 400.
  • Writes are request-scoped. Client.Insert generates a UUIDv7 ingest_id, hands a WriteRequest to the Dialect, gets back an ExecutionPlan. Plan kinds: KindDirectIngest (small batch, one bulk INSERT); KindParquetIngest (large append — stage parquet to <warehouse>/<ingest_id>/part-*.parquet through the Backend, then one INSERT INTO target SELECT * FROM parquet.<staging>); KindParquetMerge (merge-key present — same staging, but the driver emits MERGE INTO target USING (SELECT * FROM staging WHERE _ingest_id = '<batch>') ON target.<mergeKey> = source.<mergeKey>). The _ingest_id filter on the MERGE source bounds the operation to this batch and makes retry-after-OCC-conflict idempotent.
  • _ingest_id is system-managed. Every table lake-orm creates carries the column; the parquet writer stamps each row with the current batch's ingest_id. User structs don't declare it — declaring _ingest_id is a schema-parse error. Reconciliation queries use a projection struct that adds the column back on the read side.
  • Reads are typed at the edge. Query[T] / QueryStream[T] / QueryFirst[T] take a drivers.Source — a closure the concrete driver builds via its own conversion helper (drv.FromSQL, drv.FromDataFrame, drv.FromTable, drv.FromRows). The driver's drivers.Convertible implementation invokes the source, walks its native rows (Spark DataFrame / *sql.Rows), and scans each one into the caller-supplied Go struct through lake-orm's reflection scanner. Unmapped columns are dropped silently — that's why SELECT * against a table with _ingest_id returns clean rows into a struct that doesn't declare it. See examples/arbitrary_spark_query for the pattern in full: a native Spark GroupBy + Agg chain with zero SQL, decoded into a projection struct.
  • Migrations are file-based. MigrateGenerate replays the most recent migration file's State-JSON header to reconstruct the prior schema, diffs against the current struct, emits one goose-format .sql file per changed table with -- DESTRUCTIVE: <reason> comments on risky ops. lakeorm.sum manifest catches post-generation edits. Apply-time execution happens through lake-goose (separate binary), not lake-orm.

A lakehouse is the sum of its parts. lake-orm gives those parts structure.

Documentation

Overview

Package lakeorm is the Go ORM for data lakes. Typed structs are the schema contract; validation happens at the application boundary; writes go directly to the target Iceberg or Delta table via Spark Connect — no bronze landing zone by default.

Entry points:

lakeorm.Open(driver, dialect, backend, opts...)
lakeorm.Query[T](ctx, db, sql, args...)   // typed read
structs.Validate(records)                 // boundary-validate before I/O

Composition is always three pieces: a Driver (transport — how we connect and execute), a Dialect (the data-dialect — DDL, DML, capabilities, semantics; Iceberg vs Delta), and a Backend (where the bytes live). Each is a stable public interface; concrete implementations live under drivers/, dialects/, and backends/.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Query

func Query[T any](ctx context.Context, db Client, source drivers.Source) ([]T, error)

Query runs source against db's driver and materialises every decoded row into []T. The driver must implement drivers.Convertible (all v0 drivers do).

Source is driver-native — build one with the driver's own conversion helper:

drv := db.Driver().(*spark.Driver)
users, err := lakeorm.Query[User](ctx, db,
    drv.FromSQL("SELECT * FROM users WHERE country = ?", "UK"))

DuckDB / Databricks use the same shape with their own FromSQL / FromRows / FromTable helpers; anything the helpers don't cover can be expressed as a bare drivers.Source closure.

The `spark:"..."` / `lake:"..."` tags on T bind result columns to fields. A field in T the source doesn't project surfaces as a schema-mismatch error.

func QueryFirst

func QueryFirst[T any](ctx context.Context, db Client, source drivers.Source) (*T, error)

QueryFirst runs source and returns the first decoded row, or errors.ErrNoRows if source yielded zero rows.

func QueryStream

func QueryStream[T any](ctx context.Context, db Client, source drivers.Source) iter.Seq2[T, error]

QueryStream runs source and yields T values one at a time with constant memory. Rangeable via Go 1.23's iter.Seq2:

drv := db.Driver().(*spark.Driver)
for row, err := range lakeorm.QueryStream[User](ctx, db,
    drv.FromSQL("SELECT * FROM users")) {
    if err != nil { break }
    // use row
}

Schema mismatch surfaces through the iterator's error channel on the first row.

Types

type CleanupReport

type CleanupReport struct {
	Deleted []string
	Failed  []string
	Scanned int
}

CleanupReport summarises CleanupStaging's pass. Deleted carries prefix URIs that were removed; Failed carries prefixes that matched the TTL cutoff but whose delete call errored — the operator can retry selectively.

type Client

type Client interface {
	// Insert writes records (pointer, slice, or slice-of-pointer) to
	// the target table. Validation runs first; then the Dialect
	// plans KindDirectIngest (small batch), KindParquetIngest (large
	// append), or KindParquetMerge (struct carries mergeKey) and the
	// Driver executes.
	Insert(ctx context.Context, records any, opts ...InsertOption) error

	// Driver returns the underlying driver. Callers type-assert to
	// the concrete driver type to reach per-driver conversion
	// helpers (spark.Driver.FromSQL, duckdb.Driver.FromRows, etc.)
	// or the raw native handle (spark.Driver.Session,
	// duckdb.Driver.DB).
	//
	// The reason Client exposes this instead of wrapping every
	// driver-specific helper behind a Client method is that read
	// grammar is driver-specific: a Spark DataFrame and a *sql.Rows
	// have different acquisition shapes, and hiding that costs the
	// caller power without buying portability.
	Driver() drivers.Driver

	// Exec runs a raw SQL statement that returns no rows (DDL, DML
	// that the typed surface doesn't cover).
	Exec(ctx context.Context, sql string, args ...any) (drivers.ExecResult, error)

	// Migrate is the bootstrap path — idempotent CREATE TABLE IF NOT
	// EXISTS derived from struct tags. Sufficient for dev and fresh
	// tables; ALTER TABLE-shaped schema evolution goes through
	// MigrateGenerate + lake-goose.
	Migrate(ctx context.Context, models ...any) error

	// MigrateGenerate writes iceberg/delta-dialect .sql files for any
	// pending struct diffs into dir, in goose's migration-file
	// format. Destructive operations (DROP COLUMN, RENAME COLUMN,
	// type narrowings, NOT-NULL tightenings) land with a
	// `-- DESTRUCTIVE: <reason>` informational comment so the
	// reviewer notices them in the PR diff.
	//
	// Execution is not this library's job — it belongs to lake-goose
	// running against the Spark Connect database/sql driver. An
	// lakeorm.sum manifest is emitted alongside so downstream tooling
	// can detect post-generation edits.
	//
	// Returns the list of generated file paths.
	MigrateGenerate(ctx context.Context, dir string, models ...any) ([]string, error)

	// CleanupStaging walks the Backend's _staging/ namespace and
	// deletes prefixes older than olderThan. Intended as a periodic
	// janitor (e.g. goroutine on a ticker) to sweep orphaned parquet
	// from aborted Insert calls.
	CleanupStaging(ctx context.Context, olderThan time.Duration) (*CleanupReport, error)

	// MetricsRegistry returns the Prometheus registry lakeorm writes
	// runtime metrics into. Consumed by the composed runtime
	// (lakehouse) to expose /metrics. Returns nil at v0.
	MetricsRegistry() *prometheus.Registry

	// Close releases the underlying driver, session pool, and any
	// backend resources. Safe to call once at process shutdown.
	Close() error
}

Client is the main entry point. All methods are safe to call concurrently; the internal session pool serializes per-session state as needed.

Writes bind to persisted types via Insert (auto-routes to MERGE when the struct carries a mergeKey). Reads run through the drivers.Convertible capability — build a driver-native Source with the concrete driver's conversion helpers, reach the driver via Driver(), then feed the Source to lakeorm.Query[T] / QueryStream[T] / QueryFirst[T] for typed decode:

drv := db.Driver().(*spark.Driver)
users, _ := lakeorm.Query[User](ctx, db,
    drv.FromSQL("SELECT * FROM users"))

Migrate bootstraps tables from struct tags; MigrateGenerate writes .sql files for the lake-goose migration runner; CleanupStaging sweeps orphan staging prefixes; Exec is the raw-SQL escape hatch for DDL / one-off DML the typed surface doesn't cover.

func Open

func Open(driver drivers.Driver, dialect dialects.Dialect, backend backends.Backend, opts ...ClientOption) (Client, error)

Open composes a Client from a Driver, a Dialect, a Backend, and options. All three positional arguments are required.

type ClientOption

type ClientOption func(*clientConfig)

ClientOption configures the Client at construction time. Functional options so the struct stays extensible without API breakage.

func WithCompression

func WithCompression(c Compression) ClientOption

WithCompression selects the compression codec for fast-path parquet parts. Default ZSTD — 2-3x smaller than Snappy on the shapes we care about (KSUIDs, repeated S3 paths), natively readable by all engines.

func WithDefaultCatalog

func WithDefaultCatalog(name string) ClientOption

WithDefaultCatalog / WithDefaultDatabase set fallbacks for unqualified table names.

func WithDefaultDatabase

func WithDefaultDatabase(name string) ClientOption

func WithDefaultIdempotencyTTL

func WithDefaultIdempotencyTTL(d time.Duration) ClientOption

WithDefaultIdempotencyTTL sets how long idempotency tokens remain valid in the dedup table. Default 24h.

func WithFastPathThreshold

func WithFastPathThreshold(bytes int) ClientOption

WithFastPathThreshold is the advisory byte crossover the Dialect uses to choose between gRPC and object-storage ingest. Default 128 MiB.

func WithLogger

func WithLogger(l zerolog.Logger) ClientOption

WithLogger sets the zerolog logger for the Client. Passed into every driver/format/backend component via the plumbing in lakeorm.Open.

func WithMaxInflightIngests

func WithMaxInflightIngests(n int) ClientOption

WithMaxInflightIngests caps the number of fast-path ingests that can be running concurrently. The bounded semaphore propagates backpressure into Write() — without it a bursty Go producer OOMs the process even when the partition writer "flushes correctly." Default 4.

func WithSessionPoolSize

func WithSessionPoolSize(n int) ClientOption

WithSessionPoolSize controls how many Spark Connect sessions the Client borrows in flight. Default 8 — large enough for moderate concurrency, small enough to bound cluster-side overhead. Increase for high-fanout services; decrease on tight clusters.

type Compression

type Compression int

Compression selects the codec used by the fast-path partition writer.

const (
	CompressionZSTD Compression = iota
	CompressionSnappy
	CompressionGzip
	CompressionUncompressed
)

type ConflictStrategy

type ConflictStrategy int

ConflictStrategy for Insert.

const (
	ErrorOnConflict ConflictStrategy = iota
	IgnoreOnConflict
	UpdateOnConflict
)

type InsertOption

type InsertOption func(*insertConfig)

InsertOption configures a single Insert call.

func OnConflict

func OnConflict(strategy ConflictStrategy) InsertOption

OnConflict sets the behaviour for primary-key collisions.

func ViaGRPC

func ViaGRPC() InsertOption

ViaGRPC forces the direct gRPC ingest path regardless of batch size. Useful when latency matters more than throughput.

func ViaObjectStorage

func ViaObjectStorage() InsertOption

ViaObjectStorage forces the S3-Parquet fast path regardless of batch size. Useful for benchmarks or when you know the batch is large.

func WithIdempotencyKey

func WithIdempotencyKey(key string) InsertOption

WithIdempotencyKey sets the idempotency token. If omitted, a SortableID is generated automatically.

type StagingLister

type StagingLister interface {
	ListStagingPrefixes(ctx context.Context) ([]StagingPrefix, error)
}

StagingLister is the optional backend extension Client.CleanupStaging relies on. Backends that can enumerate their _staging/ namespace implement this; backends that can't leave it unimplemented and CleanupStaging returns lkerrors.ErrNotImplemented.

Kept as a separate interface (not folded into Backend) so the Backend contract stays narrow: the fast-path write path only needs StagingPrefix / StagingLocation / CleanupStaging by-URI, and that's the minimum any backend must implement.

type StagingPrefix

type StagingPrefix struct {
	URI      string
	IngestID string
}

StagingPrefix is one entry a StagingLister backend yields. URI is the absolute backend URI; IngestID is the final path segment, which CleanupStaging parses as a UUIDv7.

func NewStagingPrefix

func NewStagingPrefix(uri, ingestID string) StagingPrefix

NewStagingPrefix builds a StagingPrefix from its absolute backend URI and the trailing ingest_id path segment.

Directories

Path Synopsis
Package backends defines the Backend interface — the abstraction lake-orm uses to put bytes somewhere — and exposes the public constructors for every concrete backend in a single, ergonomic surface.
Package backends defines the Backend interface — the abstraction lake-orm uses to put bytes somewhere — and exposes the public constructors for every concrete backend in a single, ergonomic surface.
file
Package file provides a local-disk Backend.
Package file provides a local-disk Backend.
gcs
Package gcs provides a GCS Backend.
Package gcs provides a GCS Backend.
memory
Package memory provides an in-memory Backend, useful for tests and for the fast-path regression harness.
Package memory provides an in-memory Backend, useful for tests and for the fast-path regression harness.
s3
Package s3 provides an S3-compatible Backend built directly on aws-sdk-go-v2/service/s3.
Package s3 provides an S3-compatible Backend built directly on aws-sdk-go-v2/service/s3.
cmd
lakeorm command
Command lakeorm is the lakeorm CLI.
Command lakeorm is the lakeorm CLI.
Package dialects defines the Dialect interface — the data-dialect opinion lake-orm delegates DDL, DML, and write planning to — plus the trio of concrete dialects lake-orm ships.
Package dialects defines the Dialect interface — the data-dialect opinion lake-orm delegates DDL, DML, and write planning to — plus the trio of concrete dialects lake-orm ships.
delta
Package delta is the Delta Lake Dialect.
Package delta is the Delta Lake Dialect.
duckdb
Package duckdb is the lake-orm Dialect for embedded DuckDB.
Package duckdb is the lake-orm Dialect for embedded DuckDB.
iceberg
Package iceberg implements lakeorm's default Dialect — Apache Iceberg tables written via Spark Connect.
Package iceberg implements lakeorm's default Dialect — Apache Iceberg tables written via Spark Connect.
Package drivers is the contract shared by every lake-orm driver.
Package drivers is the contract shared by every lake-orm driver.
databricks
Package databricks is lake-orm's native Databricks driver.
Package databricks is lake-orm's native Databricks driver.
databricksconnect
Package databricksconnect is lake-orm's Databricks Connect driver.
Package databricksconnect is lake-orm's Databricks Connect driver.
duckdb
Package duckdb is lake-orm's embedded DuckDB driver.
Package duckdb is lake-orm's embedded DuckDB driver.
spark
Package spark provides lakeorm's generic Spark Connect driver.
Package spark provides lakeorm's generic Spark Connect driver.
Package errors is the public error surface of lake-orm — sentinel values callers check with errors.Is, typed wrappers callers check with errors.As, and constructors that build them.
Package errors is the public error surface of lake-orm — sentinel values callers check with errors.Is, typed wrappers callers check with errors.As, and constructors that build them.
examples
arbitrary_spark_query command
Example: arbitrary native Spark query → typed Go struct.
Example: arbitrary native Spark query → typed Go struct.
basic command
Example: the hero path for lakeorm.
Example: the hero path for lakeorm.
bulk command
Example: the fast-path write — parquet through object storage with a Spark `INSERT ...
Example: the fast-path write — parquet through object storage with a Spark `INSERT ...
databricks command
Example: lake-orm against a Databricks SQL warehouse using the native `databricks-sql-go` driver.
Example: lake-orm against a Databricks SQL warehouse using the native `databricks-sql-go` driver.
databricksconnect command
Example: lake-orm against a Databricks cluster via Databricks Connect (Spark Connect over a Databricks workspace).
Example: lake-orm against a Databricks cluster via Databricks Connect (Spark Connect over a Databricks workspace).
delta command
Example: writing against a Delta Lake table (instead of Iceberg).
Example: writing against a Delta Lake table (instead of Iceberg).
duckdb command
Example: the pure-Go embedded path.
Example: the pure-Go embedded path.
ingest-id command
Example: `_ingest_id` as a system-managed column for batch reconciliation.
Example: `_ingest_id` as a system-managed column for batch reconciliation.
joins command
Example: reading joins and aggregates with CQRS-style output types.
Example: reading joins and aggregates with CQRS-style output types.
lake-tag command
Example: the canonical `lake:"..."` tag.
Example: the canonical `lake:"..."` tag.
migrations command
Example: Django-style offline migration authoring.
Example: Django-style offline migration authoring.
stream command
Example: streaming a large result back with constant memory.
Example: streaming a large result back with constant memory.
typed-helpers command
Example: the typed read API — Query[T] / QueryStream[T] / QueryFirst[T] — built on the drivers.Convertible capability and per-driver From* conversion helpers.
Example: the typed read API — Query[T] / QueryStream[T] / QueryFirst[T] — built on the drivers.Convertible capability and per-driver From* conversion helpers.
validation command
Example: validation at the application boundary via go-playground/validator.
Example: validation at the application boundary via go-playground/validator.
internal
migrations
Package migrations is lakeorm's authoring side of schema evolution.
Package migrations is lakeorm's authoring side of schema evolution.
parquet
Package parquet is lakeorm's parquet-writing layer.
Package parquet is lakeorm's parquet-writing layer.
Package local is the lake-k8s docker-compose on-ramp.
Package local is the lake-k8s docker-compose on-ramp.
Package structs owns the "Go struct is the schema contract" invariant.
Package structs owns the "Go struct is the schema contract" invariant.
Package teste2e is the black-box end-to-end suite for lakeorm.
Package teste2e is the black-box end-to-end suite for lakeorm.
Package testutils is the shared test harness for lakeorm.
Package testutils is the shared test harness for lakeorm.
Package types defines the small value types lakeorm exposes on its public API: SortableID (KSUID-backed), ObjectURI/Location (typed storage addresses), and SparkTableName.
Package types defines the small value types lakeorm exposes on its public API: SortableID (KSUID-backed), ObjectURI/Location (typed storage addresses), and SparkTableName.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL