db-toolkit

module
v0.0.0-...-bd113ff Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 2, 2026 License: MIT

README

db-toolkit

Arrow-based data I/O for Go — like io.Reader/io.Writer, but for databases and file formats.

Overview

db-toolkit provides a unified streaming interface for moving data between databases and file formats using Apache Arrow as the in-memory representation. It follows a pull model: readers produce BatchStreams of Arrow records, writers consume them, and dio.Copy wires them together.

    Reader           BatchStream           Writer
┌────────────┐    ┌──────────────┐    ┌────────────┐
│ Source     │───▶│ Schema()     │───▶│ Destination│
│ Schema()   │    │ Next(ctx)    │    │            │
│ Read(ctx)  │    │ Close()      │    │ WriteStream│
└────────────┘    └──────────────┘    └────────────┘

Features:

  • Pull-model streaming — readers produce batches on demand, keeping memory usage bounded
  • Filter pushdown — push predicates down to the source (SQL WHERE clauses, Parquet row-group pruning)
  • Column projection — read only the columns you need
  • Platform registry — database platforms self-register via init(), activated with a blank import

Package Layout

Package Description
dio Core interfaces: Reader, Writer, BatchStream, Filter, Copy
dio/csv CSV reader and writer
dio/parquet Parquet reader and writer with row-group pruning via filter pushdown
dio/gen Synthetic data generation for testing
platform Platform registry (Register, Get, List)
platform/postgres PostgreSQL reader, writer, and filter pushdown

Quick Start

package main

import (
    "context"
    "os"

    "github.com/rnestertsov/db-toolkit/dio"
    "github.com/rnestertsov/db-toolkit/dio/csv"
    "github.com/rnestertsov/db-toolkit/platform"
    "github.com/rnestertsov/db-toolkit/platform/postgres"
)

func main() {
    ctx := context.Background()

    pg, err := platform.Get("postgres")
    if err != nil {
        panic(err)
    }

    conn, err := pg.OpenConnection(ctx, postgres.ConnectionConfig{
        Name: "mydb",
        User: "myuser",
        Host: "localhost",
        Port: "5432",
    })
    if err != nil {
        panic(err)
    }
    defer conn.Close(ctx)

    reader, err := conn.Query(ctx, "SELECT id, name, email FROM users")
    if err != nil {
        panic(err)
    }

    writer := csv.NewWriter(os.Stdout)

    if err := dio.Copy(ctx, reader, writer); err != nil {
        panic(err)
    }
}

Key Concepts

BatchStream

A BatchStream is a pull-based iterator over Arrow record batches. Call Next(ctx) to get the next batch; it returns io.EOF when exhausted. Callers must call Release() on each returned record.

stream, err := reader.Read(ctx)
defer stream.Close()

for {
    rec, err := stream.Next(ctx)
    if err == io.EOF {
        break
    }
    // process rec
    rec.Release()
}
Filter Pushdown

Pass filters as ReadOptions to push predicates to the source. SQL databases translate them into WHERE clauses; Parquet readers use them for row-group pruning.

stream, err := reader.Read(ctx,
    dio.WithFilter([]dio.Filter{
        dio.Eq{Column: 0, Value: 42},
        dio.GtEq{Column: 1, Value: "2024-01-01"},
    }),
)

Readers optionally implement dio.FilterClassifier to report how precisely they handle each filter (FilterExact, FilterInexact, or FilterUnsupported).

Column Projection

Read only the columns you need:

stream, err := reader.Read(ctx,
    dio.WithProjection([]int{0, 2, 5}),
)
Platform Registry

Platforms register themselves at import time. Activate a platform with a blank import, then retrieve it by name:

import _ "github.com/rnestertsov/db-toolkit/platform/postgres"

pg, err := platform.Get("postgres")

Supported Platforms

Platform Package Status
PostgreSQL platform/postgres Supported

Adding a new platform? See docs/adding-a-platform.md.

Development

Prerequisites
  • Go 1.24+
  • Docker (for integration tests using testcontainers)
Build
make build
Test
make test

Integration tests spin up real database instances via testcontainers, so Docker must be running.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Directories

Path Synopsis
dio
Package dio provides a simple and easy to use API for reading and writing records from/to different data sources.
Package dio provides a simple and easy to use API for reading and writing records from/to different data sources.
csv
gen
parquet
ABOUTME: Row group pruning logic for Parquet filter pushdown.
ABOUTME: Row group pruning logic for Parquet filter pushdown.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL