Arrow-based data I/O for Go — like io.Reader/io.Writer, but for databases and file formats.
Overview
db-toolkit provides a unified streaming interface for moving data between databases and file formats using Apache Arrow as the in-memory representation. It follows a pull model: readers produce BatchStreams of Arrow records, writers consume them, and dio.Copy wires them together.
Reader BatchStream Writer
┌────────────┐ ┌──────────────┐ ┌────────────┐
│ Source │───▶│ Schema() │───▶│ Destination│
│ Schema() │ │ Next(ctx) │ │ │
│ Read(ctx) │ │ Close() │ │ WriteStream│
└────────────┘ └──────────────┘ └────────────┘
Features:
- Pull-model streaming — readers produce batches on demand, keeping memory usage bounded
- Filter pushdown — push predicates down to the source (SQL WHERE clauses, Parquet row-group pruning)
- Column projection — read only the columns you need
- Platform registry — database platforms self-register via
init(), activated with a blank import
Package Layout
| Package |
Description |
dio |
Core interfaces: Reader, Writer, BatchStream, Filter, Copy |
dio/csv |
CSV reader and writer |
dio/parquet |
Parquet reader and writer with row-group pruning via filter pushdown |
dio/gen |
Synthetic data generation for testing |
platform |
Platform registry (Register, Get, List) |
platform/postgres |
PostgreSQL reader, writer, and filter pushdown |
Quick Start
package main
import (
"context"
"os"
"github.com/rnestertsov/db-toolkit/dio"
"github.com/rnestertsov/db-toolkit/dio/csv"
"github.com/rnestertsov/db-toolkit/platform"
"github.com/rnestertsov/db-toolkit/platform/postgres"
)
func main() {
ctx := context.Background()
pg, err := platform.Get("postgres")
if err != nil {
panic(err)
}
conn, err := pg.OpenConnection(ctx, postgres.ConnectionConfig{
Name: "mydb",
User: "myuser",
Host: "localhost",
Port: "5432",
})
if err != nil {
panic(err)
}
defer conn.Close(ctx)
reader, err := conn.Query(ctx, "SELECT id, name, email FROM users")
if err != nil {
panic(err)
}
writer := csv.NewWriter(os.Stdout)
if err := dio.Copy(ctx, reader, writer); err != nil {
panic(err)
}
}
Key Concepts
BatchStream
A BatchStream is a pull-based iterator over Arrow record batches. Call Next(ctx) to get the next batch; it returns io.EOF when exhausted. Callers must call Release() on each returned record.
stream, err := reader.Read(ctx)
defer stream.Close()
for {
rec, err := stream.Next(ctx)
if err == io.EOF {
break
}
// process rec
rec.Release()
}
Filter Pushdown
Pass filters as ReadOptions to push predicates to the source. SQL databases translate them into WHERE clauses; Parquet readers use them for row-group pruning.
stream, err := reader.Read(ctx,
dio.WithFilter([]dio.Filter{
dio.Eq{Column: 0, Value: 42},
dio.GtEq{Column: 1, Value: "2024-01-01"},
}),
)
Readers optionally implement dio.FilterClassifier to report how precisely they handle each filter (FilterExact, FilterInexact, or FilterUnsupported).
Column Projection
Read only the columns you need:
stream, err := reader.Read(ctx,
dio.WithProjection([]int{0, 2, 5}),
)
Platforms register themselves at import time. Activate a platform with a blank import, then retrieve it by name:
import _ "github.com/rnestertsov/db-toolkit/platform/postgres"
pg, err := platform.Get("postgres")
| Platform |
Package |
Status |
| PostgreSQL |
platform/postgres |
Supported |
Adding a new platform? See docs/adding-a-platform.md.
Development
Prerequisites
- Go 1.24+
- Docker (for integration tests using testcontainers)
Build
make build
Test
make test
Integration tests spin up real database instances via testcontainers, so Docker must be running.
License
This project is licensed under the MIT License — see the LICENSE file for details.