treedb

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 23, 2026 License: MIT Imports: 15 Imported by: 0

README

TreeDB

TreeDB is a high-performance, persistent key-value store optimized for the Cosmos SDK workload. It features a B+Tree index backed by a memory-mapped file (index.db) and a value log (data-*.slab) for storing large values efficiently.

Features

  • ACID Transactions: Atomic commits using Copy-On-Write (COW) and redundant superblocks (Meta Pages).
  • Snapshot Isolation: Lock-free concurrent readers using Multi-Version Concurrency Control (MVCC) and Reference Counting.
  • Hybrid Storage:
    • Index: Memory-mapped B+Tree for keys and small values.
    • Slabs: Append-only log for large values (Contract code, blobs) to reduce write amplification and memory pressure.
  • Compaction: Background compaction mechanism ("Ghost Copy") to reclaim space from dead records in slabs.
  • Crash Recovery: Automatic recovery from torn writes using strict write-ordering and checksum verification.
  • Lifecycle Management: Safe page reclamation using a Graveyard and Reader Registry to protect active snapshots.

Architecture

Storage Layout
  • Pages: 4KB fixed-size blocks in index.db.
  • Nodes: Slotted pages supporting variable-length keys.
  • Slabs: Append-only files (data-0000.slab) storing value records with CRC checksums.
Write Path ("The Zipper")

Writes are batched and applied using a recursive "Zipper" merge algorithm. This creates a new version of the tree path (COW) without modifying existing on-disk pages, ensuring crash safety and snapshot isolation.

Read Path

Readers acquire a Snapshot which pins the version of the tree and the active slab files. This guarantees a consistent view of the database even while writers are committing new versions.

Usage

package main

import (
	"errors"
	"fmt"
	"log"

	treedb "github.com/snissn/gomap/TreeDB"
)

func main() {
	// Open the database (recommended: cached wrapper)
	opts := treedb.Options{Dir: "./my-db-data"}
	database, err := treedb.Open(opts)
	if err != nil {
		if errors.Is(err, treedb.ErrLocked) {
			log.Fatal("database is already open in another process")
		}
		log.Fatal(err)
	}
	defer database.Close()

	// Set a key-value pair
	if err := database.Set([]byte("key1"), []byte("value1")); err != nil {
		log.Fatal(err)
	}

	// Get a value
	val, err := database.Get([]byte("key1"))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("Got: %s\n", val)

	// Iterate
	it, _ := database.Iterator(nil, nil)
	defer it.Close()
	for ; it.Valid(); it.Next() {
		fmt.Printf("%s = %s\n", it.Key(), it.Value())
	}
	
	// Atomic Batch
	batch := database.NewBatch()
	batch.Set([]byte("k2"), []byte("v2"))
	batch.Delete([]byte("key1"))
	batch.Write() // Atomic commit
}

Profiles (Durable / Fast / Bench)

If you want a simple, documented “bundle” of options, start with a profile and then override a few workload-specific knobs:

opts := treedb.OptionsFor(treedb.ProfileDurable, "./my-db-data")
opts.FlushThreshold = 128 << 20 // optional tuning
db, err := treedb.Open(opts)

Profiles are intended to make intent explicit:

  • ProfileDurable: safest defaults (recommended).
  • ProfileFast: relax durability/integrity knobs for throughput.
  • ProfileBench: deterministic benchmarking profile (not production).

Unsafe profiles require an explicit acknowledgement:

opts := treedb.OptionsFor(treedb.ProfileFast, "./my-db-data")
opts.AllowUnsafe = true
db, err := treedb.Open(opts)

Details: docs/TREEDB_PROFILES.md.

Durability & Safety Notes

  • Safe defaults keep WAL, fsync, and read checksums enabled; unsafe toggles require AllowUnsafe.
  • With RelaxedSync enabled, SetSync/WriteSync are crash-consistent only (no fsync) and may not survive power loss.
  • Page checksums are verified once and cached until the page is rewritten; use VerifyOnRead for paranoid always-verify behavior. DisableReadChecksum disables slab/value-log CRC checks entirely.
  • CRC checksums detect accidental corruption, not malicious tampering; use filesystem encryption/HMAC if your threat model includes adversarial disk access.
  • GetUnsafe on a Snapshot and iterator Key()/Value() return short-lived views; use Get, KeyCopy, or ValueCopy for stable bytes.
  • TreeDB does not provide encryption-at-rest or secure deletion; deleted data may remain on disk until compacted. Use OS/disk encryption for confidentiality.
  • Value-log segments are retained conservatively; large values can keep wal/ growth high until value-log GC is implemented.
  • Optional guardrail: MaxValueLogRetainedBytes emits a warning when retained value-log bytes exceed the threshold.
  • Optional hard cap: MaxValueLogRetainedBytesHard disables value-log pointers for new large values once retained bytes exceed the threshold.
  • On-disk format is considered alpha and may change without backward-compatibility guarantees.
Durability Matrix (Cached Mode)
Options WAL Sync boundary Power-loss durability Notes
Defaults on fsync yes safest default
RelaxedSync on flush-only no crash-consistent only
DisableWAL off backend checkpoint yes (if not relaxed) durable only after checkpoint
DisableWAL + RelaxedSync off flush-only no fastest, least safe

Tuning (Cached Mode)

treedb.Open defaults to cached mode (memtable + WAL + background flush). The most important knobs:

  • Options.FlushThreshold + Options.MaxQueuedMemtables (throughput vs. backlog/memory)
  • Adaptive backpressure: SlowdownBacklogSeconds, StopBacklogSeconds, MaxBacklogBytes
  • Cached-mode auto checkpointing: BackgroundCheckpointInterval, BackgroundCheckpointIdleDuration, MaxWALBytes
  • Background pruning: PruneInterval, PruneMaxPages, PruneMaxDuration
  • Background index vacuum: BackgroundIndexVacuumInterval, BackgroundIndexVacuumSpanRatioPPM
  • Optional background slab compaction: BackgroundCompactionInterval + related knobs
  • Optional flush build parallelism: FlushBuildConcurrency
  • Offline index vacuum (backend index): treedb.VacuumIndexOffline(opts) (requires the DB to be closed)

Details: docs/TREEDB_TUNING.md.

Exclusive Open (Process Lock)

TreeDB acquires an exclusive lock on Options.Dir. If another process has the database open, treedb.Open/treedb.OpenBackend returns treedb.ErrLocked.

Testing

TreeDB includes a comprehensive test suite covering unit functionality, integration, fuzzing, and crash recovery.

Unit & Integration Tests
go test -v ./...
Fuzz Testing

Model-based fuzzing verifies consistency against a simple in-memory map map model.

go test -v ./db/fuzz_test.go
Crash Simulation

The verify_crash.sh script compiles a stress tool, runs it, kills it (kill -9), and verifies database integrity upon restart.

./verify_crash.sh

License

MIT

Documentation

Index

Examples

Constants

View Source
const (
	// ModeCached opens TreeDB with the write-back caching layer enabled.
	ModeCached = db.ModeCached
	// ModeBackend opens TreeDB in backend-only mode (no caching layer).
	ModeBackend = db.ModeBackend
)

Variables

View Source
var (
	// ErrLocked indicates the database directory is already opened by another process.
	ErrLocked = db.ErrLocked
	// ErrUnsafeOptions indicates unsafe durability/integrity options were set without acknowledgement.
	ErrUnsafeOptions = db.ErrUnsafeOptions
	// ErrMemtableFull indicates the cached memtable has reached its hard cap.
	ErrMemtableFull = caching.ErrMemtableFull
	// ErrMemtableValueLogPointers indicates memtable value-log pointers require WAL/value-log enabled.
	ErrMemtableValueLogPointers = caching.ErrMemtableValueLogPointers

	// ErrClosed indicates the DB handle has been closed.
	ErrClosed = errors.New("treedb: db is closed")

	// ErrKeyNotFound indicates the key does not exist.
	ErrKeyNotFound = tree.ErrKeyNotFound
)

Functions

func ApplyProfile

func ApplyProfile(opts *Options, profile Profile)

ApplyProfile applies a profile to opts without overwriting explicit caller overrides.

For numeric/duration fields, "explicit override" means the field is already set to a non-zero value. For background intervals, note TreeDB conventions:

  • `0` means "use default"
  • `<0` means "disable"

For booleans, Go does not provide a way to distinguish “unset” from “explicit false”, so profiles set boolean policy knobs to match the profile. If you want the opposite policy, apply the profile and then override the boolean explicitly.

func VacuumIndexOffline

func VacuumIndexOffline(opts Options) error

VacuumIndexOffline rewrites `index.db` into a fresh file and swaps it in. This is intended to reclaim space and restore locality after long churn.

It is an offline operation: it acquires the exclusive open lock for opts.Dir.

Types

type Batch

type Batch interface {
	Set(key, value []byte) error
	Delete(key []byte) error
	Write() error
	WriteSync() error
	Close() error
	Replay(func(batch.Entry) error) error
	GetByteSize() (int, error)
}

Batch is the public batch contract returned by TreeDB. Both cached and backend implementations satisfy it.

Example
package main

import (
	"fmt"
	"os"

	treedb "github.com/snissn/gomap/TreeDB"
)

func main() {
	dir, err := os.MkdirTemp("", "treedb-batch-example-")
	if err != nil {
		panic(err)
	}
	defer os.RemoveAll(dir)

	db, err := treedb.Open(treedb.Options{Dir: dir})
	if err != nil {
		panic(err)
	}
	defer db.Close()

	batch := db.NewBatch()
	if batch == nil {
		panic("batch creation failed")
	}

	batch.Set([]byte("key1"), []byte("value1"))
	batch.Set([]byte("key2"), []byte("value2"))
	batch.Delete([]byte("key1"))

	// WriteSync ensures durability.
	if err := batch.WriteSync(); err != nil {
		panic(err)
	}

	val, _ := db.Get([]byte("key2"))
	fmt.Println("key2:", string(val))
	val, _ = db.Get([]byte("key1"))
	fmt.Println("key1:", val)

}
Output:

key2: value2
key1: []

type DB

type DB struct {
	// contains filtered or unexported fields
}

DB is the public TreeDB handle. It can represent either cached mode (default) or backend-only mode depending on Options.

func Open

func Open(opts Options) (*DB, error)

Open opens TreeDB. By default it enables caching (write-back layer). To open the backend-only engine, set opts.Mode = ModeBackend.

Example
package main

import (
	"fmt"
	"os"

	treedb "github.com/snissn/gomap/TreeDB"
)

func main() {
	dir, err := os.MkdirTemp("", "treedb-example-")
	if err != nil {
		panic(err)
	}
	defer os.RemoveAll(dir)

	db, err := treedb.Open(treedb.Options{Dir: dir, ChunkSize: 64 * 1024})
	if err != nil {
		panic(err)
	}
	defer db.Close()

	if err := db.SetSync([]byte("k"), []byte("v")); err != nil {
		panic(err)
	}

	val, err := db.Get([]byte("k"))
	if err != nil {
		panic(err)
	}
	fmt.Println(string(val))

}
Output:

v
Example (BackendMode)
package main

import (
	"fmt"
	"os"

	treedb "github.com/snissn/gomap/TreeDB"
)

func main() {
	dir, err := os.MkdirTemp("", "treedb-backend-example-")
	if err != nil {
		panic(err)
	}
	defer os.RemoveAll(dir)

	opts := treedb.Options{Dir: dir, ChunkSize: 64 * 1024, Mode: treedb.ModeBackend}
	db, err := treedb.Open(opts)
	if err != nil {
		panic(err)
	}
	defer db.Close()

	if err := db.SetSync([]byte("k"), []byte("v")); err != nil {
		panic(err)
	}

	val, err := db.Get([]byte("k"))
	if err != nil {
		panic(err)
	}
	fmt.Println(string(val))

}
Output:

v

func OpenBackend

func OpenBackend(opts Options) (*DB, error)

OpenBackend opens TreeDB in backend-only mode (no caching).

func OpenCached

func OpenCached(opts Options) (*DB, error)

OpenCached is an explicit cached-mode opener (alias of Open with ModeCached).

func (*DB) AcquireSnapshot

func (db *DB) AcquireSnapshot() *Snapshot

AcquireSnapshot returns a new snapshot.

func (*DB) Checkpoint

func (db *DB) Checkpoint() error

Checkpoint forces a durable backend boundary and trims cached-mode WAL segments, so long-running cached-mode workloads do not accumulate unbounded `wal/` growth.

In cached mode this flushes queued memtables with backend sync and resets the WAL to a fresh segment. In backend mode it forces a sync boundary.

func (*DB) Close

func (db *DB) Close() error

Close closes the DB.

func (*DB) CompactCandidates

func (db *DB) CompactCandidates(opts compaction.Options) error

CompactCandidates runs slab compaction based on the provided selection options. In cached mode it will also perform bounded flush assist when the caching layer is under backpressure, so compaction does not starve the foreground flush path.

func (*DB) CompactIndex

func (db *DB) CompactIndex() error

CompactIndex performs an in-place index vacuum (bulk rebuild) on the backend. In cached mode it first drains the caching layer so the backend reflects all buffered writes before rebuilding.

func (*DB) Delete

func (db *DB) Delete(key []byte) error

Delete removes a key without forcing an fsync boundary.

func (*DB) DeleteRange

func (db *DB) DeleteRange(start, end []byte) error

DeleteRange removes all keys in the range [start, end).

This is primarily used by benchmark suites and maintenance tooling. In cached mode, it may use fast paths that avoid per-key tombstones when safe.

func (*DB) DeleteSync

func (db *DB) DeleteSync(key []byte) error

DeleteSync removes a key and forces a durability boundary.

func (*DB) DurabilityMode

func (db *DB) DurabilityMode() string

DurabilityMode reports the effective durability/integrity policy string.

func (*DB) FragmentationReport

func (db *DB) FragmentationReport() (map[string]string, error)

FragmentationReport returns best-effort structural stats about the on-disk user index that help diagnose scan regressions after churn.

Note: In cached mode this reflects the backend state only; queued memtables are not included unless the caller has explicitly drained the cache (e.g. via close+reopen or a maintenance operation that drains).

func (*DB) Get

func (db *DB) Get(key []byte) ([]byte, error)

Get returns the value for a key.

Semantics: Returns a safe copy of the value.

func (*DB) GetAppend

func (db *DB) GetAppend(key, dst []byte) ([]byte, error)

GetAppend appends the value for the key to dst and returns the new slice. It avoids internal allocations by using the provided buffer. If the key is not found, it returns dst and ErrKeyNotFound.

func (*DB) GetUnsafe

func (db *DB) GetUnsafe(key []byte) ([]byte, error)

GetUnsafe returns the value for a key.

Semantics: Returns a safe copy of the value. For zero-copy views tied to a snapshot lifetime, use AcquireSnapshot().GetUnsafe.

func (*DB) Has

func (db *DB) Has(key []byte) (bool, error)

Has reports whether a key exists in the database.

func (*DB) Iterator

func (db *DB) Iterator(start, end []byte) (Iterator, error)

Iterator returns a forward iterator over the range [start, end).

func (*DB) NewBatch

func (db *DB) NewBatch() Batch

NewBatch creates a new batch for buffered writes.

func (*DB) NewBatchWithSize

func (db *DB) NewBatchWithSize(size int) Batch

NewBatchWithSize creates a new batch with a hint for the expected entry size.

func (*DB) Print

func (db *DB) Print() error

Print dumps best-effort debug output for the underlying backend.

func (*DB) ReverseIterator

func (db *DB) ReverseIterator(start, end []byte) (Iterator, error)

ReverseIterator returns a reverse iterator over the range [start, end).

func (*DB) Set

func (db *DB) Set(key, value []byte) error

Set writes a key/value pair without forcing an fsync boundary.

func (*DB) SetSync

func (db *DB) SetSync(key, value []byte) error

SetSync writes a key/value pair and forces a durability boundary. With Options.RelaxedSync enabled, Sync operations are crash-consistent only (no fsync) and may not survive power loss.

func (*DB) Stats

func (db *DB) Stats() map[string]string

Stats returns diagnostic stats for the active backend and cached layer.

func (*DB) VacuumIndexOnline

func (db *DB) VacuumIndexOnline(ctx context.Context) error

VacuumIndexOnline rebuilds the user index into a new file and swaps it in with a short writer pause. Disk space from the old index is reclaimed once any old snapshots/iterators drain.

type Iterator

type Iterator interface {
	Valid() bool
	Next()
	Key() []byte
	Value() []byte
	KeyCopy(dst []byte) []byte
	ValueCopy(dst []byte) []byte
	Close() error
	Error() error
}

Iterator is the public iterator contract returned by TreeDB.

Semantics (performance-first; callers must treat slices as read-only):

  • Key() and Value() return views valid until the next Next()/Close().
  • Use KeyCopy/ValueCopy if you need stable bytes.
Example
package main

import (
	"fmt"
	"os"

	treedb "github.com/snissn/gomap/TreeDB"
)

func main() {
	dir, err := os.MkdirTemp("", "treedb-iter-example-")
	if err != nil {
		panic(err)
	}
	defer os.RemoveAll(dir)

	db, err := treedb.Open(treedb.Options{Dir: dir})
	if err != nil {
		panic(err)
	}
	defer db.Close()

	// Insert keys in random order
	db.SetSync([]byte("b"), []byte("2"))
	db.SetSync([]byte("a"), []byte("1"))
	db.SetSync([]byte("c"), []byte("3"))

	// Iterate over all keys
	it, err := db.Iterator(nil, nil)
	if err != nil {
		panic(err)
	}
	defer it.Close()

	for ; it.Valid(); it.Next() {
		fmt.Printf("%s: %s\n", it.Key(), it.Value())
	}

}
Output:

a: 1
b: 2
c: 3

type Mode

type Mode = db.Mode

Mode selects which TreeDB implementation to open.

type Options

type Options = db.Options

Options configures TreeDB. It is re-exported from TreeDB/db for convenience.

func OptionsFor

func OptionsFor(profile Profile, dir string) Options

OptionsFor returns a copy of Options pre-filled for the given Profile.

The returned Options still follow TreeDB's normal defaulting rules for fields left as zero values (e.g. ChunkSize, KeepRecent, backpressure thresholds).

type Profile

type Profile string

Profile is a documented, high-level preset for TreeDB Options.

Why profiles exist ------------------ TreeDB exposes many low-level knobs because different workloads want different trade-offs (durability vs throughput, steady-state vs benchmark determinism, background maintenance vs predictable latency).

In practice, most callers want one of a small number of "bundles":

  • "Durable": the safest defaults; favors crash-recovery and integrity.
  • "Fast": higher throughput by relaxing durability/integrity knobs.
  • "Bench": a deterministic variant intended for benchmarking; disables background workers that can otherwise inject "random" work (e.g. index vacuum firing mid-run).

Profiles are intentionally conservative:

  • They set the *meaningful policy knobs* (durability/integrity/background), while leaving the many throughput/capacity tuning knobs at their usual defaults.
  • They do not require TreeDB internals to infer "intent" from combinations of flags. You pick the intent explicitly.

How to use ----------

  1. New DB (recommended): opts := treedb.OptionsFor(treedb.ProfileDurable, "/path/to/db") opts.FlushThreshold = 128 << 20 // optional tuning db, err := treedb.Open(opts)

  2. Existing Options (merge without clobbering explicit overrides): opts := treedb.Options{Dir: "/path/to/db"} treedb.ApplyProfile(&opts, treedb.ProfileBench) opts.FlushThreshold = 64 << 20 // explicit overrides always win db, err := treedb.Open(opts)

Note: Profiles are a convenience API. TreeDB still honors the established "0 uses a default; <0 disables" convention for background knobs, and callers can always override any field directly after applying a profile.

const (
	// ProfileDurable is the recommended default for production use when you care
	// about durability and corruption detection.
	//
	// It keeps WAL and checksums enabled and leaves background maintenance at
	// their default settings.
	ProfileDurable Profile = "durable"

	// ProfileFast prioritizes throughput by relaxing durability/integrity knobs.
	//
	// This profile is appropriate when:
	//   - you are running on top of an external durability boundary (e.g. you
	//     snapshot at higher layers), or
	//   - you are exploring performance limits, and crashes/corruption detection
	//     are acceptable trade-offs.
	//
	// Background maintenance is left enabled by default; it is generally helpful
	// for keeping the index compact and read-friendly.
	ProfileFast Profile = "fast"

	// ProfileBench is a "fast + deterministic" profile intended specifically for
	// benchmarking.
	//
	// It disables background workers that can inject heavy work mid-run (e.g.
	// background index vacuum). This makes comparisons more stable across runs.
	//
	// IMPORTANT: This is not a recommended production profile.
	ProfileBench Profile = "bench"
)

type Snapshot

type Snapshot = db.Snapshot

Snapshot is a consistent point-in-time view of the database.

Directories

Path Synopsis
cmd
db_histogram command
debug_open command
stress command
treemap command
verify command
internal
crc
wal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL