opaque

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2026 License: MIT Imports: 18 Imported by: 0

README

Opaque

Privacy-preserving vector search using homomorphic encryption.

Search encrypted vectors without revealing your query. The server computes on encrypted data and never sees what you're searching for.

Install

go get github.com/Prasad-178/opaque

Quick Start

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/Prasad-178/opaque"
)

func main() {
	db, err := opaque.NewDB(opaque.Config{Dimension: 128, NumClusters: 16})
	if err != nil { log.Fatal(err) }
	defer db.Close()

	ctx := context.Background()
	db.Add(ctx, "doc-1", vector1)
	db.Add(ctx, "doc-2", vector2)

	db.Build(ctx) // k-means clustering + HE engine init

	results, _ := db.Search(ctx, queryVector, 10)
	for _, r := range results {
		fmt.Printf("  %s: %.4f\n", r.ID, r.Score)
	}
}

Features

  • Homomorphic encryption — queries are encrypted with CKKS; the server scores centroids without seeing the query
  • AES-256-GCM — vectors encrypted at rest, decrypted only client-side
  • Decoy requests — real bucket fetches are mixed with fake ones to hide access patterns
  • Metadata & filtered search — attach key-value metadata to vectors, filter at search time
  • CRUD operations — Add, Update, Delete vectors with soft-delete and compaction via Rebuild
  • Persistence — Save/Load database state to disk
  • File-backed storage — memory or file-backed blob store for large datasets
  • Progress callbacksOnBuildProgress hook for observability during index builds
  • Batch operations — AddBatch, AddBatchWithMetadata for bulk ingestion

Configuration

Field Default Description
Dimension (required) Vector dimension
NumClusters 64 K-means clusters. More = faster search, less privacy
TopClusters NumClusters/2 Clusters probed per search. More = better recall
NumDecoys 8 Decoy clusters for access pattern hiding
WorkerPoolSize min(NumCPU, 8) Parallel HE engines (~50MB each)
Storage Memory opaque.Memory or opaque.File
StoragePath "" Directory for file storage
ProbeThreshold 0.95 Multi-probe inclusion threshold
ProbeStrategy "threshold" "threshold" or "gap" (adaptive score-gap detection)
GapMultiplier 2.0 Gap sensitivity for "gap" strategy
RedundantAssignments 1 Clusters per vector (2 = better boundary recall, 2x storage)
NumKMeansInit 1 K-means initializations (higher = better centroids)
NormalizedStorage true Pre-normalize vectors for faster search

Examples

Example Description
basic Minimal workflow: create, add, build, search
persistence Save and load database state
metadata Attach metadata and use filtered search
large-scale 10K+ vectors with batch operations
file-storage File-backed blob store for large datasets
http-server HTTP API wrapping Opaque for self-hosted deployment

Run any example:

go run ./examples/basic/

Performance

Benchmarked on Apple M4 Pro, 100K 128-dimensional vectors, 64 clusters.

Metric Standard (64 HE ops) Batch (1 HE op)
Recall@10 96.0% 96.0%
Latency 2.56s 190ms

SIFT10K (real dataset): 95% Recall@1, 96% Recall@10 scanning 14.9% of data.

See BENCHMARKS.md for full details.

Architecture

Opaque uses a three-level privacy pipeline:

  1. HE centroid scoring — server scores encrypted query against all centroids, can't see query or results
  2. Decoy-based fetch — client requests real + fake buckets, server can't tell them apart
  3. Local AES decrypt + rank — all final scoring happens client-side

See docs/ARCHITECTURE.md for the full system design, threat model, and crypto details.

Self-Hosting

Docker (gRPC search service)
docker build -t opaque .
docker run -p 50051:50051 opaque
HTTP API (example)
go run ./examples/http-server/

# Add vectors
curl -X POST localhost:8080/vectors -d '{"vectors":[{"id":"v1","values":[0.1,0.2,...]}]}'

# Build index
curl -X POST localhost:8080/admin/build

# Search
curl -X POST localhost:8080/search -d '{"vector":[0.1,0.2,...],"top_k":5}'

Development

make test-fast    # go test -short ./...
make test         # full test suite
make lint         # go vet ./...
make test-bench   # crypto/LSH micro-benchmarks
make test-sift    # SIFT10K accuracy
make test-100k    # 100K vector benchmark

License

MIT

Documentation

Overview

Package opaque provides privacy-preserving vector search using homomorphic encryption.

Opaque encrypts vectors with AES-256-GCM, scores queries against cluster centroids using CKKS homomorphic encryption (so the server never sees the query), and hides access patterns with decoy bucket fetches.

Quick Start

db, err := opaque.NewDB(opaque.Config{
    Dimension:   128,
    NumClusters: 64,
})
if err != nil {
    log.Fatal(err)
}
defer db.Close()

// Add vectors
db.Add(ctx, "doc-1", vector1)
db.Add(ctx, "doc-2", vector2)

// Build the index (runs k-means clustering + initializes HE engines)
if err := db.Build(ctx); err != nil {
    log.Fatal(err)
}

// Search
results, err := db.Search(ctx, queryVector, 10)

Lifecycle

The DB follows a three-phase lifecycle:

  1. Add vectors with DB.Add or DB.AddBatch
  2. Build the index with DB.Build (expensive: k-means clustering + HE engine initialization)
  3. Search with DB.Search (safe for concurrent use)

K-means clustering requires all vectors upfront, so DB.Build must be called after all vectors are added. To add vectors after building, use DB.Add followed by DB.Rebuild.

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrNotBuilt is returned when Search is called before Build.
	ErrNotBuilt = errors.New("opaque: index not built")

	// ErrAlreadyBuilt is returned when Add/AddBatch is called after Build.
	ErrAlreadyBuilt = errors.New("opaque: index already built; use Rebuild to add vectors")

	// ErrDimensionMismatch is returned when a vector has the wrong dimension.
	ErrDimensionMismatch = errors.New("opaque: dimension mismatch")

	// ErrNotFound is returned when a vector ID is not found.
	ErrNotFound = errors.New("opaque: vector not found")

	// ErrEmptyID is returned when an empty vector ID is provided.
	ErrEmptyID = errors.New("opaque: empty vector ID")

	// ErrNoVectors is returned when Build is called with no buffered vectors.
	ErrNoVectors = errors.New("opaque: no vectors added")

	// ErrNotReady is returned when an operation requires a built index but
	// the DB is not in the ready state (e.g., Save before Build).
	ErrNotReady = errors.New("opaque: database not ready")

	// ErrClosed is returned when an operation is attempted on a closed DB.
	ErrClosed = errors.New("opaque: database is closed")
)

Sentinel errors for programmatic error handling. Use errors.Is to check:

if errors.Is(err, opaque.ErrNotBuilt) { ... }

Functions

This section is empty.

Types

type ClusterStats

type ClusterStats struct {
	NumClusters   int     // Number of clusters
	MinSize       int     // Smallest cluster size
	MaxSize       int     // Largest cluster size
	AvgSize       float64 // Average cluster size
	EmptyClusters int     // Number of empty clusters (should be 0)
	Iterations    int     // K-means iterations until convergence
}

ClusterStats contains statistics about k-means clustering quality.

type Config

type Config struct {
	// Dimension is the length of each vector. Required.
	// All vectors added to the DB must have exactly this many elements.
	Dimension int

	// NumClusters is the number of k-means clusters used to partition vectors.
	// More clusters means faster search (fewer vectors per cluster) but weaker
	// privacy (smaller anonymity sets per cluster). Must be >= 2.
	// Default: 64.
	NumClusters int

	// TopClusters is the number of clusters probed during each search.
	// Higher values improve recall at the cost of more computation, bandwidth,
	// and weaker access pattern privacy (more clusters probed = easier to infer intent).
	// Must be <= NumClusters.
	// Default: max(NumClusters / 16, 4). For 64 clusters this is 4 (~6% of data).
	TopClusters int

	// NumDecoys is the number of extra clusters fetched per search to hide
	// which clusters are actually relevant. Higher values provide better
	// access pattern privacy at the cost of additional bandwidth.
	// Default: 8.
	NumDecoys int

	// WorkerPoolSize is the number of parallel CKKS homomorphic encryption engines.
	// Each engine consumes ~50MB of memory but enables parallel centroid scoring.
	// Set to 0 for automatic sizing (min(NumCPU, 8)).
	// Default: 0 (automatic).
	WorkerPoolSize int

	// Storage selects the backend for encrypted blob storage.
	// Default: Memory.
	Storage StorageBackend

	// StoragePath is the directory for file-backed storage.
	// Required when Storage is [File], ignored otherwise.
	StoragePath string

	// ProbeThreshold controls multi-probe cluster selection during search.
	// Clusters scoring within this fraction of the top cluster score are also probed,
	// beyond the TopClusters limit. For example, 0.95 means clusters within 5% of
	// the best score are included.
	// Set to 1.0 to disable multi-probe (strict top-K only).
	// Default: 0.95.
	ProbeThreshold float64

	// RedundantAssignments assigns each vector to multiple clusters during indexing.
	// Improves recall for vectors near cluster boundaries at the cost of increased storage.
	// A value of 2 means each vector is stored in its 2 nearest clusters.
	// Default: 1 (no redundancy).
	RedundantAssignments int

	// PCADimension enables optional PCA dimensionality reduction.
	// When set to a positive value, vectors are projected to this dimension
	// before clustering and encryption, reducing latency and bandwidth.
	// The PCA transform is applied client-side, so it has no privacy impact.
	// Must be less than Dimension. Set to 0 to disable (default).
	// Default: 0 (disabled).
	PCADimension int

	// NumKMeansInit is the number of k-means clustering initializations to run.
	// Multiple runs with different seeds are executed in parallel, and the result
	// with the lowest inertia (best cluster quality) is kept.
	// Higher values improve cluster quality at the cost of more CPU during Build.
	// Default: 1 (single initialization).
	NumKMeansInit int

	// NormalizedStorage stores vectors pre-normalized during Build, skipping
	// per-vector normalization during Search. This reduces local scoring latency
	// by 10-15%. Stored vectors lose original magnitudes (direction is preserved).
	// Default: true for new databases.
	NormalizedStorage *bool

	// ProbeStrategy selects the cluster probing method during search.
	// "threshold" (default) uses ProbeThreshold ratio to include nearby clusters.
	// "gap" uses adaptive score-gap detection to find natural breaks in the score distribution.
	// Default: "" (uses "threshold").
	ProbeStrategy string

	// GapMultiplier controls gap-based probing sensitivity when ProbeStrategy is "gap".
	// Expansion stops when the gap between consecutive cluster scores exceeds
	// GapMultiplier times the median gap. Lower values probe fewer clusters.
	// Default: 2.0.
	GapMultiplier float64

	// OnBuildProgress is called during [DB.Build] and [DB.Rebuild] to report progress.
	// The phase parameter identifies the current step, and pct is a value between 0 and 1
	// indicating completion within that phase.
	//
	// Phases reported (in order):
	//   - "pca": PCA fitting and dimensionality reduction (only if PCADimension > 0)
	//   - "clustering": k-means clustering of vectors
	//   - "encrypting": AES-256-GCM encryption of vectors
	//   - "indexing": blob storage and HE engine initialization
	//
	// The callback is invoked synchronously from the Build goroutine. Keep it fast.
	// A nil callback disables progress reporting (default).
	OnBuildProgress func(phase string, pct float64) `json:"-"`
}

Config controls the behavior of a DB instance.

Only [Config.Dimension] is required. All other fields have sensible defaults.

type DB

type DB struct {
	// contains filtered or unexported fields
}

DB is a privacy-preserving vector search database.

It encrypts stored vectors with AES-256-GCM, scores queries against cluster centroids using CKKS homomorphic encryption, and fetches decoy clusters to hide access patterns.

A DB must be built before searching. After DB.Build completes, DB.Search is safe for concurrent use from multiple goroutines.

func Load

func Load(path string) (*DB, error)

Load restores a DB from a directory previously created by DB.Save.

The returned DB is immediately ready for DB.Search — no Build is needed. The blob store is opened in file mode from the saved directory.

To add new vectors after loading, use DB.Add followed by DB.Rebuild.

func NewDB

func NewDB(cfg Config) (*DB, error)

NewDB creates a new vector search database with the given configuration.

Only [Config.Dimension] is required; all other fields use sensible defaults if zero. No expensive initialization happens here — the heavy work is deferred to DB.Build.

func (*DB) Add

func (db *DB) Add(ctx context.Context, id string, vector []float64) error

Add buffers a single vector for indexing. The id must be unique within the DB.

Before DB.Build, vectors are buffered for the initial index build. After Build, vectors are buffered for the next DB.Rebuild.

The vector is copied internally, so the caller may modify the slice after Add returns.

func (*DB) AddBatch

func (db *DB) AddBatch(ctx context.Context, ids []string, vectors [][]float64) error

AddBatch buffers multiple vectors for indexing. The ids and vectors slices must have the same length. Each vector must have exactly [Config.Dimension] elements.

This is equivalent to calling DB.Add for each vector, but acquires the lock once.

func (*DB) AddBatchWithMetadata

func (db *DB) AddBatchWithMetadata(ctx context.Context, ids []string, vectors [][]float64, metadatas []Metadata) error

AddBatchWithMetadata buffers multiple vectors with associated metadata. The metadatas slice must have the same length as ids and vectors. Use nil for vectors without metadata.

func (*DB) AddWithMetadata

func (db *DB) AddWithMetadata(ctx context.Context, id string, vector []float64, meta Metadata) error

AddWithMetadata buffers a vector with associated metadata for indexing.

Metadata is encrypted alongside the vector and can be used for filtered search with DB.SearchWithFilter. The id must be unique within the DB.

Both the vector and metadata are copied internally.

func (*DB) Build

func (db *DB) Build(ctx context.Context) error

Build creates the search index from all buffered vectors.

This is the most expensive operation in the lifecycle:

  • Runs k-means clustering to partition vectors into clusters
  • Encrypts each vector with AES-256-GCM
  • Initializes the CKKS homomorphic encryption engine pool
  • Pre-encodes cluster centroids as HE plaintexts

After Build returns successfully, DB.Search is ready for use. Build must only be called once; use DB.Rebuild to re-index after adding new vectors.

func (*DB) Close

func (db *DB) Close() error

Close releases all resources held by the DB, including the blob store and HE engine pool. The DB must not be used after Close is called.

func (*DB) ClusterStats

func (db *DB) ClusterStats() ClusterStats

ClusterStats returns statistics about the k-means clustering from the most recent Build. Returns a zero value if the index has not been built yet.

func (*DB) Count

func (db *DB) Count(ctx context.Context) int

Count returns the number of indexed vectors (in the built index only). Returns 0 if the index has not been built.

func (*DB) Delete

func (db *DB) Delete(ctx context.Context, id string) error

Delete soft-deletes a vector by ID. The vector is excluded from future DB.Search results immediately. The underlying storage is reclaimed on the next DB.Rebuild.

Returns ErrEmptyID if the ID is empty, or ErrNotFound if the ID does not exist in either the pending vectors or the built index.

Delete is safe for concurrent use with DB.Search.

func (*DB) Get

func (db *DB) Get(ctx context.Context, id string) ([]float64, error)

Get retrieves a vector by ID, decrypting it from the blob store.

Returns ErrNotReady if the index has not been built, or ErrNotFound if no vector with the given ID exists. Get is safe for concurrent use.

func (*DB) GetConfig

func (db *DB) GetConfig() Config

GetConfig returns a copy of the DB's current configuration.

func (*DB) GetMetadata

func (db *DB) GetMetadata(ctx context.Context, id string) (Metadata, error)

GetMetadata retrieves the metadata for a vector by ID.

Returns nil if the vector has no metadata, or ErrNotFound if the ID does not exist.

func (*DB) Has

func (db *DB) Has(ctx context.Context, id string) bool

Has reports whether a vector with the given ID exists in the DB.

It checks both the built index (blob store) and pending vectors. Has is safe for concurrent use.

func (*DB) IsReady

func (db *DB) IsReady() bool

IsReady reports whether the index has been built and the DB is ready for search.

func (*DB) List

func (db *DB) List(ctx context.Context, offset, limit int) ([]string, error)

List returns a paginated slice of vector IDs from the built index.

IDs are returned in sorted order. offset and limit control pagination. Returns ErrNotReady if the index has not been built.

func (*DB) Rebuild

func (db *DB) Rebuild(ctx context.Context) error

Rebuild re-indexes all vectors including any added since the last Build.

This performs a full rebuild: the old index is discarded and a new one is created from all accumulated vectors. Use this after adding vectors to a built DB:

db.Build(ctx)           // initial build
// ... later ...
db.Rebuild(ctx)         // add pending vectors, rebuild from scratch

Rebuild is not safe for concurrent use with Search.

func (*DB) Save

func (db *DB) Save(path string) error

Save persists a built DB to the given directory path.

The directory must not already contain a saved DB (no metadata.json). After Save, the DB can be restored with Load in a new process.

Save is safe for concurrent use with DB.Search — it acquires a read lock.

func (*DB) Search

func (db *DB) Search(ctx context.Context, query []float64, topK int) ([]Result, error)

Search returns the topK most similar vectors to the query.

Results are sorted by descending cosine similarity score. The query vector must have exactly [Config.Dimension] elements.

Search uses SIMD-optimized batch HE operations internally for best performance. It is safe for concurrent use from multiple goroutines after DB.Build completes.

func (*DB) SearchWithFilter

func (db *DB) SearchWithFilter(ctx context.Context, query []float64, topK int, filter Filter) ([]Result, error)

SearchWithFilter returns the topK most similar vectors matching the filter.

This runs a normal search and then post-filters results by metadata. Filtered-out results are not replaced, so fewer than topK results may be returned. For better recall with filters, increase topK.

All conditions in [Filter.Where] must match (AND logic). Matching uses exact equality for string, int, float64, and bool values.

func (*DB) Size

func (db *DB) Size() int

Size returns the total number of vectors in the DB (both pending and indexed).

func (*DB) Stats

func (db *DB) Stats(ctx context.Context) DBStats

Stats returns aggregate statistics about the database.

func (*DB) Update

func (db *DB) Update(ctx context.Context, id string, vector []float64) error

Update replaces a vector's data. This is equivalent to DB.Delete followed by DB.Add — the old vector is soft-deleted and the new one is buffered for the next DB.Rebuild.

The updated vector takes effect in search results after Rebuild. Until then, the old vector is excluded from search (soft-deleted) and the new one is pending.

Returns ErrEmptyID if the ID is empty, ErrNotFound if the ID does not exist, or ErrDimensionMismatch if the vector has the wrong length.

type DBStats

type DBStats struct {
	// TotalVectors is the total number of vectors (pending + indexed).
	TotalVectors int

	// IndexedVectors is the number of vectors in the built index.
	// Zero if the index has not been built.
	IndexedVectors int

	// PendingVectors is the number of vectors buffered but not yet indexed.
	PendingVectors int

	// ClusterStats contains k-means clustering statistics (zero if not built).
	ClusterStats ClusterStats

	// StorageBackend is the storage backend in use.
	StorageBackend StorageBackend

	// HasPCA is true if PCA dimensionality reduction is enabled.
	HasPCA bool

	// IsReady is true if the index is built and ready for search.
	IsReady bool
}

DBStats contains aggregate statistics about the database.

type Filter

type Filter struct {
	// Where contains exact-match conditions. A result must match ALL conditions.
	// Supported value types: string, int, float64, bool.
	Where map[string]any
}

Filter specifies criteria for filtered search.

type Metadata

type Metadata map[string]any

Metadata is a map of key-value pairs attached to a vector. Keys are strings; values can be string, int, float64, or bool.

Metadata is stored encrypted alongside vectors and can be used for filtered search via DB.SearchWithFilter.

type Result

type Result struct {
	// ID is the identifier passed to [DB.Add] when the vector was indexed.
	ID string

	// Score is the cosine similarity between the query and this vector.
	// Higher is more similar. Range: [-1, 1] for normalized vectors.
	Score float64
}

Result is a single search result containing the vector ID and its similarity score.

type StorageBackend

type StorageBackend int

StorageBackend selects where encrypted vector blobs are stored.

const (
	// Memory stores all data in RAM. Fast but not persistent across restarts.
	Memory StorageBackend = iota

	// File stores encrypted blobs on disk at the path specified by [Config.StoragePath].
	// Persistent across restarts, slower than memory for large datasets.
	File
)

Directories

Path Synopsis
api
cmd
cli command
Command cli provides a command-line interface for testing Opaque locally.
Command cli provides a command-line interface for testing Opaque locally.
devserver command
Development server for local testing of the privacy-preserving vector search.
Development server for local testing of the privacy-preserving vector search.
search-service command
Command search-service runs the Opaque search gRPC server.
Command search-service runs the Opaque search gRPC server.
examples
basic command
Example sdk-basic demonstrates the simplest Opaque workflow: create a DB, add vectors, build the index, and search.
Example sdk-basic demonstrates the simplest Opaque workflow: create a DB, add vectors, build the index, and search.
file-storage command
Example sdk-file-storage demonstrates using file-backed storage instead of in-memory storage.
Example sdk-file-storage demonstrates using file-backed storage instead of in-memory storage.
http-server command
Example http-server wraps Opaque in a lightweight HTTP API, demonstrating a realistic self-hosted deployment pattern.
Example http-server wraps Opaque in a lightweight HTTP API, demonstrating a realistic self-hosted deployment pattern.
large-scale command
Example sdk-large-scale demonstrates tuning Opaque for larger datasets.
Example sdk-large-scale demonstrates tuning Opaque for larger datasets.
metadata command
Example sdk-metadata demonstrates adding metadata to vectors and using filtered search to narrow results by metadata fields.
Example sdk-metadata demonstrates adding metadata to vectors and using filtered search to narrow results by metadata fields.
persistence command
Example sdk-persistence demonstrates saving a built index to disk and loading it back in a new process.
Example sdk-persistence demonstrates saving a built index to disk and loading it back in a new process.
internal
service
Package service implements the Opaque search service.
Package service implements the Opaque search service.
session
Package session provides session management for client keys.
Package session provides session management for client keys.
store
Package store provides vector storage backends.
Package store provides vector storage backends.
pkg
auth
Package auth provides token-based authentication and key distribution for Tier 2.5 hierarchical private search (Option B).
Package auth provides token-based authentication and key distribution for Tier 2.5 hierarchical private search (Option B).
blob
Package blob provides encrypted blob storage for Tier 2 data-private search.
Package blob provides encrypted blob storage for Tier 2 data-private search.
cache
Package cache provides caching for expensive HE operations.
Package cache provides caching for expensive HE operations.
client
Package client provides the Opaque SDK for privacy-preserving search.
Package client provides the Opaque SDK for privacy-preserving search.
cluster
Package cluster provides clustering algorithms for vector indexing.
Package cluster provides clustering algorithms for vector indexing.
crypto
Package crypto provides homomorphic encryption operations using Lattigo CKKS scheme.
Package crypto provides homomorphic encryption operations using Lattigo CKKS scheme.
embeddings
Package embeddings provides a client for the local embedding service.
Package embeddings provides a client for the local embedding service.
encrypt
Package encrypt provides symmetric encryption for Tier 2 data-private storage.
Package encrypt provides symmetric encryption for Tier 2 data-private storage.
enterprise
Package enterprise provides per-enterprise configuration and secret management for Tier 2.5 hierarchical private search.
Package enterprise provides per-enterprise configuration and secret management for Tier 2.5 hierarchical private search.
grpcserver
Package grpcserver implements the gRPC service for privacy-preserving vector search.
Package grpcserver implements the gRPC service for privacy-preserving vector search.
hierarchical
Package hierarchical implements a three-level privacy-preserving vector search.
Package hierarchical implements a three-level privacy-preserving vector search.
lsh
Package lsh provides locality-sensitive hashing for approximate nearest neighbor search.
Package lsh provides locality-sensitive hashing for approximate nearest neighbor search.
pca
Package pca provides Principal Component Analysis for dimensionality reduction.
Package pca provides Principal Component Analysis for dimensionality reduction.
server
Package server provides the REST API server for privacy-preserving vector search.
Package server provides the REST API server for privacy-preserving vector search.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL