needle

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 20, 2025 License: MIT Imports: 8 Imported by: 0

README

Needle

Go Report Card Go Reference

High-Performance HNSW Vector Search in Go

Needle is a high-performance implementation of Hierarchical Navigable Small World (HNSW) graphs for approximate nearest neighbor search in Go. It uses Apache Arrow for efficient vector storage and provides a thread-safe API for concurrent operations.

Features

  • Efficient Vector Storage: Uses Apache Arrow for chunked vector storage with zero-copy operations
  • Thread-Safe: Supports concurrent insertions and searches with mutex locks
  • Memory Efficient: Zero-allocation search operations using object pools and optimized memory management
  • High Performance: Optimized neighbor selection with partial sorting and efficient graph traversal
  • Batch Operations: Efficient batch vector addition with minimal allocations
  • Configurable: Adjustable parameters for search quality vs. speed trade-offs
  • Production Ready: Comprehensive test suite and benchmarks

Installation

go get github.com/TFMV/needle

Usage

package main

import (
    "fmt"
    "github.com/apache/arrow-go/v18/arrow/memory"
    "github.com/TFMV/needle"
)

func main() {
    // Create a new HNSW index
    dim := 128          // vector dimension
    m := 16            // max connections per node
    efConstruction := 200 // construction search width
    efSearch := 100    // search width
    chunkSize := 1000  // vectors per chunk
    g := needle.NewGraph(dim, m, efConstruction, efSearch, chunkSize, memory.DefaultAllocator)

    // Add vectors
    vec1 := []float64{1.0, 2.0, 3.0} // your vector data
    g.Add(1, vec1)

    // Search for nearest neighbors
    query := []float64{1.1, 2.1, 3.1}
    results, err := g.Search(query, 10) // find 10 nearest neighbors
    if err != nil {
        panic(err)
    }
    fmt.Println("Nearest neighbors:", results)
}

Parameters

  • dim: Vector dimension
  • m: Maximum number of connections per node (higher values = better recall, more memory)
  • efConstruction: Search width during construction (higher values = better graph quality)
  • efSearch: Search width during queries (higher values = better recall, slower)
  • chunkSize: Number of vectors per Arrow chunk (affects memory usage)

Performance Characteristics

  • Single-threaded Add: ~110.6μs per vector
  • Single-threaded Search: ~12.6μs per query
  • Batch Add: ~2.8μs per vector (100 vectors/batch)
  • Concurrent Add: ~191.1μs per vector
  • Concurrent Search: ~28.4μs per query
  • High Volume Search: ~116μs per query (with 100,000 vectors)

Implementation Details

Vector Storage

Vectors are stored in Apache Arrow chunks for efficient memory management and vectorized operations. The chunked storage provides:

  • Zero-copy vector access
  • Efficient memory usage with configurable chunk sizes
  • Support for large datasets
  • Optimized vector operations
Graph Structure

The HNSW graph is implemented with:

  • Hierarchical layers for fast approximate search
  • Bidirectional connections for better search quality
  • Thread-safe operations using mutex locks
  • Object pools for zero-allocation search operations
  • Optimized neighbor selection with partial sorting
  • Efficient memory management for neighbor lists
Search Algorithm

The search process is optimized for performance:

  1. Starts at the top layer
  2. Navigates down through layers using greedy search
  3. Uses priority queues for efficient neighbor selection
  4. Implements partial sorting for better performance
  5. Leverages object pools to minimize allocations
  6. Supports concurrent operations with thread safety

Testing

The implementation includes:

  • Unit tests for core functionality
  • Edge case tests
  • High-volume tests (10,000 vectors)
  • Concurrent operation tests
  • Benchmarks for various operations

Run tests with:

go test -v -bench=. -benchmem

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Graph

type Graph struct {
	// contains filtered or unexported fields
}

Graph is the main HNSW index structure.

func NewGraph

func NewGraph(dim, m, efConstruction, efSearch, chunkSize int, alloc memory.Allocator) *Graph

NewGraph initializes an HNSW index.

func (*Graph) Add

func (g *Graph) Add(id int, vec []float64) error

Add inserts a single point into the index.

func (*Graph) AddBatch added in v0.2.0

func (g *Graph) AddBatch(items []struct {
	ID  int
	Vec []float64
}) error

AddBatch inserts multiple points into the index efficiently.

func (*Graph) Len

func (g *Graph) Len() int

Len returns the number of indexed points.

func (*Graph) Search

func (g *Graph) Search(query []float64, k int) ([]int, error)

Search finds the k nearest neighbors to query.

type Node

type Node struct {
	ID int // user-provided ID
	// contains filtered or unexported fields
}

Node represents a point in the HNSW graph.

func NewNode

func NewNode(id, idx, level int) *Node

NewNode creates a new node with properly initialized neighbor slices.

type VisitedList added in v0.4.0

type VisitedList struct {
	// contains filtered or unexported fields
}

VisitedList represents a reusable visited tracking structure inspired by hnswlib

func NewVisitedList added in v0.4.0

func NewVisitedList(capacity int) *VisitedList

NewVisitedList creates a new visited list with the given capacity

func (*VisitedList) IsVisited added in v0.4.0

func (vl *VisitedList) IsVisited(idx int) bool

IsVisited checks if an index has been visited

func (*VisitedList) Reset added in v0.4.0

func (vl *VisitedList) Reset()

Reset prepares the visited list for reuse

func (*VisitedList) Visit added in v0.4.0

func (vl *VisitedList) Visit(idx int)

Visit marks an index as visited

type VisitedListPool added in v0.4.0

type VisitedListPool struct {
	// contains filtered or unexported fields
}

VisitedListPool manages a pool of VisitedList instances

func NewVisitedListPool added in v0.4.0

func NewVisitedListPool(maxSize int) *VisitedListPool

NewVisitedListPool creates a new visited list pool

func (*VisitedListPool) Get added in v0.4.0

func (vlp *VisitedListPool) Get() *VisitedList

Get retrieves a visited list from the pool

func (*VisitedListPool) Return added in v0.4.0

func (vlp *VisitedListPool) Return(vl *VisitedList)

Return returns a visited list to the pool

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL