datasets

package
v0.0.0-...-8055bb4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 5, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var CIDatasetConfigs = map[string]DatasetConfig{

	"ivf_recall":   {N: 5000, D: 128, NQ: 50, K: 10},
	"hnsw_recall":  {N: 5000, D: 128, NQ: 50, K: 10},
	"pq_recall":    {N: 5000, D: 128, NQ: 50, K: 10},
	"ivfpq_best":   {N: 10000, D: 256, NQ: 100, K: 10},
	"ivf_optimal":  {N: 10000, D: 256, NQ: 100, K: 10},
	"ivf_training": {N: 10000, D: 128, NQ: 100, K: 10},

	"semantic_search":  {N: 10000, D: 384, NQ: 100, K: 10},
	"image_similarity": {N: 10000, D: 512, NQ: 50, K: 10},
	"recommendations":  {N: 10000, D: 256, NQ: 50, K: 50},

	"param_sweep":      {N: 5000, D: 128, NQ: 50, K: 10},
	"high_dimensional": {N: 5000, D: 1536, NQ: 50, K: 10},
}

CI-friendly dataset sizes (small, fast tests)

View Source
var LocalDatasetConfigs = map[string]DatasetConfig{

	"ivf_recall":   {N: 10000, D: 128, NQ: 100, K: 10},
	"hnsw_recall":  {N: 10000, D: 256, NQ: 100, K: 10},
	"pq_recall":    {N: 10000, D: 128, NQ: 100, K: 10},
	"ivfpq_best":   {N: 100000, D: 256, NQ: 100, K: 10},
	"ivf_optimal":  {N: 100000, D: 256, NQ: 100, K: 10},
	"ivf_training": {N: 50000, D: 128, NQ: 100, K: 10},

	"semantic_search":  {N: 50000, D: 768, NQ: 500, K: 10},
	"image_similarity": {N: 50000, D: 2048, NQ: 100, K: 10},
	"recommendations":  {N: 50000, D: 256, NQ: 100, K: 50},

	"param_sweep":      {N: 10000, D: 256, NQ: 100, K: 10},
	"high_dimensional": {N: 10000, D: 1536, NQ: 100, K: 10},
}

Local testing dataset sizes (medium, realistic tests)

Functions

func GenerateStandardSizes

func GenerateStandardSizes() []struct {
	Name string
	N    int
	D    int
}

GenerateStandardSizes returns standard dataset sizes for testing

func IsCI

func IsCI() bool

IsCI detects if running in a CI environment

func IsDatasetAvailable

func IsDatasetAvailable(name string, testdataPath string) bool

IsDatasetAvailable checks if a dataset is downloaded

func SaveFVecs

func SaveFVecs(filename string, vectors []float32, n, d int) error

SaveFVecs saves vectors to .fvecs format (useful for creating test datasets)

func SaveGroundTruth

func SaveGroundTruth(filename string, groundTruth [][]int64) error

SaveGroundTruth saves ground truth to .ivecs format

Types

type DataDistribution

type DataDistribution int

DataDistribution represents different data distribution types

const (
	// UniformRandom generates uniformly distributed random vectors
	UniformRandom DataDistribution = iota
	// GaussianClustered generates vectors in Gaussian clusters
	GaussianClustered
	// PowerLaw generates vectors with power-law distance distribution
	PowerLaw
	// Normalized generates normalized (unit length) vectors
	Normalized
	// Sparse generates sparse vectors (many zeros)
	Sparse
)

type DatasetConfig

type DatasetConfig struct {
	N  int // Number of vectors
	D  int // Dimension
	NQ int // Number of queries
	K  int // Number of neighbors for recall
}

DatasetConfig defines the size parameters for a dataset

func GetDatasetConfig

func GetDatasetConfig(name string) DatasetConfig

GetDatasetConfig returns the appropriate dataset configuration based on environment If name is not found, returns a default small configuration

type DatasetInfo

type DatasetInfo struct {
	Name        string
	Description string
	BaseFile    string // Filename for base vectors
	QueryFile   string // Filename for query vectors
	GTFile      string // Filename for ground truth
	N           int    // Number of vectors
	NQ          int    // Number of queries
	D           int    // Dimension
	Format      string // fvecs, bvecs, etc.
	URL         string // Download URL
}

DatasetInfo contains metadata about available datasets

func AvailableDatasets

func AvailableDatasets() []DatasetInfo

AvailableDatasets returns information about standard benchmark datasets

type GeneratorConfig

type GeneratorConfig struct {
	N            int              // Number of vectors
	D            int              // Dimension
	Distribution DataDistribution // Distribution type
	NumClusters  int              // For clustered data
	Sparsity     float64          // For sparse data (0.0-1.0)
	Seed         int64            // Random seed for reproducibility
}

GeneratorConfig configures synthetic data generation

type RealDataset

type RealDataset struct {
	Name        string
	Vectors     []float32 // Base vectors
	Queries     []float32 // Query vectors
	GroundTruth [][]int64 // Ground truth nearest neighbors for each query
	N           int       // Number of base vectors
	NQ          int       // Number of query vectors
	D           int       // Dimension
	K           int       // Number of neighbors in ground truth
}

RealDataset represents a real-world dataset for testing

func CreateSubset

func CreateSubset(source *RealDataset, nBase, nQuery int) *RealDataset

CreateSubset creates a smaller subset from a large dataset Useful for creating SIFT10K from SIFT1M, etc.

func LoadDataset

func LoadDataset(name string, testdataPath string) (*RealDataset, error)

LoadDataset loads a dataset from the testdata directory

type SyntheticDataset

type SyntheticDataset struct {
	Vectors    []float32 // Flattened vectors (N*D)
	Queries    []float32 // Query vectors
	Labels     []int     // Cluster labels (for clustered data)
	N          int       // Number of vectors
	D          int       // Dimension
	NumQueries int       // Number of query vectors
}

SyntheticDataset contains generated vectors and metadata

func GenerateClusteredDataWithGroundTruth

func GenerateClusteredDataWithGroundTruth(n, d, numClusters int, seed int64) *SyntheticDataset

GenerateClusteredDataWithGroundTruth creates clustered data with known nearest neighbors This is superior to random data for testing because: - Recall is predictable (vectors in same cluster should be nearest neighbors) - Tests are reproducible with fixed seed - Can validate that indexes correctly identify cluster membership

func GenerateCorrelatedVectors

func GenerateCorrelatedVectors(n, d, intrinsicDim int) *SyntheticDataset

GenerateCorrelatedVectors creates vectors with correlated dimensions (useful for testing PCA, dimensionality reduction)

func GenerateRealisticEmbeddings

func GenerateRealisticEmbeddings(n, d int) *SyntheticDataset

GenerateRealisticEmbeddings creates vectors that simulate real embeddings (e.g., BERT, OpenAI, etc.) with realistic properties

func GenerateSyntheticData

func GenerateSyntheticData(config GeneratorConfig) *SyntheticDataset

GenerateSyntheticData creates synthetic vectors based on configuration

func (*SyntheticDataset) GeneratePerturbedQueries

func (d *SyntheticDataset) GeneratePerturbedQueries(numQueries int, noiseLevel float32)

GeneratePerturbedQueries creates query vectors as noisy perturbations of actual vectors This ensures queries have known nearest neighbors (the vectors they were perturbed from) noiseLevel controls the amount of noise (0.0 = identical, 0.1 = 10% noise, etc.)

func (*SyntheticDataset) GenerateQueries

func (d *SyntheticDataset) GenerateQueries(numQueries int, distribution DataDistribution)

GenerateQueries creates query vectors from the same distribution

func (*SyntheticDataset) GenerateQueriesFromClusters

func (d *SyntheticDataset) GenerateQueriesFromClusters(numQueries int, noiseLevel float32)

GenerateQueriesFromClusters creates queries that are close to specific clusters This allows for predictable recall testing: - Query i will have its K nearest neighbors in cluster i % numClusters

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL