Documentation
¶
Index ¶
- Variables
- func GenerateStandardSizes() []struct{ ... }
- func IsCI() bool
- func IsDatasetAvailable(name string, testdataPath string) bool
- func SaveFVecs(filename string, vectors []float32, n, d int) error
- func SaveGroundTruth(filename string, groundTruth [][]int64) error
- type DataDistribution
- type DatasetConfig
- type DatasetInfo
- type GeneratorConfig
- type RealDataset
- type SyntheticDataset
Constants ¶
This section is empty.
Variables ¶
var CIDatasetConfigs = map[string]DatasetConfig{
"ivf_recall": {N: 5000, D: 128, NQ: 50, K: 10},
"hnsw_recall": {N: 5000, D: 128, NQ: 50, K: 10},
"pq_recall": {N: 5000, D: 128, NQ: 50, K: 10},
"ivfpq_best": {N: 10000, D: 256, NQ: 100, K: 10},
"ivf_optimal": {N: 10000, D: 256, NQ: 100, K: 10},
"ivf_training": {N: 10000, D: 128, NQ: 100, K: 10},
"semantic_search": {N: 10000, D: 384, NQ: 100, K: 10},
"image_similarity": {N: 10000, D: 512, NQ: 50, K: 10},
"recommendations": {N: 10000, D: 256, NQ: 50, K: 50},
"param_sweep": {N: 5000, D: 128, NQ: 50, K: 10},
"high_dimensional": {N: 5000, D: 1536, NQ: 50, K: 10},
}
CI-friendly dataset sizes (small, fast tests)
var LocalDatasetConfigs = map[string]DatasetConfig{
"ivf_recall": {N: 10000, D: 128, NQ: 100, K: 10},
"hnsw_recall": {N: 10000, D: 256, NQ: 100, K: 10},
"pq_recall": {N: 10000, D: 128, NQ: 100, K: 10},
"ivfpq_best": {N: 100000, D: 256, NQ: 100, K: 10},
"ivf_optimal": {N: 100000, D: 256, NQ: 100, K: 10},
"ivf_training": {N: 50000, D: 128, NQ: 100, K: 10},
"semantic_search": {N: 50000, D: 768, NQ: 500, K: 10},
"image_similarity": {N: 50000, D: 2048, NQ: 100, K: 10},
"recommendations": {N: 50000, D: 256, NQ: 100, K: 50},
"param_sweep": {N: 10000, D: 256, NQ: 100, K: 10},
"high_dimensional": {N: 10000, D: 1536, NQ: 100, K: 10},
}
Local testing dataset sizes (medium, realistic tests)
Functions ¶
func GenerateStandardSizes ¶
GenerateStandardSizes returns standard dataset sizes for testing
func IsDatasetAvailable ¶
IsDatasetAvailable checks if a dataset is downloaded
func SaveGroundTruth ¶
SaveGroundTruth saves ground truth to .ivecs format
Types ¶
type DataDistribution ¶
type DataDistribution int
DataDistribution represents different data distribution types
const ( // UniformRandom generates uniformly distributed random vectors UniformRandom DataDistribution = iota // GaussianClustered generates vectors in Gaussian clusters GaussianClustered // PowerLaw generates vectors with power-law distance distribution PowerLaw // Normalized generates normalized (unit length) vectors Normalized // Sparse generates sparse vectors (many zeros) Sparse )
type DatasetConfig ¶
type DatasetConfig struct {
N int // Number of vectors
D int // Dimension
NQ int // Number of queries
K int // Number of neighbors for recall
}
DatasetConfig defines the size parameters for a dataset
func GetDatasetConfig ¶
func GetDatasetConfig(name string) DatasetConfig
GetDatasetConfig returns the appropriate dataset configuration based on environment If name is not found, returns a default small configuration
type DatasetInfo ¶
type DatasetInfo struct {
Name string
Description string
BaseFile string // Filename for base vectors
QueryFile string // Filename for query vectors
GTFile string // Filename for ground truth
N int // Number of vectors
NQ int // Number of queries
D int // Dimension
Format string // fvecs, bvecs, etc.
URL string // Download URL
}
DatasetInfo contains metadata about available datasets
func AvailableDatasets ¶
func AvailableDatasets() []DatasetInfo
AvailableDatasets returns information about standard benchmark datasets
type GeneratorConfig ¶
type GeneratorConfig struct {
N int // Number of vectors
D int // Dimension
Distribution DataDistribution // Distribution type
NumClusters int // For clustered data
Sparsity float64 // For sparse data (0.0-1.0)
Seed int64 // Random seed for reproducibility
}
GeneratorConfig configures synthetic data generation
type RealDataset ¶
type RealDataset struct {
Name string
Vectors []float32 // Base vectors
Queries []float32 // Query vectors
GroundTruth [][]int64 // Ground truth nearest neighbors for each query
N int // Number of base vectors
NQ int // Number of query vectors
D int // Dimension
K int // Number of neighbors in ground truth
}
RealDataset represents a real-world dataset for testing
func CreateSubset ¶
func CreateSubset(source *RealDataset, nBase, nQuery int) *RealDataset
CreateSubset creates a smaller subset from a large dataset Useful for creating SIFT10K from SIFT1M, etc.
func LoadDataset ¶
func LoadDataset(name string, testdataPath string) (*RealDataset, error)
LoadDataset loads a dataset from the testdata directory
type SyntheticDataset ¶
type SyntheticDataset struct {
Vectors []float32 // Flattened vectors (N*D)
Queries []float32 // Query vectors
Labels []int // Cluster labels (for clustered data)
N int // Number of vectors
D int // Dimension
NumQueries int // Number of query vectors
}
SyntheticDataset contains generated vectors and metadata
func GenerateClusteredDataWithGroundTruth ¶
func GenerateClusteredDataWithGroundTruth(n, d, numClusters int, seed int64) *SyntheticDataset
GenerateClusteredDataWithGroundTruth creates clustered data with known nearest neighbors This is superior to random data for testing because: - Recall is predictable (vectors in same cluster should be nearest neighbors) - Tests are reproducible with fixed seed - Can validate that indexes correctly identify cluster membership
func GenerateCorrelatedVectors ¶
func GenerateCorrelatedVectors(n, d, intrinsicDim int) *SyntheticDataset
GenerateCorrelatedVectors creates vectors with correlated dimensions (useful for testing PCA, dimensionality reduction)
func GenerateRealisticEmbeddings ¶
func GenerateRealisticEmbeddings(n, d int) *SyntheticDataset
GenerateRealisticEmbeddings creates vectors that simulate real embeddings (e.g., BERT, OpenAI, etc.) with realistic properties
func GenerateSyntheticData ¶
func GenerateSyntheticData(config GeneratorConfig) *SyntheticDataset
GenerateSyntheticData creates synthetic vectors based on configuration
func (*SyntheticDataset) GeneratePerturbedQueries ¶
func (d *SyntheticDataset) GeneratePerturbedQueries(numQueries int, noiseLevel float32)
GeneratePerturbedQueries creates query vectors as noisy perturbations of actual vectors This ensures queries have known nearest neighbors (the vectors they were perturbed from) noiseLevel controls the amount of noise (0.0 = identical, 0.1 = 10% noise, etc.)
func (*SyntheticDataset) GenerateQueries ¶
func (d *SyntheticDataset) GenerateQueries(numQueries int, distribution DataDistribution)
GenerateQueries creates query vectors from the same distribution
func (*SyntheticDataset) GenerateQueriesFromClusters ¶
func (d *SyntheticDataset) GenerateQueriesFromClusters(numQueries int, noiseLevel float32)
GenerateQueriesFromClusters creates queries that are close to specific clusters This allows for predictable recall testing: - Query i will have its K nearest neighbors in cluster i % numClusters