datasets

package

v0.0.0-...-8055bb4 Latest Latest Go to latest Published: Jan 5, 2026 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NerdMeNot/faiss-go

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func GenerateStandardSizes() []struct{ ... }
func IsCI() bool
func IsDatasetAvailable(name string, testdataPath string) bool
func SaveFVecs(filename string, vectors []float32, n, d int) error
func SaveGroundTruth(filename string, groundTruth [][]int64) error
type DataDistribution
type DatasetConfig
- func GetDatasetConfig(name string) DatasetConfig
type DatasetInfo
- func AvailableDatasets() []DatasetInfo
type GeneratorConfig
type RealDataset
- func CreateSubset(source *RealDataset, nBase, nQuery int) *RealDataset
- func LoadDataset(name string, testdataPath string) (*RealDataset, error)
type SyntheticDataset

Constants ¶

This section is empty.

Variables ¶

View Source

var CIDatasetConfigs = map[string]DatasetConfig{

	"ivf_recall":   {N: 5000, D: 128, NQ: 50, K: 10},
	"hnsw_recall":  {N: 5000, D: 128, NQ: 50, K: 10},
	"pq_recall":    {N: 5000, D: 128, NQ: 50, K: 10},
	"ivfpq_best":   {N: 10000, D: 256, NQ: 100, K: 10},
	"ivf_optimal":  {N: 10000, D: 256, NQ: 100, K: 10},
	"ivf_training": {N: 10000, D: 128, NQ: 100, K: 10},

	"semantic_search":  {N: 10000, D: 384, NQ: 100, K: 10},
	"image_similarity": {N: 10000, D: 512, NQ: 50, K: 10},
	"recommendations":  {N: 10000, D: 256, NQ: 50, K: 50},

	"param_sweep":      {N: 5000, D: 128, NQ: 50, K: 10},
	"high_dimensional": {N: 5000, D: 1536, NQ: 50, K: 10},
}

CI-friendly dataset sizes (small, fast tests)

View Source

var LocalDatasetConfigs = map[string]DatasetConfig{

	"ivf_recall":   {N: 10000, D: 128, NQ: 100, K: 10},
	"hnsw_recall":  {N: 10000, D: 256, NQ: 100, K: 10},
	"pq_recall":    {N: 10000, D: 128, NQ: 100, K: 10},
	"ivfpq_best":   {N: 100000, D: 256, NQ: 100, K: 10},
	"ivf_optimal":  {N: 100000, D: 256, NQ: 100, K: 10},
	"ivf_training": {N: 50000, D: 128, NQ: 100, K: 10},

	"semantic_search":  {N: 50000, D: 768, NQ: 500, K: 10},
	"image_similarity": {N: 50000, D: 2048, NQ: 100, K: 10},
	"recommendations":  {N: 50000, D: 256, NQ: 100, K: 50},

	"param_sweep":      {N: 10000, D: 256, NQ: 100, K: 10},
	"high_dimensional": {N: 10000, D: 1536, NQ: 100, K: 10},
}

Local testing dataset sizes (medium, realistic tests)

Functions ¶

func GenerateStandardSizes ¶

func GenerateStandardSizes() []struct {
	Name string
	N    int
	D    int
}

GenerateStandardSizes returns standard dataset sizes for testing

func IsCI ¶

func IsCI() bool

IsCI detects if running in a CI environment

func IsDatasetAvailable ¶

func IsDatasetAvailable(name string, testdataPath string) bool

IsDatasetAvailable checks if a dataset is downloaded

func SaveFVecs ¶

func SaveFVecs(filename string, vectors []float32, n, d int) error

SaveFVecs saves vectors to .fvecs format (useful for creating test datasets)

func SaveGroundTruth ¶

func SaveGroundTruth(filename string, groundTruth [][]int64) error

SaveGroundTruth saves ground truth to .ivecs format

Types ¶

type DataDistribution ¶

type DataDistribution int

DataDistribution represents different data distribution types

const (
	// UniformRandom generates uniformly distributed random vectors
	UniformRandom DataDistribution = iota
	// GaussianClustered generates vectors in Gaussian clusters
	GaussianClustered
	// PowerLaw generates vectors with power-law distance distribution
	PowerLaw
	// Normalized generates normalized (unit length) vectors
	Normalized
	// Sparse generates sparse vectors (many zeros)
	Sparse
)

type DatasetConfig ¶

type DatasetConfig struct {
	N  int // Number of vectors
	D  int // Dimension
	NQ int // Number of queries
	K  int // Number of neighbors for recall
}

DatasetConfig defines the size parameters for a dataset

func GetDatasetConfig ¶

func GetDatasetConfig(name string) DatasetConfig

GetDatasetConfig returns the appropriate dataset configuration based on environment If name is not found, returns a default small configuration

type DatasetInfo ¶

type DatasetInfo struct {
	Name        string
	Description string
	BaseFile    string // Filename for base vectors
	QueryFile   string // Filename for query vectors
	GTFile      string // Filename for ground truth
	N           int    // Number of vectors
	NQ          int    // Number of queries
	D           int    // Dimension
	Format      string // fvecs, bvecs, etc.
	URL         string // Download URL
}

DatasetInfo contains metadata about available datasets

func AvailableDatasets ¶

func AvailableDatasets() []DatasetInfo

AvailableDatasets returns information about standard benchmark datasets

type GeneratorConfig ¶

type GeneratorConfig struct {
	N            int              // Number of vectors
	D            int              // Dimension
	Distribution DataDistribution // Distribution type
	NumClusters  int              // For clustered data
	Sparsity     float64          // For sparse data (0.0-1.0)
	Seed         int64            // Random seed for reproducibility
}

GeneratorConfig configures synthetic data generation

type RealDataset ¶

type RealDataset struct {
	Name        string
	Vectors     []float32 // Base vectors
	Queries     []float32 // Query vectors
	GroundTruth [][]int64 // Ground truth nearest neighbors for each query
	N           int       // Number of base vectors
	NQ          int       // Number of query vectors
	D           int       // Dimension
	K           int       // Number of neighbors in ground truth
}

RealDataset represents a real-world dataset for testing

func CreateSubset ¶

func CreateSubset(source *RealDataset, nBase, nQuery int) *RealDataset

CreateSubset creates a smaller subset from a large dataset Useful for creating SIFT10K from SIFT1M, etc.

func LoadDataset ¶

func LoadDataset(name string, testdataPath string) (*RealDataset, error)

LoadDataset loads a dataset from the testdata directory

type SyntheticDataset ¶

type SyntheticDataset struct {
	Vectors    []float32 // Flattened vectors (N*D)
	Queries    []float32 // Query vectors
	Labels     []int     // Cluster labels (for clustered data)
	N          int       // Number of vectors
	D          int       // Dimension
	NumQueries int       // Number of query vectors
}

SyntheticDataset contains generated vectors and metadata

func GenerateClusteredDataWithGroundTruth ¶

func GenerateClusteredDataWithGroundTruth(n, d, numClusters int, seed int64) *SyntheticDataset

GenerateClusteredDataWithGroundTruth creates clustered data with known nearest neighbors This is superior to random data for testing because: - Recall is predictable (vectors in same cluster should be nearest neighbors) - Tests are reproducible with fixed seed - Can validate that indexes correctly identify cluster membership

func GenerateCorrelatedVectors ¶

func GenerateCorrelatedVectors(n, d, intrinsicDim int) *SyntheticDataset

GenerateCorrelatedVectors creates vectors with correlated dimensions (useful for testing PCA, dimensionality reduction)

func GenerateRealisticEmbeddings ¶

func GenerateRealisticEmbeddings(n, d int) *SyntheticDataset

GenerateRealisticEmbeddings creates vectors that simulate real embeddings (e.g., BERT, OpenAI, etc.) with realistic properties

func GenerateSyntheticData ¶

func GenerateSyntheticData(config GeneratorConfig) *SyntheticDataset

GenerateSyntheticData creates synthetic vectors based on configuration

func (*SyntheticDataset) GeneratePerturbedQueries ¶

func (d *SyntheticDataset) GeneratePerturbedQueries(numQueries int, noiseLevel float32)

GeneratePerturbedQueries creates query vectors as noisy perturbations of actual vectors This ensures queries have known nearest neighbors (the vectors they were perturbed from) noiseLevel controls the amount of noise (0.0 = identical, 0.1 = 10% noise, etc.)

func (*SyntheticDataset) GenerateQueries ¶

func (d *SyntheticDataset) GenerateQueries(numQueries int, distribution DataDistribution)

GenerateQueries creates query vectors from the same distribution

func (*SyntheticDataset) GenerateQueriesFromClusters ¶

func (d *SyntheticDataset) GenerateQueriesFromClusters(numQueries int, noiseLevel float32)

GenerateQueriesFromClusters creates queries that are close to specific clusters This allows for predictable recall testing: - Query i will have its K nearest neighbors in cluster i % numClusters

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
groundtruth

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL