benchmarks

package
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 29, 2026 License: MIT Imports: 24 Imported by: 0

Documentation

Overview

ContextBench runner — measures context retrieval quality on the ContextBench dataset (1,136 tasks, 66 repos, 8 languages).

For each task the runner:

  1. Clones the repo at base_commit (full depth, not shallow)
  2. Checks out base_commit and re-indexes with Synapses
  3. Calls Synapses tools (search, prepare_context, get_impact) guided by the problem statement to retrieve context
  4. Compares retrieved file+line ranges against gold_context annotations
  5. Computes Context Precision, Recall, and F1

Dataset: huggingface.co/datasets/Contextbench/ContextBench Export to JSONL:

python -c "
from datasets import load_dataset
ds = load_dataset('Contextbench/ContextBench', 'default', split='train')
ds.to_json('contextbench.jsonl')
"

Gold context format per task (JSON array):

[{"file": "path/to/file.py", "start_line": 10, "end_line": 25, "content": "..."}]

featurebench.go implements the FeatureBench benchmark runner. It measures whether giving Claude Code access to Synapses MCP tools improves its ability to implement features (pass@1 delta on FeatureBench tasks).

Unlike swebench.go which uses the Go agent loop + Anthropic API directly, this runner shells out to `claude -p` so it works with Max subscriptions (OAuth) without requiring an API key. Synapses tools are loaded via .mcp.json.

GraphBench — Graph Accuracy Benchmark (Benchmark A).

Tests whether Synapses' structural graph correctly represents code relationships. Unlike ContextBench (which conflates graph quality with retrieval strategy), GraphBench isolates graph correctness.

Query types:

  • find_callers(symbol) — who calls this? via get_impact depth=1
  • find_callees(symbol) — what does this call? via get_context format=json
  • find_imports(file) — what does this file import? via get_context format=json
  • impact_analysis(symbol) — what's affected? via get_impact depth=3
  • find_implementations(iface) — who implements this? via get_context format=json

All daemon responses are parsed as structured JSON (not regex on Markdown). Metrics: per-test Precision, Recall, F1. Aggregated by query_type and language.

NLBench — Natural Language Parsing Benchmark.

Tests whether Synapses correctly parses documentation files (.md, .txt, etc.) into knowledge graph nodes and can surface them via search and get_context.

Query types:

  • find_doc_entities(file) — search for a doc file, check expected entity names
  • doc_explains_code(symbol) — get_context for a code entity, check cross_domain docs
  • concept_search(query) — search for a concept, check expected names in results

Metrics: per-test Precision, Recall, F1. Aggregated by query_type and language.

Package benchmarks implements external benchmark runners.

RepoBench-R (arxiv.org/abs/2306.03091, ICLR 2024):

Given a code completion point and a list of candidate snippets from other files in the same repo, rank the most relevant snippet highest.

Dataset: huggingface.co/datasets/tianyang/repobench-r Each record:

  • context — code up to the completion point (the query)
  • import_statement — imports at file top
  • gold_snippet_index — index of the correct answer in candidate_code
  • candidate_code — list of snippet strings to rank

This runner implements Approach B from BENCHMARK.md: for each sample, rank all candidates against the query using the chosen retrieval mode, then score Acc@k for k in {1, 3, 5, 10}.

Retrieval modes:

  • fts-only — local BM25 over tokenised tokens
  • vector-only — local TF-IDF cosine similarity
  • hybrid-rrf — RRF merge of BM25 + TF-IDF ranks (default)
  • hybrid-convex — convex combination of BM25 + TF-IDF scores
  • hybrid-anchor — hybrid-rrf + anchor boost (first candidate gets a small boost)
  • next-hint — hybrid-convex with next-line identifiers injected into query (V2-A5)
  • bm25-lenorm — hybrid-convex with adaptive BM25 length normalisation b-param (V2-A6)
  • hybrid-ngram — hybrid-convex + word-bigram overlap signal (V2-A9)
  • cluster-hybrid — TF-IDF k-means cluster pre-filter + hybrid-convex on top cluster (V2-A10)
  • synapses-search — call Synapses search tool and rerank candidates by overlap

The local modes work without a running daemon and are the fastest to run. The synapses-search mode requires a running daemon with the repo indexed.

swebench.go implements the SWE-bench Verified benchmark runner. It measures whether giving an LLM access to Synapses MCP tools improves its ability to solve real coding tasks (Pass@1 delta).

Index

Constants

This section is empty.

Variables

View Source
var CodeModelSpecs = map[string]CodeModelSpec{
	"embed-jina-v2-code": {
		ModelID:      "jinaai/jina-embeddings-v2-base-code",
		DirName:      "jinaai_jina-embeddings-v2-base-code",
		OnnxRepoPath: "onnx/model_quantized.onnx",
		OnnxFile:     "model_quantized.onnx",
		Dims:         768,
		Description:  "Jina v2 code (768-dim, mean-pooled, code-optimized)",
	},
	"embed-jina-v3": {
		ModelID:      "jinaai/jina-embeddings-v3",
		DirName:      "jinaai_jina-embeddings-v3",
		OnnxRepoPath: "onnx/model_quantized.onnx",
		OnnxFile:     "model_quantized.onnx",
		Dims:         1024,
		Description:  "Jina v3 (1024-dim, multilingual + code)",
	},
	"embed-codebert": {
		ModelID:      "Xenova/codebert-base",
		DirName:      "Xenova_codebert-base",
		OnnxRepoPath: "onnx/model_quantized.onnx",
		OnnxFile:     "model_quantized.onnx",
		Dims:         768,
		Description:  "CodeBERT base (768-dim, CLS-token, code+NL pairs)",
	},
}

CodeModelSpecs maps each benchmark retrieval mode to its model config. Only models with confirmed ONNX exports compatible with hugot are listed.

Validation notes:

  • jinaai/jina-embeddings-v2-base-code: Sentence Transformer, mean pooling built into the ONNX graph, 768-dim. Confirmed hugot-compatible.
  • jinaai/jina-embeddings-v3: Sentence Transformer, 1024-dim, multilingual + code-optimized. Confirmed ONNX export.
  • Xenova/codebert-base: Xenova's re-export of microsoft/codebert-base with ONNX. Uses CLS-token pooling (standard BERT). 768-dim.

Functions

func BuildFeatureBenchReport

func BuildFeatureBenchReport(mode, model string, results []FeatureBenchTaskResult) *reporter.FeatureBenchReport

BuildFeatureBenchReport aggregates task results into a reporter-compatible struct.

func ExportPredictions

func ExportPredictions(results []SWEBenchTaskResult, dir string) error

ExportPredictions returns the list of task results for external use.

func IsCodeEmbedMode

func IsCodeEmbedMode(mode string) bool

IsCodeEmbedMode returns true when mode is one of the V2-E1 code embedding modes.

func IsRerankMode

func IsRerankMode(mode string) bool

IsRerankMode returns true if the retrieval mode uses cross-encoder reranking.

func RunContextBench

func RunContextBench(client *agent.SynapsesClient, opts ContextBenchOptions) (*reporter.ContextBenchResult, error)

RunContextBench runs the full ContextBench evaluation.

func RunGraphBench

func RunGraphBench(client *agent.SynapsesClient, opts GraphBenchOptions) (*reporter.GraphBenchResult, error)

RunGraphBench runs the full benchmark and returns a reporter-compatible result.

func RunNLBench

func RunNLBench(client *agent.SynapsesClient, opts NLBenchOptions) (*reporter.NLBenchResult, error)

RunNLBench runs the full NL benchmark and returns a reporter-compatible result.

func RunRepoBench

func RunRepoBench(client *agent.SynapsesClient, opts RepoBenchOptions) (*reporter.RepoBenchResult, error)

RunRepoBench executes the RepoBench-R benchmark across all requested configs and difficulties, using the given retrieval mode.

The dataset must be downloaded separately as JSONL files named:

repobench_<config>_<difficulty>.jsonl

in the current directory, or exported from HuggingFace via:

python -c "
from datasets import load_dataset
for config in ['python_cff','python_cfr','java_cff','java_cfr']:
    for split in ['test_easy','test_hard']:
        ds = load_dataset('tianyang/repobench-r', config, split=split)
        diff = split.replace('test_','')
        ds.to_json(f'repobench_{config}_{diff}.jsonl')
"

func RunSWEBench

func RunSWEBench(mcpClient *agent.SynapsesClient, opts SWEBenchOptions) (*reporter.SWEBenchResult, error)

RunSWEBench runs the SWE-bench benchmark in the specified mode.

func WritePredictions

func WritePredictions(results []SWEBenchTaskResult, outputPath string) error

WritePredictions writes the SWE-bench prediction JSONL for Docker evaluation.

Types

type CodeModelEmbedder

type CodeModelEmbedder struct {
	// contains filtered or unexported fields
}

CodeModelEmbedder wraps a hugot FeatureExtractionPipeline for a specific code embedding model. A single worker goroutine serializes ONNX inference — these models use 300-800 MB RAM each so one instance per benchmark run is appropriate.

The worker-channel design (instead of a mutex) prevents goroutine accumulation when inference is slow: a timed-out caller simply abandons its channel without spawning a new goroutine. Close() drains the in-flight request and waits for the worker to exit before destroying the ONNX session.

func NewCodeModelEmbedder

func NewCodeModelEmbedder(spec CodeModelSpec) (*CodeModelEmbedder, error)

NewCodeModelEmbedder downloads (if needed) and initializes a code embedding model. Returns an error if the model cannot be downloaded or the ONNX pipeline cannot be created — callers should handle this gracefully and skip the mode.

func (*CodeModelEmbedder) Close

func (c *CodeModelEmbedder) Close()

Close stops the worker goroutine and releases ONNX resources. It is safe to call even if inference is in flight: the worker completes the current request, then exits, ensuring session.Destroy() is never called while RunPipeline is executing.

func (*CodeModelEmbedder) EmbedBatch

func (c *CodeModelEmbedder) EmbedBatch(texts []string) ([][]float32, error)

EmbedBatch embeds a batch of texts, returning one float32 vector per text.

Requests are serialized through a single worker goroutine so the ONNX session is never accessed concurrently. If the worker does not respond within codeEmbedTimeout the call returns an error without leaking a goroutine — the abandoned respCh is buffered so the worker can always send when it finishes.

type CodeModelSpec

type CodeModelSpec struct {
	// ModelID is the HuggingFace model identifier (e.g. "jinaai/jina-embeddings-v2-base-code").
	ModelID string
	// DirName is the local directory name under ~/.synapses/models/.
	DirName string
	// OnnxRepoPath is the path within the HF repo to the ONNX file (e.g. "onnx/model_quantized.onnx").
	OnnxRepoPath string
	// OnnxFile is the local filename (e.g. "model_quantized.onnx").
	OnnxFile string
	// Dims is the native embedding dimension. 0 means use the full output dimension.
	Dims int
	// Description is a human-readable label used in benchmark reports.
	Description string
}

CodeModelSpec describes one HuggingFace ONNX embedding model.

type ContextBenchOptions

type ContextBenchOptions struct {
	// DataFile is the path to contextbench.jsonl.
	DataFile string
	// ReposDir is where repos are cloned.
	ReposDir string
	// CacheFile is the JSON cache of cloned+indexed repos (keyed by repo@commit).
	CacheFile string
	// Limit caps the number of tasks (0 = all).
	Limit int
	// Languages filters tasks by language (empty = all).
	Languages []string
	// Sources filters by source field (empty = all). E.g. ["Verified"].
	Sources []string
	// IndexWorkers controls parallel clone+index workers.
	IndexWorkers int
	// SkipIndex skips the synapses index step.
	SkipIndex bool
	// SynapsesBin is the path to the synapses binary (empty = auto-detect).
	SynapsesBin string
}

ContextBenchOptions controls the ContextBench run.

type ContextBenchTask

type ContextBenchTask struct {
	InstanceID       string `json:"instance_id"`
	OriginalInstID   string `json:"original_inst_id"`
	Repo             string `json:"repo"`
	RepoURL          string `json:"repo_url"`
	Language         string `json:"language"`
	BaseCommit       string `json:"base_commit"`
	Source           string `json:"source"`
	GoldContextRaw   string `json:"gold_context"` // JSON array string
	Patch            string `json:"patch"`
	ProblemStatement string `json:"problem_statement"`
}

ContextBenchTask is a single record from the ContextBench dataset.

type ContextBenchTaskResult

type ContextBenchTaskResult struct {
	InstanceID string  `json:"instance_id"`
	Repo       string  `json:"repo"`
	Language   string  `json:"language"`
	Precision  float64 `json:"precision"`
	Recall     float64 `json:"recall"`
	F1         float64 `json:"f1"`
	GoldLines  int     `json:"gold_lines"`
	HitLines   int     `json:"hit_lines"`
	TotalLines int     `json:"total_retrieved_lines"`
	ToolCalls  int     `json:"tool_calls"`
	Error      string  `json:"error,omitempty"`
}

ContextBenchTaskResult holds per-task metrics.

type CrossEncoderReranker

type CrossEncoderReranker struct {
	// contains filtered or unexported fields
}

CrossEncoderReranker wraps a hugot cross-encoder pipeline for in-process second-stage reranking. Thread-safe: a mutex serializes access to the single ONNX session (cross-encoder inference is fast enough that a pool is unnecessary — <200ms for 20 candidates).

func NewCrossEncoderReranker

func NewCrossEncoderReranker() (*CrossEncoderReranker, error)

NewCrossEncoderReranker downloads (if needed) and initializes the cross-encoder model from ~/.synapses/models/.

func (*CrossEncoderReranker) Close

func (r *CrossEncoderReranker) Close()

Close releases ONNX resources.

func (*CrossEncoderReranker) Rerank

func (r *CrossEncoderReranker) Rerank(ctx context.Context, query string, candidates []string, firstStage []rankedItem) []rankedItem

Rerank takes first-stage ranked items and reranks the top-N using the cross-encoder. Items beyond top-N retain their original order appended after the reranked portion.

Score normalization: reranked items get scores in [1.0, 2.0] (cross-encoder sigmoid score + 1.0 offset) and tail items get scores in [0.0, 1.0) (decaying from the lowest reranked score). This ensures reranked items always sort above tail items regardless of first-stage score scale.

type FeatureBenchOptions

type FeatureBenchOptions struct {
	Split       string   // HF split: "lite", "fast", "full"
	TaskIDs     []string // optional: specific task IDs
	Level       int      // 0 = all, 1 or 2
	ReposDir    string   // where repos are cloned
	Limit       int      // max tasks (0 = all)
	Mode        string   // "baseline" or "synapses"
	Model       string   // Claude model (passed via ANTHROPIC_MODEL env)
	Timeout     int      // seconds per task (default 1200 = 20min)
	SynapsesBin string   // path to synapses binary (for init + index)
	OutputDir   string   // where to write predictions JSONL
	Debug       bool     // dump raw stream-json to file for inspection
}

FeatureBenchOptions configures the FeatureBench run.

type FeatureBenchPrediction

type FeatureBenchPrediction struct {
	InstanceID   string                 `json:"instance_id"`
	ModelPatch   string                 `json:"model_patch"`
	TaskMetadata map[string]interface{} `json:"task_metadata,omitempty"`
}

FeatureBenchPrediction is the JSONL format that fb eval expects.

type FeatureBenchTask

type FeatureBenchTask struct {
	InstanceID       string          `json:"instance_id"`
	Repo             string          `json:"repo"`
	BaseCommit       string          `json:"base_commit"`
	ProblemStatement string          `json:"problem_statement"`
	ImageName        string          `json:"image_name"`
	RepoSettings     json.RawMessage `json:"repo_settings"`
	Patch            string          `json:"patch"`
	TestPatch        string          `json:"test_patch"`
	FailToPass       []string        `json:"FAIL_TO_PASS"`
	PassToPass       []string        `json:"PASS_TO_PASS"`
}

FeatureBenchTask is one task from the FeatureBench dataset.

type FeatureBenchTaskResult

type FeatureBenchTaskResult struct {
	InstanceID string            `json:"instance_id"`
	Repo       string            `json:"repo"`
	Mode       string            `json:"mode"`
	ModelPatch string            `json:"model_patch"`
	ToolCalls  map[string]int    `json:"tool_calls,omitempty"`
	Turns      int               `json:"turns"`
	Error      string            `json:"error,omitempty"`
	Duration   string            `json:"duration"`
	Task       *FeatureBenchTask `json:"-"` // for metadata output
}

FeatureBenchTaskResult is the outcome of running Claude on one task.

func RunFeatureBench

func RunFeatureBench(opts FeatureBenchOptions) ([]FeatureBenchTaskResult, error)

RunFeatureBench runs the FeatureBench benchmark.

type GoldContextBlock

type GoldContextBlock struct {
	File      string `json:"file"`
	StartLine int    `json:"start_line"`
	EndLine   int    `json:"end_line"`
	Content   string `json:"content"`
}

GoldContextBlock is one annotated context region.

type GraphBenchOptions

type GraphBenchOptions struct {
	DataFile string // path to graphbench.jsonl
	ReposDir string // where repos are cloned
	Limit    int    // max test suites (0 = all)
}

GraphBenchOptions controls a GraphBench run.

type GraphBenchSuite

type GraphBenchSuite struct {
	Repo     string           `json:"repo"`
	Commit   string           `json:"commit"`
	Language string           `json:"language"`
	Tests    []GraphBenchTest `json:"tests"`
}

GraphBenchSuite is one line from the JSONL file.

type GraphBenchTest

type GraphBenchTest struct {
	QueryType     string   `json:"query_type"`
	Query         string   `json:"query"`
	ExpectedNames []string `json:"expected_names,omitempty"`
	ExpectedFiles []string `json:"expected_files,omitempty"`
}

GraphBenchTest is a single query+expected pair.

type GraphBenchTestResult

type GraphBenchTestResult struct {
	Repo          string   `json:"repo"`
	Language      string   `json:"language"`
	QueryType     string   `json:"query_type"`
	Query         string   `json:"query"`
	ExpectedNames []string `json:"expected_names,omitempty"`
	ExpectedFiles []string `json:"expected_files,omitempty"`
	ActualNames   []string `json:"actual_names"`
	ActualFiles   []string `json:"actual_files"`
	Precision     float64  `json:"precision"`
	Recall        float64  `json:"recall"`
	F1            float64  `json:"f1"`
	Error         string   `json:"error,omitempty"`
	RawResponse   string   `json:"raw_response,omitempty"` // for debugging failures
}

GraphBenchTestResult holds the outcome of one test.

type LocalEmbedder

type LocalEmbedder struct {
	// contains filtered or unexported fields
}

LocalEmbedder wraps the builtin nomic-embed-text-v1.5 ONNX model for direct in-process embedding — no HTTP round-trips, full pool throughput.

func NewLocalEmbedder

func NewLocalEmbedder(poolSize int) (*LocalEmbedder, error)

NewLocalEmbedder creates and warms up a local embedder using the model already downloaded at ~/.synapses/models/.

func (*LocalEmbedder) Close

func (l *LocalEmbedder) Close()

Close releases ONNX resources.

type NLBenchOptions

type NLBenchOptions struct {
	DataFile string // path to nlbench.jsonl
	ReposDir string // where repos are cloned
	Limit    int    // max test suites (0 = all)
}

NLBenchOptions controls an NLBench run.

type NLBenchSuite

type NLBenchSuite struct {
	Repo     string        `json:"repo"`
	Commit   string        `json:"commit"`
	Language string        `json:"language"`
	Tests    []NLBenchTest `json:"tests"`
}

NLBenchSuite is one line from the JSONL file.

type NLBenchTest

type NLBenchTest struct {
	QueryType     string   `json:"query_type"`
	Query         string   `json:"query"`
	ExpectedNames []string `json:"expected_names,omitempty"`
	ExpectedDocs  []string `json:"expected_docs,omitempty"`
	Description   string   `json:"description"`
}

NLBenchTest is a single query+expected pair.

type NLBenchTestResult

type NLBenchTestResult struct {
	Repo          string   `json:"repo"`
	Language      string   `json:"language"`
	QueryType     string   `json:"query_type"`
	Query         string   `json:"query"`
	Description   string   `json:"description"`
	ExpectedNames []string `json:"expected_names,omitempty"`
	ExpectedDocs  []string `json:"expected_docs,omitempty"`
	ActualNames   []string `json:"actual_names"`
	Precision     float64  `json:"precision"`
	Recall        float64  `json:"recall"`
	F1            float64  `json:"f1"`
	Error         string   `json:"error,omitempty"`
	RawResponse   string   `json:"raw_response,omitempty"`
}

NLBenchTestResult holds the outcome of one test.

type RepoBenchOptions

type RepoBenchOptions struct {
	// Configs to run, e.g. ["python_cff", "python_cfr", "java_cff", "java_cfr"].
	Configs []string
	// Difficulties: ["easy", "hard"].
	Difficulties []string
	// RetrievalMode: fts-only | vector-only | hybrid-rrf | hybrid-convex | hybrid-anchor |
	//   next-hint | bm25-lenorm | hybrid-ngram | cluster-hybrid |
	//   synapses-search | synapses-embed | synapses-embed-local |
	//   rerank-bm25 | rerank-tfidf | rerank-hybrid | rerank-convex |
	//   embed-codebert | embed-jina-v2-code | embed-jina-v3
	RetrievalMode string
	// LimitPerSet caps samples per config×difficulty (0 = all).
	LimitPerSet int
	// ReposDir is the root directory where repos are cloned (for synapses-embed).
	ReposDir string
	// RepoCache maps repo names to local paths (for per-repo project routing).
	RepoCache *indexer.Cache
	// LocalEmbedder is used for synapses-embed-local mode (in-process ONNX).
	LocalEmbedder *LocalEmbedder
	// Reranker is used for rerank-* modes (in-process ONNX cross-encoder).
	Reranker *CrossEncoderReranker
	// CodeEmbedder is used for embed-codebert / embed-jina-v2-code / embed-jina-v3 modes.
	CodeEmbedder *CodeModelEmbedder
}

RepoBenchOptions controls what to run.

type RepoBenchSample

type RepoBenchSample struct {
	// Code is the current file's code up to the completion point — the query.
	Code string `json:"code"`
	// Context is the list of candidate snippets from other files to rank.
	Context            []string `json:"context"`
	ImportStatement    string   `json:"import_statement"`
	GoldenSnippetIndex int      `json:"golden_snippet_index"`
	NextLine           string   `json:"next_line"`
	Repo               string   `json:"repo_name"`
	File               string   `json:"file_path"`
}

RepoBenchSample is a single record from the RepoBench-R dataset (JSONL format).

Real schema (from pickle inspection):

  • Code = current file context (the query)
  • Context = candidate snippets from other files (the list to rank)
  • GoldenSnippetIndex = index into Context of the correct snippet
  • ImportStatement = imports at the top of the query file

type SWEBenchOptions

type SWEBenchOptions struct {
	DataFile string // path to JSONL dataset
	ReposDir string // directory where repos are cloned
	Limit    int    // max tasks to run (0 = all)
	Mode     string // "baseline" or "synapses"
	Model    string // Claude model name
	MaxTurns int    // max agent loop turns
	Endpoint string // Synapses daemon endpoint (synapses mode only)
}

SWEBenchOptions configures the SWE-bench benchmark run.

type SWEBenchTask

type SWEBenchTask struct {
	InstanceID       string `json:"instance_id"`
	Repo             string `json:"repo"`
	BaseCommit       string `json:"base_commit"`
	ProblemStatement string `json:"problem_statement"`
	Patch            string `json:"patch,omitempty"`      // gold patch (for reference)
	TestPatch        string `json:"test_patch,omitempty"` // test changes (for eval)
}

SWEBenchTask is one task from the SWE-bench dataset.

type SWEBenchTaskResult

type SWEBenchTaskResult struct {
	InstanceID     string           `json:"instance_id"`
	Repo           string           `json:"repo"`
	Mode           string           `json:"mode"`
	GeneratedPatch string           `json:"generated_patch"`
	Pass           bool             `json:"pass"` // set by evaluator (manual or Docker)
	Stats          agent.AgentStats `json:"stats"`
	Error          string           `json:"error,omitempty"`
	Duration       string           `json:"duration"`
}

SWEBenchTaskResult is the outcome of running the agent on one task.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL