crosssystem_test

package

v0.6.3 Latest Latest Go to latest Published: May 22, 2026 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/blackwell-systems/knowing

Links

Open Source Insights

README ¶

Cross-System Context Retrieval Benchmark

Rigorous comparison of context retrieval systems on identical tasks across 5 public repositories.

Full specification: docs/research/cross-system-benchmark.md Running results: FINDINGS.md Study overview: bench/CONTEXT-PACKING-STUDY.md

Quick Start

# 1. Clone evaluation repos (shallow, ~2GB total)
./bench/cross-system/scripts/clone-repos.sh

# 2. Index with knowing
./bench/cross-system/scripts/index-repos.sh

# 3. Run benchmark (knowing + grep baseline by default)
GOWORK=off go test ./bench/cross-system/ -run TestCrossSystem -v -timeout 30m

# 4. Results appear in bench/cross-system/results/run-<timestamp>/

Systems Compared

System	Type	Requires
knowing	Content-addressed graph + RWR	Always available (it's us)
grep (baseline)	Raw ripgrep pattern search	`rg` on PATH
GitNexus	Knowledge graph MCP	`npm install -g gitnexus`
Aider	PageRank repo-map	`pip install aider-chat`
CGC	Graph DB (KuzuDB) MCP	`pip install codegraphcontext`

Only installed systems are benchmarked. Missing systems are reported as "unavailable."

Metrics

Metric	What it measures
P@10	Fraction of top-10 results that are relevant
R@10	Fraction of ground truth found in top-10
NDCG@10	Ranking quality (rewards relevant results ranked higher)
MRR	How quickly you get the first relevant result
F1@10	Harmonic mean of P@10 and R@10
Token Eff	Relevant symbols per token consumed

Statistical significance via Wilcoxon signed-rank test (paired, non-parametric). Effect size via Cohen's d. Confidence intervals via bootstrap (10K resamples).

Evaluation Corpus

5 repos pinned to specific versions:

kubernetes (Go, ~3.5M LOC) - v1.30.0
VS Code (TypeScript, ~1M LOC) - 1.90.0
flask (Python, ~15K LOC) - 3.1.0
cargo (Rust, ~150K LOC) - 0.82.0
django (Python, ~300K LOC) - 5.1

100 tasks across 3 difficulty tiers (easy/medium/hard) with hand-labeled ground truth symbols.

Directory Structure

bench/cross-system/
  benchtype/           # shared types (leaf package, no internal imports)
  normalize/           # symbol canonicalization + tests
  metrics/             # P@K, R@K, NDCG, MRR, F1, token efficiency, stats
  adapters/            # system implementations + availability registry
  corpus/
    repos.yaml         # pinned repo definitions
    tasks/<repo>/<tier>/*.yaml  # ground truth fixtures
    repos/             # cloned repos (gitignored)
  scripts/
    clone-repos.sh     # shallow clone all 5 repos
    index-repos.sh     # index with knowing
  results/             # benchmark output (gitignored)
  harness_test.go      # main entry point

Adding a New System

Create adapters/<name>.go implementing benchtype.Adapter
Add to adapters/registry.go AllAdapters() with availability check
Run the benchmark; the new system appears in results automatically

Fairness

knowing's own repo is excluded from evaluation
All systems get the same task descriptions (no system-specific tuning)
Cold start by default (no pre-existing feedback/state)
Token counting uses tiktoken cl100k_base for all systems
Statistical tests are paired (same tasks, different systems)

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

groundtruth.go

Directories ¶

Path	Synopsis
adapters Package adapters provides system-specific implementations of the benchmark Adapter interface.	Package adapters provides system-specific implementations of the benchmark Adapter interface.
benchtype Package benchtype defines shared types for the cross-system context retrieval benchmark.	Package benchtype defines shared types for the cross-system context retrieval benchmark.
cmd
failure-analysis command Command failure-analysis examines what knowing returns vs ground truth for each task, categorizing misses into: related-but-unlisted, noise, wrong-package, correct-package-wrong-symbol.	Command failure-analysis examines what knowing returns vs ground truth for each task, categorizing misses into: related-but-unlisted, noise, wrong-package, correct-package-wrong-symbol.
validate-fixtures command Command validate-fixtures checks each ground truth symbol against the actual DB contents and reports mismatches.	Command validate-fixtures checks each ground truth symbol against the actual DB contents and reports mismatches.
metrics Package metrics computes retrieval quality metrics for the cross-system benchmark.	Package metrics computes retrieval quality metrics for the cross-system benchmark.
normalize Package normalize provides symbol name canonicalization for cross-system comparison.	Package normalize provides symbol name canonicalization for cross-system comparison.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL