agent_efficiency

package
v0.13.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: MIT Imports: 7 Imported by: 0

README

Agent Efficiency Benchmark

This benchmark proves that Claude Code with knowing MCP tools completes tasks with fewer tool calls, fewer tokens, and higher correctness than without.

How it works

The benchmark has three parts:

  1. Task fixtures (tasks.go): 8 tasks targeting the knowing codebase, each with a description, ground truth (relevant files, key symbols, answer keywords), and a complexity rating.

  2. Transcript analyzer (transcript.go): Parses Claude Code JSONL session transcripts to extract token counts, tool call counts, wall-clock time, files read, and correctness scores against the ground truth.

  3. Comparison engine (compare.go): Computes the deltas between control (without knowing) and treatment (with knowing) sessions, and renders a Markdown report.

The benchmark does NOT run Claude Code automatically. You run sessions manually (or via the runner script) and then point the analyzer at the resulting JSONL files.

Step 1: Export tasks

Generate tasks.json so the runner script can read task descriptions without a Go toolchain:

GOWORK=off go test ./bench/agent-efficiency/ -run TestExportTasks -v

This writes bench/agent-efficiency/tasks.json.

Step 2: Run sessions

Option A: Automated (requires claude CLI)
# Control session (no knowing tools)
./bench/agent-efficiency/runner.sh blast-radius-handler control

# Treatment session (with knowing tools)
./bench/agent-efficiency/runner.sh blast-radius-handler treatment

Run both modes for all 8 task IDs:

for task in blast-radius-handler context-engine-scoring node-struct-blast-radius \
            louvain-community-detection snapshot-package-coverage \
            hierarchical-merkle-diff edge-types file-save-to-cache-invalidation; do
  ./bench/agent-efficiency/runner.sh "$task" control
  ./bench/agent-efficiency/runner.sh "$task" treatment
done
Option B: Manual
  1. Open a Claude Code session in /Users/dayna.blackwell/code/knowing.
  2. Paste the task description from tasks.json.
  3. For control sessions: do not use any knowing MCP tools.
  4. For treatment sessions: use knowing MCP tools freely.
  5. Save the session JSONL transcript to: bench/agent-efficiency/transcripts/<task-id>-<mode>.jsonl

Claude Code stores session transcripts in ~/.claude/projects/<project-hash>/. Copy the relevant .jsonl file to the transcripts directory with the naming convention above.

Step 3: Analyze results

GOWORK=off go test ./bench/agent-efficiency/ -run TestAnalyzeTranscripts -v

This reads all transcripts in bench/agent-efficiency/transcripts/, computes metrics, compares control vs. treatment pairs, and writes bench/agent-efficiency/FINDINGS.md.

Transcript naming convention

Transcripts must follow this naming pattern:

transcripts/<task-id>-<mode>.jsonl

Where <mode> is either control or treatment. Examples:

transcripts/blast-radius-handler-control.jsonl
transcripts/blast-radius-handler-treatment.jsonl
transcripts/context-engine-scoring-control.jsonl
transcripts/context-engine-scoring-treatment.jsonl

Task list

ID Description Complexity
blast-radius-handler Find the function that handles the blast_radius MCP tool low
context-engine-scoring Explain the context engine scoring formula and weights medium
node-struct-blast-radius What breaks if Node struct changes medium
louvain-community-detection Walk through the Louvain algorithm implementation medium
snapshot-package-coverage What test files cover the snapshot package low
hierarchical-merkle-diff How hierarchical Merkle tree improves diff performance high
edge-types List all supported edge types and where they are defined medium
file-save-to-cache-invalidation Trace data flow from git commit to cache invalidation high

Metrics collected

Metric Description
TotalTokens Input + output tokens across all turns
ToolCalls Total number of tool_use blocks
ToolCallsByType Per-tool breakdown (Read, Grep, knowing_context, etc.)
Turns Number of assistant messages
WallClockMs Time from first user message to last assistant message
FilesRead Unique files opened via Read tool
FoundRelevantFiles Ground-truth relevant files that were actually read
FoundKeySymbols Ground-truth key symbols that appeared in assistant output
AnswerCorrectness Fraction of expected answer keywords present in final response

Interpreting results

  • Token savings: positive means treatment used fewer tokens.
  • Tool call savings: positive means treatment made fewer tool calls.
  • Time savings: positive means treatment finished faster.
  • Correctness delta: positive means treatment gave more correct answers.

A good result shows knowing tools providing significant token and tool call savings while maintaining or improving correctness.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var Tasks = []Task{
	{
		ID:          "blast-radius-handler",
		Description: "What function handles the blast_radius MCP tool in the knowing codebase? In which file is it defined?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/mcp/handlers.go",
			},
			KeySymbols: []string{
				"handleBlastRadius",
			},
			AnswerKeywords: []string{
				"handleBlastRadius",
				"handlers.go",
			},
		},
		Complexity: "low",
	},
	{
		ID:          "context-engine-scoring",
		Description: "How does the context engine score symbols? What is the formula, and what weights are applied to each component?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/context/ranking.go",
				"internal/context/hits.go",
			},
			KeySymbols: []string{
				"RankSymbols",
				"ScoringInput",
				"ScoreComponents",
				"HITSScores",
			},
			AnswerKeywords: []string{
				"RankSymbols",
				"weight",
				"feedback",
				"session",
				"recency",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "node-struct-blast-radius",
		Description: "If I change the Node struct in internal/types/types.go, what breaks? Which packages and callers depend on it?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/types/types.go",
				"internal/store/knowing/",
				"internal/indexer/indexer.go",
				"internal/mcp/handlers.go",
			},
			KeySymbols: []string{
				"Node",
				"ComputeNodeHash",
				"PutNode",
			},
			AnswerKeywords: []string{
				"Node",
				"types.go",
				"store",
				"indexer",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "louvain-community-detection",
		Description: "How does the Louvain community detection work in the knowing codebase? Walk me through the algorithm implementation.",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/community/louvain.go",
				"internal/community/algorithm.go",
			},
			KeySymbols: []string{
				"Louvain",
				"Detect",
			},
			AnswerKeywords: []string{
				"modularity",
				"community",
				"louvain",
				"Detect",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "snapshot-package-coverage",
		Description: "What test files exist for the snapshot package, and what aspects of the package do they cover?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/snapshot/manager_test.go",
				"internal/snapshot/hierarchical_test.go",
				"internal/snapshot/verify_test.go",
				"internal/snapshot/impact_test.go",
				"internal/snapshot/semantic_test.go",
			},
			KeySymbols: []string{
				"HierarchicalTree",
				"DiffHierarchicalTrees",
				"BuildHierarchicalTree",
			},
			AnswerKeywords: []string{
				"snapshot",
				"hierarchical",
				"test",
			},
		},
		Complexity: "low",
	},
	{
		ID:          "hierarchical-merkle-diff",
		Description: "How does the hierarchical Merkle tree improve diff performance in the knowing codebase? What is the algorithmic improvement over a flat diff?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/snapshot/hierarchical.go",
				"internal/snapshot/merkle.go",
				"internal/cache/subgraph.go",
			},
			KeySymbols: []string{
				"HierarchicalTree",
				"BuildHierarchicalTree",
				"DiffHierarchicalTrees",
				"DiffHierarchicalTreesWithOptions",
				"SubgraphCache",
				"InvalidatePackages",
			},
			AnswerKeywords: []string{
				"O(packages)",
				"hierarchical",
				"PackageRoots",
				"EdgeTypeRoots",
				"diff",
			},
		},
		Complexity: "high",
	},
	{
		ID:          "edge-types",
		Description: "What edge types does the knowing graph support and where are they defined? List all supported edge type strings.",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/types/types.go",
			},
			KeySymbols: []string{
				"Edge",
				"EdgeType",
				"ComputeEdgeHash",
			},
			AnswerKeywords: []string{
				"calls",
				"imports",
				"implements",
				"references",
				"types.go",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "file-save-to-cache-invalidation",
		Description: "Trace the data flow in the knowing daemon from a git commit (file save) to cache invalidation. Which functions are involved and in what order?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/daemon/gitwatcher.go",
				"internal/daemon/daemon.go",
				"internal/snapshot/hierarchical.go",
				"internal/cache/subgraph.go",
			},
			KeySymbols: []string{
				"GitWatcher",
				"CommitEvent",
				"DiffHierarchicalTrees",
				"InvalidatePackages",
			},
			AnswerKeywords: []string{
				"GitWatcher",
				"CommitEvent",
				"reindex",
				"InvalidatePackages",
				"SubgraphCache",
			},
		},
		Complexity: "high",
	},
}

Tasks is the canonical fixture set for the agent efficiency benchmark. Each task targets the knowing codebase and has been verified against the actual source.

Functions

func FormatReport

func FormatReport(results []ComparisonResult) string

FormatReport renders a slice of ComparisonResults as a Markdown report with a summary table and a per-task detail section.

func ScoreCorrectness

func ScoreCorrectness(m *SessionMetrics, gt GroundTruth, allAssistantText string)

ScoreCorrectness computes FoundRelevantFiles, FoundKeySymbols, and AnswerCorrectness against a GroundTruth. It updates m in place.

Types

type ComparisonResult

type ComparisonResult struct {
	TaskID    string
	Control   SessionMetrics
	Treatment SessionMetrics

	// TokenSavings is the fraction of tokens saved:
	//   (control.TotalTokens - treatment.TotalTokens) / control.TotalTokens
	// Positive means treatment used fewer tokens.
	TokenSavings float64

	// ToolCallSavings is the fraction of tool calls saved:
	//   (control.ToolCalls - treatment.ToolCalls) / control.ToolCalls
	ToolCallSavings float64

	// TimeSavings is the fraction of wall-clock time saved.
	TimeSavings float64

	// CorrectnessDelta is treatment.AnswerCorrectness - control.AnswerCorrectness.
	// Positive means treatment gave a more correct answer.
	CorrectnessDelta float64
}

ComparisonResult holds the delta between a control (without knowing) and a treatment (with knowing) session for the same task.

func Compare

func Compare(control, treatment SessionMetrics) ComparisonResult

Compare computes all deltas between control and treatment metrics.

type GroundTruth

type GroundTruth struct {
	// RelevantFiles are the files the agent should read or find to answer correctly.
	RelevantFiles []string
	// KeySymbols are the symbol names the agent should discover.
	KeySymbols []string
	// AnswerKeywords are substrings that must appear in the final response.
	AnswerKeywords []string
}

GroundTruth describes what a correct agent response must contain.

type SessionMetrics

type SessionMetrics struct {
	SessionID       string
	TaskID          string
	TotalTokens     int
	InputTokens     int
	OutputTokens    int
	ToolCalls       int
	ToolCallsByType map[string]int
	Turns           int
	WallClockMs     int64
	FilesRead       []string

	// Correctness fields: populated by ScoreCorrectness.
	FoundRelevantFiles int
	FoundKeySymbols    int
	AnswerCorrectness  float64
}

SessionMetrics holds all extracted metrics for a single benchmark session.

func ParseTranscript

func ParseTranscript(path string) (SessionMetrics, error)

ParseTranscript reads a Claude Code JSONL file and returns a SessionMetrics. Unknown line types and missing fields are silently skipped.

func ParseTranscriptWithScoring

func ParseTranscriptWithScoring(path string, gt GroundTruth) (SessionMetrics, error)

ParseTranscriptWithScoring parses a transcript and scores correctness against the provided ground truth in a single call.

type Task

type Task struct {
	ID          string
	Description string
	GroundTruth GroundTruth
	Complexity  string // "low", "medium", "high"
}

Task is a benchmark task fixture with a ground truth answer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL