agent_efficiency

package

v0.3.0 Latest Latest Go to latest Published: May 19, 2026 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/blackwell-systems/knowing

Links

Open Source Insights

README ¶

Agent Efficiency Benchmark

This benchmark proves that Claude Code with knowing MCP tools completes tasks with fewer tool calls, fewer tokens, and higher correctness than without.

How it works

The benchmark has three parts:

Task fixtures (tasks.go): 8 tasks targeting the knowing codebase, each with a description, ground truth (relevant files, key symbols, answer keywords), and a complexity rating.
Transcript analyzer (transcript.go): Parses Claude Code JSONL session transcripts to extract token counts, tool call counts, wall-clock time, files read, and correctness scores against the ground truth.
Comparison engine (compare.go): Computes the deltas between control (without knowing) and treatment (with knowing) sessions, and renders a Markdown report.

The benchmark does NOT run Claude Code automatically. You run sessions manually (or via the runner script) and then point the analyzer at the resulting JSONL files.

Step 1: Export tasks

Generate tasks.json so the runner script can read task descriptions without a Go toolchain:

GOWORK=off go test ./bench/agent-efficiency/ -run TestExportTasks -v

This writes bench/agent-efficiency/tasks.json.

Step 2: Run sessions

Option A: Automated (requires claude CLI)

# Control session (no knowing tools)
./bench/agent-efficiency/runner.sh blast-radius-handler control

# Treatment session (with knowing tools)
./bench/agent-efficiency/runner.sh blast-radius-handler treatment

Run both modes for all 8 task IDs:

for task in blast-radius-handler context-engine-scoring node-struct-blast-radius \
            louvain-community-detection snapshot-package-coverage \
            hierarchical-merkle-diff edge-types file-save-to-cache-invalidation; do
  ./bench/agent-efficiency/runner.sh "$task" control
  ./bench/agent-efficiency/runner.sh "$task" treatment
done

Option B: Manual

Open a Claude Code session in /Users/dayna.blackwell/code/knowing.
Paste the task description from tasks.json.
For control sessions: do not use any knowing MCP tools.
For treatment sessions: use knowing MCP tools freely.
Save the session JSONL transcript to: bench/agent-efficiency/transcripts/<task-id>-<mode>.jsonl

Claude Code stores session transcripts in ~/.claude/projects/<project-hash>/. Copy the relevant .jsonl file to the transcripts directory with the naming convention above.

Step 3: Analyze results

GOWORK=off go test ./bench/agent-efficiency/ -run TestAnalyzeTranscripts -v

This reads all transcripts in bench/agent-efficiency/transcripts/, computes metrics, compares control vs. treatment pairs, and writes bench/agent-efficiency/FINDINGS.md.

Transcript naming convention

Transcripts must follow this naming pattern:

transcripts/<task-id>-<mode>.jsonl

Where <mode> is either control or treatment. Examples:

transcripts/blast-radius-handler-control.jsonl
transcripts/blast-radius-handler-treatment.jsonl
transcripts/context-engine-scoring-control.jsonl
transcripts/context-engine-scoring-treatment.jsonl

Task list

ID	Description	Complexity
`blast-radius-handler`	Find the function that handles the blast_radius MCP tool	low
`context-engine-scoring`	Explain the context engine scoring formula and weights	medium
`node-struct-blast-radius`	What breaks if Node struct changes	medium
`louvain-community-detection`	Walk through the Louvain algorithm implementation	medium
`snapshot-package-coverage`	What test files cover the snapshot package	low
`hierarchical-merkle-diff`	How hierarchical Merkle tree improves diff performance	high
`edge-types`	List all supported edge types and where they are defined	medium
`file-save-to-cache-invalidation`	Trace data flow from git commit to cache invalidation	high

Metrics collected

Metric	Description
`TotalTokens`	Input + output tokens across all turns
`ToolCalls`	Total number of tool_use blocks
`ToolCallsByType`	Per-tool breakdown (Read, Grep, knowing_context, etc.)
`Turns`	Number of assistant messages
`WallClockMs`	Time from first user message to last assistant message
`FilesRead`	Unique files opened via Read tool
`FoundRelevantFiles`	Ground-truth relevant files that were actually read
`FoundKeySymbols`	Ground-truth key symbols that appeared in assistant output
`AnswerCorrectness`	Fraction of expected answer keywords present in final response

Interpreting results

Token savings: positive means treatment used fewer tokens.
Tool call savings: positive means treatment made fewer tool calls.
Time savings: positive means treatment finished faster.
Correctness delta: positive means treatment gave more correct answers.

A good result shows knowing tools providing significant token and tool call savings while maintaining or improving correctness.

Documentation ¶

Index ¶

Variables
func FormatReport(results []ComparisonResult) string
func ScoreCorrectness(m *SessionMetrics, gt GroundTruth, allAssistantText string)
type ComparisonResult
- func Compare(control, treatment SessionMetrics) ComparisonResult
type GroundTruth
type SessionMetrics
- func ParseTranscript(path string) (SessionMetrics, error)
- func ParseTranscriptWithScoring(path string, gt GroundTruth) (SessionMetrics, error)
type Task

Constants ¶

This section is empty.

Variables ¶

View Source

var Tasks = []Task{
	{
		ID:          "blast-radius-handler",
		Description: "What function handles the blast_radius MCP tool in the knowing codebase? In which file is it defined?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/mcp/handlers.go",
			},
			KeySymbols: []string{
				"handleBlastRadius",
			},
			AnswerKeywords: []string{
				"handleBlastRadius",
				"handlers.go",
			},
		},
		Complexity: "low",
	},
	{
		ID:          "context-engine-scoring",
		Description: "How does the context engine score symbols? What is the formula, and what weights are applied to each component?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/context/ranking.go",
				"internal/context/hits.go",
			},
			KeySymbols: []string{
				"RankSymbols",
				"ScoringInput",
				"ScoreComponents",
				"HITSScores",
			},
			AnswerKeywords: []string{
				"RankSymbols",
				"weight",
				"feedback",
				"session",
				"recency",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "node-struct-blast-radius",
		Description: "If I change the Node struct in internal/types/types.go, what breaks? Which packages and callers depend on it?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/types/types.go",
				"internal/store/knowing/",
				"internal/indexer/indexer.go",
				"internal/mcp/handlers.go",
			},
			KeySymbols: []string{
				"Node",
				"ComputeNodeHash",
				"PutNode",
			},
			AnswerKeywords: []string{
				"Node",
				"types.go",
				"store",
				"indexer",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "louvain-community-detection",
		Description: "How does the Louvain community detection work in the knowing codebase? Walk me through the algorithm implementation.",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/community/louvain.go",
				"internal/community/algorithm.go",
			},
			KeySymbols: []string{
				"Louvain",
				"Detect",
			},
			AnswerKeywords: []string{
				"modularity",
				"community",
				"louvain",
				"Detect",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "snapshot-package-coverage",
		Description: "What test files exist for the snapshot package, and what aspects of the package do they cover?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/snapshot/manager_test.go",
				"internal/snapshot/hierarchical_test.go",
				"internal/snapshot/verify_test.go",
				"internal/snapshot/impact_test.go",
				"internal/snapshot/semantic_test.go",
			},
			KeySymbols: []string{
				"HierarchicalTree",
				"DiffHierarchicalTrees",
				"BuildHierarchicalTree",
			},
			AnswerKeywords: []string{
				"snapshot",
				"hierarchical",
				"test",
			},
		},
		Complexity: "low",
	},
	{
		ID:          "hierarchical-merkle-diff",
		Description: "How does the hierarchical Merkle tree improve diff performance in the knowing codebase? What is the algorithmic improvement over a flat diff?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/snapshot/hierarchical.go",
				"internal/snapshot/merkle.go",
				"internal/cache/subgraph.go",
			},
			KeySymbols: []string{
				"HierarchicalTree",
				"BuildHierarchicalTree",
				"DiffHierarchicalTrees",
				"DiffHierarchicalTreesWithOptions",
				"SubgraphCache",
				"InvalidatePackages",
			},
			AnswerKeywords: []string{
				"O(packages)",
				"hierarchical",
				"PackageRoots",
				"EdgeTypeRoots",
				"diff",
			},
		},
		Complexity: "high",
	},
	{
		ID:          "edge-types",
		Description: "What edge types does the knowing graph support and where are they defined? List all supported edge type strings.",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/types/types.go",
			},
			KeySymbols: []string{
				"Edge",
				"EdgeType",
				"ComputeEdgeHash",
			},
			AnswerKeywords: []string{
				"calls",
				"imports",
				"implements",
				"references",
				"types.go",
			},
		},
		Complexity: "medium",
	},
	{
		ID:          "file-save-to-cache-invalidation",
		Description: "Trace the data flow in the knowing daemon from a git commit (file save) to cache invalidation. Which functions are involved and in what order?",
		GroundTruth: GroundTruth{
			RelevantFiles: []string{
				"internal/daemon/gitwatcher.go",
				"internal/daemon/daemon.go",
				"internal/snapshot/hierarchical.go",
				"internal/cache/subgraph.go",
			},
			KeySymbols: []string{
				"GitWatcher",
				"CommitEvent",
				"DiffHierarchicalTrees",
				"InvalidatePackages",
			},
			AnswerKeywords: []string{
				"GitWatcher",
				"CommitEvent",
				"reindex",
				"InvalidatePackages",
				"SubgraphCache",
			},
		},
		Complexity: "high",
	},
}

Tasks is the canonical fixture set for the agent efficiency benchmark. Each task targets the knowing codebase and has been verified against the actual source.

Functions ¶

func FormatReport ¶

func FormatReport(results []ComparisonResult) string

FormatReport renders a slice of ComparisonResults as a Markdown report with a summary table and a per-task detail section.

func ScoreCorrectness ¶

func ScoreCorrectness(m *SessionMetrics, gt GroundTruth, allAssistantText string)

ScoreCorrectness computes FoundRelevantFiles, FoundKeySymbols, and AnswerCorrectness against a GroundTruth. It updates m in place.

Types ¶

type ComparisonResult ¶

type ComparisonResult struct {
	TaskID    string
	Control   SessionMetrics
	Treatment SessionMetrics

	// TokenSavings is the fraction of tokens saved:
	//   (control.TotalTokens - treatment.TotalTokens) / control.TotalTokens
	// Positive means treatment used fewer tokens.
	TokenSavings float64

	// ToolCallSavings is the fraction of tool calls saved:
	//   (control.ToolCalls - treatment.ToolCalls) / control.ToolCalls
	ToolCallSavings float64

	// TimeSavings is the fraction of wall-clock time saved.
	TimeSavings float64

	// CorrectnessDelta is treatment.AnswerCorrectness - control.AnswerCorrectness.
	// Positive means treatment gave a more correct answer.
	CorrectnessDelta float64
}

ComparisonResult holds the delta between a control (without knowing) and a treatment (with knowing) session for the same task.

func Compare ¶

func Compare(control, treatment SessionMetrics) ComparisonResult

Compare computes all deltas between control and treatment metrics.

type GroundTruth ¶

type GroundTruth struct {
	// RelevantFiles are the files the agent should read or find to answer correctly.
	RelevantFiles []string
	// KeySymbols are the symbol names the agent should discover.
	KeySymbols []string
	// AnswerKeywords are substrings that must appear in the final response.
	AnswerKeywords []string
}

GroundTruth describes what a correct agent response must contain.

type SessionMetrics ¶

type SessionMetrics struct {
	SessionID       string
	TaskID          string
	TotalTokens     int
	InputTokens     int
	OutputTokens    int
	ToolCalls       int
	ToolCallsByType map[string]int
	Turns           int
	WallClockMs     int64
	FilesRead       []string

	// Correctness fields: populated by ScoreCorrectness.
	FoundRelevantFiles int
	FoundKeySymbols    int
	AnswerCorrectness  float64
}

SessionMetrics holds all extracted metrics for a single benchmark session.

func ParseTranscript ¶

func ParseTranscript(path string) (SessionMetrics, error)

ParseTranscript reads a Claude Code JSONL file and returns a SessionMetrics. Unknown line types and missing fields are silently skipped.

func ParseTranscriptWithScoring ¶

func ParseTranscriptWithScoring(path string, gt GroundTruth) (SessionMetrics, error)

ParseTranscriptWithScoring parses a transcript and scores correctness against the provided ground truth in a single call.

type Task ¶

type Task struct {
	ID          string
	Description string
	GroundTruth GroundTruth
	Complexity  string // "low", "medium", "high"
}

Task is a benchmark task fixture with a ground truth answer.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL