Documentation
¶
Overview ¶
Package termbench is the TerminalBench external-benchmark adapter.
Responsibilities:
- Parse a TerminalBench task directory (task.yaml + instruction + tests/) into a typed Task.
- Translate Task → an agent ServiceExecuteRequest payload (prompt, work-dir, timeout). The adapter does NOT execute the task itself — it returns a Plan that the caller (cmd/bench) feeds into the agent service.Execute pipeline.
- Capture the agent's stream of harnesses.Event into the trajectory output format that TerminalBench's Harbor grader expects, namely ATIF v1.4 written to <out>/logs/agent/trajectory.json. The grader then runs the task's own pytest test_outputs.py against the modified workspace; we do not reimplement that grading.
What this package deliberately does NOT do:
It does NOT spin up Docker containers, install Harbor, or run pytest. Doing so requires the full Harbor stack (`harbor run …`) and an x86_64 Docker daemon. Instead, the adapter emits the artifacts Harbor expects so the upstream grader can score the run unmodified. See docs/helix/02-design/external-benchmarks.md for the end-to-end pipeline (agent → harness output → harbor grader).
It does NOT reimplement TerminalBench's grading rubric. Reward comes from /logs/verifier/reward.txt produced by Harbor's verifier.
See SD-008 (docs/helix/02-design/solution-designs/SD-008-terminal-bench-integration.md) for the integration audit; this package is the Go-side companion to the Python adapter at scripts/benchmark/harbor_agent.py.
Index ¶
Constants ¶
const ( DefaultAgentTimeoutSec = 180 DefaultTestTimeoutSec = 30 )
Default timeouts match the TerminalBench task.yaml schema defaults (https://www.tbench.ai/docs/task-overview). Re-declared here so that missing-field tasks still receive sane bounds without needing the upstream loader.
const ATIFSchemaVersion = "1.4"
ATIFSchemaVersion pins the trajectory schema to the version Harbor's grader currently consumes. SD-008 §4 documented the fields required.
Variables ¶
var ErrNoVerifierOutput = errors.New("termbench: verifier produced no reward.txt")
ErrNoVerifierOutput is returned by ReadGraderResult when the verifier did not produce a reward file. Sentinel so callers can treat "not graded" differently from "graded as failed".
Functions ¶
func WriteHarnessOutput ¶
func WriteHarnessOutput(outDir string, traj *Trajectory) error
WriteHarnessOutput writes the artifacts Harbor's grader expects under outDir, mirroring the container layout from SD-008 §4:
<outDir>/logs/agent/trajectory.json — ATIF v1.4 <outDir>/logs/agent/transcript.txt — flattened messages, debug aid
The function does NOT write reward.txt or ctrf.json — those come from the verifier (pytest run) which is upstream's job.
Types ¶
type AgentInfo ¶
type AgentInfo struct {
Name string `json:"name"`
Version string `json:"version"`
ModelName string `json:"model_name,omitempty"`
}
AgentInfo identifies the executor in trajectory output. Harbor's reporters use Name + Version to label leaderboard entries.
type CaptureOptions ¶
type CaptureOptions struct {
SessionID string
Agent AgentInfo
TaskID string
StartedAt time.Time // used for relative timestamps; defaults to now()
}
CaptureOptions controls how harness events are folded into a trajectory.
type ExecutionPlan ¶
type ExecutionPlan struct {
// Task is the source task (kept for downstream reporting).
Task *Task
// Request is the ServiceExecuteRequest the caller should hand to
// fizeau.New(...).Execute. WorkDir is left to the caller because a
// real Harbor run mounts the task workspace at a container path
// (/app), while a dry-run from cmd/bench may point at a temp dir.
Request fizeau.ServiceExecuteRequest
// Timeout matches the task's MaxAgentTimeoutSec budget, suitable for
// a context.WithTimeout wrapping Execute.
Timeout time.Duration
}
ExecutionPlan is the ServiceExecuteRequest payload + ancillary metadata for one TerminalBench task. The caller (cmd/bench) is responsible for invoking fizeau.Service.Execute with Request and consuming the resulting event channel; this package never spawns a goroutine of its own.
func BuildPlan ¶
func BuildPlan(task *Task, opts PlanOptions) *ExecutionPlan
BuildPlan converts a Task into an ExecutionPlan. It does not execute anything; callers are free to inspect or mutate Request before handing it to fizeau.Service.Execute.
The instruction text is used verbatim as the prompt — TerminalBench tasks are written to be agent-ready, so wrapping them in additional scaffolding would change the contract the upstream grader expects.
type GraderResult ¶
type GraderResult struct {
// TaskID identifies which task produced this result.
TaskID string
// Reward is the canonical pass/fail signal: 1 = passed, 0 = failed.
// Mirrors Harbor's /logs/verifier/reward.txt contract.
Reward int
// CTRFPath, if set, points at the verifier's pytest CTRF JSON. Reporters
// can drill down for per-test detail.
CTRFPath string
// RewardPath is the absolute path the reward was read from.
RewardPath string
// Notes carries human-readable detail (e.g. "missing reward.txt").
Notes string
}
GraderResult is the scored outcome of one TerminalBench task. It is produced by Harbor's verifier (pytest --ctrf), NOT by this package. We only read the artifacts the verifier writes.
func ReadGraderResult ¶
func ReadGraderResult(taskID, outDir string) (*GraderResult, error)
ReadGraderResult inspects an output directory laid out the way Harbor produces it (see SD-008 §4) and returns the verifier's verdict. The contract:
<outDir>/logs/verifier/reward.txt — single integer (required) <outDir>/logs/verifier/ctrf.json — pytest CTRF report (optional)
If reward.txt is missing the function returns ErrNoVerifierOutput so callers can distinguish "task failed" (reward=0) from "task never graded". Both can be present if the agent ran but the verifier fell over.
func (*GraderResult) Passed ¶
func (g *GraderResult) Passed() bool
Passed is true iff the verifier awarded reward >= 1.
type PlanOptions ¶
type PlanOptions struct {
// Harness is the agent harness label (e.g. "fiz", "claude-code",
// "codex"). Passed through verbatim into ServiceExecuteRequest.
Harness string
// Model is the provider model ID (e.g. "openrouter/qwen/qwen3.6-plus").
Model string
// Provider, if non-empty, pins the named provider (e.g. "vidar",
// "bragi"). Useful when the same model id is ambiguous across
// providers, or when a pin-only catalog entry needs an explicit
// provider hop.
Provider string
// WorkDir is the directory the agent operates in. For a real Harbor
// trial this is /app inside the container; for a Go-side dry-run it
// can be a tempdir seeded with the task workspace.
WorkDir string
// Permissions, if non-empty, overrides the default "safe" preset. The
// TerminalBench tasks routinely require shell + edit access so the
// caller may want "trusted" here.
Permissions string
// Seed enables deterministic sampling (matches cmd/bench parity runs).
// Zero means "leave unset".
Seed int64
// Temperature, if non-nil, overrides the request temperature. The
// default-zero behavior is "leave unset" so the agent's own bench
// path can pin temperature to 0 separately.
Temperature *float32
}
PlanOptions tunes how a Task is converted into a ServiceExecuteRequest. The defaults match what SD-008 §3 + §5 documented for the Harbor smoke run, so callers can leave most fields zero-valued.
type Task ¶
type Task struct {
// ID is the task directory name (terminal-bench convention).
ID string
// Path is the absolute path to the task directory.
Path string
// Instruction is the natural-language prompt the agent receives. Resolved
// from the descriptions[].instruction matching the chosen difficulty
// key, falling back to the "base" description.
Instruction string
// Difficulty is one of "easy", "medium", "hard" (or empty if not declared).
Difficulty string
// Tags is the categorization labels from task.yaml, copied verbatim.
Tags []string
// MaxAgentTimeoutSec is the per-task agent wall-clock budget Harbor
// enforces. The adapter uses this when building ServiceExecuteRequest
// so the Go-side timeout matches what the grader will tolerate.
MaxAgentTimeoutSec int
// MaxTestTimeoutSec is the verifier's pytest timeout. We surface it so
// reporters can record it; the adapter does not run tests itself.
MaxTestTimeoutSec int
// AuthorEmail comes straight from task.yaml; useful for provenance
// tracking in result artifacts.
AuthorEmail string
}
Task is the typed projection of a TerminalBench task directory we care about for adapter purposes. We only read the fields needed to drive ServiceExecuteRequest; everything else (Dockerfile, test scripts, etc.) stays in the upstream task tree and is the grader's concern.
func LoadTask ¶
LoadTask reads a TerminalBench task directory and returns a Task. The task ID is inferred from the directory name. Two layouts are supported:
- TB1 ("terminal-bench"): <taskDir>/task.yaml with descriptions[] embedded plus tests/test_outputs.py. Documented at https://www.tbench.ai/docs/task-overview.
- TB2 ("terminal-bench-2"): <taskDir>/task.toml plus a sibling instruction.md and tests/test_outputs.py. Documented at the same URL but the schema is reorganized; our reader extracts only the fields we need so we don't pull in a TOML dependency.
The function does NOT verify Dockerfile/environment presence — those are Harbor's concern. We only check the contract surface this adapter consumes (instruction text + timeouts).
type Trajectory ¶
type Trajectory struct {
SchemaVersion string `json:"schema_version"`
SessionID string `json:"session_id"`
TaskID string `json:"task_id"`
Agent AgentInfo `json:"agent"`
Steps []TrajectoryStep `json:"steps"`
FinalMetrics TrajectoryStat `json:"final_metrics"`
FinalStatus string `json:"final_status,omitempty"`
ExitCode int `json:"exit_code"`
DurationMS int64 `json:"duration_ms"`
Error string `json:"error,omitempty"`
}
Trajectory is the top-level ATIF v1.4 document.
func Capture ¶
func Capture(ch <-chan harnesses.Event, opts CaptureOptions) *Trajectory
Capture consumes harness events from ch and returns an ATIF trajectory. The function blocks until ch closes. It is the inverse of the Python adapter's `populate_context_post_run` hook described in SD-008 §4.
Mapping rules:
- text_delta events accumulate into a single "agent" step's message. We do not split on token boundaries because Harbor's grader scores the rendered transcript, not the streaming protocol.
- tool_call + matching tool_result are paired by ID into one TrajectoryTC entry on a "tool" step.
- final carries the exit code, status, total usage, and cost.
Unknown event types (compaction, stall, routing_decision) are recorded as "system" steps with their JSON payload in the message field, so downstream reporters can still see them without breaking the schema.
type TrajectoryStat ¶
type TrajectoryStat struct {
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
Cost float64 `json:"cost"`
}
TrajectoryStat carries per-step or final usage/cost metrics.
type TrajectoryStep ¶
type TrajectoryStep struct {
StepID int `json:"step_id"`
Timestamp string `json:"timestamp"`
Source string `json:"source"` // user|agent|system|tool
Message string `json:"message,omitempty"`
ToolCalls []TrajectoryTC `json:"tool_calls,omitempty"`
Metrics *TrajectoryStat `json:"metrics,omitempty"`
}
TrajectoryStep is one transcript element in ATIF v1.4 form.
type TrajectoryTC ¶
type TrajectoryTC struct {
Name string `json:"name"`
Input json.RawMessage `json:"input,omitempty"`
Output string `json:"output,omitempty"`
Error string `json:"error,omitempty"`
}
TrajectoryTC is one tool invocation as ATIF expects it.