termbench

package

v0.10.15 Latest Latest Go to latest Published: May 8, 2026 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/DocumentDrivenDX/fizeau

Links

Open Source Insights

Documentation ¶

Overview ¶

Package termbench is the TerminalBench external-benchmark adapter.

Responsibilities:

Parse a TerminalBench task directory (task.yaml + instruction + tests/) into a typed Task.
Translate Task → an agent ServiceExecuteRequest payload (prompt, work-dir, timeout). The adapter does NOT execute the task itself — it returns a Plan that the caller (cmd/bench) feeds into the agent service.Execute pipeline.
Capture the agent's stream of harnesses.Event into the trajectory output format that TerminalBench's Harbor grader expects, namely ATIF v1.4 written to <out>/logs/agent/trajectory.json. The grader then runs the task's own pytest test_outputs.py against the modified workspace; we do not reimplement that grading.

What this package deliberately does NOT do:

It does NOT spin up Docker containers, install Harbor, or run pytest. Doing so requires the full Harbor stack (`harbor run …`) and an x86_64 Docker daemon. Instead, the adapter emits the artifacts Harbor expects so the upstream grader can score the run unmodified. See docs/helix/02-design/external-benchmarks.md for the end-to-end pipeline (agent → harness output → harbor grader).
It does NOT reimplement TerminalBench's grading rubric. Reward comes from /logs/verifier/reward.txt produced by Harbor's verifier.

See SD-008 (docs/helix/02-design/solution-designs/SD-008-terminal-bench-integration.md) for the integration audit; this package is the Go-side companion to the Python adapter at scripts/benchmark/harbor_agent.py.

Index ¶

Constants
Variables
func WriteHarnessOutput(outDir string, traj *Trajectory) error
type AgentInfo
type CaptureOptions
type ExecutionPlan
- func BuildPlan(task *Task, opts PlanOptions) *ExecutionPlan
type GraderResult
- func ReadGraderResult(taskID, outDir string) (*GraderResult, error)
- func (g *GraderResult) Passed() bool
type PlanOptions
type Task
- func LoadTask(taskDir string) (*Task, error)
- func LoadTasks(root string) ([]*Task, error)
type Trajectory
- func Capture(ch <-chan harnesses.Event, opts CaptureOptions) *Trajectory
type TrajectoryStat
type TrajectoryStep
type TrajectoryTC

Constants ¶

View Source

const (
	DefaultAgentTimeoutSec = 180
	DefaultTestTimeoutSec  = 30
)

Default timeouts match the TerminalBench task.yaml schema defaults (https://www.tbench.ai/docs/task-overview). Re-declared here so that missing-field tasks still receive sane bounds without needing the upstream loader.

View Source

const ATIFSchemaVersion = "1.4"

ATIFSchemaVersion pins the trajectory schema to the version Harbor's grader currently consumes. SD-008 §4 documented the fields required.

Variables ¶

View Source

var ErrNoVerifierOutput = errors.New("termbench: verifier produced no reward.txt")

ErrNoVerifierOutput is returned by ReadGraderResult when the verifier did not produce a reward file. Sentinel so callers can treat "not graded" differently from "graded as failed".

Functions ¶

func WriteHarnessOutput ¶

func WriteHarnessOutput(outDir string, traj *Trajectory) error

WriteHarnessOutput writes the artifacts Harbor's grader expects under outDir, mirroring the container layout from SD-008 §4:

<outDir>/logs/agent/trajectory.json   — ATIF v1.4
<outDir>/logs/agent/transcript.txt    — flattened messages, debug aid

The function does NOT write reward.txt or ctrf.json — those come from the verifier (pytest run) which is upstream's job.

Types ¶

type AgentInfo ¶

type AgentInfo struct {
	Name      string `json:"name"`
	Version   string `json:"version"`
	ModelName string `json:"model_name,omitempty"`
}

AgentInfo identifies the executor in trajectory output. Harbor's reporters use Name + Version to label leaderboard entries.

type CaptureOptions ¶

type CaptureOptions struct {
	SessionID string
	Agent     AgentInfo
	TaskID    string
	StartedAt time.Time // used for relative timestamps; defaults to now()
}

CaptureOptions controls how harness events are folded into a trajectory.

type ExecutionPlan ¶

type ExecutionPlan struct {
	// Task is the source task (kept for downstream reporting).
	Task *Task

	// Request is the ServiceExecuteRequest the caller should hand to
	// fizeau.New(...).Execute. WorkDir is left to the caller because a
	// real Harbor run mounts the task workspace at a container path
	// (/app), while a dry-run from cmd/bench may point at a temp dir.
	Request fizeau.ServiceExecuteRequest

	// Timeout matches the task's MaxAgentTimeoutSec budget, suitable for
	// a context.WithTimeout wrapping Execute.
	Timeout time.Duration
}

ExecutionPlan is the ServiceExecuteRequest payload + ancillary metadata for one TerminalBench task. The caller (cmd/bench) is responsible for invoking fizeau.Service.Execute with Request and consuming the resulting event channel; this package never spawns a goroutine of its own.

func BuildPlan ¶

func BuildPlan(task *Task, opts PlanOptions) *ExecutionPlan

BuildPlan converts a Task into an ExecutionPlan. It does not execute anything; callers are free to inspect or mutate Request before handing it to fizeau.Service.Execute.

The instruction text is used verbatim as the prompt — TerminalBench tasks are written to be agent-ready, so wrapping them in additional scaffolding would change the contract the upstream grader expects.

type GraderResult ¶

type GraderResult struct {
	// TaskID identifies which task produced this result.
	TaskID string

	// Reward is the canonical pass/fail signal: 1 = passed, 0 = failed.
	// Mirrors Harbor's /logs/verifier/reward.txt contract.
	Reward int

	// CTRFPath, if set, points at the verifier's pytest CTRF JSON. Reporters
	// can drill down for per-test detail.
	CTRFPath string

	// RewardPath is the absolute path the reward was read from.
	RewardPath string

	// Notes carries human-readable detail (e.g. "missing reward.txt").
	Notes string
}

GraderResult is the scored outcome of one TerminalBench task. It is produced by Harbor's verifier (pytest --ctrf), NOT by this package. We only read the artifacts the verifier writes.

func ReadGraderResult ¶

func ReadGraderResult(taskID, outDir string) (*GraderResult, error)

ReadGraderResult inspects an output directory laid out the way Harbor produces it (see SD-008 §4) and returns the verifier's verdict. The contract:

<outDir>/logs/verifier/reward.txt   — single integer (required)
<outDir>/logs/verifier/ctrf.json    — pytest CTRF report (optional)

If reward.txt is missing the function returns ErrNoVerifierOutput so callers can distinguish "task failed" (reward=0) from "task never graded". Both can be present if the agent ran but the verifier fell over.

func (*GraderResult) Passed ¶

func (g *GraderResult) Passed() bool

Passed is true iff the verifier awarded reward >= 1.

type PlanOptions ¶

type PlanOptions struct {
	// Harness is the agent harness label (e.g. "fiz", "claude-code",
	// "codex"). Passed through verbatim into ServiceExecuteRequest.
	Harness string

	// Model is the provider model ID (e.g. "openrouter/qwen/qwen3.6-plus").
	Model string

	// Provider, if non-empty, pins the named provider (e.g. "vidar",
	// "bragi"). Useful when the same model id is ambiguous across
	// providers, or when a pin-only catalog entry needs an explicit
	// provider hop.
	Provider string

	// WorkDir is the directory the agent operates in. For a real Harbor
	// trial this is /app inside the container; for a Go-side dry-run it
	// can be a tempdir seeded with the task workspace.
	WorkDir string

	// Permissions, if non-empty, overrides the default "safe" preset. The
	// TerminalBench tasks routinely require shell + edit access so the
	// caller may want "trusted" here.
	Permissions string

	// Seed enables deterministic sampling (matches cmd/bench parity runs).
	// Zero means "leave unset".
	Seed int64

	// Temperature, if non-nil, overrides the request temperature. The
	// default-zero behavior is "leave unset" so the agent's own bench
	// path can pin temperature to 0 separately.
	Temperature *float32
}

PlanOptions tunes how a Task is converted into a ServiceExecuteRequest. The defaults match what SD-008 §3 + §5 documented for the Harbor smoke run, so callers can leave most fields zero-valued.

type Task ¶

type Task struct {
	// ID is the task directory name (terminal-bench convention).
	ID string

	// Path is the absolute path to the task directory.
	Path string

	// Instruction is the natural-language prompt the agent receives. Resolved
	// from the descriptions[].instruction matching the chosen difficulty
	// key, falling back to the "base" description.
	Instruction string

	// Difficulty is one of "easy", "medium", "hard" (or empty if not declared).
	Difficulty string

	// Tags is the categorization labels from task.yaml, copied verbatim.
	Tags []string

	// MaxAgentTimeoutSec is the per-task agent wall-clock budget Harbor
	// enforces. The adapter uses this when building ServiceExecuteRequest
	// so the Go-side timeout matches what the grader will tolerate.
	MaxAgentTimeoutSec int

	// MaxTestTimeoutSec is the verifier's pytest timeout. We surface it so
	// reporters can record it; the adapter does not run tests itself.
	MaxTestTimeoutSec int

	// AuthorEmail comes straight from task.yaml; useful for provenance
	// tracking in result artifacts.
	AuthorEmail string
}

Task is the typed projection of a TerminalBench task directory we care about for adapter purposes. We only read the fields needed to drive ServiceExecuteRequest; everything else (Dockerfile, test scripts, etc.) stays in the upstream task tree and is the grader's concern.

func LoadTask ¶

func LoadTask(taskDir string) (*Task, error)

LoadTask reads a TerminalBench task directory and returns a Task. The task ID is inferred from the directory name. Two layouts are supported:

TB1 ("terminal-bench"): <taskDir>/task.yaml with descriptions[] embedded plus tests/test_outputs.py. Documented at https://www.tbench.ai/docs/task-overview.
TB2 ("terminal-bench-2"): <taskDir>/task.toml plus a sibling instruction.md and tests/test_outputs.py. Documented at the same URL but the schema is reorganized; our reader extracts only the fields we need so we don't pull in a TOML dependency.

The function does NOT verify Dockerfile/environment presence — those are Harbor's concern. We only check the contract surface this adapter consumes (instruction text + timeouts).

func LoadTasks ¶

func LoadTasks(root string) ([]*Task, error)

LoadTasks loads all task subdirectories under root that contain a task.yaml. Returned slice is sorted by ID for deterministic ordering. Non-task directories are silently skipped — TerminalBench's tasks/ directory may contain README files, schemas, etc., alongside tasks.

type Trajectory ¶

type Trajectory struct {
	SchemaVersion string           `json:"schema_version"`
	SessionID     string           `json:"session_id"`
	TaskID        string           `json:"task_id"`
	Agent         AgentInfo        `json:"agent"`
	Steps         []TrajectoryStep `json:"steps"`
	FinalMetrics  TrajectoryStat   `json:"final_metrics"`
	FinalStatus   string           `json:"final_status,omitempty"`
	ExitCode      int              `json:"exit_code"`
	DurationMS    int64            `json:"duration_ms"`
	Error         string           `json:"error,omitempty"`
}

Trajectory is the top-level ATIF v1.4 document.

func Capture ¶

func Capture(ch <-chan harnesses.Event, opts CaptureOptions) *Trajectory

Capture consumes harness events from ch and returns an ATIF trajectory. The function blocks until ch closes. It is the inverse of the Python adapter's `populate_context_post_run` hook described in SD-008 §4.

Mapping rules:

text_delta events accumulate into a single "agent" step's message. We do not split on token boundaries because Harbor's grader scores the rendered transcript, not the streaming protocol.
tool_call + matching tool_result are paired by ID into one TrajectoryTC entry on a "tool" step.
final carries the exit code, status, total usage, and cost.

Unknown event types (compaction, stall, routing_decision) are recorded as "system" steps with their JSON payload in the message field, so downstream reporters can still see them without breaking the schema.

type TrajectoryStat ¶

type TrajectoryStat struct {
	InputTokens  int     `json:"input_tokens"`
	OutputTokens int     `json:"output_tokens"`
	Cost         float64 `json:"cost"`
}

TrajectoryStat carries per-step or final usage/cost metrics.

type TrajectoryStep ¶

type TrajectoryStep struct {
	StepID    int             `json:"step_id"`
	Timestamp string          `json:"timestamp"`
	Source    string          `json:"source"` // user|agent|system|tool
	Message   string          `json:"message,omitempty"`
	ToolCalls []TrajectoryTC  `json:"tool_calls,omitempty"`
	Metrics   *TrajectoryStat `json:"metrics,omitempty"`
}

TrajectoryStep is one transcript element in ATIF v1.4 form.

type TrajectoryTC ¶

type TrajectoryTC struct {
	Name   string          `json:"name"`
	Input  json.RawMessage `json:"input,omitempty"`
	Output string          `json:"output,omitempty"`
	Error  string          `json:"error,omitempty"`
}

TrajectoryTC is one tool invocation as ATIF expects it.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL