Documentation
¶
Overview ¶
Package eval is the foundation of the R4 code-task generation and evaluation harness. It defines the pure, I/O-free data contracts the rest of the harness binds to: a versioned TaskSpec, the Generator interface per-language adapters implement, and a language Registry that maps a Language to the Generator that produces tasks for it.
This package is deliberately free of I/O. It performs no sandbox provisioning, no knowledge-graph queries, and no filesystem access beyond in-memory (de)serialization helpers. Those concerns live in downstream R4 packages (internal/eval/kgquery, internal/eval/sandbox, internal/eval/gen/<lang>, ...) which import these contracts.
Versioning ¶
Every TaskSpec carries a TaskSpec.TaskSpecVersion. Schema evolution is explicit and auditable: consumers bind to a version, and CurrentTaskSpecVersion names the version this build produces. See decision D4.5 in the R4 spec (.agents/workflow/specs/r4-code-task-generation-eval/design.md).
Index ¶
Constants ¶
const CurrentTaskSpecVersion = 1
CurrentTaskSpecVersion is the TaskSpec schema version this build produces. v1 is the initial schema (R4 spec decision D4.5).
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Difficulty ¶
type Difficulty string
Difficulty is the reproducible, KG-derived difficulty band of a task. The band is computed downstream from difficulty signals (node/edge counts, a cyclomatic-complexity proxy) so re-running the generator on the same KG state yields the same band (R4 requirement R2).
const ( DifficultyEasy Difficulty = "easy" DifficultyMedium Difficulty = "medium" DifficultyHard Difficulty = "hard" )
Difficulty bands.
func (Difficulty) Valid ¶
func (d Difficulty) Valid() bool
Valid reports whether d is a recognized difficulty band.
type GenerateOptions ¶
type GenerateOptions struct {
// Difficulty optionally constrains the band of the generated task. The
// zero value (empty string) lets the generator choose.
Difficulty Difficulty
// TemplateID optionally selects a specific template; empty lets the
// generator pick.
TemplateID string
}
GenerateOptions carries the per-call inputs a generator needs to frame a task. It is intentionally small and I/O-free at this layer; downstream generators thread their own KG reader and other collaborators through their constructor, not through this struct.
type GeneratedFrom ¶
type GeneratedFrom struct {
Kind GeneratedKind `yaml:"kind"`
TemplateID string `yaml:"template_id,omitempty"`
KGQuery *KGQuery `yaml:"kg_query,omitempty"`
}
GeneratedFrom records how a task was produced so a run is reproducible and auditable (R4 requirement R10).
type GeneratedKind ¶
type GeneratedKind string
GeneratedKind names the provenance of a task. v1 generates from the Tree-sitter knowledge graph (KindKGTemplate); KindBenchmarkSeed is reserved for the v2 benchmark-seed adapter that emits the same TaskSpec shape.
const ( KindKGTemplate GeneratedKind = "kg_template" KindBenchmarkSeed GeneratedKind = "benchmark_seed" )
Generation provenance kinds.
func (GeneratedKind) Valid ¶
func (k GeneratedKind) Valid() bool
Valid reports whether k is a recognized generation kind.
type Generator ¶
type Generator interface {
// Language reports the language this generator produces tasks for. A
// generator handles exactly one language.
Language() Language
// Generate synthesizes one TaskSpec. Implementations must return a spec
// that passes TaskSpec.Validate, or an error.
Generate(ctx context.Context, opts GenerateOptions) (*TaskSpec, error)
}
Generator produces a versioned TaskSpec for a single language. Each per-language adapter (internal/eval/gen/<lang>) implements this interface and registers itself in a Registry. The interface is the seam between the language-agnostic harness and language-specific task synthesis.
type KGQuery ¶
type KGQuery struct {
Intent string `yaml:"intent,omitempty"`
SeedSymbol string `yaml:"seed_symbol,omitempty"`
}
KGQuery records the knowledge-graph query a kg_template task was framed around. It is metadata only — this package issues no queries.
type Language ¶
type Language string
Language identifies the programming language a task targets. Per R4 decision D4.3 the v1 harness covers Go, Python, and TypeScript; the type is a string so a future language is an additive constant, not a breaking change.
type Registry ¶
type Registry struct {
// contains filtered or unexported fields
}
Registry maps a Language to the Generator that produces tasks for it. It is safe for concurrent use. A Registry is the lookup surface the harness uses to resolve `da eval gen --language <lang>` to a concrete generator.
func NewRegistry ¶
func NewRegistry() *Registry
NewRegistry returns an empty Registry ready for use.
type SolutionArtifact ¶
SolutionArtifact names a file the task expects to exist or be modified and its role (e.g. "target").
type TaskSpec ¶
type TaskSpec struct {
TaskSpecVersion int `yaml:"task_spec_version"`
TaskID string `yaml:"task_id"`
Language Language `yaml:"language"`
Difficulty Difficulty `yaml:"difficulty"`
DifficultySignals map[string]int `yaml:"difficulty_signals,omitempty"`
GeneratedFrom GeneratedFrom `yaml:"generated_from"`
Prompt string `yaml:"prompt"`
SolutionArtifacts []SolutionArtifact `yaml:"solution_artifacts,omitempty"`
Verification Verification `yaml:"verification"`
}
TaskSpec is the versioned, language-agnostic description of a single evaluable programming task. It is the central contract of the R4 harness: generators produce it, sandboxes provision against it, verifiers consume its verification commands, and the scoring bridge records it.
TaskSpec round-trips through YAML via the canonical field tags so the on-disk sidecar (.agents/eval/runs/<run-id>/taskspec.yaml) matches the in-memory shape exactly.
func ParseTaskSpec ¶
ParseTaskSpec decodes a TaskSpec from YAML bytes and validates it. Strict decoding rejects unknown fields so a stale-version sidecar cannot be silently misread.
func (*TaskSpec) MarshalYAML ¶
MarshalYAML serializes the spec to canonical YAML bytes. Map keys in difficulty_signals are emitted in sorted order so the same spec always produces byte-identical output (reproducibility, R4 requirement R2/R10).
func (*TaskSpec) SignalKeys ¶
SignalKeys returns the difficulty-signal keys in sorted order. It is a convenience for callers (verifier, dashboard) that need stable iteration.
type Verification ¶
type Verification struct {
BuildCmd []string `yaml:"build_cmd,omitempty"`
TestCmd []string `yaml:"test_cmd"`
TimeoutSeconds int `yaml:"timeout_seconds,omitempty"`
}
Verification holds the commands the harness runs after the agent finishes. The test command is hidden from the agent (R4 decision D4.7); these fields are data only — this package executes nothing.