eval

package
v0.4.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 29, 2026 License: Apache-2.0 Imports: 6 Imported by: 0

Documentation

Overview

Package eval is the foundation of the R4 code-task generation and evaluation harness. It defines the pure, I/O-free data contracts the rest of the harness binds to: a versioned TaskSpec, the Generator interface per-language adapters implement, and a language Registry that maps a Language to the Generator that produces tasks for it.

This package is deliberately free of I/O. It performs no sandbox provisioning, no knowledge-graph queries, and no filesystem access beyond in-memory (de)serialization helpers. Those concerns live in downstream R4 packages (internal/eval/kgquery, internal/eval/sandbox, internal/eval/gen/<lang>, ...) which import these contracts.

Versioning

Every TaskSpec carries a TaskSpec.TaskSpecVersion. Schema evolution is explicit and auditable: consumers bind to a version, and CurrentTaskSpecVersion names the version this build produces. See decision D4.5 in the R4 spec (.agents/workflow/specs/r4-code-task-generation-eval/design.md).

Index

Constants

View Source
const CurrentTaskSpecVersion = 1

CurrentTaskSpecVersion is the TaskSpec schema version this build produces. v1 is the initial schema (R4 spec decision D4.5).

Variables

This section is empty.

Functions

This section is empty.

Types

type Difficulty

type Difficulty string

Difficulty is the reproducible, KG-derived difficulty band of a task. The band is computed downstream from difficulty signals (node/edge counts, a cyclomatic-complexity proxy) so re-running the generator on the same KG state yields the same band (R4 requirement R2).

const (
	DifficultyEasy   Difficulty = "easy"
	DifficultyMedium Difficulty = "medium"
	DifficultyHard   Difficulty = "hard"
)

Difficulty bands.

func (Difficulty) Valid

func (d Difficulty) Valid() bool

Valid reports whether d is a recognized difficulty band.

type GenerateOptions

type GenerateOptions struct {
	// Difficulty optionally constrains the band of the generated task. The
	// zero value (empty string) lets the generator choose.
	Difficulty Difficulty
	// TemplateID optionally selects a specific template; empty lets the
	// generator pick.
	TemplateID string
}

GenerateOptions carries the per-call inputs a generator needs to frame a task. It is intentionally small and I/O-free at this layer; downstream generators thread their own KG reader and other collaborators through their constructor, not through this struct.

type GeneratedFrom

type GeneratedFrom struct {
	Kind       GeneratedKind `yaml:"kind"`
	TemplateID string        `yaml:"template_id,omitempty"`
	KGQuery    *KGQuery      `yaml:"kg_query,omitempty"`
}

GeneratedFrom records how a task was produced so a run is reproducible and auditable (R4 requirement R10).

type GeneratedKind

type GeneratedKind string

GeneratedKind names the provenance of a task. v1 generates from the Tree-sitter knowledge graph (KindKGTemplate); KindBenchmarkSeed is reserved for the v2 benchmark-seed adapter that emits the same TaskSpec shape.

const (
	KindKGTemplate    GeneratedKind = "kg_template"
	KindBenchmarkSeed GeneratedKind = "benchmark_seed"
)

Generation provenance kinds.

func (GeneratedKind) Valid

func (k GeneratedKind) Valid() bool

Valid reports whether k is a recognized generation kind.

type Generator

type Generator interface {
	// Language reports the language this generator produces tasks for. A
	// generator handles exactly one language.
	Language() Language
	// Generate synthesizes one TaskSpec. Implementations must return a spec
	// that passes TaskSpec.Validate, or an error.
	Generate(ctx context.Context, opts GenerateOptions) (*TaskSpec, error)
}

Generator produces a versioned TaskSpec for a single language. Each per-language adapter (internal/eval/gen/<lang>) implements this interface and registers itself in a Registry. The interface is the seam between the language-agnostic harness and language-specific task synthesis.

type KGQuery

type KGQuery struct {
	Intent     string `yaml:"intent,omitempty"`
	SeedSymbol string `yaml:"seed_symbol,omitempty"`
}

KGQuery records the knowledge-graph query a kg_template task was framed around. It is metadata only — this package issues no queries.

type Language

type Language string

Language identifies the programming language a task targets. Per R4 decision D4.3 the v1 harness covers Go, Python, and TypeScript; the type is a string so a future language is an additive constant, not a breaking change.

const (
	LanguageGo         Language = "go"
	LanguagePython     Language = "python"
	LanguageTypeScript Language = "typescript"
)

Supported v1 languages.

func (Language) Valid

func (l Language) Valid() bool

Valid reports whether l is a recognized v1 language.

type Registry

type Registry struct {
	// contains filtered or unexported fields
}

Registry maps a Language to the Generator that produces tasks for it. It is safe for concurrent use. A Registry is the lookup surface the harness uses to resolve `da eval gen --language <lang>` to a concrete generator.

func NewRegistry

func NewRegistry() *Registry

NewRegistry returns an empty Registry ready for use.

func (*Registry) Languages

func (r *Registry) Languages() []Language

Languages returns the registered languages in sorted order.

func (*Registry) Lookup

func (r *Registry) Lookup(lang Language) (Generator, bool)

Lookup returns the generator registered for lang. The boolean is false when no generator is registered for the language.

func (*Registry) Register

func (r *Registry) Register(g Generator) error

Register adds g to the registry keyed by its Language. It errors on a nil generator, an invalid language, or a duplicate registration so collisions surface at wiring time rather than silently shadowing.

type SolutionArtifact

type SolutionArtifact struct {
	Path string `yaml:"path"`
	Role string `yaml:"role,omitempty"`
}

SolutionArtifact names a file the task expects to exist or be modified and its role (e.g. "target").

type TaskSpec

type TaskSpec struct {
	TaskSpecVersion   int                `yaml:"task_spec_version"`
	TaskID            string             `yaml:"task_id"`
	Language          Language           `yaml:"language"`
	Difficulty        Difficulty         `yaml:"difficulty"`
	DifficultySignals map[string]int     `yaml:"difficulty_signals,omitempty"`
	GeneratedFrom     GeneratedFrom      `yaml:"generated_from"`
	Prompt            string             `yaml:"prompt"`
	SolutionArtifacts []SolutionArtifact `yaml:"solution_artifacts,omitempty"`
	Verification      Verification       `yaml:"verification"`
}

TaskSpec is the versioned, language-agnostic description of a single evaluable programming task. It is the central contract of the R4 harness: generators produce it, sandboxes provision against it, verifiers consume its verification commands, and the scoring bridge records it.

TaskSpec round-trips through YAML via the canonical field tags so the on-disk sidecar (.agents/eval/runs/<run-id>/taskspec.yaml) matches the in-memory shape exactly.

func ParseTaskSpec

func ParseTaskSpec(data []byte) (*TaskSpec, error)

ParseTaskSpec decodes a TaskSpec from YAML bytes and validates it. Strict decoding rejects unknown fields so a stale-version sidecar cannot be silently misread.

func (*TaskSpec) MarshalYAML

func (t *TaskSpec) MarshalYAML() ([]byte, error)

MarshalYAML serializes the spec to canonical YAML bytes. Map keys in difficulty_signals are emitted in sorted order so the same spec always produces byte-identical output (reproducibility, R4 requirement R2/R10).

func (*TaskSpec) SignalKeys

func (t *TaskSpec) SignalKeys() []string

SignalKeys returns the difficulty-signal keys in sorted order. It is a convenience for callers (verifier, dashboard) that need stable iteration.

func (*TaskSpec) Validate

func (t *TaskSpec) Validate() error

Validate checks structural invariants the harness depends on. It does not validate that referenced files or symbols exist — that is a downstream, I/O-bearing concern.

type Verification

type Verification struct {
	BuildCmd       []string `yaml:"build_cmd,omitempty"`
	TestCmd        []string `yaml:"test_cmd"`
	TimeoutSeconds int      `yaml:"timeout_seconds,omitempty"`
}

Verification holds the commands the harness runs after the agent finishes. The test command is hidden from the agent (R4 decision D4.7); these fields are data only — this package executes nothing.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL