eval

package

v0.4.2 Latest Latest Go to latest Published: Jun 29, 2026 License: Apache-2.0 Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/AGOrcha/dot-agents

Links

Open Source Insights

Documentation ¶

Overview ¶

Package eval is the foundation of the R4 code-task generation and evaluation harness. It defines the pure, I/O-free data contracts the rest of the harness binds to: a versioned TaskSpec, the Generator interface per-language adapters implement, and a language Registry that maps a Language to the Generator that produces tasks for it.

This package is deliberately free of I/O. It performs no sandbox provisioning, no knowledge-graph queries, and no filesystem access beyond in-memory (de)serialization helpers. Those concerns live in downstream R4 packages (internal/eval/kgquery, internal/eval/sandbox, internal/eval/gen/<lang>, ...) which import these contracts.

Versioning ¶

Every TaskSpec carries a TaskSpec.TaskSpecVersion. Schema evolution is explicit and auditable: consumers bind to a version, and CurrentTaskSpecVersion names the version this build produces. See decision D4.5 in the R4 spec (.agents/workflow/specs/r4-code-task-generation-eval/design.md).

Index ¶

Constants
type Difficulty
- func (d Difficulty) Valid() bool
type GenerateOptions
type GeneratedFrom
type GeneratedKind
- func (k GeneratedKind) Valid() bool
type Generator
type KGQuery
type Language
- func (l Language) Valid() bool
type Registry
- func NewRegistry() *Registry
type SolutionArtifact
type TaskSpec
- func ParseTaskSpec(data []byte) (*TaskSpec, error)
type Verification

Constants ¶

View Source

const CurrentTaskSpecVersion = 1

CurrentTaskSpecVersion is the TaskSpec schema version this build produces. v1 is the initial schema (R4 spec decision D4.5).

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Difficulty ¶

type Difficulty string

Difficulty is the reproducible, KG-derived difficulty band of a task. The band is computed downstream from difficulty signals (node/edge counts, a cyclomatic-complexity proxy) so re-running the generator on the same KG state yields the same band (R4 requirement R2).

const (
	DifficultyEasy   Difficulty = "easy"
	DifficultyMedium Difficulty = "medium"
	DifficultyHard   Difficulty = "hard"
)

Difficulty bands.

func (Difficulty) Valid ¶

func (d Difficulty) Valid() bool

Valid reports whether d is a recognized difficulty band.

type GenerateOptions ¶

type GenerateOptions struct {
	// Difficulty optionally constrains the band of the generated task. The
	// zero value (empty string) lets the generator choose.
	Difficulty Difficulty
	// TemplateID optionally selects a specific template; empty lets the
	// generator pick.
	TemplateID string
}

GenerateOptions carries the per-call inputs a generator needs to frame a task. It is intentionally small and I/O-free at this layer; downstream generators thread their own KG reader and other collaborators through their constructor, not through this struct.

type GeneratedFrom ¶

type GeneratedFrom struct {
	Kind       GeneratedKind `yaml:"kind"`
	TemplateID string        `yaml:"template_id,omitempty"`
	KGQuery    *KGQuery      `yaml:"kg_query,omitempty"`
}

GeneratedFrom records how a task was produced so a run is reproducible and auditable (R4 requirement R10).

type GeneratedKind ¶

type GeneratedKind string

GeneratedKind names the provenance of a task. v1 generates from the Tree-sitter knowledge graph (KindKGTemplate); KindBenchmarkSeed is reserved for the v2 benchmark-seed adapter that emits the same TaskSpec shape.

const (
	KindKGTemplate    GeneratedKind = "kg_template"
	KindBenchmarkSeed GeneratedKind = "benchmark_seed"
)

Generation provenance kinds.

func (GeneratedKind) Valid ¶

func (k GeneratedKind) Valid() bool

Valid reports whether k is a recognized generation kind.

type Generator ¶

type Generator interface {
	// Language reports the language this generator produces tasks for. A
	// generator handles exactly one language.
	Language() Language
	// Generate synthesizes one TaskSpec. Implementations must return a spec
	// that passes TaskSpec.Validate, or an error.
	Generate(ctx context.Context, opts GenerateOptions) (*TaskSpec, error)
}

Generator produces a versioned TaskSpec for a single language. Each per-language adapter (internal/eval/gen/<lang>) implements this interface and registers itself in a Registry. The interface is the seam between the language-agnostic harness and language-specific task synthesis.

type KGQuery ¶

type KGQuery struct {
	Intent     string `yaml:"intent,omitempty"`
	SeedSymbol string `yaml:"seed_symbol,omitempty"`
}

KGQuery records the knowledge-graph query a kg_template task was framed around. It is metadata only — this package issues no queries.

type Language ¶

type Language string

Language identifies the programming language a task targets. Per R4 decision D4.3 the v1 harness covers Go, Python, and TypeScript; the type is a string so a future language is an additive constant, not a breaking change.

const (
	LanguageGo         Language = "go"
	LanguagePython     Language = "python"
	LanguageTypeScript Language = "typescript"
)

Supported v1 languages.

func (Language) Valid ¶

func (l Language) Valid() bool

Valid reports whether l is a recognized v1 language.

type Registry ¶

type Registry struct {
	// contains filtered or unexported fields
}

Registry maps a Language to the Generator that produces tasks for it. It is safe for concurrent use. A Registry is the lookup surface the harness uses to resolve `da eval gen --language <lang>` to a concrete generator.

func NewRegistry ¶

func NewRegistry() *Registry

NewRegistry returns an empty Registry ready for use.

func (*Registry) Languages ¶

func (r *Registry) Languages() []Language

Languages returns the registered languages in sorted order.

func (*Registry) Lookup ¶

func (r *Registry) Lookup(lang Language) (Generator, bool)

Lookup returns the generator registered for lang. The boolean is false when no generator is registered for the language.

func (*Registry) Register ¶

func (r *Registry) Register(g Generator) error

Register adds g to the registry keyed by its Language. It errors on a nil generator, an invalid language, or a duplicate registration so collisions surface at wiring time rather than silently shadowing.

type SolutionArtifact ¶

type SolutionArtifact struct {
	Path string `yaml:"path"`
	Role string `yaml:"role,omitempty"`
}

SolutionArtifact names a file the task expects to exist or be modified and its role (e.g. "target").

type TaskSpec ¶

type TaskSpec struct {
	TaskSpecVersion   int                `yaml:"task_spec_version"`
	TaskID            string             `yaml:"task_id"`
	Language          Language           `yaml:"language"`
	Difficulty        Difficulty         `yaml:"difficulty"`
	DifficultySignals map[string]int     `yaml:"difficulty_signals,omitempty"`
	GeneratedFrom     GeneratedFrom      `yaml:"generated_from"`
	Prompt            string             `yaml:"prompt"`
	SolutionArtifacts []SolutionArtifact `yaml:"solution_artifacts,omitempty"`
	Verification      Verification       `yaml:"verification"`
}

TaskSpec is the versioned, language-agnostic description of a single evaluable programming task. It is the central contract of the R4 harness: generators produce it, sandboxes provision against it, verifiers consume its verification commands, and the scoring bridge records it.

TaskSpec round-trips through YAML via the canonical field tags so the on-disk sidecar (.agents/eval/runs/<run-id>/taskspec.yaml) matches the in-memory shape exactly.

func ParseTaskSpec ¶

func ParseTaskSpec(data []byte) (*TaskSpec, error)

ParseTaskSpec decodes a TaskSpec from YAML bytes and validates it. Strict decoding rejects unknown fields so a stale-version sidecar cannot be silently misread.

func (*TaskSpec) MarshalYAML ¶

func (t *TaskSpec) MarshalYAML() ([]byte, error)

MarshalYAML serializes the spec to canonical YAML bytes. Map keys in difficulty_signals are emitted in sorted order so the same spec always produces byte-identical output (reproducibility, R4 requirement R2/R10).

func (*TaskSpec) SignalKeys ¶

func (t *TaskSpec) SignalKeys() []string

SignalKeys returns the difficulty-signal keys in sorted order. It is a convenience for callers (verifier, dashboard) that need stable iteration.

func (*TaskSpec) Validate ¶

func (t *TaskSpec) Validate() error

Validate checks structural invariants the harness depends on. It does not validate that referenced files or symbols exist — that is a downstream, I/O-bearing concern.

type Verification ¶

type Verification struct {
	BuildCmd       []string `yaml:"build_cmd,omitempty"`
	TestCmd        []string `yaml:"test_cmd"`
	TimeoutSeconds int      `yaml:"timeout_seconds,omitempty"`
}

Verification holds the commands the harness runs after the agent finishes. The test command is hidden from the agent (R4 decision D4.7); these fields are data only — this package executes nothing.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL