Structured Evaluation

A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.
Overview
structured-evaluation provides standardized types for evaluation reports, enabling:
- LLM-as-Judge assessments with weighted category scores and severity-based findings
- GO/NO-GO summary reports for deterministic checks (CI, tests, validation)
- Multi-agent coordination with DAG-based report aggregation
Installation
go get github.com/agentplexus/structured-evaluation
Packages
| Package |
Description |
evaluation |
EvaluationReport, CategoryScore, Finding, Severity types |
summary |
SummaryReport, TeamSection, TaskResult for GO/NO-GO checks |
combine |
DAG-based report aggregation using Kahn's algorithm |
render/box |
Box-format terminal renderer for summary reports |
render/detailed |
Detailed terminal renderer for evaluation reports |
schema |
JSON Schema generation and embedding |
Report Types
Evaluation Report (LLM-as-Judge)
For subjective quality assessments with detailed findings:
import "github.com/agentplexus/structured-evaluation/evaluation"
report := evaluation.NewEvaluationReport("prd", "document.md")
report.AddCategory(evaluation.NewCategoryScore("problem_definition", 0.20, 8.5, "Clear problem statement"))
report.AddFinding(evaluation.Finding{
Severity: evaluation.SeverityMedium,
Category: "metrics",
Title: "Missing baseline metrics",
Recommendation: "Add current baseline measurements",
})
report.Finalize("sevaluation check document.md")
Summary Report (GO/NO-GO)
For deterministic checks with pass/fail status:
import "github.com/agentplexus/structured-evaluation/summary"
report := summary.NewSummaryReport("my-service", "v1.0.0", "Release Validation")
report.AddTeam(summary.TeamSection{
ID: "qa",
Name: "Quality Assurance",
Tasks: []summary.TaskResult{
{ID: "unit-tests", Status: summary.StatusGo, Detail: "Coverage: 92%"},
{ID: "e2e-tests", Status: summary.StatusWarn, Detail: "2 flaky tests"},
},
})
Severity Levels
Following InfoSec conventions:
| Severity |
Icon |
Blocking |
Description |
| Critical |
🔴 |
Yes |
Must fix before approval |
| High |
🔴 |
Yes |
Must fix before approval |
| Medium |
🟡 |
No |
Should fix, tracked |
| Low |
🟢 |
No |
Nice to fix |
| Info |
⚪ |
No |
Informational only |
Pass Criteria
Default criteria (zero blocking findings, minimum score):
criteria := evaluation.DefaultPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: -1 (unlimited), MinScore: 7.0
criteria := evaluation.StrictPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: 3, MinScore: 8.0
# Install
go install github.com/agentplexus/structured-evaluation/cmd/sevaluation@latest
# Render reports
sevaluation render report.json --format=detailed
sevaluation render report.json --format=box
sevaluation render report.json --format=json
# Check pass/fail (exit code 0/1)
sevaluation check report.json
# Validate structure
sevaluation validate report.json
# Generate JSON Schema
sevaluation schema generate -o ./schema/
DAG-Based Aggregation
For multi-agent workflows with dependencies:
import "github.com/agentplexus/structured-evaluation/combine"
results := []combine.AgentResult{
{TeamID: "qa", Tasks: qaTasks},
{TeamID: "security", Tasks: secTasks, DependsOn: []string{"qa"}},
{TeamID: "release", Tasks: relTasks, DependsOn: []string{"qa", "security"}},
}
report := combine.AggregateResults(results, "my-project", "v1.0.0", "Release")
// Teams are topologically sorted: qa → security → release
JSON Schema
Schemas are embedded for runtime validation:
import "github.com/agentplexus/structured-evaluation/schema"
evalSchema := schema.EvaluationSchemaJSON
summarySchema := schema.SummarySchemaJSON
Rubrics (v0.2.0)
Define explicit scoring criteria for consistent evaluations:
rubric := evaluation.NewRubric("quality", "Output quality").
AddRangeAnchor(8, 10, "Excellent", "Near perfect").
AddRangeAnchor(5, 7.9, "Good", "Acceptable").
AddRangeAnchor(0, 4.9, "Poor", "Needs work")
// Use default PRD rubric
rubricSet := evaluation.DefaultPRDRubricSet()
Track LLM judge configuration for reproducibility:
judge := evaluation.NewJudgeMetadata("claude-3-opus").
WithProvider("anthropic").
WithPrompt("prd-eval-v1", "1.0").
WithTemperature(0.0).
WithTokenUsage(1500, 800)
report.SetJudge(judge)
Pairwise Comparison (v0.2.0)
Compare two outputs instead of absolute scoring:
comparison := evaluation.NewPairwiseComparison(input, outputA, outputB)
comparison.SetWinner(evaluation.WinnerA, "A is more accurate", 0.9)
// Aggregate multiple comparisons
result := evaluation.ComputePairwiseResult(comparisons)
// result.WinRateA, result.OverallWinner
Multi-Judge Aggregation (v0.2.0)
Combine evaluations from multiple judges:
result := evaluation.AggregateEvaluations(evaluations, evaluation.AggregationMean)
// Methods: AggregationMean, AggregationMedian, AggregationConservative, AggregationMajority
// result.Agreement - inter-judge agreement (0-1)
// result.Disagreements - categories with significant disagreement
// result.ConsolidatedDecision - final aggregated decision
OmniObserve Integration
Export evaluations to Opik, Phoenix, or Langfuse:
import "github.com/agentplexus/omniobserve/integrations/sevaluation"
// Export to observability platform
err := sevaluation.Export(ctx, provider, traceID, report)
Integration
Designed to work with:
github.com/agentplexus/omniobserve - LLM observability (Opik, Phoenix, Langfuse)
github.com/grokify/structured-requirements - PRD evaluation templates
github.com/agentplexus/multi-agent-spec - Agent coordination
github.com/grokify/structured-changelog - Release validation
License
MIT License - see LICENSE for details.