Structured Evaluation

A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.
Overview
structured-evaluation provides standardized types for evaluation reports, enabling:
- βοΈ LLM-as-Judge assessments with categorical scoring and severity-based findings
- π Dual-scale support with Likert (1-5) scales for human comparison studies
- π Inter-rater reliability metrics for LLM calibration and quality assurance
- β
GO/NO-GO summary reports for deterministic checks (CI, tests, validation)
- π Multi-agent coordination with DAG-based report aggregation
- π Claims validation for factual claim extraction and source verification
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SummaryReport (GO/NO-GO) β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β Embedded Reports β β Team Sections β β
β β (Full-Fidelity) β β (Task Results) β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β²
βββββββββββββββββ΄ββββββββββββββββ
β β
βββββββββββββββ΄ββββββββββββββ βββββββββββββββ΄ββββββββββββββ
β Rubric (rubric/) β β ClaimsReport (claims/) β
β βββββββββββββββββββββββ β β βββββββββββββββββββββββ β
β β Category Results β β β β Claims + Validation β β
β β (pass/partial/fail) β β β β (verified/rejected) β β
β βββββββββββββββββββββββ€ β β βββββββββββββββββββββββ€ β
β β Findings β β β β Sources β β
β β (severity-based) β β β β (external/internal) β β
β βββββββββββββββββββββββ β β βββββββββββββββββββββββ β
β LLM-as-Judge scoring β β Fact verification β
βββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββ
Three complementary report types:
| Package |
Purpose |
Evaluation Type |
rubric/ |
Categorical scoring with findings |
Subjective (LLM-as-Judge) |
claims/ |
Fact verification with sources |
Objective (source-backed) |
summary/ |
GO/NO-GO aggregation |
Deterministic |
Installation
go get github.com/plexusone/structured-evaluation
Packages
| Package |
Description |
rubric |
Rubric, CategoryResult, Finding, Severity types for LLM-as-Judge |
claims |
ClaimsReport, Claim, Validation, Verdict for source verification |
summary |
SummaryReport, TeamSection, TaskResult for GO/NO-GO checks |
combine |
DAG-based report aggregation using Kahn's algorithm |
render/box |
Box-format terminal renderer for summary reports |
render/detailed |
Detailed terminal renderer for rubric reports |
render/terminal |
ANSI-colored terminal renderer with UTF8 icons |
render/markdown |
Markdown report renderer |
schema |
JSON Schema generation and embedding |
Report Types
Rubric (LLM-as-Judge)
For subjective quality assessments with detailed findings:
import "github.com/plexusone/structured-evaluation/rubric"
report := rubric.NewRubric("prd", "document.md")
report.AddCategoryResult(rubric.CategoryResult{
Category: "problem_definition",
Score: rubric.ScorePass,
Reasoning: "Clear problem statement with measurable goals",
})
report.AddFinding(rubric.Finding{
Severity: rubric.SeverityMedium,
Category: "metrics",
Title: "Missing baseline metrics",
Recommendation: "Add current baseline measurements",
})
report.Finalize(nil, "sevaluation check document.md")
Summary Report (GO/NO-GO)
For deterministic checks with pass/fail status:
import "github.com/plexusone/structured-evaluation/summary"
report := summary.NewSummaryReport("my-service", "v1.0.0", "Release Validation")
report.AddTeam(summary.TeamSection{
ID: "qa",
Name: "Quality Assurance",
Tasks: []summary.TaskResult{
{ID: "unit-tests", Status: summary.StatusGo, Detail: "Coverage: 92%"},
{ID: "e2e-tests", Status: summary.StatusWarn, Detail: "2 flaky tests"},
},
})
Claims Report (v0.6.0)
For factual claim extraction and source validation:
import "github.com/plexusone/structured-evaluation/claims"
report := claims.NewClaimsReport("security-advisory.md")
// External source: CVE from NVD
claim := claims.NewClaim("cvss", "CVSS 8.8 High", claims.ClaimRiskAssessment,
claims.Location{Section: "severity"})
claim.SetValidation(claims.NewExternalValidation(
"https://nvd.nist.gov/vuln/detail/CVE-2026-25253",
claims.ExternalNVD,
))
report.AddClaim(*claim)
// Internal validation: exploit confirmed via code
exploit := claims.NewClaim("exploit", "RCE confirmed", claims.ClaimTechnicalFinding,
claims.Location{Section: "impact"})
exploit.SetValidation(claims.NewInternalValidation(
claims.MethodCodeExecution, "poc.py", true,
))
report.AddClaim(*exploit)
report.Finalize()
// report.Decision.Passed, report.Summary.Counts
Severity Levels
Following InfoSec conventions:
| Severity |
Icon |
Blocking |
Description |
| Critical |
π΄ |
Yes |
Must fix before approval |
| High |
π΄ |
Yes |
Must fix before approval |
| Medium |
π‘ |
No |
Should fix, tracked |
| Low |
π’ |
No |
Nice to fix |
| Info |
βͺ |
No |
Informational only |
Pass Criteria
Default criteria (zero blocking findings, all categories passing):
criteria := rubric.DefaultPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: -1 (unlimited), RequireAllPass: false
criteria := rubric.StrictPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: 3, RequireAllPass: true
# Install
go install github.com/plexusone/structured-evaluation/cmd/sevaluation@latest
# Render reports
sevaluation render report.json --format=detailed
sevaluation render report.json --format=terminal # ANSI colors + UTF8 icons
sevaluation render report.json --format=markdown # Markdown output
sevaluation render report.json --format=box
sevaluation render report.json --format=json
# Check pass/fail (exit code 0/1)
sevaluation check report.json
# Validate structure
sevaluation validate report.json
# Generate JSON Schema
sevaluation schema generate -o ./schema/
DAG-Based Aggregation
For multi-agent workflows with dependencies:
import "github.com/plexusone/structured-evaluation/combine"
results := []combine.AgentResult{
{TeamID: "qa", Tasks: qaTasks},
{TeamID: "security", Tasks: secTasks, DependsOn: []string{"qa"}},
{TeamID: "release", Tasks: relTasks, DependsOn: []string{"qa", "security"}},
}
report := combine.AggregateResults(results, "my-project", "v1.0.0", "Release")
// Teams are topologically sorted: qa β security β release
JSON Schema
Schemas are embedded for runtime validation:
import "github.com/plexusone/structured-evaluation/schema"
rubricSchema := schema.RubricSchemaJSON
claimsSchema := schema.ClaimsSchemaJSON
summarySchema := schema.SummarySchemaJSON
RubricSet (v0.4.0)
Define explicit criteria for consistent categorical evaluations:
cat := rubric.NewCategory("quality", "Output Quality", "Overall quality assessment").
WithPassPartialFail(
[]string{"Meets all requirements, no significant issues"},
[]string{"Meets most requirements, minor issues"},
[]string{"Missing key requirements or major issues"},
)
// Use default PRD rubric
rubricSet := rubric.DefaultPRDRubricSet()
Track LLM judge configuration for reproducibility:
judge := rubric.NewJudgeMetadata("claude-3-opus").
WithProvider("anthropic").
WithPrompt("prd-eval-v1", "1.0").
WithTemperature(0.0).
WithTokenUsage(1500, 800)
report.SetJudge(judge)
Pairwise Comparison (v0.2.0)
Compare two outputs instead of absolute scoring:
comparison := rubric.NewPairwiseComparison(input, outputA, outputB)
comparison.SetWinner(rubric.WinnerA, "A is more accurate", 0.9)
// Aggregate multiple comparisons
result := rubric.ComputePairwiseResult(comparisons)
// result.WinRateA, result.OverallWinner
Multi-Judge Aggregation (v0.4.0)
Combine evaluations from multiple judges:
result := rubric.AggregateEvaluations(evaluations, rubric.AggregationMajority)
// Methods: AggregationMajority, AggregationConservative, AggregationOptimistic
// result.Agreement - inter-judge agreement (0-1)
// result.Disagreements - categories with significant disagreement
// result.ConsolidatedDecision - final aggregated decision
Likert Scales (v0.5.0)
Use 1-5 numeric scales for human comparison studies:
// Create a Likert-scale category
cat := rubric.NewCategory("quality", "Content Quality", "Overall quality").
WithLikert5(rubric.StandardLikert5Anchors())
// Record a Likert score (automatically maps to categorical)
result := rubric.NewCategoryResultFromLikert("quality", 4, config, "Good quality")
// result.Score = ScorePass, result.NumericScore = 4.0
// Or record both categorical and numeric
result := rubric.NewCategoryResultWithNumeric("quality", rubric.ScorePass, 4.5, "reasoning")
Inter-Rater Reliability (v0.5.0)
Compare LLM evaluations with human ground truth:
// Compute IRR metrics
metrics := rubric.ComputeIRRFromResults(humanResults, llmResults)
fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Adjacent Agreement: %.1f%%\n", metrics.AdjacentAgreement*100)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)
// Categorical agreement with confusion matrix
agreement := rubric.ComputeCategoricalAgreement(humanResults, llmResults)
Claims Validation (v0.6.0)
Validate factual claims have proper source backing:
import "github.com/plexusone/structured-evaluation/claims"
report := claims.NewClaimsReport("article.md")
// Source types: external (URL), internal (code/lab), derived, subjective
// Reliability tiers: authoritative, high, medium, low
// Verdicts: verified, unverified, needs-review, rejected
// Configure pass criteria
report.SetCriteria(claims.ClaimsCriteria{
RequireAllVerified: true,
AllowSubjectiveWithDisclaimer: false,
MinReliabilityTier: claims.ReliabilityHigh,
})
report.Finalize()
if report.IsPassing() {
fmt.Println("Ready for publication")
}
Embedded Reports (v0.6.0)
Archive full-fidelity reports within SummaryReport:
report := summary.NewSummaryReport("project", "v1.0.0", "RELEASE")
// Embed detailed reports
report.EmbedRubricReport("quality-review", rubricReport)
report.EmbedClaimsReport("source-validation", claimsReport)
// Retrieve later
var r rubric.Rubric
report.GetEmbeddedRubricReport("quality-review", &r)
OmniObserve Integration
Export evaluations to Opik, Phoenix, or Langfuse:
import "github.com/plexusone/omniobserve/integrations/sevaluation"
// Export to observability platform
err := sevaluation.Export(ctx, provider, traceID, report)
Integration
Designed to work with:
github.com/plexusone/omniobserve - LLM observability (Opik, Phoenix, Langfuse)
github.com/grokify/structured-requirements - PRD evaluation templates
github.com/plexusone/multi-agent-spec - Agent coordination
github.com/grokify/structured-changelog - Release validation
License
MIT License - see LICENSE for details.