structured-evaluation

module
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: MIT

README ΒΆ

Structured Evaluation

Go CI Go Lint Go SAST Go Report Card Docs Docs Visualization License

A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.

Overview

structured-evaluation provides standardized types for evaluation reports, enabling:

  • βš–οΈ LLM-as-Judge assessments with categorical scoring and severity-based findings
  • πŸ“Š Dual-scale support with Likert (1-5) scales for human comparison studies
  • πŸ“ˆ Inter-rater reliability metrics for LLM calibration and quality assurance
  • βœ… GO/NO-GO summary reports for deterministic checks (CI, tests, validation)
  • πŸ”— Multi-agent coordination with DAG-based report aggregation
  • πŸ“‹ Claims validation for factual claim extraction and source verification

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SummaryReport (GO/NO-GO)               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  Embedded Reports    β”‚  β”‚   Team Sections      β”‚       β”‚
β”‚  β”‚  (Full-Fidelity)     β”‚  β”‚   (Task Results)     β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β–²
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                               β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Rubric (rubric/)      β”‚   β”‚   ClaimsReport (claims/)  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Category Results    β”‚  β”‚   β”‚  β”‚ Claims + Validation β”‚  β”‚
β”‚  β”‚ (pass/partial/fail) β”‚  β”‚   β”‚  β”‚ (verified/rejected) β”‚  β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚   β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚  β”‚ Findings            β”‚  β”‚   β”‚  β”‚ Sources             β”‚  β”‚
β”‚  β”‚ (severity-based)    β”‚  β”‚   β”‚  β”‚ (external/internal) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  LLM-as-Judge scoring     β”‚   β”‚  Fact verification        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Three complementary report types:

Package Purpose Evaluation Type
rubric/ Categorical scoring with findings Subjective (LLM-as-Judge)
claims/ Fact verification with sources Objective (source-backed)
summary/ GO/NO-GO aggregation Deterministic

Installation

go get github.com/plexusone/structured-evaluation

Packages

Package Description
rubric Rubric, CategoryResult, Finding, Severity types for LLM-as-Judge
claims ClaimsReport, Claim, Validation, Verdict for source verification
summary SummaryReport, TeamSection, TaskResult for GO/NO-GO checks
combine DAG-based report aggregation using Kahn's algorithm
render/box Box-format terminal renderer for summary reports
render/detailed Detailed terminal renderer for rubric reports
render/terminal ANSI-colored terminal renderer with UTF8 icons
render/markdown Markdown report renderer
schema JSON Schema generation and embedding

Report Types

Rubric (LLM-as-Judge)

For subjective quality assessments with detailed findings:

import "github.com/plexusone/structured-evaluation/rubric"

report := rubric.NewRubric("prd", "document.md")
report.AddCategoryResult(rubric.CategoryResult{
    Category:  "problem_definition",
    Score:     rubric.ScorePass,
    Reasoning: "Clear problem statement with measurable goals",
})
report.AddFinding(rubric.Finding{
    Severity:       rubric.SeverityMedium,
    Category:       "metrics",
    Title:          "Missing baseline metrics",
    Recommendation: "Add current baseline measurements",
})
report.Finalize(nil, "sevaluation check document.md")
Summary Report (GO/NO-GO)

For deterministic checks with pass/fail status:

import "github.com/plexusone/structured-evaluation/summary"

report := summary.NewSummaryReport("my-service", "v1.0.0", "Release Validation")
report.AddTeam(summary.TeamSection{
    ID:   "qa",
    Name: "Quality Assurance",
    Tasks: []summary.TaskResult{
        {ID: "unit-tests", Status: summary.StatusGo, Detail: "Coverage: 92%"},
        {ID: "e2e-tests", Status: summary.StatusWarn, Detail: "2 flaky tests"},
    },
})
Claims Report (v0.6.0)

For factual claim extraction and source validation:

import "github.com/plexusone/structured-evaluation/claims"

report := claims.NewClaimsReport("security-advisory.md")

// External source: CVE from NVD
claim := claims.NewClaim("cvss", "CVSS 8.8 High", claims.ClaimRiskAssessment,
    claims.Location{Section: "severity"})
claim.SetValidation(claims.NewExternalValidation(
    "https://nvd.nist.gov/vuln/detail/CVE-2026-25253",
    claims.ExternalNVD,
))
report.AddClaim(*claim)

// Internal validation: exploit confirmed via code
exploit := claims.NewClaim("exploit", "RCE confirmed", claims.ClaimTechnicalFinding,
    claims.Location{Section: "impact"})
exploit.SetValidation(claims.NewInternalValidation(
    claims.MethodCodeExecution, "poc.py", true,
))
report.AddClaim(*exploit)

report.Finalize()
// report.Decision.Passed, report.Summary.Counts

Severity Levels

Following InfoSec conventions:

Severity Icon Blocking Description
Critical πŸ”΄ Yes Must fix before approval
High πŸ”΄ Yes Must fix before approval
Medium 🟑 No Should fix, tracked
Low 🟒 No Nice to fix
Info βšͺ No Informational only

Pass Criteria

Default criteria (zero blocking findings, all categories passing):

criteria := rubric.DefaultPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: -1 (unlimited), RequireAllPass: false

criteria := rubric.StrictPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: 3, RequireAllPass: true

CLI Tool

# Install
go install github.com/plexusone/structured-evaluation/cmd/sevaluation@latest

# Render reports
sevaluation render report.json --format=detailed
sevaluation render report.json --format=terminal   # ANSI colors + UTF8 icons
sevaluation render report.json --format=markdown   # Markdown output
sevaluation render report.json --format=box
sevaluation render report.json --format=json

# Check pass/fail (exit code 0/1)
sevaluation check report.json

# Validate structure
sevaluation validate report.json

# Generate JSON Schema
sevaluation schema generate -o ./schema/

DAG-Based Aggregation

For multi-agent workflows with dependencies:

import "github.com/plexusone/structured-evaluation/combine"

results := []combine.AgentResult{
    {TeamID: "qa", Tasks: qaTasks},
    {TeamID: "security", Tasks: secTasks, DependsOn: []string{"qa"}},
    {TeamID: "release", Tasks: relTasks, DependsOn: []string{"qa", "security"}},
}

report := combine.AggregateResults(results, "my-project", "v1.0.0", "Release")
// Teams are topologically sorted: qa β†’ security β†’ release

JSON Schema

Schemas are embedded for runtime validation:

import "github.com/plexusone/structured-evaluation/schema"

rubricSchema := schema.RubricSchemaJSON
claimsSchema := schema.ClaimsSchemaJSON
summarySchema := schema.SummarySchemaJSON

RubricSet (v0.4.0)

Define explicit criteria for consistent categorical evaluations:

cat := rubric.NewCategory("quality", "Output Quality", "Overall quality assessment").
    WithPassPartialFail(
        []string{"Meets all requirements, no significant issues"},
        []string{"Meets most requirements, minor issues"},
        []string{"Missing key requirements or major issues"},
    )

// Use default PRD rubric
rubricSet := rubric.DefaultPRDRubricSet()

Judge Metadata (v0.2.0)

Track LLM judge configuration for reproducibility:

judge := rubric.NewJudgeMetadata("claude-3-opus").
    WithProvider("anthropic").
    WithPrompt("prd-eval-v1", "1.0").
    WithTemperature(0.0).
    WithTokenUsage(1500, 800)

report.SetJudge(judge)

Pairwise Comparison (v0.2.0)

Compare two outputs instead of absolute scoring:

comparison := rubric.NewPairwiseComparison(input, outputA, outputB)
comparison.SetWinner(rubric.WinnerA, "A is more accurate", 0.9)

// Aggregate multiple comparisons
result := rubric.ComputePairwiseResult(comparisons)
// result.WinRateA, result.OverallWinner

Multi-Judge Aggregation (v0.4.0)

Combine evaluations from multiple judges:

result := rubric.AggregateEvaluations(evaluations, rubric.AggregationMajority)

// Methods: AggregationMajority, AggregationConservative, AggregationOptimistic
// result.Agreement - inter-judge agreement (0-1)
// result.Disagreements - categories with significant disagreement
// result.ConsolidatedDecision - final aggregated decision

Likert Scales (v0.5.0)

Use 1-5 numeric scales for human comparison studies:

// Create a Likert-scale category
cat := rubric.NewCategory("quality", "Content Quality", "Overall quality").
    WithLikert5(rubric.StandardLikert5Anchors())

// Record a Likert score (automatically maps to categorical)
result := rubric.NewCategoryResultFromLikert("quality", 4, config, "Good quality")
// result.Score = ScorePass, result.NumericScore = 4.0

// Or record both categorical and numeric
result := rubric.NewCategoryResultWithNumeric("quality", rubric.ScorePass, 4.5, "reasoning")

Inter-Rater Reliability (v0.5.0)

Compare LLM evaluations with human ground truth:

// Compute IRR metrics
metrics := rubric.ComputeIRRFromResults(humanResults, llmResults)

fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Adjacent Agreement: %.1f%%\n", metrics.AdjacentAgreement*100)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)

// Categorical agreement with confusion matrix
agreement := rubric.ComputeCategoricalAgreement(humanResults, llmResults)

Claims Validation (v0.6.0)

Validate factual claims have proper source backing:

import "github.com/plexusone/structured-evaluation/claims"

report := claims.NewClaimsReport("article.md")

// Source types: external (URL), internal (code/lab), derived, subjective
// Reliability tiers: authoritative, high, medium, low
// Verdicts: verified, unverified, needs-review, rejected

// Configure pass criteria
report.SetCriteria(claims.ClaimsCriteria{
    RequireAllVerified:           true,
    AllowSubjectiveWithDisclaimer: false,
    MinReliabilityTier:           claims.ReliabilityHigh,
})

report.Finalize()
if report.IsPassing() {
    fmt.Println("Ready for publication")
}

Embedded Reports (v0.6.0)

Archive full-fidelity reports within SummaryReport:

report := summary.NewSummaryReport("project", "v1.0.0", "RELEASE")

// Embed detailed reports
report.EmbedRubricReport("quality-review", rubricReport)
report.EmbedClaimsReport("source-validation", claimsReport)

// Retrieve later
var r rubric.Rubric
report.GetEmbeddedRubricReport("quality-review", &r)

OmniObserve Integration

Export evaluations to Opik, Phoenix, or Langfuse:

import "github.com/plexusone/omniobserve/integrations/sevaluation"

// Export to observability platform
err := sevaluation.Export(ctx, provider, traceID, report)

Integration

Designed to work with:

  • github.com/plexusone/omniobserve - LLM observability (Opik, Phoenix, Langfuse)
  • github.com/grokify/structured-requirements - PRD evaluation templates
  • github.com/plexusone/multi-agent-spec - Agent coordination
  • github.com/grokify/structured-changelog - Release validation

License

MIT License - see LICENSE for details.

Directories ΒΆ

Path Synopsis
Package claims provides types for claim extraction and source validation.
Package claims provides types for claim extraction and source validation.
cmd
genschema command
Command genschema generates JSON schemas for all report types.
Command genschema generates JSON schemas for all report types.
sevaluation command
Command sevaluation provides CLI tools for working with evaluation reports.
Command sevaluation provides CLI tools for working with evaluation reports.
Package combine provides functionality for combining multiple evaluations into a single report, with DAG-based ordering.
Package combine provides functionality for combining multiple evaluations into a single report, with DAG-based ordering.
render
box
Package box provides box-format terminal rendering for summary reports.
Package box provides box-format terminal rendering for summary reports.
detailed
Package detailed provides detailed terminal rendering for evaluation reports.
Package detailed provides detailed terminal rendering for evaluation reports.
markdown
Package markdown provides Markdown rendering for evaluation reports.
Package markdown provides Markdown rendering for evaluation reports.
terminal
Package terminal provides ANSI-colored terminal rendering for evaluation reports.
Package terminal provides ANSI-colored terminal rendering for evaluation reports.
Package rubric provides types for rubric-based evaluation reports with categorical scoring and severity-based findings.
Package rubric provides types for rubric-based evaluation reports with categorical scoring and severity-based findings.
Package schema provides JSON Schema generation for structured evaluation types.
Package schema provides JSON Schema generation for structured evaluation types.
Package summary provides types for summary-style evaluation reports with GO/WARN/NO-GO status per task.
Package summary provides types for summary-style evaluation reports with GO/WARN/NO-GO status per task.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL