evaluation

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package evaluation provides types and utilities for evaluating vulnerability documentation using LLM-as-Judge workflows.

The schema follows Go-first principles: Go types are the source of truth, JSON Schema is generated from them. Rubric implementations are JSON data files that conform to the schema.

Index

Constants

View Source
const (
	CategoryTechnicalAccuracy     = "technical_accuracy"
	CategoryResponsibleDisclosure = "responsible_disclosure"
	CategoryCompleteness          = "completeness"
	CategoryActionability         = "actionability"
	CategoryFrameworkMappings     = "framework_mappings"
	CategorySourceAttribution     = "source_attribution"
	CategoryDetectionContent      = "detection_content"
	CategoryWritingQuality        = "writing_quality"
	CategoryDiagramQuality        = "diagram_quality"
)

Category IDs for vulnerability articles.

View Source
const (
	CategorySchemaCompliance    = "schema_compliance"
	CategoryAssetIdentification = "asset_identification"
	CategoryAttackModeling      = "attack_modeling"
	CategoryMappingAccuracy     = "mapping_accuracy"
	CategoryMitigationQuality   = "mitigation_quality"
	CategoryDiagramIntegration  = "diagram_integration"
	CategoryThreatCoverage      = "threat_coverage"
	CategoryCredentialFlows     = "credential_flows"
	CategoryRedBlueContent      = "red_blue_content"
)

Category IDs for threat model JSON files.

View Source
const (
	CategoryTrustBoundaries   = "trust_boundaries"
	CategoryAttackFlowClarity = "attack_flow_clarity"
	CategoryNotationStandards = "notation_standards"
	CategoryDataFlowAccuracy  = "data_flow_accuracy"
	CategoryConsistency       = "consistency"
	CategoryRenderingQuality  = "rendering_quality"
	CategoryAccessibility     = "accessibility"
)

Category IDs for diagrams.

Variables

This section is empty.

Functions

func FindingTemplateToFinding

func FindingTemplateToFinding(ft FindingTemplate) rubric.Finding

FindingTemplateToFinding converts a FindingTemplate to a structured-evaluation Finding.

func ListEmbeddedRubrics

func ListEmbeddedRubrics() ([]string, error)

ListEmbeddedRubrics returns the names of all embedded rubrics.

Types

type Category

type Category struct {
	// ID uniquely identifies this category within the rubric.
	ID string `json:"id"`

	// Name is the human-readable category name.
	Name string `json:"name"`

	// Description explains what this category measures.
	Description string `json:"description"`

	// Weight is the relative importance (default 1.0).
	Weight float64 `json:"weight,omitempty"`

	// Required indicates if this category must pass for overall pass.
	Required bool `json:"required,omitempty"`

	// Scale defines how this category is scored.
	Scale Scale `json:"scale"`

	// EvaluationPrompt is a specific prompt for evaluating this category.
	EvaluationPrompt string `json:"evaluationPrompt,omitempty"`

	// Examples provides few-shot examples for LLM evaluation.
	Examples *CategoryExamples `json:"examples,omitempty"`
}

Category is a single evaluation dimension.

type CategoryExamples

type CategoryExamples struct {
	Pass    *Example `json:"pass,omitempty"`
	Partial *Example `json:"partial,omitempty"`
	Fail    *Example `json:"fail,omitempty"`
}

CategoryExamples provides few-shot examples for a category. Research shows 1 example per level improves LLM alignment.

type CategoryResult

type CategoryResult struct {
	// Category is the category ID.
	Category string `json:"category"`

	// Score is the assigned score (e.g., "pass", "partial", "fail").
	Score string `json:"score"`

	// Reasoning explains the score (chain-of-thought).
	Reasoning string `json:"reasoning"`

	// Evidence are specific quotes or observations.
	Evidence []string `json:"evidence,omitempty"`
}

CategoryResult is the evaluation result for a single category.

type ChecklistThreshold

type ChecklistThreshold struct {
	// Required is "all" or a number of required items that must be present.
	Required string `json:"required,omitempty"`

	// Optional is the minimum number of optional items needed.
	Optional int `json:"optional,omitempty"`
}

ChecklistThreshold defines pass criteria for checklist scales.

type EvaluationResult

type EvaluationResult struct {
	// RubricID identifies the rubric used.
	RubricID string `json:"rubricId"`

	// RubricVersion is the version of the rubric used.
	RubricVersion string `json:"rubricVersion"`

	// Categories contains per-category results.
	Categories []CategoryResult `json:"categories"`

	// Findings are issues discovered during evaluation.
	Findings []Finding `json:"findings,omitempty"`

	// OverallDecision is pass/conditional/fail.
	OverallDecision string `json:"overallDecision"`

	// Summary is a brief explanation of the decision.
	Summary string `json:"summary"`
}

EvaluationResult is the output from an LLM judge evaluation.

func (*EvaluationResult) ToClaimsReport

func (er *EvaluationResult) ToClaimsReport(document string) *claims.ClaimsReport

ToClaimsReport extracts factual claims from an EvaluationResult for source validation. This is useful for verifying CVE details, CVSS scores, and other factual assertions.

func (*EvaluationResult) ToEvaluationReport

func (er *EvaluationResult) ToEvaluationReport(document string) *rubric.Rubric

ToEvaluationReport converts an EvaluationResult to a structured-evaluation EvaluationReport. This enables integration with the broader structured-evaluation ecosystem.

type EvaluationType

type EvaluationType string

EvaluationType defines how evaluation is performed.

const (
	// EvaluationTypeAnalytic scores each category independently (recommended).
	EvaluationTypeAnalytic EvaluationType = "analytic"

	// EvaluationTypeHolistic provides a single overall score.
	EvaluationTypeHolistic EvaluationType = "holistic"
)

type Example

type Example struct {
	// Excerpt is example content from an article.
	Excerpt string `json:"excerpt"`

	// Reasoning explains why this gets this score.
	// Including reasoning improves LLM alignment (chain-of-thought).
	Reasoning string `json:"reasoning"`
}

Example is a few-shot example for LLM evaluation.

type Finding

type Finding struct {
	ID             string   `json:"id,omitempty"`
	Category       string   `json:"category"`
	Severity       Severity `json:"severity"`
	Title          string   `json:"title"`
	Description    string   `json:"description"`
	Recommendation string   `json:"recommendation,omitempty"`
	Evidence       []string `json:"evidence,omitempty"`
}

Finding represents an issue discovered during evaluation. Severity type is defined in findings_catalog.go.

type FindingLimits

type FindingLimits struct {
	Critical int `json:"critical"`
	High     int `json:"high"`
	Medium   int `json:"medium"`
	Low      int `json:"low,omitempty"`
}

FindingLimits sets maximum allowed findings per severity. Use -1 for unlimited.

type FindingTemplate

type FindingTemplate struct {
	ID             string   `json:"id"`
	Category       string   `json:"category"`
	Severity       Severity `json:"severity"`
	Title          string   `json:"title"`
	Description    string   `json:"description"`
	Recommendation string   `json:"recommendation"`
	Effort         string   `json:"effort"` // low, medium, high
}

FindingTemplate defines a reusable finding pattern.

func ArticleFindingsCatalog

func ArticleFindingsCatalog() []FindingTemplate

ArticleFindingsCatalog returns common findings for vulnerability articles.

func DiagramFindingsCatalog

func DiagramFindingsCatalog() []FindingTemplate

DiagramFindingsCatalog returns common findings for security diagrams.

func ThreatModelFindingsCatalog

func ThreatModelFindingsCatalog() []FindingTemplate

ThreatModelFindingsCatalog returns common findings for threat model JSON files.

type PassCriteria

type PassCriteria struct {
	// MinCategoriesPassing is "all", "all_required", or a number.
	MinCategoriesPassing string `json:"minCategoriesPassing,omitempty"`

	// MaxFindings limits findings by severity.
	MaxFindings *FindingLimits `json:"maxFindingsSeverity,omitempty"`
}

PassCriteria defines requirements for overall pass/fail determination.

type RubricMetadata

type RubricMetadata struct {
	CreatedAt string   `json:"createdAt,omitempty"`
	Author    string   `json:"author,omitempty"`
	BasedOn   []string `json:"basedOn,omitempty"`
}

RubricMetadata contains additional rubric information.

type RubricSet

type RubricSet struct {
	// ID uniquely identifies this rubric set.
	ID string `json:"id"`

	// Name is the human-readable name.
	Name string `json:"name"`

	// Version is the semantic version of this rubric.
	Version string `json:"version"`

	// Description explains what this rubric set evaluates.
	Description string `json:"description,omitempty"`

	// EvaluationType is "analytic" (per-category) or "holistic" (single score).
	// Analytic is recommended for LLM-as-Judge.
	EvaluationType EvaluationType `json:"evaluationType,omitempty"`

	// PassCriteria defines requirements for overall pass/fail.
	PassCriteria PassCriteria `json:"passCriteria"`

	// Categories are the evaluation dimensions.
	Categories []Category `json:"categories"`

	// JudgePromptTemplate is the prompt template for LLM evaluation.
	// Supports placeholders: {article_content}, {categories}, etc.
	JudgePromptTemplate string `json:"judgePromptTemplate,omitempty"`

	// Metadata contains additional information about the rubric.
	Metadata *RubricMetadata `json:"metadata,omitempty"`
}

RubricSet is a collection of rubrics for evaluating a document type.

func DiagramRubric

func DiagramRubric() (*RubricSet, error)

DiagramRubric loads the embedded diagram rubric.

func LoadEmbeddedRubric

func LoadEmbeddedRubric(name string) (*RubricSet, error)

LoadEmbeddedRubric loads a rubric from the embedded rubrics directory. Name should be the filename without path (e.g., "vulnerability-article.rubric.json").

func LoadRubricFromFile

func LoadRubricFromFile(path string) (*RubricSet, error)

LoadRubricFromFile loads a rubric set from a JSON file.

func ThreatModelRubric

func ThreatModelRubric() (*RubricSet, error)

ThreatModelRubric loads the embedded threat model rubric.

func VulnerabilityArticleRubric

func VulnerabilityArticleRubric() (*RubricSet, error)

VulnerabilityArticleRubric loads the embedded vulnerability article rubric.

func (*RubricSet) ToJSON

func (rs *RubricSet) ToJSON() ([]byte, error)

ToJSON serializes a rubric set to JSON.

func (*RubricSet) ToPrompt

func (rs *RubricSet) ToPrompt(content string) string

ToPrompt generates an LLM-ready prompt from the rubric. If content is provided, it's inserted into the template.

func (*RubricSet) Validate

func (rs *RubricSet) Validate() []string

Validate checks the rubric for common issues.

type Scale

type Scale struct {
	// Type is "categorical", "checklist", or "binary".
	// Categorical with 2-3 options is recommended for LLM-as-Judge.
	Type ScaleType `json:"type"`

	// Options are the scoring options (for categorical scales).
	Options []ScaleOption `json:"options,omitempty"`

	// RequiredItems are items that must be present (for checklist scales).
	RequiredItems []string `json:"requiredItems,omitempty"`

	// OptionalItems are items that add value (for checklist scales).
	OptionalItems []string `json:"optionalItems,omitempty"`

	// PassingThreshold defines pass criteria (for checklist scales).
	PassingThreshold *ChecklistThreshold `json:"passingThreshold,omitempty"`
}

Scale defines the scoring mechanism for a category.

type ScaleOption

type ScaleOption struct {
	// Value is the machine-readable value (e.g., "pass", "partial", "fail").
	Value string `json:"value"`

	// Label is the human-readable label.
	Label string `json:"label"`

	// Criteria are specific requirements for this score level.
	Criteria []string `json:"criteria"`
}

ScaleOption is a single option in a categorical scale.

type ScaleType

type ScaleType string

ScaleType defines the type of scoring scale.

const (
	// ScaleTypeCategorical uses discrete categories (pass/partial/fail).
	// Recommended for LLM-as-Judge - better calibrated than numeric.
	ScaleTypeCategorical ScaleType = "categorical"

	// ScaleTypeChecklist uses a list of required/optional items.
	ScaleTypeChecklist ScaleType = "checklist"

	// ScaleTypeBinary is simple pass/fail.
	ScaleTypeBinary ScaleType = "binary"

	// ScaleTypeNumeric uses numeric scores (0-10).
	// Less recommended for LLM evaluation due to calibration issues.
	ScaleTypeNumeric ScaleType = "numeric"
)

type Severity

type Severity string

Severity levels following InfoSec conventions.

const (
	SeverityCritical Severity = "critical" // Publication blocker - must fix
	SeverityHigh     Severity = "high"     // Publication blocker - must fix
	SeverityMedium   Severity = "medium"   // Should fix, tracked
	SeverityLow      Severity = "low"      // Nice to fix
	SeverityInfo     Severity = "info"     // Informational only
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL