Documentation
¶
Overview ¶
Package evals provides the core evaluation framework for PromptPack. Eval definitions travel with packs and can run both during Arena testing and at runtime in production via the SDK.
Index ¶
- Constants
- Variables
- func RegisterDefault(h EvalTypeHandler)
- func ShouldRun(trigger EvalTrigger, samplePct float64, ctx *TriggerContext) bool
- func ShouldRunWhen(when *EvalWhen, toolCalls []ToolCallRecord) (shouldRun bool, reason string)
- func ValidateEvalTypes(defs []EvalDef, registry *EvalTypeRegistry) []string
- func ValidateEvals(defs []EvalDef, scope string) []string
- type EvalContext
- type EvalDef
- type EvalResult
- type EvalRunner
- func (r *EvalRunner) RunConversationEvals(ctx context.Context, defs []EvalDef, evalCtx *EvalContext) []EvalResult
- func (r *EvalRunner) RunSessionEvals(ctx context.Context, defs []EvalDef, evalCtx *EvalContext) []EvalResult
- func (r *EvalRunner) RunTurnEvals(ctx context.Context, defs []EvalDef, evalCtx *EvalContext) []EvalResult
- type EvalTrigger
- type EvalTypeHandler
- type EvalTypeRegistry
- type EvalViolation
- type EvalWhen
- type EvalWorker
- type EventSubscriber
- type Logger
- type MetricCollector
- type MetricCollectorOption
- type MetricDef
- type MetricRecorder
- type MetricResultWriter
- type MetricType
- type Range
- type ResultWriter
- type RunnerOption
- type Threshold
- type ToolCallRecord
- type TriggerContext
- type WorkerOption
Constants ¶
const DefaultEvalTimeout = 30 * time.Second
DefaultEvalTimeout is the per-eval execution timeout.
const DefaultSamplePercentage = 5.0
DefaultSamplePercentage is the default sampling rate when not specified.
Variables ¶
var DefaultBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
DefaultBuckets are the default Prometheus histogram bucket boundaries. These match prometheus.DefBuckets.
var ValidMetricTypes = map[MetricType]bool{ MetricGauge: true, MetricCounter: true, MetricHistogram: true, MetricBoolean: true, }
ValidMetricTypes is the set of valid metric type values.
var ValidTriggers = map[EvalTrigger]bool{ TriggerEveryTurn: true, TriggerOnSessionComplete: true, TriggerSampleTurns: true, TriggerSampleSessions: true, TriggerOnConversationComplete: true, TriggerOnWorkflowStep: true, }
ValidTriggers is the set of valid trigger values.
Functions ¶
func RegisterDefault ¶
func RegisterDefault(h EvalTypeHandler)
RegisterDefault adds a handler to the default set used by NewEvalTypeRegistry. Call this from handler init() functions or from handlers.RegisterDefaults().
func ShouldRun ¶
func ShouldRun( trigger EvalTrigger, samplePct float64, ctx *TriggerContext, ) bool
ShouldRun determines whether an eval should fire given its trigger, sampling percentage, and current context. Sampling is deterministic: the same sessionID+turnIndex always produces the same decision.
func ShouldRunWhen ¶ added in v1.3.2
func ShouldRunWhen(when *EvalWhen, toolCalls []ToolCallRecord) (shouldRun bool, reason string)
ShouldRunWhen evaluates EvalWhen preconditions against the current eval context's tool call records. Returns whether the eval should run and a reason string if skipped. When toolCalls is nil (e.g. duplex path), returns true to let the handler itself decide how to handle the missing data.
func ValidateEvalTypes ¶ added in v1.3.3
func ValidateEvalTypes(defs []EvalDef, registry *EvalTypeRegistry) []string
ValidateEvalTypes checks that every EvalDef's Type has a registered handler in the given registry. Returns a list of human-readable error strings for any unknown types. This is safe to call from any package that has access to both the defs and the registry.
func ValidateEvals ¶
ValidateEvals validates a slice of EvalDef for correctness. The scope parameter is used in error messages (e.g. "pack", "prompt:foo"). It checks:
- IDs are non-empty and unique within the slice
- Type is non-empty
- Trigger is a valid value
- sample_percentage (if set) is in [0, 100]
- Metric name matches Prometheus naming regex
- Metric type is one of gauge/counter/histogram/boolean
Types ¶
type EvalContext ¶
type EvalContext struct {
Messages []types.Message `json:"messages"`
TurnIndex int `json:"turn_index"`
CurrentOutput string `json:"current_output"`
ToolCalls []ToolCallRecord `json:"tool_calls,omitempty"`
SessionID string `json:"session_id"`
PromptID string `json:"prompt_id"`
Variables map[string]any `json:"variables,omitempty"`
Metadata map[string]any `json:"metadata,omitempty"`
Extras map[string]any `json:"extras,omitempty"`
}
EvalContext provides data to eval handlers. For turn-level evals: Messages contains history up to the current turn. For session-level evals: Messages contains the full conversation.
type EvalDef ¶
type EvalDef struct {
ID string `json:"id" yaml:"id"`
Type string `json:"type" yaml:"type"`
Trigger EvalTrigger `json:"trigger" yaml:"trigger"`
Params map[string]any `json:"params" yaml:"params"`
Description string `json:"description,omitempty" yaml:"description,omitempty"`
Enabled *bool `json:"enabled,omitempty" yaml:"enabled,omitempty"`
SamplePercentage *float64 `json:"sample_percentage,omitempty" yaml:"sample_percentage,omitempty"`
Metric *MetricDef `json:"metric,omitempty" yaml:"metric,omitempty"`
Threshold *Threshold `json:"threshold,omitempty" yaml:"threshold,omitempty"`
Message string `json:"message,omitempty" yaml:"message,omitempty"`
When *EvalWhen `json:"when,omitempty" yaml:"when,omitempty"`
}
EvalDef defines a single evaluation within a PromptPack. Evals are defined at pack level and/or prompt level. Prompt-level evals override pack-level evals by ID.
func ResolveEvals ¶
ResolveEvals merges pack-level and prompt-level eval definitions. Prompt-level evals override pack-level evals when they share the same ID. The returned slice preserves pack ordering first, followed by any prompt-only evals (those with no pack counterpart) in their original order.
func (*EvalDef) GetSamplePercentage ¶
GetSamplePercentage returns the sampling percentage. Defaults to DefaultSamplePercentage when SamplePercentage is nil.
type EvalResult ¶
type EvalResult struct {
EvalID string `json:"eval_id"`
Type string `json:"type"`
Passed bool `json:"passed"`
Score *float64 `json:"score,omitempty"`
MetricValue *float64 `json:"metric_value,omitempty"`
Explanation string `json:"explanation,omitempty"`
DurationMs int64 `json:"duration_ms"`
Error string `json:"error,omitempty"`
Message string `json:"message,omitempty"`
Details map[string]any `json:"details,omitempty"`
Violations []EvalViolation `json:"violations,omitempty"`
Skipped bool `json:"skipped,omitempty"`
SkipReason string `json:"skip_reason,omitempty"`
SessionID string `json:"session_id,omitempty"`
TurnIndex int `json:"turn_index,omitempty"`
}
EvalResult captures the outcome of a single eval execution.
type EvalRunner ¶
type EvalRunner struct {
// contains filtered or unexported fields
}
EvalRunner executes evals in-process. It is the leaf execution unit used by all dispatch modes (in-proc, event-driven, worker).
func NewEvalRunner ¶
func NewEvalRunner( registry *EvalTypeRegistry, opts ...RunnerOption, ) *EvalRunner
NewEvalRunner creates an EvalRunner with the given registry and options.
func (*EvalRunner) RunConversationEvals ¶ added in v1.3.2
func (r *EvalRunner) RunConversationEvals( ctx context.Context, defs []EvalDef, evalCtx *EvalContext, ) []EvalResult
RunConversationEvals runs conversation-level evals (on_conversation_complete trigger). Call this when a multi-turn conversation ends (e.g., Arena self-play completion).
func (*EvalRunner) RunSessionEvals ¶
func (r *EvalRunner) RunSessionEvals( ctx context.Context, defs []EvalDef, evalCtx *EvalContext, ) []EvalResult
RunSessionEvals runs session-level evals (on_session_complete and sample_sessions triggers). Call this when a session ends.
func (*EvalRunner) RunTurnEvals ¶
func (r *EvalRunner) RunTurnEvals( ctx context.Context, defs []EvalDef, evalCtx *EvalContext, ) []EvalResult
RunTurnEvals runs turn-level evals (every_turn and sample_turns triggers). It filters by enabled state and trigger, then executes matching handlers.
type EvalTrigger ¶
type EvalTrigger string
EvalTrigger determines when an eval fires.
const ( // TriggerEveryTurn fires the eval after every assistant turn. TriggerEveryTurn EvalTrigger = "every_turn" // TriggerOnSessionComplete fires the eval when a session ends. TriggerOnSessionComplete EvalTrigger = "on_session_complete" // TriggerSampleTurns fires the eval on a percentage of turns (hash-based). TriggerSampleTurns EvalTrigger = "sample_turns" // TriggerSampleSessions fires the eval on a percentage of sessions (hash-based). TriggerSampleSessions EvalTrigger = "sample_sessions" // TriggerOnConversationComplete fires the eval when a conversation ends. TriggerOnConversationComplete EvalTrigger = "on_conversation_complete" // TriggerOnWorkflowStep fires the eval after each workflow step. TriggerOnWorkflowStep EvalTrigger = "on_workflow_step" )
type EvalTypeHandler ¶
type EvalTypeHandler interface {
// Type returns the eval type identifier (e.g. "contains", "regex").
Type() string
// Eval executes the evaluation and returns a result.
// The EvalContext carries messages, tool calls, and metadata.
// Params come from the EvalDef.Params map.
Eval(ctx context.Context, evalCtx *EvalContext, params map[string]any) (*EvalResult, error)
}
EvalTypeHandler defines the interface for eval type implementations. Each handler covers a single eval type (e.g. "contains", "llm_judge"). Handlers are stateless — params are passed per invocation.
type EvalTypeRegistry ¶
type EvalTypeRegistry struct {
// contains filtered or unexported fields
}
EvalTypeRegistry provides thread-safe registration and lookup of EvalTypeHandler implementations by type name.
func NewEmptyEvalTypeRegistry ¶
func NewEmptyEvalTypeRegistry() *EvalTypeRegistry
NewEmptyEvalTypeRegistry creates a registry with no handlers registered. Use this in tests to control exactly which handlers are available.
func NewEvalTypeRegistry ¶
func NewEvalTypeRegistry() *EvalTypeRegistry
NewEvalTypeRegistry creates a registry pre-populated with all built-in eval handlers. Call this in production code. Handlers self-register via RegisterDefaults in the handlers package; import _ "github.com/AltairaLabs/PromptKit/runtime/evals/handlers" or call handlers.RegisterDefaults(r) explicitly.
func (*EvalTypeRegistry) Get ¶
func (r *EvalTypeRegistry) Get(evalType string) (EvalTypeHandler, error)
Get returns the handler for the given type, or an error if not found.
func (*EvalTypeRegistry) Has ¶
func (r *EvalTypeRegistry) Has(evalType string) bool
Has returns true if a handler is registered for the given type.
func (*EvalTypeRegistry) Register ¶
func (r *EvalTypeRegistry) Register(handler EvalTypeHandler)
Register adds a handler to the registry. If a handler with the same type is already registered, it is replaced.
func (*EvalTypeRegistry) Types ¶
func (r *EvalTypeRegistry) Types() []string
Types returns a sorted list of all registered eval type names.
type EvalViolation ¶ added in v1.3.2
type EvalViolation struct {
TurnIndex int `json:"turn_index"`
Description string `json:"description"`
Evidence map[string]any `json:"evidence,omitempty"`
}
EvalViolation represents a single eval violation within a conversation or session.
type EvalWhen ¶ added in v1.3.2
type EvalWhen struct {
ToolCalled string `json:"tool_called,omitempty" yaml:"tool_called,omitempty"`
ToolCalledPattern string `json:"tool_called_pattern,omitempty" yaml:"tool_called_pattern,omitempty"`
AnyToolCalled bool `json:"any_tool_called,omitempty" yaml:"any_tool_called,omitempty"`
MinToolCalls int `json:"min_tool_calls,omitempty" yaml:"min_tool_calls,omitempty"`
}
EvalWhen specifies preconditions that must be met for an eval to run.
type EvalWorker ¶
type EvalWorker struct {
// contains filtered or unexported fields
}
EvalWorker is a reusable worker loop for event-driven eval execution. It subscribes to eval events via EventSubscriber, deserializes payloads, calls EvalRunner, and writes results via ResultWriter. Platforms wire this with their own EventSubscriber and ResultWriter implementations.
func NewEvalWorker ¶
func NewEvalWorker( runner *EvalRunner, subscriber EventSubscriber, resultWriter ResultWriter, opts ...WorkerOption, ) *EvalWorker
NewEvalWorker creates a worker that processes eval events.
type EventSubscriber ¶
type EventSubscriber interface {
Subscribe(
ctx context.Context,
subject string,
handler func(event []byte) error,
) error
}
EventSubscriber subscribes to eval events from an event bus. PromptKit ships this interface only — platforms provide concrete implementations backed by Redis Streams, NATS, Kafka, etc.
type MetricCollector ¶
type MetricCollector struct {
// contains filtered or unexported fields
}
MetricCollector implements MetricRecorder and provides Prometheus text exposition. It is safe for concurrent use.
func NewMetricCollector ¶
func NewMetricCollector(opts ...MetricCollectorOption) *MetricCollector
NewMetricCollector creates a new MetricCollector with the given options.
func (*MetricCollector) Record ¶
func (mc *MetricCollector) Record(result EvalResult, metric *MetricDef) error
Record records an eval result for the given metric definition. Thread-safe.
func (*MetricCollector) Reset ¶
func (mc *MetricCollector) Reset()
Reset clears all metrics. Primarily for testing.
func (*MetricCollector) WritePrometheus ¶
func (mc *MetricCollector) WritePrometheus(w io.Writer) error
WritePrometheus writes all metrics in Prometheus text exposition format.
type MetricCollectorOption ¶
type MetricCollectorOption func(*MetricCollector)
MetricCollectorOption configures a MetricCollector.
func WithBuckets ¶
func WithBuckets(buckets []float64) MetricCollectorOption
WithBuckets sets custom histogram bucket boundaries.
func WithLabels ¶ added in v1.3.3
func WithLabels(labels map[string]string) MetricCollectorOption
WithLabels sets base labels that are merged into every recorded metric. These are typically platform-level labels (e.g. env, tenant_id, region). Base labels take precedence over MetricDef labels on conflict.
func WithNamespace ¶
func WithNamespace(ns string) MetricCollectorOption
WithNamespace sets the metric name prefix (e.g. "promptpack").
type MetricDef ¶
type MetricDef struct {
Name string `json:"name" yaml:"name"`
Type MetricType `json:"type" yaml:"type"`
Range *Range `json:"range,omitempty" yaml:"range,omitempty"`
Labels map[string]string `json:"labels,omitempty" yaml:"labels,omitempty"`
// Extra holds additional properties beyond the defined schema fields.
// This supports the RFC's additionalProperties: true on metric.
Extra map[string]any `json:"-" yaml:"-"`
}
MetricDef defines a Prometheus-style metric associated with an eval. The Extra field captures additionalProperties from the schema.
func (MetricDef) MarshalJSON ¶
MarshalJSON implements custom JSON marshaling to include Extra fields as top-level properties alongside the known fields.
func (*MetricDef) UnmarshalJSON ¶
UnmarshalJSON implements custom JSON unmarshaling to capture additional properties into the Extra field.
type MetricRecorder ¶
type MetricRecorder interface {
Record(result EvalResult, metric *MetricDef) error
}
MetricRecorder records eval results as metrics. This interface is implemented by MetricCollector and injected into MetricResultWriter to avoid circular dependencies.
type MetricResultWriter ¶
type MetricResultWriter struct {
// contains filtered or unexported fields
}
MetricResultWriter feeds eval results to a MetricRecorder for Prometheus exposition. Only results whose corresponding EvalDef has a Metric definition are recorded.
func NewMetricResultWriter ¶
func NewMetricResultWriter( recorder MetricRecorder, defs []EvalDef, ) *MetricResultWriter
NewMetricResultWriter creates a writer that records metrics. The defs slice provides the metric definitions keyed by eval ID.
func (*MetricResultWriter) WriteResults ¶
func (w *MetricResultWriter) WriteResults( _ context.Context, results []EvalResult, ) error
WriteResults records each result that has an associated metric. If the result carries SessionID or TurnIndex, they are injected as additional labels on a per-call copy of the MetricDef so each time series includes its execution context.
type MetricType ¶
type MetricType string
MetricType defines the Prometheus metric type for eval results.
const ( // MetricGauge represents a gauge metric (set to a value). MetricGauge MetricType = "gauge" // MetricCounter represents a counter metric (increment only). MetricCounter MetricType = "counter" // MetricHistogram represents a histogram metric (observe values). MetricHistogram MetricType = "histogram" // MetricBoolean represents a boolean metric (0 or 1). MetricBoolean MetricType = "boolean" )
type Range ¶
type Range struct {
Min *float64 `json:"min,omitempty" yaml:"min,omitempty"`
Max *float64 `json:"max,omitempty" yaml:"max,omitempty"`
}
Range defines the valid range for a metric value.
type ResultWriter ¶
type ResultWriter interface {
WriteResults(ctx context.Context, results []EvalResult) error
}
ResultWriter controls WHERE eval results go. Implementations may write to Prometheus metrics, message metadata, telemetry spans, databases, or external APIs. Platform-specific writers are implemented outside PromptKit.
type RunnerOption ¶
type RunnerOption func(*EvalRunner)
RunnerOption configures an EvalRunner.
func WithTimeout ¶
func WithTimeout(d time.Duration) RunnerOption
WithTimeout sets the per-eval execution timeout.
type Threshold ¶ added in v1.3.2
type Threshold struct {
Passed *bool `json:"passed,omitempty" yaml:"passed,omitempty"`
MinScore *float64 `json:"min_score,omitempty" yaml:"min_score,omitempty"`
MaxScore *float64 `json:"max_score,omitempty" yaml:"max_score,omitempty"`
}
Threshold defines pass/fail criteria for an eval result.
func (*Threshold) Apply ¶ added in v1.3.2
func (t *Threshold) Apply(result *EvalResult)
Apply adjusts the EvalResult based on threshold criteria.
type ToolCallRecord ¶
type ToolCallRecord = types.ToolCallRecord
ToolCallRecord is an alias for types.ToolCallRecord so existing code referencing evals.ToolCallRecord continues to compile unchanged.
type TriggerContext ¶
type TriggerContext struct {
// SessionID identifies the current session (used for sampling).
SessionID string
// TurnIndex is the current turn number (used for sampling).
TurnIndex int
// IsSessionComplete indicates whether the session has ended.
IsSessionComplete bool
}
TriggerContext provides context for trigger evaluation decisions.
type WorkerOption ¶
type WorkerOption func(*EvalWorker)
WorkerOption configures an EvalWorker.
func WithLogger ¶
func WithLogger(l Logger) WorkerOption
WithLogger sets a custom logger for the worker.