validator

package
v0.15.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2026 License: Apache-2.0 Imports: 24 Imported by: 0

Documentation

Overview

Package validator evaluates a recipe's constraints and validation checks against a cluster snapshot and the live cluster.

The validator runs in two phases:

  1. Readiness pre-flight: top-level constraint expressions are evaluated against the snapshot inline (no cluster access required). A malformed expression fails closed so misconfigured rules cannot masquerade as passing.

  2. In-cluster checks: each declared check is materialized as a short-lived Kubernetes Job. RBAC (ServiceAccount + a per-run ClusterRoleBinding to the built-in cluster-admin ClusterRole) is provisioned via server-side apply under the "aicr" field manager so concurrent validators converge on a single owner. Job logs and exit codes are aggregated into a CTRF-formatted report.

Test isolation is mandatory: callers operating without cluster credentials must pass WithNoCluster(true) (or --no-cluster on the CLI). In that mode constraint evaluation still runs, but RBAC creation and Job deployment are skipped and each check is reported as "skipped - no-cluster mode".

Subpackages:

  • catalog: built-in validation check catalog
  • ctrf: Common Test Report Format emitter
  • job: Job lifecycle, RBAC, and result extraction
  • labels: standard label/annotation set applied to validator resources
  • v1: Validation YAML schema and decoders

Shared constraint extraction helpers live in the top-level pkg/constraints package, not under validator.

Package validator provides a container-per-validator execution engine for AICR cluster validation. Each validator is an OCI container image run as a Kubernetes Job, communicating results via exit codes and termination messages.

Index

Constants

View Source
const (
	PhaseDeployment  = v1.PhaseDeployment
	PhasePerformance = v1.PhasePerformance
	PhaseConformance = v1.PhaseConformance
)

Re-exported phase constants from pkg/validator/v1.

View Source
const PhaseAll = "all"

PhaseAll is the wildcard string accepted by both the `aicr validate --phase` CLI flag and the spec.validate.execution.phases config field to mean "run every phase." It is not a Phase value — the CLI parser collapses it into a nil selection that ValidatePhases interprets as "run all phases."

Variables

PhaseNames is the canonical user-facing vocabulary accepted by the --phase flag and spec.validate.execution.phases. The typed Phase constants in PhaseOrder plus the PhaseAll wildcard. Single source of truth so the CLI parser and the config-load validator stay in sync when a phase is added or removed.

PhaseOrder defines the execution order for ValidatePhases. All phases run by default; set Validator.FailFast to stop after the first phase that reports StatusFailed.

Order rationale: deployment (cheap install/health checks) → conformance → performance. Performance runs LAST because its inference-perf benchmark saturates every GPU on the node and the DynamoGraphDeployment teardown releases those DRA ResourceClaims asynchronously; running it before conformance starved conformance's GPU-needing checks (e.g. dra-support, whose 1-GPU test pod failed "cannot allocate all claims" on single-node clusters). Running performance last also keeps a flaky perf phase from blocking conformance under FailFast.

Note: Readiness phase is NOT included. It remains in pkg/validator and uses inline constraint evaluation (no containers).

Functions

func EnsureDataConfigMaps added in v0.14.0

func EnsureDataConfigMaps(
	ctx context.Context,
	clientset kubernetes.Interface,
	namespace string,
	runID string,
	snap *snapshotter.Snapshot,
	validationInput *v1.ValidationInput,
) error

EnsureDataConfigMaps creates or updates snapshot and validation ConfigMaps. Creates ConfigMaps named aicr-snapshot-{runID} and aicr-validation-{runID} with create-or-update semantics. External controllers should call this after generating a runID and before rendering validator Jobs. The Jobs mount these ConfigMaps at /data/snapshot and /data/validation.

func WarnPhasesAgainstRecipe added in v0.14.0

func WarnPhasesAgainstRecipe(phases []Phase, rec *recipe.RecipeResult)

WarnPhasesAgainstRecipe warns when a requested phase has no checks defined in the recipe. The phase will still run but produce 0 tests in the CTRF report. This is purely advisory — it emits slog warnings and never fails the run.

Types

type Option

type Option func(*Validator)

Option is a functional option for configuring Validator instances.

func WithCleanup

func WithCleanup(cleanup bool) Option

WithCleanup controls whether to delete Jobs, ConfigMaps, and RBAC after validation. Default: true.

func WithCommit added in v0.12.0

func WithCommit(commit string) Option

WithCommit sets the git commit SHA (typically the CLI build commit). Used for resolving dev-build validator images to SHA-tagged images.

func WithDataProvider added in v0.14.0

func WithDataProvider(dp recipe.DataProvider) Option

WithDataProvider binds the recipe DataProvider used to load the validator catalog. When unset, the catalog loads from the package-global provider.

func WithFailFast added in v0.15.0

func WithFailFast(failFast bool) Option

WithFailFast controls whether ValidatePhases stops after the first phase that reports StatusFailed. Default: false (all phases run regardless of earlier failures). Set true to restore the historical fail-fast behavior.

func WithImagePullSecrets

func WithImagePullSecrets(secrets []string) Option

WithImagePullSecrets sets image pull secrets for validator Jobs.

func WithImageRegistryOverride added in v0.14.0

func WithImageRegistryOverride(override string) Option

WithImageRegistryOverride sets the image registry prefix override for validator container images. When non-empty, replaces the default registry (e.g., ghcr.io/nvidia) with the specified prefix (e.g., localhost:5001). Forwarded to validator Jobs via the AICR_VALIDATOR_IMAGE_REGISTRY env var.

func WithImageTagOverride added in v0.14.0

func WithImageTagOverride(override string) Option

WithImageTagOverride sets the image tag override for validator container images. When non-empty, overrides the resolved tag on every validator image. Intended for feature-branch dev builds whose commit SHA has no published image. Forwarded to validator Jobs via the AICR_VALIDATOR_IMAGE_TAG env var.

func WithNamespace

func WithNamespace(namespace string) Option

WithNamespace sets the Kubernetes namespace for validation Jobs. Default: "aicr-validation".

func WithNoCluster

func WithNoCluster(noCluster bool) Option

WithNoCluster controls cluster access. When true, all validators are reported as skipped and no K8s API calls are made. Default: false.

func WithNodeSelector added in v0.12.0

func WithNodeSelector(nodeSelector map[string]string) Option

WithNodeSelector sets node selector labels to override inner workload scheduling. When set, validators pass these selectors to the workloads they create (e.g., NCCL benchmark pods), replacing platform-specific defaults. Does not affect the orchestrator Job.

func WithRunID

func WithRunID(runID string) Option

WithRunID sets the RunID for this validation run. Used when resuming a previous run.

func WithTolerations added in v0.8.2

func WithTolerations(tolerations []corev1.Toleration) Option

WithTolerations sets tolerations to override inner workload scheduling. When set, validators pass these tolerations to the workloads they create (e.g., NCCL benchmark pods), replacing default tolerate-all policy. Does not affect the orchestrator Job.

func WithVersion

func WithVersion(version string) Option

WithVersion sets the validator version string (typically the CLI version).

type Phase added in v0.9.0

type Phase = v1.Phase

Phase re-exports pkg/validator/v1.Phase so callers that work in the pkg/validator orchestration layer do not have to import the wire package directly.

func ParsePhase added in v0.13.0

func ParsePhase(s string) (Phase, bool)

ParsePhase converts a user-facing phase name to its typed Phase value. Returns false for PhaseAll (the wildcard, which has no Phase value) and for unrecognized inputs. Callers that want to accept the wildcard handle it separately, typically by collapsing the whole selection to nil (= run every phase).

func ParsePhaseSelection added in v0.14.0

func ParsePhaseSelection(phaseStrs []string) ([]Phase, error)

ParsePhaseSelection parses a list of user-facing phase names (from the `--phase` CLI flag or the spec.validate.execution.phases config field) into typed Phase values. The PhaseAll wildcard collapses the whole selection to nil (= run every phase), matching the documented "Default: all phases" behavior. PhaseAll is exclusive: combining it with any specific phase is a hard error rather than silently treating the selection as wildcard, so a typo like `--phase deployment --phase all` does not mask the user's mistake.

Every entry is parsed before the wildcard collapse, so an invalid phase name surfaces an error even when "all" is also present.

type PhaseResult

type PhaseResult struct {
	// Phase is the phase that was executed.
	Phase Phase

	// Status is the overall phase status derived from the CTRF summary.
	Status string

	// Report is the CTRF report for this phase.
	Report *ctrf.Report

	// Duration is the wall-clock time for the entire phase.
	Duration time.Duration
}

PhaseResult is the outcome of running all validators in a single phase.

type Validator

type Validator struct {
	// Version is the validator version (typically the CLI version).
	Version string

	// Commit is the git commit SHA from the CLI build. Used to resolve
	// dev-build validator images to SHA-tagged images pushed by on-push CI.
	Commit string

	// Namespace is the Kubernetes namespace for validation Jobs.
	Namespace string

	// RunID is a unique identifier for this validation run.
	RunID string

	// Cleanup controls whether to delete Jobs, ConfigMaps, and RBAC after validation.
	Cleanup bool

	// ImagePullSecrets are secret names for pulling validator images.
	ImagePullSecrets []string

	// NoCluster controls whether to skip cluster operations (dry-run mode).
	NoCluster bool

	// Tolerations are passed to validation workloads (e.g., NCCL benchmark pods)
	// to override their default scheduling constraints. Does not affect the
	// orchestrator Job itself.
	Tolerations []corev1.Toleration

	// NodeSelector is passed to validation workloads (e.g., NCCL benchmark pods)
	// to override platform-specific node selectors. Use when GPU nodes have
	// non-standard labels. Does not affect the orchestrator Job itself.
	NodeSelector map[string]string

	// ImageRegistryOverride, when non-empty, replaces the registry prefix
	// of all validator container images. Forwarded to the validator Job's
	// container env as AICR_VALIDATOR_IMAGE_REGISTRY so inner workloads
	// (e.g., AIPerf benchmark images) resolve from the same registry.
	ImageRegistryOverride string

	// ImageTagOverride, when non-empty, overrides the resolved image tag
	// of all validator container images. Forwarded to the validator Job's
	// container env as AICR_VALIDATOR_IMAGE_TAG. Intended for feature-branch
	// dev builds whose commit SHA has no published image; typical value: "latest".
	ImageTagOverride string

	// FailFast, when true, stops validation after the first phase that reports
	// StatusFailed. By default (false) all phases run and produce results.
	FailFast bool
	// contains filtered or unexported fields
}

Validator orchestrates validation runs using containerized validators.

func New

func New(opts ...Option) *Validator

New creates a new Validator with the provided options.

func (*Validator) ValidatePhase

func (v *Validator) ValidatePhase(
	ctx context.Context,
	phase Phase,
	validationInput *v1.ValidationInput,
	snap *snapshotter.Snapshot,
) (*PhaseResult, error)

ValidatePhase runs a single validation phase.

func (*Validator) ValidatePhases

func (v *Validator) ValidatePhases(
	ctx context.Context,
	phases []Phase,
	validationInput *v1.ValidationInput,
	snap *snapshotter.Snapshot,
) ([]*PhaseResult, error)

ValidatePhases runs the specified phases sequentially and returns one PhaseResult per phase. Pass nil or empty phases to run all phases. By default all phases run and produce results regardless of failures. Set FailFast to stop after the first phase that reports StatusFailed.

Directories

Path Synopsis
Package catalog provides the declarative validator catalog.
Package catalog provides the declarative validator catalog.
Package ctrf provides Go types and utilities for the Common Test Report Format (CTRF).
Package ctrf provides Go types and utilities for the Common Test Report Format (CTRF).
Package labels provides shared label constants for validation resources.
Package labels provides shared label constants for validation resources.
Package v1 defines AICR's validator input format (v1alpha1).
Package v1 defines AICR's validator input format (v1alpha1).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL