Documentation
¶
Overview ¶
Package v1 defines AICR's validator input format (v1alpha1).
Stability ¶
v1alpha1 is unstable and may have breaking changes before v1. Breaking changes at v1+ will require major version bumps (v2.0.0).
API Group ¶
validator.nvidia.com is a non-binding example. AICR ships no CRDs - external projects should use their own API groups.
Usage ¶
This package defines ValidationInput, the input format for AICR's validator plugins. It carries both validation configuration (phases, checks) and recipe context (ComponentRefs, Criteria, Constraints).
ValidationInput supports two usage patterns:
1. Standalone validation.yaml files (with apiVersion/kind/metadata) 2. Embedded in custom resources (metadata fields omitted via omitempty)
For external controllers that want to embed validation configuration, embed ValidationConfig directly (not ValidationInput) to avoid nested spec fields:
type MySpec struct {
Validation ValidationConfig `json:"validation"`
}
Index ¶
- Constants
- func BuildOrchestratorAffinity(deps []DependencyAffinity, componentRefs []recipe.ComponentRef) (*corev1.Affinity, error)
- func GenerateRunID() string
- func ImagePullPolicy(image string, imageTagOverride string) corev1.PullPolicy
- func RenderPlan(plan JobPlan) *batchv1.Job
- func RenderPlanToApplyConfig(plan JobPlan, jobName string) *applybatchv1.JobApplyConfiguration
- func ValidateDependencyAffinity(deps []DependencyAffinity, componentRefs []recipe.ComponentRef) error
- type CatalogMetadata
- type DependencyAffinity
- type DependencyRequirement
- type EnvVar
- type JobPlan
- type NodeSelection
- type Phase
- type ResourceRequirements
- type ValidationConfig
- type ValidationInput
- type ValidationMetadata
- type ValidationPhase
- type ValidatorCatalog
- type ValidatorEntry
Constants ¶
const ( // CatalogAPIVersion is the supported catalog API version. CatalogAPIVersion = "validator.nvidia.com/v1alpha1" // CatalogKind is the supported catalog kind. CatalogKind = "ValidatorCatalog" )
const (
// KindValidationInput is the Kubernetes kind for ValidationInput resources.
KindValidationInput = "ValidationInput"
)
Variables ¶
This section is empty.
Functions ¶
func BuildOrchestratorAffinity ¶
func BuildOrchestratorAffinity( deps []DependencyAffinity, componentRefs []recipe.ComponentRef, ) (*corev1.Affinity, error)
BuildOrchestratorAffinity composes the orchestrator pod's full affinity from the validator's declared dependencies and the resolved recipe's component refs. The result always includes the default prefer-CPU NodeAffinity; each resolvable dependency adds one PodAffinity term.
Resolution rules (per https://github.com/NVIDIA/aicr/issues/933):
- A "required" dependency whose componentRef is missing from componentRefs returns ErrCodeInvalidRequest. The caller should treat this as a recipe misconfiguration and fail the run before deploying any Job.
- A "preferred" dependency whose componentRef is missing is logged at slog.Warn and produces no PodAffinity term. The orchestrator schedules wherever the scheduler picks; this preserves backward-compatible behavior on flat networks where the dependency may not be present.
- Components whose Namespace is empty after recipe resolution are treated as missing (a dependency without a known namespace cannot produce a well-formed PodAffinityTerm).
Note: required only verifies the component is present in the resolved recipe (with IsEnabled() and a non-empty namespace). It does NOT check runtime readiness — if the dependency pods have not yet started or are crashlooping, the orchestrator pod will stay Pending until the Job's activeDeadlineSeconds fires. Operators triaging a hung run should inspect both the Job's pod PodScheduled condition and the dependency component's replica status.
For pre-flight gates that only need to check resolvability, use ValidateDependencyAffinity to avoid allocating the full affinity tree.
func GenerateRunID ¶
func GenerateRunID() string
GenerateRunID creates a unique run identifier for validation sessions. Format: {timestamp}-{random-hex} (e.g., "20260514-123045-abc123def456"). External controllers should use this to generate runIDs before creating ConfigMaps and rendering Jobs.
Panics if the system's random number generator fails. Entropy failures are exceptional and we prefer to fail fast rather than generate predictable IDs that could collide across concurrent runs.
func ImagePullPolicy ¶
func ImagePullPolicy(image string, imageTagOverride string) corev1.PullPolicy
ImagePullPolicy determines the pull policy for a container image. Returns Never for local side-loaded images (ko.local, kind.local), Always for :latest tag or when imageTagOverride is set, IfNotPresent for digest-pinned or versioned tags.
func RenderPlan ¶
RenderPlan renders a complete Kubernetes Job from a JobPlan. The returned Job spec matches exactly what the current deployer produces.
func RenderPlanToApplyConfig ¶
func RenderPlanToApplyConfig(plan JobPlan, jobName string) *applybatchv1.JobApplyConfiguration
RenderPlanToApplyConfig renders a Kubernetes Job ApplyConfiguration from a JobPlan. This is used for server-side apply deployment strategy. External controllers can use this to get field ownership tracking and idempotent apply semantics.
The jobName parameter must be provided by the caller (unlike RenderPlan which uses plan.JobName). This allows controllers to use deterministic names for idempotent re-runs.
func ValidateDependencyAffinity ¶
func ValidateDependencyAffinity( deps []DependencyAffinity, componentRefs []recipe.ComponentRef, ) error
ValidateDependencyAffinity verifies that all dependencies resolve against componentRefs without constructing the affinity tree. Returns the same error class as BuildOrchestratorAffinity (ErrCodeInvalidRequest on any malformed entry or any missing required component); suppresses the slog.Warn that BuildOrchestratorAffinity emits for missing preferred dependencies so pre-flight gates don't duplicate the build-time warning.
Note: only checks recipe membership (componentRef present, enabled, with a resolved namespace). Does NOT verify the dependency's pods are actually running — see the runtime-readiness note on BuildOrchestratorAffinity.
Types ¶
type CatalogMetadata ¶
type CatalogMetadata struct {
Name string `json:"name" yaml:"name"`
Version string `json:"version" yaml:"version"` // SemVer
}
CatalogMetadata contains catalog-level metadata.
type DependencyAffinity ¶
type DependencyAffinity struct {
// ComponentRef is the name of a recipe component whose pod the orchestrator
// should co-locate with. The deployer resolves this to a namespace at spawn
// time using the resolved recipe's componentRefs.
ComponentRef string `json:"componentRef" yaml:"componentRef"`
// PodLabelSelector matches the dependency pod's labels (e.g.,
// {"app.kubernetes.io/name": "prometheus"}). All key/value pairs must match.
PodLabelSelector map[string]string `json:"podLabelSelector" yaml:"podLabelSelector"`
// Requirement controls strength. "required" hard-fails when the dependency
// is unschedulable; "preferred" (default) is a high-weight scheduling hint.
Requirement DependencyRequirement `json:"requirement,omitempty" yaml:"requirement,omitempty"`
// TopologyKey is the node label whose value defines co-location.
// Defaults to kubernetes.io/hostname (same node) when empty.
TopologyKey string `json:"topologyKey,omitempty" yaml:"topologyKey,omitempty"`
}
DependencyAffinity declares a co-location preference for a validator's orchestrator pod with another component's pod.
func (DependencyAffinity) RequirementOrDefault ¶
func (d DependencyAffinity) RequirementOrDefault() DependencyRequirement
RequirementOrDefault returns the requirement strength, defaulting to "preferred" when unset.
func (DependencyAffinity) TopologyKeyOrDefault ¶
func (d DependencyAffinity) TopologyKeyOrDefault() string
TopologyKeyOrDefault returns the topology key, defaulting to kubernetes.io/hostname when unset.
func (DependencyAffinity) Validate ¶
func (d DependencyAffinity) Validate() error
Validate checks that ComponentRef and PodLabelSelector are non-empty and that Requirement is either empty (defaults to preferred), "preferred", or "required".
type DependencyRequirement ¶
type DependencyRequirement string
DependencyRequirement is the strength of a dependency affinity.
const ( // DependencyRequirementPreferred renders as preferredDuringSchedulingIgnoredDuringExecution // with a high weight; missing components are tolerated with a warning. DependencyRequirementPreferred DependencyRequirement = "preferred" // DependencyRequirementRequired renders as requiredDuringSchedulingIgnoredDuringExecution // and causes pre-flight failure when the referenced component is absent from the recipe. DependencyRequirementRequired DependencyRequirement = "required" )
type EnvVar ¶
type EnvVar struct {
Name string `json:"name" yaml:"name"`
Value string `json:"value" yaml:"value"`
}
EnvVar is a name/value pair for container environment variables.
type JobPlan ¶
type JobPlan struct {
// ValidatorName is the unique validator identifier
ValidatorName string
// Phase is the validation phase ("deployment", "performance", "conformance")
Phase string
// JobName is the generated Kubernetes Job name (unique per invocation)
JobName string
// Namespace is the Kubernetes namespace for the Job
Namespace string
// Image is the validator container image
Image string
// Args are container arguments
Args []string
// Env are environment variables for the container
Env []corev1.EnvVar
// Volumes are pod volumes (ConfigMaps for snapshot and validation data)
Volumes []corev1.Volume
// VolumeMounts are container volume mounts
VolumeMounts []corev1.VolumeMount
// Resources are container resource requirements
Resources corev1.ResourceRequirements
// Timeout is the maximum execution time (Job activeDeadlineSeconds)
Timeout int64
// ServiceAccount is the Kubernetes ServiceAccount name
ServiceAccount string
// Tolerations are pod tolerations for scheduling
Tolerations []corev1.Toleration
// ImagePullSecrets are secret names for pulling images (empty = no secrets)
ImagePullSecrets []string
// Labels are labels to apply to the Job and Pod
Labels map[string]string
// Affinity is the orchestrator pod's full affinity (NodeAffinity for
// prefer-CPU plus any PodAffinity terms derived from the catalog entry's
// DependencyAffinity). If nil, the renderer falls back to the default
// prefer-CPU NodeAffinity.
Affinity *corev1.Affinity
}
JobPlan contains all components needed to build a validator Job. External controllers can use these components to build custom Jobs or call RenderPlan() to get an AICR-identical Job.
func BuildJobPlan ¶
func BuildJobPlan( entry ValidatorEntry, runID string, namespace string, version string, commit string, serviceAccount string, imagePullSecrets []string, tolerations []corev1.Toleration, nodeSelector map[string]string, imageRegistryOverride string, imageTagOverride string, componentRefs []recipe.ComponentRef, ) (JobPlan, error)
BuildJobPlan creates a JobPlan from a validator entry. Exposed as public for verification and testing purposes.
The tolerations and nodeSelector parameters apply to inner workloads spawned by validators (e.g., GPU benchmarks, NCCL tests) and are forwarded via AICR_TOLERATIONS and AICR_NODE_SELECTOR environment variables. The orchestrator Job Pod itself always uses tolerate-all scheduling ({Operator: TolerationOpExists}) and gets affinity from BuildOrchestratorAffinity (prefer-CPU NodeAffinity, plus PodAffinity per entry.DependencyAffinity if any). componentRefs is the resolved recipe's component list and is used to resolve dependencyAffinity componentRefs to namespaces.
Returns ErrCodeInvalidRequest when entry.DependencyAffinity declares a "required" component that is not present in componentRefs.
func Plan ¶
func Plan( cat *ValidatorCatalog, validationInput *ValidationInput, runID string, namespace string, version string, commit string, serviceAccount string, imagePullSecrets []string, tolerations []corev1.Toleration, nodeSelector map[string]string, imageRegistryOverride string, imageTagOverride string, componentRefs []recipe.ComponentRef, ) ([]JobPlan, error)
Plan generates job plans for all validators across all phases. Returns a flat list of JobPlans where each plan contains all components needed to build a validator Job. Controllers can group by Phase field.
type NodeSelection ¶
type NodeSelection struct {
// Selector specifies label-based node selection.
Selector map[string]string `json:"selector,omitempty" yaml:"selector,omitempty"`
// MaxNodes limits the number of nodes to validate.
MaxNodes int `json:"maxNodes,omitempty" yaml:"maxNodes,omitempty"`
// ExcludeNodes lists node names to exclude from validation.
ExcludeNodes []string `json:"excludeNodes,omitempty" yaml:"excludeNodes,omitempty"`
}
NodeSelection defines node filtering for validation scope.
type ResourceRequirements ¶
type ResourceRequirements struct {
CPU string `json:"cpu,omitempty" yaml:"cpu,omitempty"`
Memory string `json:"memory,omitempty" yaml:"memory,omitempty"`
}
ResourceRequirements defines CPU and memory for a validator container.
type ValidationConfig ¶
type ValidationConfig struct {
// Readiness defines readiness validation phase settings.
Readiness *ValidationPhase `json:"readiness,omitempty" yaml:"readiness,omitempty"`
// Deployment defines deployment validation phase settings.
Deployment *ValidationPhase `json:"deployment,omitempty" yaml:"deployment,omitempty"`
// Performance defines performance validation phase settings.
Performance *ValidationPhase `json:"performance,omitempty" yaml:"performance,omitempty"`
// Conformance defines conformance validation phase settings.
Conformance *ValidationPhase `json:"conformance,omitempty" yaml:"conformance,omitempty"`
}
ValidationConfig defines validation phases and settings.
type ValidationInput ¶
type ValidationInput struct {
// APIVersion is the API version (optional, for standalone resource usage).
APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`
// Kind is always "ValidationInput" (optional, for standalone resource usage).
Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`
// Metadata contains validation metadata (optional, for standalone resource usage).
Metadata *ValidationMetadata `json:"metadata,omitempty" yaml:"metadata,omitempty"`
// Config defines the validation phases configuration.
Config ValidationConfig `json:"config" yaml:"config"`
// ComponentRefs lists the components to validate (optional).
ComponentRefs []recipe.ComponentRef `json:"componentRefs,omitempty" yaml:"componentRefs,omitempty"`
// Criteria specifies the cluster characteristics (optional).
Criteria recipe.Criteria `json:"criteria,omitempty" yaml:"criteria,omitempty"`
// Constraints are top-level readiness constraints evaluated before validation phases (optional).
Constraints []recipe.Constraint `json:"constraints,omitempty" yaml:"constraints,omitempty"`
}
ValidationInput is the complete validation input specification. Supports both standalone file usage (with full metadata) and embedded usage in CRs (metadata omitted).
Standalone usage (validation.yaml):
apiVersion: validator.nvidia.com/v1alpha1
kind: ValidationInput
metadata:
name: my-validation
version: 1.0.0
config:
readiness:
timeout: 10m
componentRefs: [...]
criteria: {...}
Embedded usage (in a CR):
spec:
validation:
config:
readiness:
timeout: 10m
componentRefs: [...]
criteria: {...}
func NewValidationInput ¶
func NewValidationInput() *ValidationInput
NewValidationInput creates a new empty ValidationInput instance.
func ToValidationInput ¶
func ToValidationInput(r *recipe.RecipeResult) *ValidationInput
ToValidationInput converts RecipeResult to ValidationInput for use with validators. This extracts the validation-relevant fields (ValidationConfig, ComponentRefs, Criteria) and discards recipe-specific metadata (AppliedOverlays, DeploymentOrder, etc.). Returns nil if the input RecipeResult is nil.
Populates optional APIVersion/Kind/Metadata fields to support standalone usage. When embedding in CRs, these fields can be omitted via omitempty tags.
func (*ValidationInput) GetComponentRefs ¶
func (i *ValidationInput) GetComponentRefs() []recipe.ComponentRef
GetComponentRefs returns the resolved recipe's component refs in a nil-safe way. Callers can invoke this on a nil *ValidationInput and receive nil rather than panicking — used by the validator deployer to resolve dependencyAffinity componentRefs to namespaces. When the input is nil, deployers fall back to the default (no podAffinity).
type ValidationMetadata ¶
type ValidationMetadata struct {
// Name is a human-readable name for this validation.
Name string `json:"name,omitempty" yaml:"name,omitempty"`
// Version is the version of this validation specification.
Version string `json:"version,omitempty" yaml:"version,omitempty"`
}
ValidationMetadata contains validation-level metadata.
type ValidationPhase ¶
type ValidationPhase struct {
// Timeout is the maximum duration for this phase (e.g., "10m").
Timeout string `json:"timeout,omitempty" yaml:"timeout,omitempty"`
// Constraints are phase-level constraints to evaluate.
Constraints []recipe.Constraint `json:"constraints,omitempty" yaml:"constraints,omitempty"`
// Checks are named validation checks to run in this phase.
Checks []string `json:"checks,omitempty" yaml:"checks,omitempty"`
// NodeSelection defines which nodes to include in validation.
NodeSelection *NodeSelection `json:"nodeSelection,omitempty" yaml:"nodeSelection,omitempty"`
// Infrastructure references a componentRef that provides validation infrastructure.
// Example: "nccl-doctor" for performance testing.
Infrastructure string `json:"infrastructure,omitempty" yaml:"infrastructure,omitempty"`
}
ValidationPhase represents a single validation phase configuration.
type ValidatorCatalog ¶
type ValidatorCatalog struct {
// APIVersion is the API version (optional, for standalone resource usage).
APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`
// Kind is always "ValidatorCatalog" (optional, for standalone resource usage).
Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`
// Metadata contains catalog metadata (optional, for standalone resource usage).
Metadata *CatalogMetadata `json:"metadata,omitempty" yaml:"metadata,omitempty"`
// Validators is the list of validator entries (required).
Validators []ValidatorEntry `json:"validators" yaml:"validators"`
}
ValidatorCatalog is the top-level catalog document. Supports both standalone file usage (with full metadata) and embedded usage in CRs (metadata omitted).
Standalone usage (catalog.yaml):
apiVersion: validator.nvidia.com/v1alpha1 kind: ValidatorCatalog metadata: name: default version: 1.0.0 validators: [...]
Embedded usage (in a CR):
spec:
catalog:
validators: [...]
func (*ValidatorCatalog) ForPhase ¶
func (c *ValidatorCatalog) ForPhase(phase Phase) []ValidatorEntry
ForPhase returns validators filtered by phase.
type ValidatorEntry ¶
type ValidatorEntry struct {
// Name is the unique identifier for this validator, used in Job names.
Name string `json:"name" yaml:"name"`
// Phase is the validation phase: "deployment", "performance", or "conformance".
Phase string `json:"phase" yaml:"phase"`
// Description is a human-readable description of what this validator checks.
Description string `json:"description" yaml:"description"`
// Image is the OCI image reference for the validator container.
Image string `json:"image" yaml:"image"`
// Timeout is the maximum execution time for this validator.
// Maps to Job activeDeadlineSeconds.
Timeout time.Duration `json:"timeout" yaml:"timeout"`
// Args are the container arguments.
Args []string `json:"args,omitempty" yaml:"args,omitempty"`
// Env are environment variables to set in the container.
Env []EnvVar `json:"env,omitempty" yaml:"env,omitempty"`
// Resources specifies container resource requests/limits.
// If nil, defaults from pkg/defaults are used.
Resources *ResourceRequirements `json:"resources,omitempty" yaml:"resources,omitempty"`
// DependencyAffinity declares co-location preferences for the orchestrator
// pod of this validator. Each entry references a recipe component by name
// (componentRef) and a label selector matching that component's pods.
// The deployer resolves the componentRef to a namespace from the resolved
// recipe at spawn time and emits a podAffinity term on the orchestrator
// Pod spec. "required" entries hard-fail the run when the referenced
// component is absent from the recipe; "preferred" entries (default) emit
// a structured warning and proceed with no affinity term for that
// dependency.
//
// Motivation: ai-service-metrics queries Prometheus over a Service. On
// clusters with asymmetric pod-to-pod network reachability (e.g.,
// multi-Security-Group DGXC EKS), the orchestrator must run on a node
// that can reach the Prometheus pod. Co-locating with the Prometheus pod
// makes the dial loopback / same-network and removes the dependency on
// cluster network topology. See https://github.com/NVIDIA/aicr/issues/933.
DependencyAffinity []DependencyAffinity `json:"dependencyAffinity,omitempty" yaml:"dependencyAffinity,omitempty"`
}
ValidatorEntry defines a single validator container.
func FilterEntriesByValidation ¶
func FilterEntriesByValidation(entries []ValidatorEntry, phase Phase, validationInput *ValidationInput) []ValidatorEntry
FilterEntriesByValidation filters catalog entries based on the validation's declared checks for the given phase. Returns nil if the validation has no phase configuration or no checks declared.