v1

package
v0.14.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: Apache-2.0 Imports: 19 Imported by: 0

README

Validator v1 API

pkg/validator/v1 is the canonical home of AICR's validator input format and the job-plan API external Kubernetes controllers use to render and deploy AICR validator Jobs.

Stability. This package implements v1alpha1. The schema may have breaking changes before v1. Breaking changes once we reach v1 will require a major version bump (v2.0.0). See doc.go for the full stability contract.

Provenance. This package previously lived at pkg/api/validator/v1 and was relocated under pkg/validator/v1 so the on-disk layout matches its position in the validation pipeline. Re-exported aliases keep older import paths source-compatible during the transition; new code should import this path directly.

Package surface

The package is intentionally narrow and exports three concerns:

  1. ValidationInput (validation_input.go) — the wire format consumed by a recipe's spec.validation block. Carries phases, checks, constraints, criteria, and the resolved component refs.
  2. ValidatorCatalog + ValidatorEntry (catalog.go) — the catalog schema and the Phase/filtering helpers. Catalog loading lives in pkg/validator/catalog; this package owns the types.
  3. JobPlan + planners + renderers (job_plan.go) — the data shape external controllers customize, plus the functions that build one (Plan, BuildJobPlan) and render it into a batchv1.Job or an apply-config (RenderPlan, RenderPlanToApplyConfig).
  4. Affinity helpers (affinity.go, dependency_affinity.go) — support for the catalog's dependencyAffinity declaration.

GenerateRunID and ImagePullPolicy are exported utilities for callers that build their own renderer.

Quick start

1. Generate a run ID
import v1 "github.com/NVIDIA/aicr/pkg/validator/v1"

runID := v1.GenerateRunID()
// Example: "20260514-143052-a1b2c3d4e5f6g7h8"
2. Create the snapshot + validation ConfigMaps
import (
    v1 "github.com/NVIDIA/aicr/pkg/validator/v1"
    "github.com/NVIDIA/aicr/pkg/validator"
)

err := validator.EnsureDataConfigMaps(
    ctx,
    clientset,
    namespace,
    runID,
    snapshot,        // *snapshotter.Snapshot
    validationInput, // *v1.ValidationInput
)

EnsureDataConfigMaps stays in pkg/validator because it touches Kubernetes API objects and is not part of the v1 wire contract.

3. Load the catalog and plan
import (
    v1 "github.com/NVIDIA/aicr/pkg/validator/v1"
    "github.com/NVIDIA/aicr/pkg/validator/catalog"
)

cat, err := catalog.LoadWithDataProvider(ctx, nil, version, commit)
if err != nil {
    return err
}

serviceAccount := "my-validator-sa-" + runID

plans, err := v1.Plan(
    cat,
    validationInput,
    runID,
    namespace,
    version,        // controller version
    commit,         // controller commit SHA
    serviceAccount, // SA name your controller manages
    nil,            // imagePullSecrets
    nil,            // tolerations (forwarded to inner workloads)
    nil,            // nodeSelector (forwarded to inner workloads)
    "",             // imageRegistryOverride
    "",             // imageTagOverride
    componentRefs,  // []recipe.ComponentRef, may be nil
)

Notes:

  • LoadWithDataProvider(ctx, nil, …) uses the embedded catalog. Pass a layered recipe.DataProvider to honor a --data overlay.
  • serviceAccount is yours to name. AICR's own CLI uses aicr-validator-<runID>; external controllers should pick a strategy consistent with their RBAC.
  • tolerations and nodeSelector apply to inner workloads (GPU benchmarks, NCCL tests). The orchestrator pod itself uses tolerate-all scheduling and gets its affinity from BuildOrchestratorAffinity (prefer-CPU NodeAffinity, plus PodAffinity for any dependencyAffinity declarations).
  • componentRefs is the resolved recipe's component list and is used exclusively to resolve dependencyAffinity.componentRef entries to namespaces. Pass nil when no component-targeted affinity applies.
  • Plan returns ErrCodeInvalidRequest when an entry declares a required dependencyAffinity.componentRef that is not present in componentRefs.
4a. Deploy with Create (simple path)
for _, plan := range plans {
    job := v1.RenderPlan(plan)
    if _, err := clientset.BatchV1().Jobs(namespace).Create(
        ctx, job, metav1.CreateOptions{},
    ); err != nil {
        return err
    }
}
4b. Deploy with server-side apply (idempotent)
for _, plan := range plans {
    apply := v1.RenderPlanToApplyConfig(plan, plan.JobName)
    if _, err := clientset.BatchV1().Jobs(namespace).Apply(
        ctx, apply, metav1.ApplyOptions{
            FieldManager: "my-controller",
            Force:        true,
        },
    ); err != nil {
        return err
    }
}

Use the apply path when the same plan may be reconciled more than once or when more than one controller owns fields on the same Job.

JobPlan

type JobPlan struct {
    ValidatorName    string                      // unique validator identifier
    Phase            string                      // "deployment" | "performance" | "conformance"
    JobName          string                      // generated; aicr-{validator}-{hex}
    Namespace        string                      // Kubernetes namespace
    Image            string                      // resolved container image
    Args             []string
    Env              []corev1.EnvVar
    Volumes          []corev1.Volume             // snapshot + validation ConfigMaps
    VolumeMounts     []corev1.VolumeMount
    Resources        corev1.ResourceRequirements
    Timeout          int64                       // activeDeadlineSeconds
    ServiceAccount   string
    Tolerations      []corev1.Toleration         // forwarded; orchestrator pod is tolerate-all
    ImagePullSecrets []string
    Labels           map[string]string
    Affinity         *corev1.Affinity            // orchestrator pod affinity (NodeAffinity + optional PodAffinity)
}

JobName is generated by BuildJobPlan as aicr-{validatorName}-{hex}. It is stable within a plan but unique across invocations.

Affinity, when non-nil, is what the renderer applies to the orchestrator pod. A nil value falls back to the default prefer-CPU NodeAffinity.

Grouping plans by phase

Plan returns a flat list. Controllers that want phase ordering should group:

groups := make(map[string][]v1.JobPlan)
for _, p := range plans {
    groups[p.Phase] = append(groups[p.Phase], p)
}

for _, p := range groups[string(v1.PhaseDeployment)] { /* … */ }
for _, p := range groups[string(v1.PhasePerformance)] { /* … */ }
for _, p := range groups[string(v1.PhaseConformance)] { /* … */ }

Customizing a single plan

plan, err := v1.BuildJobPlan(
    entry,
    runID,
    namespace,
    version, commit,
    serviceAccount,
    nil, tolerations, nodeSelector,
    "", "",
    componentRefs,
)
if err != nil {
    return err
}

plan.Timeout = 600 // 10 minutes
plan.Env = append(plan.Env, corev1.EnvVar{Name: "MY_VAR", Value: "x"})

job := v1.RenderPlan(plan)

API choice

Scenario Use
Single-shot controller RenderPlan + Create
Idempotent reconciliation RenderPlanToApplyConfig + Apply
Multi-controller field ownership RenderPlanToApplyConfig + Apply

Server-side apply is the default we recommend for any controller that will run a reconcile loop.

End-to-end example

package main

import (
    "context"
    "fmt"

    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"

    "github.com/NVIDIA/aicr/pkg/snapshotter"
    "github.com/NVIDIA/aicr/pkg/validator"
    "github.com/NVIDIA/aicr/pkg/validator/catalog"
    v1 "github.com/NVIDIA/aicr/pkg/validator/v1"
)

func RunValidation(
    ctx context.Context,
    clientset kubernetes.Interface,
    namespace, version, commit string,
    validationInput *v1.ValidationInput,
    snapshot *snapshotter.Snapshot,
    componentRefs []v1.ComponentRef,
) error {
    runID := v1.GenerateRunID()

    if err := validator.EnsureDataConfigMaps(
        ctx, clientset, namespace, runID, snapshot, validationInput,
    ); err != nil {
        return err
    }

    cat, err := catalog.LoadWithDataProvider(ctx, nil, version, commit)
    if err != nil {
        return err
    }

    serviceAccount := "my-validator-sa-" + runID

    plans, err := v1.Plan(
        cat, validationInput, runID, namespace,
        version, commit, serviceAccount,
        nil, nil, nil, "", "",
        componentRefs,
    )
    if err != nil {
        return err
    }

    for _, plan := range plans {
        fmt.Printf("apply %s (validator=%s phase=%s)\n",
            plan.JobName, plan.ValidatorName, plan.Phase)
        apply := v1.RenderPlanToApplyConfig(plan, plan.JobName)
        if _, err := clientset.BatchV1().Jobs(namespace).Apply(
            ctx, apply, metav1.ApplyOptions{
                FieldManager: "my-controller",
                Force:        true,
            },
        ); err != nil {
            return fmt.Errorf("apply %s: %w", plan.ValidatorName, err)
        }
    }
    return nil
}

API reference

Session

GenerateRunID() string Returns {YYYYMMDD-HHMMSS}-{hex16} — e.g. 20260514-123045-a1b2c3d4e5f6g7h8. Panics on entropy failure (preferred over a predictable ID).

Planning

Plan(cat, validationInput, runID, namespace, version, commit, serviceAccount, imagePullSecrets, tolerations, nodeSelector, imageRegistryOverride, imageTagOverride, componentRefs) ([]JobPlan, error)

Builds one JobPlan per (phase, validator entry) pair that matches validationInput. A nil catalog returns an empty slice and no error.

BuildJobPlan(entry, runID, namespace, version, commit, serviceAccount, imagePullSecrets, tolerations, nodeSelector, imageRegistryOverride, imageTagOverride, componentRefs) (JobPlan, error)

Builds a plan from a single catalog entry. Returns ErrCodeInvalidRequest when a required dependencyAffinity.componentRef is not present in componentRefs.

Parameter notes shared by both functions:

  • tolerations, nodeSelector — forwarded to inner workloads via the AICR_TOLERATIONS and AICR_NODE_SELECTOR env vars. The orchestrator Pod itself is tolerate-all and has no node selector.
  • imageRegistryOverride — replaces the registry prefix on every validator image. Matches AICR_VALIDATOR_IMAGE_REGISTRY. Empty disables the override.
  • imageTagOverride — replaces the tag on every tag-based reference. Digest-pinned references (name@sha256:…) are left untouched. Matches AICR_VALIDATOR_IMAGE_TAG. Empty disables the override.
  • componentRefs — resolved component list from the recipe. Used to resolve dependencyAffinity.componentRef to a namespace. Pass nil when dependencyAffinity is unused.
Rendering

RenderPlan(plan JobPlan) *batchv1.Job Materializes a batchv1.Job. Uses plan.JobName and plan.Namespace.

RenderPlanToApplyConfig(plan JobPlan, jobName string) *applybatchv1.JobApplyConfiguration Materializes an apply-config for server-side apply. Pass plan.JobName as jobName.

Image utilities

ImagePullPolicy(image string, imageTagOverride string) corev1.PullPolicy

Input Policy
ko.local/…, kind.local/… PullNever
digest pinned (…@sha256:…) PullIfNotPresent
imageTagOverride != "" PullAlways
:latest PullAlways
anything else (versioned tag) PullIfNotPresent

RBAC

External controllers own their ServiceAccount, Role/ClusterRole, and binding. Plug the SA name into Plan / BuildJobPlan via the serviceAccount parameter. AICR's own CLI uses aicr-validator-<runID>, but you are free to choose any convention.

Where the moving parts live

Concern Package
Wire types (ValidationInput, catalog schema, JobPlan) pkg/validator/v1 (this package)
Catalog loading + image rewriting pkg/validator/catalog
Job dispatch, watch, log streaming pkg/validator
Snapshot capture pkg/snapshotter
Recipe resolution / overlays pkg/recipe

Notes

  • The public, semver-stable consumer surface is pkg/client/v1. This package is the underlying wire format the facade re-exports. Embed ValidationConfig directly (not ValidationInput) if you need to drop the wrapper metadata/apiVersion/kind fields in a custom resource spec.
  • Both Create and server-side Apply deployment strategies are supported.
  • ConfigMap creation (EnsureDataConfigMaps) stays in pkg/validator because it is an in-cluster side effect, not a wire-format concern.

Documentation

Overview

Package v1 defines AICR's validator input format (v1alpha1).

Stability

v1alpha1 is unstable and may have breaking changes before v1. Breaking changes at v1+ will require major version bumps (v2.0.0).

API Group

validator.nvidia.com is a non-binding example. AICR ships no CRDs - external projects should use their own API groups.

Usage

This package defines ValidationInput, the input format for AICR's validator plugins. It carries both validation configuration (phases, checks) and recipe context (ComponentRefs, Criteria, Constraints).

ValidationInput supports two usage patterns:

1. Standalone validation.yaml files (with apiVersion/kind/metadata) 2. Embedded in custom resources (metadata fields omitted via omitempty)

For external controllers that want to embed validation configuration, embed ValidationConfig directly (not ValidationInput) to avoid nested spec fields:

type MySpec struct {
    Validation ValidationConfig `json:"validation"`
}

Index

Constants

View Source
const (
	// CatalogAPIVersion is the supported catalog API version.
	CatalogAPIVersion = "validator.nvidia.com/v1alpha1"

	// CatalogKind is the supported catalog kind.
	CatalogKind = "ValidatorCatalog"
)
View Source
const (
	// KindValidationInput is the Kubernetes kind for ValidationInput resources.
	KindValidationInput = "ValidationInput"
)

Variables

This section is empty.

Functions

func BuildOrchestratorAffinity

func BuildOrchestratorAffinity(
	deps []DependencyAffinity,
	componentRefs []recipe.ComponentRef,
) (*corev1.Affinity, error)

BuildOrchestratorAffinity composes the orchestrator pod's full affinity from the validator's declared dependencies and the resolved recipe's component refs. The result always includes the default prefer-CPU NodeAffinity; each resolvable dependency adds one PodAffinity term.

Resolution rules (per https://github.com/NVIDIA/aicr/issues/933):

  • A "required" dependency whose componentRef is missing from componentRefs returns ErrCodeInvalidRequest. The caller should treat this as a recipe misconfiguration and fail the run before deploying any Job.
  • A "preferred" dependency whose componentRef is missing is logged at slog.Warn and produces no PodAffinity term. The orchestrator schedules wherever the scheduler picks; this preserves backward-compatible behavior on flat networks where the dependency may not be present.
  • Components whose Namespace is empty after recipe resolution are treated as missing (a dependency without a known namespace cannot produce a well-formed PodAffinityTerm).

Note: required only verifies the component is present in the resolved recipe (with IsEnabled() and a non-empty namespace). It does NOT check runtime readiness — if the dependency pods have not yet started or are crashlooping, the orchestrator pod will stay Pending until the Job's activeDeadlineSeconds fires. Operators triaging a hung run should inspect both the Job's pod PodScheduled condition and the dependency component's replica status.

For pre-flight gates that only need to check resolvability, use ValidateDependencyAffinity to avoid allocating the full affinity tree.

func GenerateRunID

func GenerateRunID() string

GenerateRunID creates a unique run identifier for validation sessions. Format: {timestamp}-{random-hex} (e.g., "20260514-123045-abc123def456"). External controllers should use this to generate runIDs before creating ConfigMaps and rendering Jobs.

Panics if the system's random number generator fails. Entropy failures are exceptional and we prefer to fail fast rather than generate predictable IDs that could collide across concurrent runs.

func ImagePullPolicy

func ImagePullPolicy(image string, imageTagOverride string) corev1.PullPolicy

ImagePullPolicy determines the pull policy for a container image. Returns Never for local side-loaded images (ko.local, kind.local), Always for :latest tag or when imageTagOverride is set, IfNotPresent for digest-pinned or versioned tags.

func RenderPlan

func RenderPlan(plan JobPlan) *batchv1.Job

RenderPlan renders a complete Kubernetes Job from a JobPlan. The returned Job spec matches exactly what the current deployer produces.

func RenderPlanToApplyConfig

func RenderPlanToApplyConfig(plan JobPlan, jobName string) *applybatchv1.JobApplyConfiguration

RenderPlanToApplyConfig renders a Kubernetes Job ApplyConfiguration from a JobPlan. This is used for server-side apply deployment strategy. External controllers can use this to get field ownership tracking and idempotent apply semantics.

The jobName parameter must be provided by the caller (unlike RenderPlan which uses plan.JobName). This allows controllers to use deterministic names for idempotent re-runs.

func ValidateDependencyAffinity

func ValidateDependencyAffinity(
	deps []DependencyAffinity,
	componentRefs []recipe.ComponentRef,
) error

ValidateDependencyAffinity verifies that all dependencies resolve against componentRefs without constructing the affinity tree. Returns the same error class as BuildOrchestratorAffinity (ErrCodeInvalidRequest on any malformed entry or any missing required component); suppresses the slog.Warn that BuildOrchestratorAffinity emits for missing preferred dependencies so pre-flight gates don't duplicate the build-time warning.

Note: only checks recipe membership (componentRef present, enabled, with a resolved namespace). Does NOT verify the dependency's pods are actually running — see the runtime-readiness note on BuildOrchestratorAffinity.

Types

type CatalogMetadata

type CatalogMetadata struct {
	Name    string `json:"name" yaml:"name"`
	Version string `json:"version" yaml:"version"` // SemVer
}

CatalogMetadata contains catalog-level metadata.

type DependencyAffinity

type DependencyAffinity struct {
	// ComponentRef is the name of a recipe component whose pod the orchestrator
	// should co-locate with. The deployer resolves this to a namespace at spawn
	// time using the resolved recipe's componentRefs.
	ComponentRef string `json:"componentRef" yaml:"componentRef"`

	// PodLabelSelector matches the dependency pod's labels (e.g.,
	// {"app.kubernetes.io/name": "prometheus"}). All key/value pairs must match.
	PodLabelSelector map[string]string `json:"podLabelSelector" yaml:"podLabelSelector"`

	// Requirement controls strength. "required" hard-fails when the dependency
	// is unschedulable; "preferred" (default) is a high-weight scheduling hint.
	Requirement DependencyRequirement `json:"requirement,omitempty" yaml:"requirement,omitempty"`

	// TopologyKey is the node label whose value defines co-location.
	// Defaults to kubernetes.io/hostname (same node) when empty.
	TopologyKey string `json:"topologyKey,omitempty" yaml:"topologyKey,omitempty"`
}

DependencyAffinity declares a co-location preference for a validator's orchestrator pod with another component's pod.

func (DependencyAffinity) RequirementOrDefault

func (d DependencyAffinity) RequirementOrDefault() DependencyRequirement

RequirementOrDefault returns the requirement strength, defaulting to "preferred" when unset.

func (DependencyAffinity) TopologyKeyOrDefault

func (d DependencyAffinity) TopologyKeyOrDefault() string

TopologyKeyOrDefault returns the topology key, defaulting to kubernetes.io/hostname when unset.

func (DependencyAffinity) Validate

func (d DependencyAffinity) Validate() error

Validate checks that ComponentRef and PodLabelSelector are non-empty and that Requirement is either empty (defaults to preferred), "preferred", or "required".

type DependencyRequirement

type DependencyRequirement string

DependencyRequirement is the strength of a dependency affinity.

const (
	// DependencyRequirementPreferred renders as preferredDuringSchedulingIgnoredDuringExecution
	// with a high weight; missing components are tolerated with a warning.
	DependencyRequirementPreferred DependencyRequirement = "preferred"

	// DependencyRequirementRequired renders as requiredDuringSchedulingIgnoredDuringExecution
	// and causes pre-flight failure when the referenced component is absent from the recipe.
	DependencyRequirementRequired DependencyRequirement = "required"
)

type EnvVar

type EnvVar struct {
	Name  string `json:"name" yaml:"name"`
	Value string `json:"value" yaml:"value"`
}

EnvVar is a name/value pair for container environment variables.

type JobPlan

type JobPlan struct {
	// ValidatorName is the unique validator identifier
	ValidatorName string

	// Phase is the validation phase ("deployment", "performance", "conformance")
	Phase string

	// JobName is the generated Kubernetes Job name (unique per invocation)
	JobName string

	// Namespace is the Kubernetes namespace for the Job
	Namespace string

	// Image is the validator container image
	Image string

	// Args are container arguments
	Args []string

	// Env are environment variables for the container
	Env []corev1.EnvVar

	// Volumes are pod volumes (ConfigMaps for snapshot and validation data)
	Volumes []corev1.Volume

	// VolumeMounts are container volume mounts
	VolumeMounts []corev1.VolumeMount

	// Resources are container resource requirements
	Resources corev1.ResourceRequirements

	// Timeout is the maximum execution time (Job activeDeadlineSeconds)
	Timeout int64

	// ServiceAccount is the Kubernetes ServiceAccount name
	ServiceAccount string

	// Tolerations are pod tolerations for scheduling
	Tolerations []corev1.Toleration

	// ImagePullSecrets are secret names for pulling images (empty = no secrets)
	ImagePullSecrets []string

	// Labels are labels to apply to the Job and Pod
	Labels map[string]string

	// Affinity is the orchestrator pod's full affinity (NodeAffinity for
	// prefer-CPU plus any PodAffinity terms derived from the catalog entry's
	// DependencyAffinity). If nil, the renderer falls back to the default
	// prefer-CPU NodeAffinity.
	Affinity *corev1.Affinity
}

JobPlan contains all components needed to build a validator Job. External controllers can use these components to build custom Jobs or call RenderPlan() to get an AICR-identical Job.

func BuildJobPlan

func BuildJobPlan(
	entry ValidatorEntry,
	runID string,
	namespace string,
	version string,
	commit string,
	serviceAccount string,
	imagePullSecrets []string,
	tolerations []corev1.Toleration,
	nodeSelector map[string]string,
	imageRegistryOverride string,
	imageTagOverride string,
	componentRefs []recipe.ComponentRef,
) (JobPlan, error)

BuildJobPlan creates a JobPlan from a validator entry. Exposed as public for verification and testing purposes.

The tolerations and nodeSelector parameters apply to inner workloads spawned by validators (e.g., GPU benchmarks, NCCL tests) and are forwarded via AICR_TOLERATIONS and AICR_NODE_SELECTOR environment variables. The orchestrator Job Pod itself always uses tolerate-all scheduling ({Operator: TolerationOpExists}) and gets affinity from BuildOrchestratorAffinity (prefer-CPU NodeAffinity, plus PodAffinity per entry.DependencyAffinity if any). componentRefs is the resolved recipe's component list and is used to resolve dependencyAffinity componentRefs to namespaces.

Returns ErrCodeInvalidRequest when entry.DependencyAffinity declares a "required" component that is not present in componentRefs.

func Plan

func Plan(
	cat *ValidatorCatalog,
	validationInput *ValidationInput,
	runID string,
	namespace string,
	version string,
	commit string,
	serviceAccount string,
	imagePullSecrets []string,
	tolerations []corev1.Toleration,
	nodeSelector map[string]string,
	imageRegistryOverride string,
	imageTagOverride string,
	componentRefs []recipe.ComponentRef,
) ([]JobPlan, error)

Plan generates job plans for all validators across all phases. Returns a flat list of JobPlans where each plan contains all components needed to build a validator Job. Controllers can group by Phase field.

type NodeSelection

type NodeSelection struct {
	// Selector specifies label-based node selection.
	Selector map[string]string `json:"selector,omitempty" yaml:"selector,omitempty"`

	// MaxNodes limits the number of nodes to validate.
	MaxNodes int `json:"maxNodes,omitempty" yaml:"maxNodes,omitempty"`

	// ExcludeNodes lists node names to exclude from validation.
	ExcludeNodes []string `json:"excludeNodes,omitempty" yaml:"excludeNodes,omitempty"`
}

NodeSelection defines node filtering for validation scope.

type Phase

type Phase string

Phase represents a validation phase.

const (
	// PhaseDeployment is the deployment validation phase.
	PhaseDeployment Phase = "deployment"

	// PhasePerformance is the performance validation phase.
	PhasePerformance Phase = "performance"

	// PhaseConformance is the conformance validation phase.
	PhaseConformance Phase = "conformance"
)

type ResourceRequirements

type ResourceRequirements struct {
	CPU    string `json:"cpu,omitempty" yaml:"cpu,omitempty"`
	Memory string `json:"memory,omitempty" yaml:"memory,omitempty"`
}

ResourceRequirements defines CPU and memory for a validator container.

type ValidationConfig

type ValidationConfig struct {
	// Readiness defines readiness validation phase settings.
	Readiness *ValidationPhase `json:"readiness,omitempty" yaml:"readiness,omitempty"`

	// Deployment defines deployment validation phase settings.
	Deployment *ValidationPhase `json:"deployment,omitempty" yaml:"deployment,omitempty"`

	// Performance defines performance validation phase settings.
	Performance *ValidationPhase `json:"performance,omitempty" yaml:"performance,omitempty"`

	// Conformance defines conformance validation phase settings.
	Conformance *ValidationPhase `json:"conformance,omitempty" yaml:"conformance,omitempty"`
}

ValidationConfig defines validation phases and settings.

type ValidationInput

type ValidationInput struct {
	// APIVersion is the API version (optional, for standalone resource usage).
	APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`

	// Kind is always "ValidationInput" (optional, for standalone resource usage).
	Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`

	// Metadata contains validation metadata (optional, for standalone resource usage).
	Metadata *ValidationMetadata `json:"metadata,omitempty" yaml:"metadata,omitempty"`

	// Config defines the validation phases configuration.
	Config ValidationConfig `json:"config" yaml:"config"`

	// ComponentRefs lists the components to validate (optional).
	ComponentRefs []recipe.ComponentRef `json:"componentRefs,omitempty" yaml:"componentRefs,omitempty"`

	// Criteria specifies the cluster characteristics (optional).
	Criteria recipe.Criteria `json:"criteria,omitempty" yaml:"criteria,omitempty"`

	// Constraints are top-level readiness constraints evaluated before validation phases (optional).
	Constraints []recipe.Constraint `json:"constraints,omitempty" yaml:"constraints,omitempty"`
}

ValidationInput is the complete validation input specification. Supports both standalone file usage (with full metadata) and embedded usage in CRs (metadata omitted).

Standalone usage (validation.yaml):

apiVersion: validator.nvidia.com/v1alpha1
kind: ValidationInput
metadata:
  name: my-validation
  version: 1.0.0
config:
  readiness:
    timeout: 10m
componentRefs: [...]
criteria: {...}

Embedded usage (in a CR):

spec:
  validation:
    config:
      readiness:
        timeout: 10m
    componentRefs: [...]
    criteria: {...}

func NewValidationInput

func NewValidationInput() *ValidationInput

NewValidationInput creates a new empty ValidationInput instance.

func ToValidationInput

func ToValidationInput(r *recipe.RecipeResult) *ValidationInput

ToValidationInput converts RecipeResult to ValidationInput for use with validators. This extracts the validation-relevant fields (ValidationConfig, ComponentRefs, Criteria) and discards recipe-specific metadata (AppliedOverlays, DeploymentOrder, etc.). Returns nil if the input RecipeResult is nil.

Populates optional APIVersion/Kind/Metadata fields to support standalone usage. When embedding in CRs, these fields can be omitted via omitempty tags.

func (*ValidationInput) GetComponentRefs

func (i *ValidationInput) GetComponentRefs() []recipe.ComponentRef

GetComponentRefs returns the resolved recipe's component refs in a nil-safe way. Callers can invoke this on a nil *ValidationInput and receive nil rather than panicking — used by the validator deployer to resolve dependencyAffinity componentRefs to namespaces. When the input is nil, deployers fall back to the default (no podAffinity).

type ValidationMetadata

type ValidationMetadata struct {
	// Name is a human-readable name for this validation.
	Name string `json:"name,omitempty" yaml:"name,omitempty"`

	// Version is the version of this validation specification.
	Version string `json:"version,omitempty" yaml:"version,omitempty"`
}

ValidationMetadata contains validation-level metadata.

type ValidationPhase

type ValidationPhase struct {
	// Timeout is the maximum duration for this phase (e.g., "10m").
	Timeout string `json:"timeout,omitempty" yaml:"timeout,omitempty"`

	// Constraints are phase-level constraints to evaluate.
	Constraints []recipe.Constraint `json:"constraints,omitempty" yaml:"constraints,omitempty"`

	// Checks are named validation checks to run in this phase.
	Checks []string `json:"checks,omitempty" yaml:"checks,omitempty"`

	// NodeSelection defines which nodes to include in validation.
	NodeSelection *NodeSelection `json:"nodeSelection,omitempty" yaml:"nodeSelection,omitempty"`

	// Infrastructure references a componentRef that provides validation infrastructure.
	// Example: "nccl-doctor" for performance testing.
	Infrastructure string `json:"infrastructure,omitempty" yaml:"infrastructure,omitempty"`
}

ValidationPhase represents a single validation phase configuration.

type ValidatorCatalog

type ValidatorCatalog struct {
	// APIVersion is the API version (optional, for standalone resource usage).
	APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`

	// Kind is always "ValidatorCatalog" (optional, for standalone resource usage).
	Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`

	// Metadata contains catalog metadata (optional, for standalone resource usage).
	Metadata *CatalogMetadata `json:"metadata,omitempty" yaml:"metadata,omitempty"`

	// Validators is the list of validator entries (required).
	Validators []ValidatorEntry `json:"validators" yaml:"validators"`
}

ValidatorCatalog is the top-level catalog document. Supports both standalone file usage (with full metadata) and embedded usage in CRs (metadata omitted).

Standalone usage (catalog.yaml):

apiVersion: validator.nvidia.com/v1alpha1
kind: ValidatorCatalog
metadata:
  name: default
  version: 1.0.0
validators: [...]

Embedded usage (in a CR):

spec:
  catalog:
    validators: [...]

func (*ValidatorCatalog) ForPhase

func (c *ValidatorCatalog) ForPhase(phase Phase) []ValidatorEntry

ForPhase returns validators filtered by phase.

type ValidatorEntry

type ValidatorEntry struct {
	// Name is the unique identifier for this validator, used in Job names.
	Name string `json:"name" yaml:"name"`

	// Phase is the validation phase: "deployment", "performance", or "conformance".
	Phase string `json:"phase" yaml:"phase"`

	// Description is a human-readable description of what this validator checks.
	Description string `json:"description" yaml:"description"`

	// Image is the OCI image reference for the validator container.
	Image string `json:"image" yaml:"image"`

	// Timeout is the maximum execution time for this validator.
	// Maps to Job activeDeadlineSeconds.
	Timeout time.Duration `json:"timeout" yaml:"timeout"`

	// Args are the container arguments.
	Args []string `json:"args,omitempty" yaml:"args,omitempty"`

	// Env are environment variables to set in the container.
	Env []EnvVar `json:"env,omitempty" yaml:"env,omitempty"`

	// Resources specifies container resource requests/limits.
	// If nil, defaults from pkg/defaults are used.
	Resources *ResourceRequirements `json:"resources,omitempty" yaml:"resources,omitempty"`

	// DependencyAffinity declares co-location preferences for the orchestrator
	// pod of this validator. Each entry references a recipe component by name
	// (componentRef) and a label selector matching that component's pods.
	// The deployer resolves the componentRef to a namespace from the resolved
	// recipe at spawn time and emits a podAffinity term on the orchestrator
	// Pod spec. "required" entries hard-fail the run when the referenced
	// component is absent from the recipe; "preferred" entries (default) emit
	// a structured warning and proceed with no affinity term for that
	// dependency.
	//
	// Motivation: ai-service-metrics queries Prometheus over a Service. On
	// clusters with asymmetric pod-to-pod network reachability (e.g.,
	// multi-Security-Group DGXC EKS), the orchestrator must run on a node
	// that can reach the Prometheus pod. Co-locating with the Prometheus pod
	// makes the dial loopback / same-network and removes the dependency on
	// cluster network topology. See https://github.com/NVIDIA/aicr/issues/933.
	DependencyAffinity []DependencyAffinity `json:"dependencyAffinity,omitempty" yaml:"dependencyAffinity,omitempty"`
}

ValidatorEntry defines a single validator container.

func FilterEntriesByValidation

func FilterEntriesByValidation(entries []ValidatorEntry, phase Phase, validationInput *ValidationInput) []ValidatorEntry

FilterEntriesByValidation filters catalog entries based on the validation's declared checks for the given phase. Returns nil if the validation has no phase configuration or no checks declared.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL