job

package

v0.13.0 Latest Latest Go to latest Published: May 15, 2026 License: Apache-2.0 Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func CleanupRBAC(ctx context.Context, clientset kubernetes.Interface, namespace, runID string) error
func ClusterRoleBindingName(runID string) string
func EnsureRBAC(ctx context.Context, clientset kubernetes.Interface, namespace, runID string) error
func ServiceAccountName(runID string) string
type Deployer
- func NewDeployer(clientset kubernetes.Interface, factory informers.SharedInformerFactory, ...) *Deployer

Constants ¶

View Source

const (
	// ValidatorContainerName is the required name for the validator container.
	// This is part of the validator package contract to ensure sidecar-safety.
	ValidatorContainerName = "validator"
)

Variables ¶

This section is empty.

Functions ¶

func CleanupRBAC ¶

func CleanupRBAC(ctx context.Context, clientset kubernetes.Interface, namespace, runID string) error

CleanupRBAC removes the per-run ServiceAccount and ClusterRoleBinding. Ignores NotFound errors (idempotent). Call once at end of validation run.

When both deletes fail, the returned StructuredError wraps the joined underlying errors via stderrors.Join so callers can inspect individual failures with errors.Is / errors.As.

func ClusterRoleBindingName ¶

func ClusterRoleBindingName(runID string) string

ClusterRoleBindingName returns the per-run ClusterRoleBinding name. The CRB is cluster-scoped, so name uniqueness across concurrent runs (even on different namespaces) is what prevents cross-run cleanup races.

func EnsureRBAC ¶

func EnsureRBAC(ctx context.Context, clientset kubernetes.Interface, namespace, runID string) error

EnsureRBAC applies the ServiceAccount and ClusterRoleBinding for validator Jobs using server-side apply. Call once per validation run before deploying any Jobs. The runID scopes the resource names so overlapping runs do not clobber each other.

func ServiceAccountName ¶

func ServiceAccountName(runID string) string

ServiceAccountName returns the per-run ServiceAccount name used by the validator Jobs deployed for runID. Each `aicr validate` invocation generates a unique runID, so the SA created at run start is the same one deleted at run end — overlapping runs cannot clobber each other.

Types ¶

type Deployer ¶

type Deployer struct {
	// contains filtered or unexported fields
}

Deployer manages the lifecycle of a single validator Job.

func NewDeployer ¶

func NewDeployer(
	clientset kubernetes.Interface,
	factory informers.SharedInformerFactory,
	namespace, runID, cliVersion, cliCommit string,
	entry catalog.ValidatorEntry,
	imagePullSecrets []string,
	tolerations []corev1.Toleration,
	nodeSelector map[string]string,
) *Deployer

NewDeployer creates a Deployer for a single validator catalog entry. The factory must be a namespace-scoped SharedInformerFactory started by the caller. cliVersion is the CLI's own version string; empty is acceptable for dev builds and is forwarded to the validator container via the AICR_CLI_VERSION env var so the validator can resolve images it references outside the catalog (e.g. the AIPerf benchmark image used by inference-perf) using the same rewriting rules as catalog.Load. cliCommit is the git commit SHA, forwarded via AICR_CLI_COMMIT for SHA-based image tag resolution in dev builds.

func (*Deployer) CleanupJob ¶

func (d *Deployer) CleanupJob(ctx context.Context) error

CleanupJob deletes the validator Job with foreground propagation (waits for pod deletion).

func (*Deployer) DeployJob ¶

func (d *Deployer) DeployJob(ctx context.Context) error

DeployJob applies the validator Job using server-side apply. A unique name is generated client-side and stored in d.jobName.

func (*Deployer) ExtractResult ¶

func (d *Deployer) ExtractResult(ctx context.Context) *ctrf.ValidatorResult

ExtractResult reads the exit code, termination message, and stdout from the "validator" container in a completed validator pod.

CONTRACT: The container name MUST be "validator". This is a frozen public contract of the validator package to ensure sidecar-safety — ExtractResult will only read from the "validator" container, ignoring any sidecar containers that may be injected by external controllers (e.g., log streaming, result processing).

Returns a ValidatorResult regardless of how the container terminated — the caller maps the result to a CTRF status.

This method must be called after WaitForCompletion returns, when the Job is in a terminal state (Complete or Failed).

func (*Deployer) HandleTimeout ¶

func (d *Deployer) HandleTimeout(ctx context.Context) *ctrf.ValidatorResult

HandleTimeout extracts whatever result is available when the orchestrator's wait has timed out. Uses a fresh context since the parent may be canceled.

func (*Deployer) JobName ¶

func (d *Deployer) JobName() string

JobName returns the Kubernetes Job name assigned by the API server. Empty until DeployJob is called.

func (*Deployer) WaitForCompletion ¶

func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error

WaitForCompletion watches the Job until it reaches a terminal state (Complete or Failed). Returns nil for both — the caller uses ExtractResult to determine pass/fail/skip from the exit code.

Returns error only for infrastructure failures (watch error, timeout). Job failure (exit != 0) is NOT an error return — that decision lives here in the validator orchestrator, not in the shared pod.WaitForJobTerminal helper, which intentionally treats both Complete and Failed Jobs as legitimate completions and lets the caller classify them.

func (*Deployer) WaitForPodTermination ¶

func (d *Deployer) WaitForPodTermination(ctx context.Context) error

WaitForPodTermination watches the Job's pod until it reaches a terminal state. Prevents RBAC cleanup from racing with in-progress pod operations.

Returns the underlying error from pod.WaitForTermination so callers can decide log severity. A nil error means the pod is gone or terminal; a non-nil error means the wait was abandoned (timeout, watch failure, or repeated watch closures) and the cleanup may race with an in-progress pod.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL