job

package

v0.12.1 Latest Latest Go to latest Published: May 1, 2026 License: Apache-2.0 Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func CleanupRBAC(ctx context.Context, clientset kubernetes.Interface, namespace string) error
func EnsureRBAC(ctx context.Context, clientset kubernetes.Interface, namespace string) error
type Deployer
- func NewDeployer(clientset kubernetes.Interface, factory informers.SharedInformerFactory, ...) *Deployer

Constants ¶

View Source

const (
	// ServiceAccountName is the name of the ServiceAccount used by all validator Jobs.
	ServiceAccountName = "aicr-validator"

	// ClusterRoleBindingName is the name of the ClusterRoleBinding that grants
	// cluster-admin to the validator ServiceAccount.
	ClusterRoleBindingName = "aicr-validator"
)

Variables ¶

This section is empty.

Functions ¶

func CleanupRBAC ¶

func CleanupRBAC(ctx context.Context, clientset kubernetes.Interface, namespace string) error

CleanupRBAC removes the ServiceAccount and ClusterRoleBinding. Ignores NotFound errors (idempotent). Call once at end of validation run.

When both deletes fail, the returned StructuredError wraps the joined underlying errors via stderrors.Join so callers can inspect individual failures with errors.Is / errors.As.

func EnsureRBAC ¶

func EnsureRBAC(ctx context.Context, clientset kubernetes.Interface, namespace string) error

EnsureRBAC applies the ServiceAccount and ClusterRoleBinding for validator Jobs using server-side apply. Call once per validation run before deploying any Jobs.

Types ¶

type Deployer ¶

type Deployer struct {
	// contains filtered or unexported fields
}

Deployer manages the lifecycle of a single validator Job.

func NewDeployer ¶

func NewDeployer(
	clientset kubernetes.Interface,
	factory informers.SharedInformerFactory,
	namespace, runID, cliVersion, cliCommit string,
	entry catalog.ValidatorEntry,
	imagePullSecrets []string,
	tolerations []corev1.Toleration,
	nodeSelector map[string]string,
) *Deployer

NewDeployer creates a Deployer for a single validator catalog entry. The factory must be a namespace-scoped SharedInformerFactory started by the caller. cliVersion is the CLI's own version string; empty is acceptable for dev builds and is forwarded to the validator container via the AICR_CLI_VERSION env var so the validator can resolve images it references outside the catalog (e.g. the AIPerf benchmark image used by inference-perf) using the same rewriting rules as catalog.Load. cliCommit is the git commit SHA, forwarded via AICR_CLI_COMMIT for SHA-based image tag resolution in dev builds.

func (*Deployer) CleanupJob ¶

func (d *Deployer) CleanupJob(ctx context.Context) error

CleanupJob deletes the validator Job with foreground propagation (waits for pod deletion).

func (*Deployer) DeployJob ¶

func (d *Deployer) DeployJob(ctx context.Context) error

DeployJob applies the validator Job using server-side apply. A unique name is generated client-side and stored in d.jobName.

func (*Deployer) ExtractResult ¶

func (d *Deployer) ExtractResult(ctx context.Context) *ctrf.ValidatorResult

ExtractResult reads the exit code, termination message, and stdout from a completed validator pod. Returns a ValidatorResult regardless of how the container terminated — the caller maps the result to a CTRF status.

This method must be called after WaitForCompletion returns, when the Job is in a terminal state (Complete or Failed).

func (*Deployer) HandleTimeout ¶

func (d *Deployer) HandleTimeout(ctx context.Context) *ctrf.ValidatorResult

HandleTimeout extracts whatever result is available when the orchestrator's wait has timed out. Uses a fresh context since the parent may be canceled.

func (*Deployer) JobName ¶

func (d *Deployer) JobName() string

JobName returns the Kubernetes Job name assigned by the API server. Empty until DeployJob is called.

func (*Deployer) WaitForCompletion ¶

func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error

WaitForCompletion watches the Job until it reaches a terminal state (Complete or Failed). Returns nil for both — the caller uses ExtractResult to determine pass/fail/skip from the exit code.

Returns error only for infrastructure failures (watch error, timeout). Job failure (exit != 0) is NOT an error return — that decision lives here in the validator orchestrator, not in the shared pod.WaitForJobTerminal helper, which intentionally treats both Complete and Failed Jobs as legitimate completions and lets the caller classify them.

func (*Deployer) WaitForPodTermination ¶

func (d *Deployer) WaitForPodTermination(ctx context.Context) error

WaitForPodTermination watches the Job's pod until it reaches a terminal state. Prevents RBAC cleanup from racing with in-progress pod operations.

Returns the underlying error from pod.WaitForTermination so callers can decide log severity. A nil error means the pod is gone or terminal; a non-nil error means the wait was abandoned (timeout, watch failure, or repeated watch closures) and the cleanup may race with an in-progress pod.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL