aicr

package
v0.15.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2026 License: Apache-2.0 Imports: 24 Imported by: 0

Documentation

Overview

Package aicr is the stable, public Go library surface for external consumers of the AI Cluster Runtime.

External projects should import THIS package and use the types and constructors re-exported here. The underlying pkg/* packages are public and will remain importable, but this facade is the contract the project commits to via semver.

Surface

Client exposes the four end-to-end operations the CLI / server share:

  • ResolveRecipe / ResolveRecipeFromCriteria / ResolveRecipeFromSnapshot and LoadRecipe — produce or load a *RecipeResult.
  • BundleComponents — resolve Helm values and stitched manifests for each component in a *RecipeResult.
  • CollectSnapshot — deploy the snapshotter Job and retrieve a *Snapshot.
  • ValidateState — evaluate a resolved recipe against a snapshot, running deployment / conformance / performance phases.

All facade types (Snapshot, AgentConfig, Criteria, RecipeRequest, RecipeResult, ComponentBundle, ComponentRef, PhaseResult, AllowLists) are facade-owned structs translated to and from the upstream pkg/* shapes, so internal field renames don't churn external callers.

Example

client, err := aicr.NewClient(
    aicr.WithRecipeSource(aicr.FilesystemSource("/etc/aicr/recipes")),
)
if err != nil {
    return err
}
defer client.Close()

result, err := client.ResolveRecipe(ctx, aicr.RecipeRequest{
    Service:     "eks",
    Region:      "us-east-1",
    Accelerator: "h100",
    Nodes:       8, // worker-node count, not GPU count
    Intent:      "training",
})

Stability

This package's exported API follows semver. The underlying pkg/* packages may introduce breaking changes between minor releases; if you depend on them directly, pin AICR to a patch version and audit upgrades.

Concurrency and Client lifecycle

Each Client owns its own DataProvider and per-DataProvider cached metadata store and component registry. Multiple Clients constructed from different sources can resolve recipes concurrently without clobbering each other — a property multi-tenant consumers (e.g., a controller managing one Client per per-tenant configuration) rely on. This is a v0.12+ guarantee; earlier facade builds mutated a process-global DataProvider via recipe.SetDataProvider and were unsafe to construct concurrently.

**Retain and reuse Client instances.** The recipe package keys its internal caches on DataProvider identity (pointer-equality of the interface value). Each call to NewClient builds a fresh DataProvider, so two Clients constructed from the same recipe source still produce distinct cache entries and do their own directory walk on first use. Long-running consumers should cache Clients keyed by their configuration (e.g., a content hash of the recipe-source settings) rather than constructing one per request.

**Call Close when done.** When a Client is no longer needed (cache eviction, controller shutdown), call Close to drop its metadata store and component registry from the recipe package's internal caches. Without this, memory grows monotonically with the number of unique DataProviders ever observed.

See docs/integrator/go-library.md for the integration guide.

Index

Constants

View Source
const (
	// CatalogSourceEmbedded is the Source value for built-in OSS overlays.
	CatalogSourceEmbedded = recipe.CatalogSourceEmbedded

	// CatalogSourceExternal is the Source value for overlays loaded via --data.
	CatalogSourceExternal = recipe.CatalogSourceExternal
)

CatalogSource constants for CatalogEntry.Source comparisons.

Variables

This section is empty.

Functions

func SelectFromRecipe

func SelectFromRecipe(r *RecipeResult, selector string) (any, error)

SelectFromRecipe hydrates a resolved recipe and extracts a dot-path selector (e.g. "components.gpu-operator.values.driver.version"). An empty selector returns the entire hydrated structure. Mirrors `aicr query`.

The recipe must have been produced by a Client (so its internal pkg/recipe.RecipeResult is populated). A facade RecipeResult constructed outside ResolveRecipe / LoadRecipe / AdoptRecipe is rejected with ErrCodeInvalidRequest.

func ToInternalAllowLists

func ToInternalAllowLists(al *AllowLists) *recipe.AllowLists

ToInternalAllowLists translates a facade AllowLists into the pkg/recipe.AllowLists enum-typed shape the resolver consumes. The string values are wrapped in the corresponding pkg/recipe enum types without validation; registry-strict mode at resolve time rejects unknown values.

Exposed so in-tree adapters (e.g., the REST handler's pre-check) share the same facade→internal projection as the Client's internal backstop, instead of inlining a parallel mapping that can drift if AllowLists gains a field.

Types

type AgentConfig

type AgentConfig struct {
	Kubeconfig         string
	Namespace          string
	Image              string
	ImagePullSecrets   []string
	JobName            string
	ServiceAccountName string
	NodeSelector       map[string]string
	Tolerations        []corev1.Toleration
	Timeout            time.Duration
	Cleanup            bool
	Output             string
	Debug              bool
	Privileged         bool
	RequireGPU         bool
	RuntimeClassName   string
	TemplatePath       string
	MaxNodesPerEntry   int
	OS                 string
	Requests           corev1.ResourceList
	Limits             corev1.ResourceList
}

AgentConfig is the deployment-time configuration for the snapshot- collection Job passed to Client.CollectSnapshot. Facade-owned; field-for-field mirror of pkg/snapshotter.AgentConfig. Tolerations keep k8s.io/api/core/v1.Toleration since kubernetes/api is itself stable.

type AllowLists

type AllowLists struct {
	// Accelerators is the set of accepted accelerator identifiers
	// (e.g., "h100", "b200"). Empty = accept all.
	Accelerators []string

	// Services is the set of accepted service identifiers
	// (e.g., "eks", "gke"). Empty = accept all.
	Services []string

	// Intents is the set of accepted intent identifiers
	// (e.g., "training", "inference"). Empty = accept all.
	Intents []string

	// OSTypes is the set of accepted OS identifiers
	// (e.g., "ubuntu", "rhel"). Empty = accept all.
	OSTypes []string
}

AllowLists fences which criteria values the resolve path accepts on a Client constructed via WithAllowLists. Facade-owned; the typed-enum fields on pkg/recipe.AllowLists project to plain string slices so the facade does not propagate pkg/recipe's enum identifiers across the semver boundary. A nil receiver, or an AllowLists whose slices are all empty, accepts every value (the documented "no fencing" mode). An "any" value on a Criteria field is always accepted regardless of the allowlist, matching the pkg/recipe behavior.

func ParseAllowListsFromEnv

func ParseAllowListsFromEnv() (*AllowLists, error)

ParseAllowListsFromEnv builds an AllowLists from the AICR_ALLOWED_* environment variables (AICR_ALLOWED_ACCELERATORS, AICR_ALLOWED_SERVICES, AICR_ALLOWED_INTENTS, AICR_ALLOWED_OS). Returns nil when none are set — WithAllowLists treats a nil AllowLists as allow-all. Pass the result to WithAllowLists.

func WrapAllowLists

func WrapAllowLists(al *recipe.AllowLists) *AllowLists

WrapAllowLists projects a pkg/recipe.AllowLists into the facade AllowLists shape. Use at the boundary where in-tree callers parse allowlists from configuration (e.g., recipe.ParseAllowListsFromEnv) and hand them to the facade. Returns nil for nil input.

The facade slices are independent copies; mutating either side after wrap does not affect the other.

type BundleArtifact

type BundleArtifact = *result.Output

BundleArtifact summarizes a completed bundle generation: file count, total size, duration, per-bundler results, and the output directory the files were written to. Transparent alias of pkg/bundler/result.Output (#1078 wraps it). Inspect HasErrors() for non-fatal per-bundler failures; the bundle files themselves are on disk under OutputDir.

type BundleAttester

type BundleAttester = attestation.Attester

BundleAttester signs bundle content. Transparent alias of pkg/bundler/attestation.Attester. The zero value of BundleOptions leaves this nil, in which case MakeBundle uses the bundler's no-op attester (the same default bundler.New applies when --attest is not set).

type BundleConfig

type BundleConfig = config.Config

BundleConfig is the bundler configuration — deployer mode, value overrides, node selectors, tolerations, vendoring, app/chart names, etc. Transparent alias of pkg/bundler/config.Config (the alias is tracked by #1078). Construct one with config.NewConfig(config.WithDeployer(...), ...) — the same builder the CLI bundle command and the REST /v1/bundle handler use, so MakeBundle reproduces their exact output byte-for-byte.

type BundleOptions

type BundleOptions struct {
	// Config carries the bundler configuration (deployer mode, value
	// overrides, node selectors/tolerations, vendoring, app/chart
	// names). When nil, MakeBundle uses config.NewConfig() — the same
	// default bundler.New applies (Helm deployer, no overrides).
	Config *BundleConfig

	// Attester signs bundle content. When nil, MakeBundle uses the
	// no-op attester (matching bundler.New's default when --attest is
	// not set). The CLI builds this via attestation.ResolveAttesterLazy
	// when --attest is passed.
	Attester BundleAttester

	// OutputDir is the directory bundle files are written to. Empty
	// means the current directory ("."), matching Make's default.
	OutputDir string

	// Timeout optionally caps the bundle run. When > 0, MakeBundle wraps
	// the caller's context with context.WithTimeout(ctx, Timeout) so the
	// run is bounded by the smaller of this and any tighter parent
	// deadline. When 0 (the zero value), MakeBundle imposes NO
	// facade-level deadline and runs under the caller's ctx as-is —
	// large bundles, --vendor-charts, and attestation/signing can each
	// exceed a fixed cap. The REST /v1/bundle handler sets this to
	// defaults.BundleHandlerTimeout to preserve its 60s request boundary;
	// the CLI bundle command leaves it 0 so long bundles are uncapped.
	Timeout time.Duration
}

BundleOptions configures a MakeBundle call. It mirrors exactly what bundler.New / (*DefaultBundler).Make accept so the facade reproduces the same full deployer-mode bundle artifact the CLI bundle command and REST /v1/bundle handler produce today.

type CatalogEntry added in v0.15.0

type CatalogEntry struct {
	// Name is the overlay name, e.g. "h100-eks-ubuntu-training".
	Name string `json:"name" yaml:"name"`

	// Criteria is the set of dimensions this overlay targets.
	Criteria Criteria `json:"criteria" yaml:"criteria"`

	// IsLeaf is true when this overlay is a catalog leaf (no other
	// overlay inherits from it).
	IsLeaf bool `json:"is_leaf" yaml:"is_leaf"`

	// Source is the data provenance: "embedded" or "external".
	Source string `json:"source" yaml:"source"`
}

CatalogEntry describes one overlay in the recipe catalog, returned by Client.ListCatalog.

IsLeaf is true when the overlay is a leaf — no other overlay in the catalog lists this one as its spec.base. Leaf overlays are the most specific recipes for a given criteria combination.

Source is one of CatalogSourceEmbedded or CatalogSourceExternal.

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is the single entry point for external Go consumers.

Concurrent ResolveRecipe calls are safe — the Builder itself is thread-safe over its read-only state. The mu guards the small window where Close swaps builder/dp to nil; without it, concurrent ResolveRecipe + Close on the same Client is a data race because the field write in Close is unsynchronised against the field read at the top of ResolveRecipe.

func NewClient

func NewClient(opts ...Option) (*Client, error)

NewClient constructs a Client with the supplied functional options. Callers must provide a recipe source via WithRecipeSource.

For FilesystemSource, the external directory is layered OVER the embedded recipe data — files in the directory override embedded equivalents, and recipes must include a registry.yaml at the root.

OCI sources are not yet wired through to the loader and return an ErrCodeUnavailable error from NewClient until that gap is closed.

func (*Client) AdoptRecipe

func (c *Client) AdoptRecipe(ctx context.Context, rec *recipe.RecipeResult) (*RecipeResult, error)

AdoptRecipe wraps a raw pkg/recipe.RecipeResult — typically decoded from an external source such as a REST /v1/bundle POST body — into a Client-owned *RecipeResult ready for MakeBundle. The returned RecipeResult is bound to this Client's DataProvider and owner-stamped, so it passes MakeBundle's ownership and provider-isolation checks exactly as a LoadRecipe result does.

Use this when the caller already holds a fully-hydrated RecipeResult (not a criteria request or a file path) and needs to bundle it through the facade. In-process consumers that resolve via ResolveRecipe / LoadRecipe should use those results directly; AdoptRecipe is for the decode-then-bundle boundary.

func (*Client) BundleComponents

func (c *Client) BundleComponents(ctx context.Context, r *RecipeResult) ([]ComponentBundle, error)

BundleComponents resolves Helm values and rendered manifests for each component in a previously-resolved RecipeResult. The returned slice mirrors r.Components 1:1 — same order, same length — so callers correlate by index.

When to call

Call AFTER ResolveRecipe; pass that call's *RecipeResult unchanged. BundleComponents reads the internal pkg/recipe.RecipeResult that ResolveRecipe attached to the facade RecipeResult — it does NOT re-resolve from criteria. A RecipeResult constructed by the caller (rather than returned from ResolveRecipe) has a nil internal field and BundleComponents returns ErrCodeInvalidRequest.

Per-Client DataProvider isolation

Both values-file reads (Helm components) and manifest-file reads (Helm supplemental + Kustomize) are bound to this Client's own DataProvider via the WithProvider variants on the recipe package (recipe.RecipeResult.GetValuesForComponentWithProvider, recipe.GetManifestContentWithProvider). Two Clients constructed from different recipe sources can BundleComponents concurrently without contaminating each other's bundle output.

History: pre-v0.2 the values and manifest paths short-circuited through recipe.GetDataProvider() — the process-global DataProvider singleton. With two Clients A and B pointing at different sources, an eviction+repopulate sequence on A's cache followed by a B BundleComponents call could return values or manifests resolved against A's recipe source. That gap is closed; the metadata store and component registry were already per-Client at the time and stayed correct throughout, so ResolveRecipe results never drifted.

Synchronization

Read-locks Client.mu so a concurrent Close can't race the values load. The lock is held only across the snapshot of c.builder and c.dp; the values and manifest reads themselves run unlocked (consistent with ResolveRecipe's pattern). The DataProvider snapshot is the per-Client provider this Client owns — the same one its Builder is bound to via recipe.WithDataProvider.

func (*Client) Close

func (c *Client) Close() error

Close releases this Client's cached metadata store and component registry from the recipe package's internal caches. Call when a Client is no longer needed (cache eviction in a higher-level memoiser, controller shutdown) to prevent unbounded memory growth — the recipe package keys its caches on DataProvider identity and does not auto-evict, so a process that observes many distinct recipe sources over time would otherwise grow memory monotonically.

Safe to call on a nil receiver and safe to call multiple times (subsequent calls are no-ops). Always returns nil; the signature matches io.Closer so this can stand in for io.Closer in composite cleanup chains.

func (*Client) CollectSnapshot

func (c *Client) CollectSnapshot(ctx context.Context, cfg *AgentConfig) (*Snapshot, error)

CollectSnapshot deploys the snapshotter Job to the cluster identified by cfg.Kubeconfig and returns the captured Snapshot.

CollectSnapshot does NOT consult the Client's recipe data provider — the Client is required only to keep the facade surface uniform (every public operation goes through a Client) and to leave room for future per-Client telemetry hooks or cluster-connection caching without breaking signatures. CollectSnapshot is therefore safe even on a Client whose recipe source is unrelated to the target cluster.

cfg.Kubeconfig is the path (or empty for in-cluster). cfg.Namespace, cfg.Image, cfg.ServiceAccountName must be set; other fields fall back to package defaults documented on snapshotter.AgentConfig.

Errors:

  • ErrCodeInvalidRequest when the Client is nil, cfg is nil, or the Client has been Closed.
  • All snapshotter errors propagate unwrapped — they already carry the appropriate pkg/errors codes (ErrCodeInternal for deployment failures, ErrCodeTimeout for context expiry, etc.).

Concurrent CollectSnapshot calls are safe; each call constructs an independent run.

func (*Client) ComputeHealth added in v0.15.0

func (c *Client) ComputeHealth(ctx context.Context, filter *Criteria) (*health.Report, error)

ComputeHealth scores the structural health of every leaf recipe in this Client's catalog and returns a deterministic *health.Report, optionally narrowed by filter. Computation is delegated wholesale to pkg/health.Compute; this facade only binds the Client's own DataProvider and version so health is scored against the same catalog (including any --data overlays) the Client resolves with — never the process-global embedded catalog.

filter narrows enumeration to leaf overlays carrying every explicitly set criteria dimension; nil scores all leaf combos. Empty/"any" filter dimensions place no constraint.

health.Compute applies its own catalog-wide timeout (defaults.HealthComputeTimeout), so this method does not impose the shorter per-operation timeout the resolve methods use.

Returns ErrCodeInvalidRequest on a nil/closed Client or nil context, and propagates the underlying structured code (or ErrCodeInternal) if health computation fails.

func (*Client) CriteriaRegistry

func (c *Client) CriteriaRegistry() *CriteriaRegistry

CriteriaRegistry returns the per-DataProvider criteria registry for THIS Client. CLI/library callers use it to parse criteria values (so --data overlay contributions validate) and to apply strict mode against the same provider the Client resolves with. Call LoadCatalog first so the registry is seeded from the provider's overlays before parsing.

Returns the registry for this Client's provider via recipe.GetCriteriaRegistryFor. On a nil Client this returns a fresh ephemeral registry so callers can defensively call without nil-checking, matching the lenient nil behavior of the other accessors.

func (*Client) ListCatalog added in v0.15.0

func (c *Client) ListCatalog(ctx context.Context, filter *Criteria) ([]CatalogEntry, error)

ListCatalog returns catalog entries for all overlays known to this Client, optionally narrowed by the filter criteria. Call LoadCatalog first so the catalog is fully populated before calling this.

Each entry carries the overlay name, its criteria, whether it is a leaf (IsLeaf=true means no other overlay inherits from it), and its data provenance ("embedded" or "external"). Entries are returned in ascending name order for deterministic output.

When filter is non-nil, only overlays whose criteria carry the exact values specified in each non-empty/non-"any" filter dimension are returned. Setting a filter dimension to "" or "any" places no constraint on that dimension.

Returns ErrCodeInvalidRequest on a nil or closed Client, and propagates ErrCodeInternal if the underlying metadata store cannot be loaded.

func (*Client) LoadCatalog

func (c *Client) LoadCatalog(ctx context.Context) error

LoadCatalog eagerly loads (and caches) this Client's metadata store, which has the side effect of seeding THIS Client's per-provider criteria registry from every overlay's spec.criteria. Call it before parsing criteria through CriteriaRegistry so values contributed by a FilesystemSource --data overlay are admitted by the registry's lookups.

This mirrors the pre-facade eager recipe.LoadCatalog the CLI ran after SetDataProvider, but seeds the Client's OWN provider registry rather than the process-global one — so two Clients built from different sources keep isolated criteria registries.

Errors propagate with their structured codes preserved (a malformed overlay surfaces as ErrCodeInvalidRequest, not masked as ErrCodeInternal) via PropagateOrWrap.

The same guards as the resolve methods apply: nil receiver and nil context are rejected with ErrCodeInvalidRequest, and a closed Client is rejected.

func (*Client) LoadRecipe

func (c *Client) LoadRecipe(ctx context.Context, path, kubeconfig string) (*RecipeResult, error)

LoadRecipe loads a recipe from a file path (or cm:// ConfigMap URI, honoring kubeconfig) through THIS Client's data provider, and returns it as a Client-owned *RecipeResult ready for ValidateState / BundleComponents. Overlay inputs (kind: RecipeMetadata) are hydrated against the Client's provider, so an external --data overlay resolves against the same recipe source the Client was constructed with rather than the package global. An already-hydrated RecipeResult file is returned with its provider bound to the Client's provider.

The returned RecipeResult is owner-stamped with this Client, so it passes ValidateState / BundleComponents' assertOwns check — same as a RecipeResult produced by ResolveRecipe.

Errors:

  • ErrCodeInvalidRequest when the Client is nil, ctx is nil, path is empty, or the Client has been Closed.
  • All loader errors propagate with their structured codes (e.g., ErrCodeInvalidRequest for an overlay without criteria, ErrCodeInternal for a read or parse failure).

func (*Client) MakeBundle

func (c *Client) MakeBundle(ctx context.Context, recipe *RecipeResult, opts BundleOptions) (BundleArtifact, error)

MakeBundle generates the full deployer-mode bundle for a previously resolved or loaded RecipeResult, writing the bundle files under opts.OutputDir and returning a BundleArtifact summary. Unlike BundleComponents (which returns per-component Helm values + manifests in memory), MakeBundle produces the SAME complete artifact the CLI bundle command emits — README, deploy.sh, per-component directories, checksums — in the deployer layout selected by opts.Config.Deployer() (helm, argocd, argocd-helm, flux, helmfile).

When to call

Call AFTER Client.ResolveRecipe or Client.LoadRecipe; pass that call's *RecipeResult unchanged. MakeBundle bundles from recipe.Resolved() (the full pkg/recipe.RecipeResult), which carries this Client's own DataProvider — so provider-scoped lookups (values files, manifest files) resolve against the Client's recipe source rather than the package global.

Allowlist enforcement

When the Client was constructed WithAllowLists, MakeBundle validates the recipe's criteria against the allowlist before bundling — same fencing the resolve path and the REST /v1/bundle handler apply. A recipe whose criteria fall outside the allowlist is rejected with the allowlist's structured error. A recipe with nil Criteria (a loaded, already-hydrated or bare RecipeResult file) skips the check, matching the handler's `recipeResult.Criteria != nil` guard.

Synchronization

Read-locks Client.mu so a concurrent Close can't race the bundle, and registers in the inflight WaitGroup so Close drains before evicting caches — the same protocol as BundleComponents. A facade-level timeout is opt-in via opts.Timeout: when set (> 0) it bounds the run by the smaller of opts.Timeout and any tighter caller deadline; when unset (0) MakeBundle runs under the caller's context with NO added cap. The REST /v1/bundle handler sets opts.Timeout = defaults.BundleHandlerTimeout to keep its 60s request boundary; the CLI bundle command leaves it 0 so large bundles, --vendor-charts, and attestation/signing are uncapped.

Errors:

  • ErrCodeInvalidRequest when the Client, ctx, or recipe is nil, when recipe lacks internal state (constructed outside Resolve/Load), when the recipe was produced by a different Client, or when the Client has been Closed.
  • Allowlist and bundler errors propagate with their structured codes.

func (*Client) ResolveRecipe

func (c *Client) ResolveRecipe(ctx context.Context, req RecipeRequest) (*RecipeResult, error)

ResolveRecipe maps a RecipeRequest to a concrete validated recipe. It wraps pkg/recipe.Builder.BuildFromCriteria with a stable external request shape so AICR's internal Criteria type can evolve without breaking consumers.

Pinned recipe references (req.PinnedName / req.PinnedVersion) are not yet supported by the facade and return ErrCodeUnavailable. The field is reserved so callers can adopt it without API churn when the underlying builder gains pinning support.

func (*Client) ResolveRecipeFromCriteria

func (c *Client) ResolveRecipeFromCriteria(ctx context.Context, criteria *Criteria) (*RecipeResult, error)

ResolveRecipeFromCriteria resolves a facade Criteria into a facade RecipeResult. The Components projection mirrors ResolveRecipe; callers needing the full upstream recipe (constraints, deployment order, metadata) access it via the returned result's Resolved() helper.

Use this when the caller already speaks the facade Criteria type (e.g., a REST handler that parsed criteria from an HTTP request and translated via WrapCriteria) rather than the RecipeRequest shape ResolveRecipe takes.

Allowlist enforcement (WithAllowLists) applies here just as it does on the shared resolve path: criteria outside the configured allowlist are rejected before the recipe is built.

The same guards and synchronization as ResolveRecipe apply: nil receiver, nil context, and nil criteria are rejected with ErrCodeInvalidRequest; a closed Client is rejected; a facade-level timeout bounds the resolve.

func (*Client) ResolveRecipeFromSnapshot

func (c *Client) ResolveRecipeFromSnapshot(ctx context.Context, criteria *Criteria, snap *Snapshot) (*RecipeResult, error)

ResolveRecipeFromSnapshot resolves a recipe from explicit Criteria and evaluates its constraints against an observed cluster Snapshot, mirroring `aicr recipe --snapshot`. It returns the facade RecipeResult; callers needing the upstream recipe (ComponentRefs, deployment order, per- constraint evaluation results) access it via Resolved().

Unlike ResolveRecipeFromCriteria — which builds the recipe without observing the cluster — this variant threads a constraint evaluator that runs each resolution constraint against snap via pkg/constraints.Evaluate. The CLI's `recipe --snapshot` path does the same: it derives criteria from the snapshot fingerprint, then calls BuildFromCriteriaWithEvaluator so the resolved recipe records whether each constraint passed against the observed state.

Allowlist enforcement (WithAllowLists) applies here just as it does on the shared resolve path: criteria outside the configured allowlist are rejected before the recipe is built.

The same guards and synchronization as ResolveRecipeFromCriteria apply: nil receiver, nil context, nil criteria, and nil snapshot are rejected with ErrCodeInvalidRequest; a closed Client is rejected; a facade-level timeout bounds the resolve. Builder errors propagate as-is (they already carry the appropriate pkg/errors code) rather than being re-wrapped.

func (*Client) ValidateState

func (c *Client) ValidateState(
	ctx context.Context,
	recipe *RecipeResult,
	snap *Snapshot,
	opts ...ValidateOption,
) ([]*PhaseResult, error)

ValidateState evaluates a resolved recipe against an observed cluster snapshot, runs the selected validation phases (by default PhaseDeployment, PhaseConformance, PhasePerformance) in order, and returns one PhaseResult per phase run. Pass WithValidationPhases to restrict the run to a subset.

recipe must come from a prior Client.ResolveRecipe call on this Client — it carries the unexported internal recipe state needed to drive constraint evaluation. Passing a RecipeResult constructed by the caller (or one produced by a different Client whose internal has since been evicted) returns ErrCodeInvalidRequest.

snap is the Snapshot returned by Client.CollectSnapshot or by any other snapshotter source.

opts configure the validator run. Pass WithValidationNoCluster(true) from unit tests so no Kubernetes resources are created and every check reports as "skipped". WithValidationNamespace, WithValidationRunID, WithValidationCleanup, WithValidationTolerations, WithValidationNodeSelector, and WithValidationPhases cover the production-controller knobs. The validator catalog loads through this Client's own DataProvider, so a Client built from FilesystemSource validates against that recipe source rather than the package global.

Errors:

  • ErrCodeInvalidRequest when the Client, recipe, or snap is nil, when recipe lacks internal state, or when the Client has been Closed.
  • All validator errors propagate unwrapped — readiness-check failures surface as ErrCodeInvalidRequest, infrastructure failures as ErrCodeInternal.

All phases run by default and produce results regardless of earlier failures. Pass WithValidationFailFast(true) to stop after the first failed phase (useful for skipping expensive checks like inference-perf when deployment already failed). Callers wanting per-phase control can reach into pkg/validator.ValidatePhase directly.

type ComponentBundle

type ComponentBundle struct {
	// Component is the matching ComponentRef from the recipe.
	Component ComponentRef

	// HelmValues are YAML-encoded Helm values, or nil for
	// non-Helm components.
	HelmValues []byte

	// Manifests are rendered manifest bytes. Non-nil for
	// Kustomize components, and also non-nil for Helm components
	// whose recipe attaches supplemental manifestFiles. nil when
	// the component has no manifest files of its own.
	Manifests []byte
}

ComponentBundle is the resolved deployable artifact for one recipe component. The slice returned by Client.BundleComponents mirrors RecipeResult.Components 1:1 — same order, same length — so callers can correlate by index when threading bundles back through their own state.

Component identity (Name, Kind, Version) duplicates the matching RecipeRef so callers passing bundles around without the original RecipeResult retain enough context to dispatch on kind.

HelmValues vs Manifests population — read carefully, the rule is per-Kind, not "exactly one":

  • Helm components: HelmValues holds YAML-encoded values that downstream consumers pass to `helm install --values`. Manifests MAY ALSO be non-nil when the recipe attaches supplemental manifest files to the Helm component (e.g., gpu-operator's overlay attaches a dcgm-exporter manifest; h100-gke-cos-training attaches gke-nccl-tcpxo manifests). Downstream consumers should apply Manifests alongside the Helm release. Skipping Manifests on a Helm component will silently drop those resources.
  • Kustomize / raw-manifest components: Manifests holds the rendered manifest bytes. HelmValues is nil.
  • Components with neither (rare — a recipe component with no valuesFile, no overrides, and no manifestFiles): both fields are nil; the component is still listed for ordering / status purposes.

type ComponentRef

type ComponentRef struct {
	// Name is the component identifier, e.g. "gpu-operator".
	Name string

	// Kind is the deployment kind, e.g. "Helm" or "Kustomize".
	Kind string

	// Version is the component chart/manifest version.
	Version string

	// Source is the upstream artifact location: a Helm chart
	// repository URL for Helm components (e.g.
	// "https://helm.ngc.nvidia.com/nvidia"), or a Kustomize source
	// repo for Kustomize components. Empty when the recipe
	// registry leaves it unset.
	Source string

	// Chart is the Helm chart name as it appears in the upstream
	// repository (e.g. "gpu-operator"). Empty for non-Helm
	// components. Defaults to Name when the registry leaves it
	// unset.
	Chart string

	// Namespace is the install namespace recommended by the recipe
	// (e.g. "gpu-operator"). Consumers SHOULD honor it so the
	// deployed layout matches what AICR validation expects to find.
	// Empty when the recipe leaves it unset.
	Namespace string
}

ComponentRef identifies a deployable recipe component.

The Name/Chart distinction matters: Name is AICR's identifier (e.g. "nfd"), while Chart is the Helm chart name (e.g. "node-feature-discovery"). Most components have Name == Chart, but the registry's helm.defaultChart override allows them to differ. Consumers building Helm Releases must use Chart, not Name, as spec.forProvider.chart.name.

type Criteria

type Criteria struct {
	Service     string
	Accelerator string
	Intent      string
	OS          string
	Platform    string
	Nodes       int
}

Criteria is the facade-owned, semver-stable shape of a recipe-resolution query. Mirrors pkg/recipe.Criteria field-for-field with the enum-typed pkg/recipe values projected to plain strings so the facade contract does not pin consumers to pkg/recipe's enum identifiers (an internal enum rename or addition stays internal). Construct one directly or wrap an upstream pkg/recipe.Criteria via WrapCriteria.

Field meanings match the pkg/recipe.Criteria documentation:

  • Service: Kubernetes service flavor (eks/gke/aks/oke/kind/lke/bcm).
  • Accelerator: GPU model identifier (h100/h200/b200/gb200/a100/l40/rtx-pro-6000).
  • Intent: workload intent (training/inference).
  • OS: worker-node OS (ubuntu/rhel/cos/amazonlinux/talos).
  • Platform: framework overlay (dynamo/kubeflow/nim/runai/slurm).
  • Nodes: worker-node count hint (0 = unspecified).

Empty string is the "unspecified" sentinel for every field except Nodes, where 0 plays that role. A non-empty string that the registry does not recognize is rejected at resolve time with ErrCodeInvalidRequest.

func WrapCriteria

func WrapCriteria(c *recipe.Criteria) *Criteria

WrapCriteria projects a pkg/recipe.Criteria into the facade Criteria shape. Use this at the boundary where in-tree callers (CLI/API handlers) hand a parsed criteria — produced by recipe.ParseCriteriaFromRequest or recipe.BuildCriteriaWithRegistry — to facade methods such as Client.ResolveRecipeFromCriteria. Returns nil for nil input.

Round-trip: WrapCriteria(c) then toInternalCriteria projects back to the pkg/recipe.Criteria enum-typed shape; the round-trip is lossless because the facade carries plain strings for the same set of named enum fields (Service/Accelerator/Intent/OS/Platform) plus Nodes.

type CriteriaRegistry

type CriteriaRegistry = recipe.CriteriaRegistry

CriteriaRegistry is the per-DataProvider set of valid criteria values, returned by Client.CriteriaRegistry so CLI/library callers parse and validate criteria against the SAME provider the Client resolves with.

Intentionally kept as a transparent alias of pkg/recipe.CriteriaRegistry rather than wrapped into a facade-owned type, for two reasons:

  1. The registry is behavior-rich (ParseService/ParseAccelerator/..., SetStrict, Values, AllAcceleratorTypes, etc.) — wrapping it would require translating every method through, with no semver win because these methods are already used to construct pkg/recipe.Criteria instances in CLI / API call paths.
  2. The registry carries mutable shared state (strict mode, registered values) keyed by per-Client DataProvider identity. A facade wrapper would either copy state (breaking the per-Client identity coupling) or hold a pointer (no isolation win over the alias).

External callers receive the same pkg/recipe.CriteriaRegistry the Client's resolve path uses. If the underlying API evolves, this alias is the single canary; the facade can absorb it by hand-writing a wrapper then.

type Option

type Option func(*Client)

Option configures a Client.

func WithAllowLists

func WithAllowLists(al *AllowLists) Option

WithAllowLists fences which criteria values the Client's resolve path accepts. A resolve whose criteria fall outside the allowlist is rejected before the recipe is built. Pass nil (or omit the option) to allow all values. Construct an AllowLists directly or via ParseAllowListsFromEnv.

func WithRecipeSource

func WithRecipeSource(s RecipeSourceOption) Option

WithRecipeSource sets the recipe source on the Client. Construct the argument with OCISource or FilesystemSource.

func WithVersion

func WithVersion(version string) Option

WithVersion sets the version string stamped into resolved recipe metadata (RecipeResult.Metadata.Version). Threaded through to the underlying recipe.Builder via recipe.WithVersion. Typically the consuming binary's build version.

type Phase

type Phase string

Phase identifies a single validation phase. Facade-owned so the stable surface does not propagate pkg/validator type-shape changes. Values match pkg/validator/v1 constants verbatim for direct wire compatibility.

const (
	PhaseDeployment  Phase = "deployment"
	PhasePerformance Phase = "performance"
	PhaseConformance Phase = "conformance"
)

Validation phases — string values match pkg/validator/v1 so wire round-trips between facade and validator are byte-identical.

type PhaseResult

type PhaseResult struct {
	Phase     Phase
	Status    string
	Duration  time.Duration
	Summary   ReportSummary
	RawReport []byte
	Report    *ctrf.Report
}

PhaseResult is the outcome of running all validators in a single phase. Facade-owned. Summary holds the CTRF count breakdown for the common pass/fail check; RawReport carries the marshaled CTRF JSON for callers needing per-test detail; Report is the typed CTRF report retained for in-tree consumers that merge per-phase reports via ctrf.MergeReports.

type RecipeRequest

type RecipeRequest struct {
	// Service is the target Kubernetes service identifier, e.g.
	// "eks", "gke", "aks", "oke", "kind", "lke", or "any". Mapped
	// to pkg/recipe CriteriaService. Note that this is the K8s
	// FLAVOR (eks vs gke), not the cloud vendor (aws vs gcp);
	// callers that think in cloud-vendor terms must map first
	// (aws→eks, gcp→gke, etc.).
	Service string

	// Region is the cloud region. Informational only — not part of
	// pkg/recipe.Criteria today; captured on the request so consumers
	// can audit the call without a separate field.
	Region string

	// Accelerator is the GPU model identifier, e.g. "h100", "b200".
	Accelerator string

	// Nodes is the worker-node count hint. Mapped to CriteriaNodes.
	// Note that this is the NUMBER OF NODES, not the number of
	// accelerators — a 64-GPU cluster on 8-GPU nodes has Nodes=8.
	// Zero means "unspecified, AICR picks the default-sized recipe."
	// Negative values are rejected with ErrCodeInvalidRequest.
	Nodes int32

	// Intent is the workload intent. Mapped to CriteriaIntent.
	// Supported values are defined by pkg/recipe.GetCriteriaIntentTypes
	// — today "training" and "inference".
	Intent string

	// OS is the worker-node operating system. Mapped to CriteriaOS.
	// Supported values: "ubuntu", "rhel", "cos", "amazonlinux".
	// Empty means "unspecified" — recipe resolution will not select
	// OS-pinned leaf overlays (e.g., h100-eks-ubuntu-training,
	// h100-gke-cos-training) and will fall back to the OS-agnostic
	// ancestor. Set this when the cluster's OS is known so OS-specific
	// constraints and mixins (kernel version, driver tuning) are
	// included.
	OS string

	// Platform is the workload platform overlay. Mapped to
	// CriteriaPlatform. Supported values are defined by
	// pkg/recipe.GetCriteriaPlatformTypes — today "", "any", "dynamo",
	// "kubeflow", "nim".
	Platform string

	// PinnedName reserves space for future pinned-recipe support.
	// Currently rejected with ErrCodeUnavailable; set the criteria
	// fields above instead.
	PinnedName string

	// PinnedVersion reserves space for future pinned-recipe support.
	// Currently rejected with ErrCodeUnavailable.
	PinnedVersion string
}

RecipeRequest is the stable external request shape. The Client translates this into pkg/recipe.Criteria.

type RecipeResult

type RecipeResult struct {
	// Name is a stable identifier derived from the resolved criteria.
	// Because AICR recipes are keyed by criteria (not by a standalone
	// name), this field is the criteria string representation rather
	// than an independent label.
	Name string

	// Version is the recipe metadata version (set by the CLI that
	// generated the recipe data).
	Version string

	// TranslatedAt is the wall-clock time the facade completed the
	// translation of the internal RecipeResult into this shape. This
	// is NOT the time the underlying recipe was built — AICR's
	// internal RecipeResult currently carries no build timestamp.
	TranslatedAt time.Time

	// Components lists the deployable components in the recipe.
	Components []ComponentRef
	// contains filtered or unexported fields
}

RecipeResult is the stable external result shape.

func (*RecipeResult) Resolved

func (r *RecipeResult) Resolved() *recipe.RecipeResult

Resolved returns the complete underlying recipe (the full pkg/recipe.RecipeResult) that this result wraps. The facade RecipeResult exposes only Name/Version/Components; callers that need constraints, validation config, deployment order, or metadata (e.g. evidence emission) use this. Returns nil if the result was not produced by the Client.

Lifetime: the returned pointer is borrowed from the facade RecipeResult. Do not mutate; do not retain past the facade RecipeResult's lifetime. Marshal/serialize first if persistence is needed.

type RecipeSourceOption

type RecipeSourceOption struct {
	// contains filtered or unexported fields
}

RecipeSourceOption identifies where recipes are sourced from.

func EmbeddedSource

func EmbeddedSource() RecipeSourceOption

EmbeddedSource uses only AICR's built-in (embedded) recipe data, no overlay.

func FilesystemSource

func FilesystemSource(path string) RecipeSourceOption

FilesystemSource describes a local filesystem path containing AICR recipes.

func OCISource

func OCISource(registry, tag string) RecipeSourceOption

OCISource describes an OCI registry containing AICR recipes.

The tag is optional; if empty, "latest" is assumed by the downstream loader.

type ReportSummary

type ReportSummary struct {
	Tests   int
	Passed  int
	Failed  int
	Skipped int
	Pending int
	Other   int
}

ReportSummary is the high-level pass/fail count breakdown of a validation phase's CTRF report. Facade-owned (not aliased to ctrf.Summary); fields mirror the CTRF spec summary contract.

type Snapshot

type Snapshot struct {
	APIVersion string
	Kind       string
	CapturedAt time.Time
	// contains filtered or unexported fields
}

Snapshot is the captured cluster-state artifact returned by Client.CollectSnapshot. Facade-owned so the stable surface does not propagate pkg/snapshotter type-shape changes. APIVersion / Kind / CapturedAt are the high-level identifying metadata; the full measurement payload is held in an unexported internal field for zero-copy round-trip through ValidateState. Consumers needing measurement-level inspection import pkg/snapshotter directly.

func WrapSnapshot

func WrapSnapshot(s *snapshotter.Snapshot) *Snapshot

WrapSnapshot wraps a pkg/snapshotter.Snapshot in the facade Snapshot type so callers that load snapshots externally (e.g., the CLI reading a YAML file) can pass them to facade methods. Returns nil for nil input.

type ValidateOption

type ValidateOption func(*validateConfig)

ValidateOption configures a validation run launched via Client.ValidateState. It is a facade-owned functional option type: each WithValidation* factory below captures its argument into an internal validateConfig, and Client.ValidateState translates the captured config into pkg/validator options at call time.

The wrap insulates the facade's semver contract from pkg/validator's own evolving Option signature. Adding a field to pkg/validator's Validator struct, renaming validator.WithXxx, or changing the validator.Option function signature can all be absorbed inside the translation function without breaking facade consumers.

func WithValidationCleanup

func WithValidationCleanup(cleanup bool) ValidateOption

WithValidationCleanup controls whether validator-emitted Jobs, ConfigMaps, and RBAC are deleted at the end of the run. Default: true. Set to false to leave artifacts behind for post-mortem inspection.

func WithValidationCommit

func WithValidationCommit(commit string) ValidateOption

WithValidationCommit sets the git commit SHA threaded into the validator (validator.WithCommit). Used to resolve dev-build validator images to SHA-tagged images. An empty string is the "unset" sentinel — no validator option is emitted, matching the validator's own behavior where an empty commit influences nothing.

func WithValidationFailFast added in v0.15.0

func WithValidationFailFast(failFast bool) ValidateOption

WithValidationFailFast controls whether ValidateState stops after the first phase that reports StatusFailed. Default: false (all phases run and produce results). Set true to restore stop-on-first-failure behavior.

func WithValidationImagePullSecrets

func WithValidationImagePullSecrets(secrets []string) ValidateOption

WithValidationImagePullSecrets sets imagePullSecrets on the validator pods. Use this when the validator images live in a private registry whose credentials live in a Secret in the validation namespace.

The input is defensively copied; a caller that mutates the slice after this returns won't race with ValidateState reading it on a goroutine. nil-in maps to nil stored (preserves the "unset" sentinel downstream), empty-in maps to an empty-non-nil copy.

func WithValidationImageRegistryOverride

func WithValidationImageRegistryOverride(registry string) ValidateOption

WithValidationImageRegistryOverride overrides the registry prefix on validator container images (validator.WithImageRegistryOverride), e.g. to point at a local registry mirror. Empty means "no override" — the validator keeps its default registry.

func WithValidationImageTagOverride

func WithValidationImageTagOverride(tag string) ValidateOption

WithValidationImageTagOverride overrides the tag on every validator container image (validator.WithImageTagOverride), intended for feature-branch dev builds whose commit SHA has no published image. Empty means "no override" — the validator keeps its resolved tag.

func WithValidationNamespace

func WithValidationNamespace(namespace string) ValidateOption

WithValidationNamespace sets the Kubernetes namespace where validation Jobs run. Default: "aicr-validation".

func WithValidationNoCluster

func WithValidationNoCluster(noCluster bool) ValidateOption

WithValidationNoCluster enables dry-run mode: no Kubernetes resources are created, all checks report as "skipped - no-cluster mode (test mode)". Constraints are still evaluated inline (they don't need cluster access). Use this for unit tests that exercise the facade surface without a live cluster.

func WithValidationNodeSelector

func WithValidationNodeSelector(nodeSelector map[string]string) ValidateOption

WithValidationNodeSelector passes a node selector through to the validation workload pods. Use when GPU nodes carry non-standard labels and the platform-default selector wouldn't match. Does NOT affect the orchestrator Job itself.

The input is defensively copied; without this, a caller mutating the map after handing off would race with the validator's map iteration (potential "concurrent map iteration and map write" panic in serializeNodeSelector).

func WithValidationPhases

func WithValidationPhases(phases ...Phase) ValidateOption

WithValidationPhases restricts the run to the named phases, in the order given. Valid values are PhaseDeployment, PhasePerformance, and PhaseConformance. When omitted (or called with no phases), all phases run in their canonical order — the default behavior. ValidateState rejects any unrecognized phase value with ErrCodeInvalidRequest before touching the cluster, so a typo cannot silently produce an empty run.

The input is defensively copied so a caller mutating the slice after this returns won't race with ValidateState reading it.

func WithValidationRunID

func WithValidationRunID(runID string) ValidateOption

WithValidationRunID overrides the auto-generated identifier shared across the Jobs and resources produced by a single validation run. Use this to make repeated runs in the same namespace distinguishable (e.g., a controller's reconcile-key suffix).

func WithValidationTimeout

func WithValidationTimeout(d time.Duration) ValidateOption

WithValidationTimeout opts into a facade-level deadline for the ValidateState run. By default (option unset) ValidateState wraps the caller's context with defaults.ValidationOperationTimeout (75m), which suits controllers that pass an unbounded context. Pass a positive duration to set an explicit cap, or 0 to impose NO facade cap — the run then proceeds under the caller's context unchanged so per-validator timeouts (e.g. the 65m inference-perf check) govern. The CLI validate command passes 0 so an all-phase run isn't cut short by a fixed cap.

func WithValidationTolerations

func WithValidationTolerations(tolerations []corev1.Toleration) ValidateOption

WithValidationTolerations passes tolerations through to the validation workload pods (e.g. NCCL benchmark pods). Does NOT affect the orchestrator Job itself, which runs with snapshotter.DefaultTolerations.

The input is defensively copied; mutation after this returns won't race with downstream serialization on a validator goroutine.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL