Dynamic Accelerator Slicer (DAS) Operator
Dynamic Accelerator Slicer (DAS) is an operator that dynamically partitions GPU accelerators in Kubernetes and
OpenShift. It currently ships with a reference implementation for NVIDIA Multi-Instance GPU (MIG) and is designed to
support additional technologies such as NVIDIA MPS or GPUs from other vendors.
Minimum supported OpenShift versions: 4.18.21 and 4.19.6.
Table of Contents
Features
- On-demand partitioning of GPUs via a custom Kubernetes operator.
- Scheduler integration that allocates NVIDIA MIG slices through a plugin located at
pkg/scheduler/plugins/mig/mig.go.
AllocationClaim custom resource to track slice reservations
(pkg/apis/dasoperator/v1alpha1/allocation_types.go).
- Emulated mode to exercise the workflow without real hardware.
Getting Started
This project uses just for task automation.
Install just first:
# On macOS
brew install just
# On Fedora/RHEL
dnf install just
# On Ubuntu/Debian
apt install just
# Or via cargo
cargo install just
Quick Start
-
Configure your images by editing related_images.your-username.json with your registry:
[
{"name": "instaslice-operator-next", "image": "quay.io/your-username/instaslice-operator:latest"},
{"name": "instaslice-webhook-next", "image": "quay.io/your-username/instaslice-webhook:latest"},
{"name": "instaslice-scheduler-next", "image": "quay.io/your-username/instaslice-scheduler:latest"},
{"name": "instaslice-daemonset-next", "image": "quay.io/your-username/instaslice-daemonset:latest"}
]
-
Build and push all images:
just build-push-parallel
-
Deploy to OpenShift (with emulated mode for development):
export EMULATED_MODE=enabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocp
-
Test the installation:
kubectl apply -f test/test-pod-emulated.yaml
OpenShift with GPU Hardware
For OpenShift clusters with GPU hardware:
-
Deploy prerequisites:
just deploy-cert-manager-ocp
just deploy-nfd-ocp
just deploy-nvidia-ocp
-
Deploy DAS operator:
export EMULATED_MODE=disabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocp
-
Test with GPU workload:
kubectl apply -f test/test-pod.yaml
Development Workflow
For local development:
-
Run operator locally (requires scheduler, webhook, and daemonset images to be built and pushed beforehand):
# Build and push images first
just build-push-parallel
# Run operator locally
# Set EMULATED_MODE to control hardware emulation
EMULATED_MODE=enabled just run-local
-
Run tests:
just test-e2e
-
Check code quality:
just lint
Operator Bundle Development
- Login into podman and have a repository created for the operator bundle.
- Set
BUNDLE_IMAGE to point to your repository and tag of choice.
- Run
just bundle-generate to generate the bundle manifests.
- Run
just build-push-bundle to build and push the bundle image to your repository.
- Run
just deploy-cert-manager-ocp to install cert-manager on OpenShift.
- Run
just deploy-nfd-ocp to install Node Feature Discovery (NFD) on OpenShift.
- Run
just deploy-nvidia-ocp to install NVIDIA GPU operator on Openshift.
- Run
operator-sdk run bundle --namespace <namespace> ${BUNDLE_IMAGE} to deploy
the operator.
- Apply the
DASOperator custom resource to initialize the operator
kubectl apply -f deploy/03_instaslice_operator.cr.yaml`
Using a base CSV for bundle generation
Running generate bundle is the first step to publishing an operator to a catalog
and deploying it with OLM. A CSV manifest is generated by collecting data from the
set of manifests passed to this command, such as CRDs, RBAC, etc., and applying
that data to a "base" CSV manifest.
The steps to provide a base CSV:
- create a base CSV file that contains the desired metadata, the base CSV file name can be arbitrary, we can follow
the convention
{operator-name}.base.clusterserviceverison.yaml
- put the base CSV file in the
deploy folder. This is the folder from which the generate bundle command will
collect the k8s manifests. Note that the base CSV file can be placed inside a sub-directory within the deploy
folder.
- make sure that the
metadata.name of the base CSV is the same name as the package name provided to the
generate bundle command, otherwise the generate bundle command will ignore the base CSV and will generate on an
empty CSV.
Layout of an example deploy folder:
tree deploy/
deploy/
├── crds
│ └── foo-operator.crd.yaml
├── base-csv
│ └── foo-operator.base.clusterserviceversion.yaml
├── deployment.yaml
├── role.yaml
├── role_binding.yaml
├── service_account.yaml
└── webhooks.yaml
The bundle generation command:
operator-sdk generate bundle --input-dir deploy --version 0.1.0 --output-dir=bundle --package foo-operator
The base CSV yaml:
apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
name: foo-operator.base
annotations:
alm-examples:
# other annotations can be placed here
spec:
displayName: Instaslice
version: 0.0.2
apiservicedefinitions:
customresourcedefinitions:
install:
installModes:
- supported: false
type: OwnNamespace
- supported: false
type: SingleNamespace
- supported: false
type: MultiNamespace
- supported: true
type: AllNamespaces
maturity: alpha
minKubeVersion: 1.16.0
provider:
name: Codeflare
url: https://github.com/openshift/instaslice-operator
relatedImages:
keywords:
- Foo
links:
- name: My Operator
url: https://github.com/foo/bar
maintainers:
description:
icon:
- There is no need to provide any permission, or deployment spec inside the base CSV.
- Note that the
metadata.name of the base CSV has a prefix of foo-operator. which adheres to the format
{package name}.
- if there are multiple CSV files inside the deploy folder, the one encountered first in lexical order will be
selected as the base CSV
The CSV generation details can be found by inspecting the bundle generation code here:
https://github.com/operator-framework/operator-sdk/blob/0eefc52889ff3dfe4af406038709e6c5ba7398e5/internal/generate/clusterserviceversion/clusterserviceversion.go#L148-L159
Emulated Mode
Emulated mode allows the operator to publish synthetic GPU capacity and skip NVML calls. This is handy for development and CI environments with no hardware. Emulated mode is controlled via the EMULATED_MODE environment variable.
Configuration
The EMULATED_MODE environment variable is read by the operator at startup and determines how the daemonset components behave:
disabled (default): Normal operation mode that requires real MIG compatible GPUs hardware and makes NVML calls
enabled: Emulated mode that simulates MIG capable GPUs capacity without requiring actual hardware
Setting Emulated Mode
For local development:
# Run operator locally with emulation
EMULATED_MODE=enabled just run-local
For deployment:
# Deploy with emulated mode enabled
export EMULATED_MODE=enabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocp
For production with MIG Compatible GPUs:
# Deploy with emulated mode disabled (default)
export EMULATED_MODE=disabled
export RELATED_IMAGES=related_images.your-username.json
just deploy-das-ocp
How it Works
The operator reads the EMULATED_MODE environment variable at startup and passes this configuration to the daemonset pods running on each node. When emulated mode is enabled:
- The daemonset skips hardware detection and NVML library calls
- Synthetic GPU resources are published to simulate hardware capacity
- MIG slicing operations are simulated rather than performed on real hardware
This allows for testing and development of the operator functionality without requiring physical GPU hardware.
Justfile Usage
This project includes a Justfile for convenient task automation. The Justfile
provides several commands for building, pushing, and deploying the operator components.
Prerequisites
Install just command runner:
# On macOS
brew install just
# On Fedora/RHEL
dnf install just
# On Ubuntu/Debian
apt install just
# Or via cargo
cargo install just
Available Commands
List all available commands:
just
View current configuration:
just info
Development and Testing
Run the operator locally for development:
# Set EMULATED_MODE to 'enabled' for simulated GPUs or 'disabled' for real hardware
EMULATED_MODE=enabled just run-local
Run end-to-end tests:
just test-e2e
Run tests with a specific focus:
just test-e2e focus="GPU slices"
Bundle Operations
Generate operator bundle:
just bundle-generate
Build and push bundle image:
just build-push-bundle
Build and push developer bundle:
just build-push-developer-bundle
NVIDIA GPU Operator Management
Deploy NVIDIA GPU operator to OpenShift:
just deploy-nvidia-ocp
Remove NVIDIA GPU operator from OpenShift:
just undeploy-nvidia-ocp
Cert Manager Operations
Deploy cert-manager for OpenShift:
just deploy-cert-manager-ocp
Remove cert-manager from OpenShift:
just undeploy-cert-manager-ocp
Deploy cert-manager for Kubernetes:
just deploy-cert-manager
Node Feature Discovery
Deploy Node Feature Discovery (NFD) operator for OpenShift:
just deploy-nfd-ocp
Code Quality
Run all linting (markdown and Go):
just lint
Run all linting with automatic fixes:
just lint-fix
Run only Go linting:
just lint-go
Run only markdown linting:
just lint-md
Run Go linting and automatically fix issues:
just lint-go-fix
Run markdown linting and automatically fix issues:
just lint-md-fix
Cleanup
Clean up all deployed Kubernetes resources:
just undeploy
Building and Pushing Images
Build and push individual component images:
just build-push-scheduler # Build and push scheduler image
just build-push-daemonset # Build and push daemonset image
just build-push-operator # Build and push operator image
just build-push-webhook # Build and push webhook image
Build and push all images in parallel:
just build-push-parallel
Deployment
Deploy DAS on OpenShift Container Platform:
just deploy-das-ocp
Generate CRDs and clients:
just regen-crd # Generate CRDs into manifests directory
just regen-crd-k8s # Generate CRDs directly into deploy directory
just generate-clients # Generate client code
just verify-codegen # Verify generated client code is up to date
just generate # Generate all - CRDs and clients
Use custom developer images
Copy related_images.developer.json to related_images.username.json to use as a template and modify it to contain
the target developer image repositories to use.
cp related_images.developer.json related_images.username.json
# Edit related_images.username.json with your registry
quay.io/username/image:latest
Then set the RELATED_IMAGES environment variable to related_images.username.json.
RELATED_IMAGES=related_images.username.json just
Configuration
The Justfile uses environment variables for configuration. You can customize these by setting them in your
environment or creating a .env file:
PODMAN - Container runtime (default: podman)
KUBECTL - Kubernetes CLI (default: oc)
EMULATED_MODE - Enable emulated mode (default: disabled)
RELATED_IMAGES - Path to related images JSON file (default: related_images.json)
DEPLOY_DIR - Deployment directory (default: deploy)
OPERATOR_SDK - Operator SDK binary (default: operator-sdk)
OPERATOR_VERSION - Operator version for bundle generation (default: 0.1.0)
GOLANGCI_LINT - Golangci-lint binary (default: golangci-lint)
Example:
export EMULATED_MODE=enabled
just deploy-das-ocp
Architecture
The diagram below summarizes how the operator components interact. Pods requesting GPU slices are mutated by a
webhook to use the mig.das.com extended resource. The scheduler plugin tracks slice availability and creates
AllocationClaim objects processed by the device plugin on each node.

MIG scheduler plugin
The plugin integrates with the Kubernetes scheduler and runs through three framework phases:
- Filter – ensures the node is MIG capable and stages
AllocationClaims for suitable GPUs.
- Score – prefers nodes with the most free MIG slice slots after considering existing and staged claims.
- PreBind – promotes staged claims on the selected node to
created and removes the rest.
Once promoted, the device plugin provisions the slices.
The daemonset advertises GPU resources only after the NVIDIA GPU Operator's
ClusterPolicy reports a Ready state. This prevents the scheduler from
scheduling pods on a node before the GPU Operator has initialized the drivers.
AllocationClaim resource
AllocationClaim is a namespaced CRD that records which MIG slice will be prepared for a pod. Claims start in the
staged state and transition to created once all requests are satisfied. Each claim stores the GPU UUID, slice
position and pod reference.
Example:
$ kubectl get allocationclaims -n das-operator
NAME AGE
8835132e-8a7a-4766-a78f-0cb853d165a2-busy-0 61s
$ kubectl get allocationclaims -n das-operator -o yaml
apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
...
Debugging
All components run in the das-operator namespace:
kubectl get pods -n das-operator
Inspect the active claims:
kubectl get allocationclaims -n das-operator
On the node, verify that the CDI devices were created:
ls -l /var/run/cdi/
Increase verbosity by editing the DASOperator resource and setting operatorLogLevel to Debug or Trace.
Running Tests
Unit Tests
Run all unit tests for the project:
make test
Run unit tests with verbose output:
go test -v ./pkg/...
Run unit tests with coverage:
go test -cover ./pkg/...
End-to-End Tests
A running cluster with a valid KUBECONFIG is required:
just test-e2e
You can focus on specific tests:
just test-e2e focus="GPU slices"
Known Issues
Due to kubernetes/kubernetes#128043
pods may enter an UnexpectedAdmissionError state if admission fails. Pods
managed by higher level controllers such as Deployments will be recreated
automatically. Naked pods, however, must be cleaned up manually with
kubectl delete pod. Using controllers is recommended until the upstream issue
is resolved.
Uninstalling
Remove the deployed resources with:
just undeploy
Contributing
Contributions are welcome! Please open issues or pull requests.
License
This project is licensed under the Apache 2.0 License.