KAI-scheduler

module

v0.4.4 Latest Latest Go to latest Published: Apr 23, 2025 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/KAI-scheduler

Links

README ¶

KAI Scheduler

KAI Scheduler is a robust, efficient, and scalable Kubernetes scheduler that optimizes GPU resource allocation for AI and machine learning workloads.

Designed to manage large-scale GPU clusters, including thousands of nodes, and high-throughput of workloads, makes the KAI Scheduler ideal for extensive and demanding environments. KAI Scheduler allows administrators of Kubernetes clusters to dynamically allocate GPU resources to workloads.

KAI Scheduler supports the entire AI lifecycle, from small, interactive jobs that require minimal resources to large training and inference, all within the same cluster. It ensures optimal resource allocation while maintaining resource fairness between the different consumers. It can run alongside other schedulers installed on the cluster.

Key Features

Batch Scheduling: Ensure all pods in a group are scheduled simultaneously or not at all.
Bin Packing & Spread Scheduling: Optimize node usage either by minimizing fragmentation (bin-packing) or increasing resiliency and load balancing (spread scheduling).
Workload Priority: Prioritize workloads effectively within queues.
Hierarchical Queues: Manage workloads with two-level queue hierarchies for flexible organizational control.
Resource distribution: Customize quotas, over-quota weights, limits, and priorities per queue.
Fairness Policies: Ensure equitable resource distribution using Dominant Resource Fairness (DRF) and resource reclamation across queues.
Workload Consolidation: Reallocate running workloads intelligently to reduce fragmentation and increase cluster utilization.
Elastic Workloads: Dynamically scale workloads within defined minimum and maximum pod counts.
Dynamic Resource Allocation (DRA): Support vendor-specific hardware resources through Kubernetes ResourceClaims (e.g., GPUs from NVIDIA or AMD).
GPU Sharing: Allow multiple workloads to efficiently share single or multiple GPUs, maximizing resource utilization.
Cloud & On-premise Support: Fully compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter) as well as static on-premise deployments.

Prerequisites

Before installing KAI Scheduler, ensure you have:

A running Kubernetes cluster
Helm CLI installed
NVIDIA GPU-Operator installed in order to schedule workloads that request GPU resources

Installation

KAI Scheduler will be installed in kai-scheduler namespace. When submitting workloads make sure to use a dedicated namespace.

Installation Methods

KAI Scheduler can be installed:

From Production (Recommended)
From Source (Build it Yourself)

Install from Production

Locate the latest release version in releases page. Run the following command after replacing <VERSION> with the desired release version:

helm upgrade -i kai-scheduler oci://ghcr.io/nvidia/kai-scheduler/kai-scheduler -n kai-scheduler --create-namespace --version <VERSION>

Build from Source

Follow the instructions here

Quick Start

To start scheduling workloads with KAI Scheduler, please continue to Quick Start example

Support and Getting Help

We’d love to hear from you! Here's how to reach out:

Technical Questions, Bugs, and Feature Requests: Please open an issue on GitHub for anything related to technical support, bug reports, or feature suggestions. This helps us track and address them efficiently.
General Discussion & Roadmap Topics: For broader conversations—like roadmap discussions, scheduling strategies, or working group coordination—join the CNCF Slack workspace and drop by the #batch-wg channel.

Directories ¶

Path	Synopsis
cmd
binder command
binder/app
podgrouper command
podgrouper/app
resourcereservation command
resourcereservation/app
scheduler command
scheduler/app
scheduler/app/options
scheduler/profiling
snapshot-tool command
webhookmanager command
webhookmanager/app
pkg
apis/client/clientset/versioned
apis/client/clientset/versioned/fake This package has the automatically generated fake clientset.	This package has the automatically generated fake clientset.
apis/client/clientset/versioned/scheme This package contains the scheme of the automatically generated clientset.	This package contains the scheme of the automatically generated clientset.
apis/client/clientset/versioned/typed/scheduling/v1alpha2 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
apis/client/clientset/versioned/typed/scheduling/v1alpha2/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
apis/client/clientset/versioned/typed/scheduling/v2 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
apis/client/clientset/versioned/typed/scheduling/v2/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
apis/client/clientset/versioned/typed/scheduling/v2alpha2 This package has the automatically generated typed clients.	This package has the automatically generated typed clients.
apis/client/clientset/versioned/typed/scheduling/v2alpha2/fake Package fake has the automatically generated clients.	Package fake has the automatically generated clients.
apis/client/informers/externalversions
apis/client/informers/externalversions/internalinterfaces
apis/client/informers/externalversions/scheduling
apis/client/informers/externalversions/scheduling/v1alpha2
apis/client/informers/externalversions/scheduling/v2
apis/client/informers/externalversions/scheduling/v2alpha2
apis/client/listers/scheduling/v1alpha2
apis/client/listers/scheduling/v2
apis/client/listers/scheduling/v2alpha2
apis/scheduling/v1alpha2 +groupName=scheduling.run.ai	+groupName=scheduling.run.ai
apis/scheduling/v2 +groupName=scheduling.run.ai	+groupName=scheduling.run.ai
apis/scheduling/v2alpha2 +groupName=scheduling.run.ai	+groupName=scheduling.run.ai
binder/admission/pod-mutator
binder/admission/pod-validator
binder/binding
binder/binding/mock Package mock_binder is a generated GoMock package.	Package mock_binder is a generated GoMock package.
binder/binding/resourcereservation
binder/binding/resourcereservation/group_mutex
binder/binding/resourcereservation/mock Package mock_resourcereservation is a generated GoMock package.	Package mock_resourcereservation is a generated GoMock package.
binder/common
binder/common/gpusharingconfigmap
binder/controllers
binder/plugins
binder/plugins/gpusharing
binder/plugins/gpusharing/gpu-request
binder/plugins/k8s-plugins
binder/plugins/k8s-plugins/common
binder/plugins/k8s-plugins/common/mock Package mock_common_plugins is a generated GoMock package.	Package mock_common_plugins is a generated GoMock package.
binder/plugins/k8s-plugins/dynamicresources
binder/plugins/k8s-plugins/volumebinding
binder/plugins/mock Package mock_plugins is a generated GoMock package.	Package mock_plugins is a generated GoMock package.
binder/plugins/state
binder/test_utils
common/constants
common/gpu_operator_discovery
common/k8s_utils
common/resources
env-tests
env-tests/dynamicresource
env-tests/scheduler
podgrouper
podgrouper/podgroup
podgrouper/podgrouper
podgrouper/podgrouper/plugins
podgrouper/podgrouper/plugins/aml
podgrouper/podgrouper/plugins/constants
podgrouper/podgrouper/plugins/cronjobs
podgrouper/podgrouper/plugins/deployment
podgrouper/podgrouper/plugins/job
podgrouper/podgrouper/plugins/knative
podgrouper/podgrouper/plugins/kubeflow
podgrouper/podgrouper/plugins/kubeflow/jax
podgrouper/podgrouper/plugins/kubeflow/mpi
podgrouper/podgrouper/plugins/kubeflow/notebook
podgrouper/podgrouper/plugins/kubeflow/pytorch
podgrouper/podgrouper/plugins/kubeflow/tensorflow
podgrouper/podgrouper/plugins/kubeflow/xgboost
podgrouper/podgrouper/plugins/podjob
podgrouper/podgrouper/plugins/ray
podgrouper/podgrouper/plugins/runaijob
podgrouper/podgrouper/plugins/skiptopowner
podgrouper/podgrouper/plugins/spark
podgrouper/podgrouper/plugins/spotrequest
podgrouper/podgrouper/supported_types
podgrouper/topowner
resourcereservation/discovery
resourcereservation/patcher
resourcereservation/poddetails
scheduler
scheduler/actions
scheduler/actions/allocate
scheduler/actions/common
scheduler/actions/common/solvers
scheduler/actions/common/solvers/accumulated_scenario_filters
scheduler/actions/common/solvers/scenario
scheduler/actions/consolidation
scheduler/actions/integration_tests/integration_tests_utils
scheduler/actions/preempt
scheduler/actions/reclaim
scheduler/actions/stalegangeviction
scheduler/actions/utils
scheduler/api
scheduler/api/bindrequest_info
scheduler/api/common_info
scheduler/api/common_info/resources
scheduler/api/configmap_info
scheduler/api/csidriver_info
scheduler/api/eviction_info
scheduler/api/node_info
scheduler/api/pod_affinity Package pod_affinity is a generated GoMock package.	Package pod_affinity is a generated GoMock package.
scheduler/api/pod_info
scheduler/api/pod_status
scheduler/api/podgroup_info
scheduler/api/queue_info
scheduler/api/reclaimer_info
scheduler/api/resource_info
scheduler/api/storagecapacity_info
scheduler/api/storageclaim_info
scheduler/api/storageclass_info
scheduler/cache Package cache is a generated GoMock package.	Package cache is a generated GoMock package.
scheduler/cache/cluster_info
scheduler/cache/cluster_info/data_lister Package data_lister is a generated GoMock package.	Package data_lister is a generated GoMock package.
scheduler/cache/evictor
scheduler/cache/status_updater
scheduler/conf
scheduler/conf_util
scheduler/constants
scheduler/constants/labels
scheduler/constants/status
scheduler/framework
scheduler/gpu_sharing
scheduler/k8s_internal
scheduler/k8s_internal/plugins
scheduler/k8s_internal/predicates
scheduler/k8s_utils Package k8s_utils is a generated GoMock package.	Package k8s_utils is a generated GoMock package.
scheduler/log
scheduler/metrics
scheduler/plugins
scheduler/plugins/dynamicresources
scheduler/plugins/elastic
scheduler/plugins/gpupack
scheduler/plugins/gpusharingorder
scheduler/plugins/gpuspread
scheduler/plugins/kubeflow
scheduler/plugins/nodeavailability
scheduler/plugins/nodeplacement
scheduler/plugins/nominatednode
scheduler/plugins/podaffinity
scheduler/plugins/predicates
scheduler/plugins/priority
scheduler/plugins/proportion
scheduler/plugins/proportion/capacity_policy
scheduler/plugins/proportion/queue_order
scheduler/plugins/proportion/reclaimable
scheduler/plugins/proportion/reclaimable/strategies
scheduler/plugins/proportion/resource_division
scheduler/plugins/proportion/resource_share
scheduler/plugins/proportion/utils
scheduler/plugins/ray
scheduler/plugins/resourcetype
scheduler/plugins/scores
scheduler/plugins/snapshot
scheduler/plugins/taskorder
scheduler/scheduler_util
scheduler/test_utils
scheduler/test_utils/dra_fake
scheduler/test_utils/jobs_fake
scheduler/test_utils/nodes_fake
scheduler/test_utils/plugins_fake/predicates_fake
scheduler/test_utils/resources_fake
scheduler/test_utils/tasks_fake
scheduler/utils
scheduler/version
webhookmanager
webhookmanager/cert
test
e2e/modules/configurations
e2e/modules/configurations/feature_flags
e2e/modules/constant
e2e/modules/constant/labels
e2e/modules/context
e2e/modules/environment
e2e/modules/resources/capacity
e2e/modules/resources/fillers
e2e/modules/resources/rd
e2e/modules/resources/rd/crd
e2e/modules/resources/rd/mpi
e2e/modules/resources/rd/pod_group
e2e/modules/resources/rd/pytorch
e2e/modules/resources/rd/queue
e2e/modules/utils
e2e/modules/wait
e2e/modules/wait/watcher

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL