operator

module

v0.1.0-alpha.3 Latest Latest Go to latest Published: Oct 20, 2025 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/grove

Links

Open Source Insights

README ¶

grove

PodCliqueSet CRD and Controller for Network Topology Aware Gang Scheduling & Autoscaling

👷 This project site is currently under active construction

Scaling Groups: MinAvailable and Gang Scheduling

Grove's PodClique Scaling Groups provide sophisticated gang scheduling and termination protection through two key configuration parameters: replicas and minAvailable.

Overview

Scaling Groups allow you to group multiple PodCliques together and scale them as a unit while maintaining gang scheduling semantics. This is particularly useful for distributed workloads that require coordinated scheduling and graceful scaling behavior.

Key Configuration Parameters

`replicas`

Purpose: Sets the desired number of replicas for the scaling group
Default: 1 if not specified
Behavior: Controls how many instances of the scaling group are created

`minAvailable`

Purpose: Defines the minimum number of ready replicas required for operational stability
Default: 1 if not specified
Behavior: Enables gang scheduling and controls termination policies

Gang Scheduling Behavior

Grove implements a sophisticated two-tier gang scheduling system based on the minAvailable setting:

Base PodGang (Core Cluster)

Replicas: 0 through (minAvailable - 1)
Scheduling: All pods scheduled together as a single gang
Purpose: Establishes the minimum viable cluster
Gates Removed: Immediately when pods are assigned to the PodGang

Scaled PodGangs (Scale-Out Replicas)

Replicas: minAvailable and above
Scheduling: Each replica gets its own scaled PodGang
Purpose: Provides additional capacity once core functionality is established
Gates Removed: Only after the base PodGang is ready and running

Example Scenarios

Scenario 1: Database Cluster

apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
spec:
  template:
    podCliqueScalingGroupConfigs:
    - name: "database-cluster"
      replicas: 5
      minAvailable: 3
      cliqueNames: ["db-primary", "db-secondary"]

Behavior:

Replicas 0, 1, 2: Form base PodGang, scheduled together (minimum viable cluster)
Replicas 3, 4: Scaled PodGangs, wait for base cluster to be ready
Result: Ensures core 3-node cluster is operational before adding scale-out nodes

Scenario 2: Machine Learning Training

apiVersion: grove.io/v1alpha1  
kind: PodCliqueSet
spec:
  template:
    podCliqueScalingGroupConfigs:
    - name: "ml-training"
      replicas: 8
      minAvailable: 4
      cliqueNames: ["parameter-server", "worker"]

Behavior:

Replicas 0, 1, 2, 3: Base PodGang for core training infrastructure
Replicas 4, 5, 6, 7: Scaled PodGangs for additional training capacity
Result: Core training cluster established before scaling out workers

Ready Replica Definition

A scaling group replica is considered "ready" when:

Its associated PodClique has sufficient ready Pods: PodClique.Status.ReadyReplicas >= PodClique.Spec.MinReplicas
A Pod is considered ready when its PodReady condition is True

Gang Termination Protection

If the number of ready replicas falls below minAvailable:

Gang termination is triggered for the affected scaling group replica
Purpose: Prevents resource waste and maintains workload integrity
Behavior: Ensures workloads fail fast rather than running in degraded states

Benefits

Efficient Resource Utilization

Core functionality established first before scaling out
Prevents wasteful scheduling of non-essential replicas

Workload Stability

Gang scheduling ensures all-or-nothing scheduling semantics
Termination protection maintains minimum viable cluster size

Graceful Scaling

Base cluster provides stable foundation
Scale-out replicas add capacity without disrupting core functionality

Directories ¶

Path	Synopsis
api module
client module
cmd
opts
initc
cmd command
cmd/opts
internal
internal
client
constants
controller
controller/cert
controller/common
controller/common/component
controller/common/component/utils
controller/podclique
controller/podclique/components
controller/podclique/components/pod
controller/podcliquescalinggroup
controller/podcliquescalinggroup/components
controller/podcliquescalinggroup/components/podclique
controller/podcliqueset
controller/podcliqueset/components
controller/podcliqueset/components/hpa
controller/podcliqueset/components/podclique
controller/podcliqueset/components/podcliquescalinggroup
controller/podcliqueset/components/podcliquesetreplica
controller/podcliqueset/components/podgang
controller/podcliqueset/components/role
controller/podcliqueset/components/rolebinding
controller/podcliqueset/components/satokensecret
controller/podcliqueset/components/service
controller/podcliqueset/components/serviceaccount
controller/utils
errors
expect
index
logger
utils
utils/kubernetes
version
webhook
webhook/admission/pcs/authorization
webhook/admission/pcs/defaulting
webhook/admission/pcs/validation
test
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL