Workload-Variant-Autoscaler (WVA)

GPU-aware autoscaler for LLM inference workloads with optimal resource allocation
The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers. It assigns GPU types to models, determines optimal replica counts for given request traffic loads and service classes, and configures batch sizes—all while optimizing for cost and performance.

Key Features
- Intelligent Autoscaling: Optimizes replica count and GPU allocation based on workload, performance models, and SLO requirements
- Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements
- Performance Modeling: Uses queueing theory (M/M/1/k, M/G/1 models) for accurate latency and throughput prediction
- Multi-Model Support: Manages multiple models with different service classes and priorities
Quick Start
Prerequisites
- Kubernetes v1.31.0+ (or OpenShift 4.18+)
- Helm 3.x
- kubectl
Install with Helm (Recommended)
# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
--set-file prometheus.caCert=/tmp/prometheus-ca.crt \
--set variantAutoscaling.accelerator=L40S \
--set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
--set vllmService.enabled=true \
--set vllmService.nodePort=30000
--create-namespace
Try it Locally with Kind (No GPU Required!)
# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-llm-d-wva-emulated-on-kind
# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testing
Works on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed!
Perfect for development and testing with GPU emulation.
See the Installation Guide for detailed instructions.
Documentation
User Guide
Tutorials
Integrations
Design & Architecture
Developer Guide
Deployment Options
Architecture
WVA consists of several key components:
- Reconciler: Kubernetes controller that manages VariantAutoscaling resources
- Collector: Gathers cluster state and vLLM server metrics
- Model Analyzer: Performs per-model analysis using queueing theory
- Optimizer: Makes global scaling decisions across models
- Actuator: Emits metrics to Prometheus and updates deployment replicas
For detailed architecture information, see the design documentation.
How It Works
- Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
- Platform admin creates a
VariantAutoscaling CR for the running deployment
- WVA continuously monitors request rates and server performance via Prometheus metrics
- Model Analyzer estimates latency and throughput using queueing models
- Optimizer solves for minimal cost allocation meeting all SLOs
- Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
- External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly
Important Notes:
- Create the VariantAutoscaling CR only after your deployment is warmed up to avoid immediate scale-down
- Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
- WVA updates the VA status with current and desired allocations every reconciliation cycle
Example
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-8b-autoscaler
namespace: llm-inference
spec:
modelName: "meta/llama-3.1-8b"
serviceClass: "Premium"
acceleratorType: "A100"
minReplicas: 1
maxBatchSize: 256
More examples in config/samples/.
Contributing
We welcome contributions! See the llm-d Contributing Guide for guidelines.
Join the llm-d autoscaling community meetings to get involved.
License
Apache 2.0 - see LICENSE for details.
For detailed documentation, visit the docs directory.