workload-variant-autoscaler

module

v0.0.4 Latest Latest Go to latest Published: Nov 15, 2025 License: Apache-2.0

README ¶

Workload-Variant-Autoscaler (WVA)

GPU-aware autoscaler for LLM inference workloads with optimal resource allocation

The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers. It assigns GPU types to models, determines optimal replica counts for given request traffic loads and service classes, and configures batch sizes—all while optimizing for cost and performance.

Architecture

Key Features

Intelligent Autoscaling: Optimizes replica count and GPU allocation based on workload, performance models, and SLO requirements
Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements
Performance Modeling: Uses queueing theory (M/M/1/k, M/G/1 models) for accurate latency and throughput prediction
Multi-Model Support: Manages multiple models with different service classes and priorities

Quick Start

Prerequisites

Kubernetes v1.31.0+ (or OpenShift 4.18+)
Helm 3.x
kubectl

Install with Helm (Recommended)

# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
  --namespace workload-variant-autoscaler-system \
  --set-file prometheus.caCert=/tmp/prometheus-ca.crt \
  --set variantAutoscaling.accelerator=L40S \
  --set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
  --set vllmService.enabled=true \
  --set vllmService.nodePort=30000
  --create-namespace

Try it Locally with Kind (No GPU Required!)

# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-llm-d-wva-emulated-on-kind

# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testing

Works on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed!
Perfect for development and testing with GPU emulation.

See the Installation Guide for detailed instructions.

Documentation

User Guide

Tutorials

Integrations

Design & Architecture

Architecture Overview
Architecture Diagrams - Visual architecture and workflow diagrams

Developer Guide

Deployment Options

Architecture

WVA consists of several key components:

Reconciler: Kubernetes controller that manages VariantAutoscaling resources
Collector: Gathers cluster state and vLLM server metrics
Model Analyzer: Performs per-model analysis using queueing theory
Optimizer: Makes global scaling decisions across models
Actuator: Emits metrics to Prometheus and updates deployment replicas

For detailed architecture information, see the design documentation.

How It Works

Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
Platform admin creates a VariantAutoscaling CR for the running deployment
WVA continuously monitors request rates and server performance via Prometheus metrics
Model Analyzer estimates latency and throughput using queueing models
Optimizer solves for minimal cost allocation meeting all SLOs
Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly

Important Notes:

Create the VariantAutoscaling CR only after your deployment is warmed up to avoid immediate scale-down
Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
WVA updates the VA status with current and desired allocations every reconciliation cycle

Example

apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
  name: llama-8b-autoscaler
  namespace: llm-inference
spec:
  modelName: "meta/llama-3.1-8b"
  serviceClass: "Premium"
  acceleratorType: "A100"
  minReplicas: 1
  maxBatchSize: 256

More examples in config/samples/.

Contributing

We welcome contributions! See the llm-d Contributing Guide for guidelines.

Join the llm-d autoscaling community meetings to get involved.

License

Apache 2.0 - see LICENSE for details.

For detailed documentation, visit the docs directory.

Directories ¶

Path	Synopsis
api
v1alpha1 Package v1alpha1 contains API Schema definitions for the llmd v1alpha1 API group.	Package v1alpha1 contains API Schema definitions for the llmd v1alpha1 API group.
cmd
internal
actuator
collector
constants Package constants provides centralized constant definitions for the autoscaler.	Package constants provides centralized constant definitions for the autoscaler.
controller
interfaces
logger
metrics
modelanalyzer
optimizer
utils
pkg
analyzer
config
core
manager
solver
test
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Workload-Variant-Autoscaler (WVA)

Key Features

Quick Start

Prerequisites

Install with Helm (Recommended)

Try it Locally with Kind (No GPU Required!)

Documentation

User Guide

Tutorials

Integrations

Design & Architecture

Developer Guide

Deployment Options

Architecture

How It Works

Example

Contributing

License

Related Projects

Directories ¶