README
¶
Load Testing Framework for cd-operator
This directory contains comprehensive load tests to validate cd-operator performance at scale. The framework tests PR pipeline operations, drift detection across multiple clusters, and system behavior under sustained load.
Overview
The load testing framework validates:
- PR Pipeline Performance: Handling 100+ concurrent PRs across multiple repositories
- Drift Detection at Scale: Monitoring 50+ applications across 5+ ArgoCD clusters
- Worker Pool Behavior: Bounded concurrency and goroutine management
- Rate Limiting: Graceful handling of GitHub/ArgoCD API rate limits
- Resource Usage: CPU, memory, and goroutine stability under load
Test Structure
tests/load/
├── suite_test.go # Test suite setup and global fixtures
├── pr_load_test.go # PR pipeline load tests
├── drift_load_test.go # Drift detection load tests
├── helpers/
│ ├── resources.go # Resource monitoring utilities
│ ├── generators.go # Mock data generators
│ ├── metrics.go # Prometheus metrics collection
│ └── assertions.go # Performance SLO assertions
└── README.md # This file
Prerequisites
Option 1: Using Existing Cluster (Recommended)
For realistic load testing, use a real Kubernetes cluster:
# Ensure kubectl is configured
kubectl cluster-info
# Set environment variable to use existing cluster
export USE_EXISTING_CLUSTER=true
Cluster Requirements:
- Kubernetes 1.26+
- 4+ CPU cores
- 8GB+ RAM
- CRDs installed (
make install)
Option 2: Using envtest
For lighter-weight testing (less realistic but faster):
# Use envtest (default)
export USE_ENVTEST=true
Install Dependencies
# Install test dependencies
go mod download
# Install CRDs (if using existing cluster)
make install
Running Load Tests
Basic Usage
Run all load tests with default configuration:
go test -v ./tests/load/... -timeout=30m
Scaling Test Load
Use the LOAD_FACTOR environment variable to scale test workloads:
# Half load (faster, for CI)
LOAD_FACTOR=0.5 go test -v ./tests/load/...
# Baseline load (default)
LOAD_FACTOR=1.0 go test -v ./tests/load/...
# Double load (stress testing)
LOAD_FACTOR=2.0 go test -v ./tests/load/...
# Maximum load (10x baseline)
LOAD_FACTOR=10.0 go test -v ./tests/load/... -timeout=60m
Load factor scales:
- Number of PRs/apps created
- Timeout durations
- Concurrency levels
Running Specific Tests
# Only PR pipeline tests
go test -v ./tests/load/... -run TestPRPipeline -timeout=20m
# Only drift detection tests
go test -v ./tests/load/... -run TestDriftPipeline -timeout=20m
# Specific test case
go test -v ./tests/load/... -run "TestPRPipeline/Concurrent_PR_Discovery" -timeout=15m
Ginkgo Focus/Skip
# Focus on specific test
go test -v ./tests/load/... -ginkgo.focus="should handle 100 concurrent PRs"
# Skip specific test
go test -v ./tests/load/... -ginkgo.skip="rate limiting"
Performance Baselines
PR Pipeline
| Metric | Target | Acceptable Range |
|---|---|---|
| Discovery Latency P95 | <5s | 2-10s |
| Qualification Latency P95 | <2s | 1-5s |
| Throughput | >10 PRs/sec | 5-20 PRs/sec |
| Memory Usage | <500MB | 300-800MB |
| Goroutine Growth | <100 | 50-200 |
| Error Rate | <1% | 0-2% |
Drift Detection
| Metric | Target | Acceptable Range |
|---|---|---|
| Drift Check Latency P95 | <10s | 5-20s |
| Multi-Cluster Throughput | >10 checks/sec | 5-15 checks/sec |
| Memory Usage | <500MB | 300-800MB |
| Error Rate | <1% | 0-2% |
Test Scenarios
PR Pipeline Load Tests (pr_load_test.go)
1. Concurrent PR Discovery
Tests operator's ability to discover and track 100 PRs across 5 repositories simultaneously.
Validates:
- CRD creation throughput
- Discovery loop performance
- State transition speed
- Memory stability
Load: 100 PRs, 5 repositories
2. PR Qualification Under Load
Tests qualification logic with mixed pass/fail scenarios under high volume.
Validates:
- Validation rule performance
- Layout detection speed
- Mergeable state checks
- Result distribution accuracy
Load: 100 PRs (70% pass, 20% layout fail, 10% not mergeable)
3. Worker Pool Concurrency
Tests worker pool behavior under sustained load with periodic PR arrivals.
Validates:
- Goroutine count stability
- Queue depth management
- No unbounded growth
- Graceful backpressure
Load: 200 PRs in 20 batches over 1 minute
Drift Detection Load Tests (drift_load_test.go)
1. Multi-Cluster Drift Detection
Tests drift monitoring across multiple ArgoCD clusters with many applications.
Validates:
- Multi-cluster query parallelization
- Sync status check accuracy
- Drift detection correctness
- Status update propagation
Load: 50 applications, 5 clusters
2. ArgoCD API Rate Limiting
Tests graceful handling of ArgoCD API rate limits with retry backoff.
Validates:
- Rate limit detection
- Exponential backoff
- Request retry logic
- Error rate under throttling
Load: 30 applications, aggressive rate limiting (10 req/sec)
3. Parallel Cluster Checks
Tests parallel execution efficiency across multiple clusters with simulated latency.
Validates:
- Parallelization effectiveness
- Latency handling
- Resource efficiency
- Speedup vs serial execution
Load: 25 applications, 5 clusters, 100ms latency per cluster
Performance Report
After running tests, an HTML performance report is generated:
load-test-report.html
The report includes:
- Test summary (duration, items processed)
- Latency percentiles (P50, P95, P99)
- Resource usage charts (memory, goroutines)
- Throughput analysis
- Error rate breakdown
View the report:
open load-test-report.html # macOS
xdg-open load-test-report.html # Linux
Interpreting Results
Successful Test Run
Performance Summary:
Throughput: 15.23 PRs/sec
P50 Latency: 1.2s
P95 Latency: 3.8s
P99 Latency: 5.1s
Memory Delta: 45.32 MB
Goroutines: 50 -> 75
SLO Report: 5/5 SLOs met (100.0%)
─────────────────────────────────────────────
✓ PR Discovery Latency P95: 3.80 seconds (target: <= 5.00 seconds)
✓ PR Qualification Latency P95: 1.50 seconds (target: <= 2.00 seconds)
✓ Memory Usage: 345.23 MB (target: <= 500.00 MB)
✓ Goroutine Growth: 25 count (target: <= 100 count)
✓ Error Rate: 0.20 percent (target: <= 1.00 percent)
─────────────────────────────────────────────
Warning Signs
High P99 Latency:
- Investigate long tail latencies
- Check for slow API calls
- Review timeout configurations
Memory Growth:
- Check for memory leaks
- Review object caching
- Validate cleanup logic
Goroutine Growth:
- Check for goroutine leaks
- Validate worker pool shutdown
- Review context cancellation
High Error Rate:
- Check API connectivity
- Review retry logic
- Validate rate limiting
Troubleshooting
Test Timeouts
If tests timeout, increase the timeout or reduce load:
# Increase timeout
go test -v ./tests/load/... -timeout=60m
# Reduce load
LOAD_FACTOR=0.5 go test -v ./tests/load/... -timeout=30m
Out of Memory
If tests run out of memory, reduce concurrency:
# Reduce load factor
LOAD_FACTOR=0.3 go test -v ./tests/load/...
CRD Not Found Errors
Install CRDs before running tests:
make install
Connection Refused Errors
Ensure you have a running Kubernetes cluster:
kubectl cluster-info
Slow Test Execution
Load tests are CPU and I/O intensive. To speed up:
- Use fewer test iterations:
LOAD_FACTOR=0.5 - Run specific tests:
-run TestPRPipeline - Use existing cluster:
USE_EXISTING_CLUSTER=true - Increase cluster resources
CI/CD Integration
GitHub Actions Example
name: Load Tests
on:
schedule:
- cron: '0 2 * * *' # Nightly
workflow_dispatch:
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-go@v4
with:
go-version: '1.26'
- name: Setup test cluster
run: |
kind create cluster
make install
- name: Run load tests
env:
USE_EXISTING_CLUSTER: true
LOAD_FACTOR: 0.5
run: |
go test -v ./tests/load/... -timeout=30m
- name: Upload performance report
uses: actions/upload-artifact@v3
if: always()
with:
name: load-test-report
path: load-test-report.html
Best Practices
When to Run Load Tests
- Before major releases - Validate performance regressions
- After optimization work - Measure improvement
- Nightly in CI - Catch performance degradation early
- On-demand - Troubleshoot production issues
Analyzing Results
- Establish Baseline - Run tests on main branch
- Compare Results - Run on feature branch
- Check SLOs - Verify all targets met
- Review Report - Analyze detailed metrics
- Investigate Failures - Debug issues before merge
Scaling Guidelines
| LOAD_FACTOR | Use Case | Duration | Resources |
|---|---|---|---|
| 0.1 | Quick smoke test | 5min | Minimal |
| 0.5 | CI/PR checks | 15min | Standard |
| 1.0 | Nightly regression | 30min | Standard |
| 2.0 | Weekly stress test | 60min | High |
| 5.0+ | Capacity planning | 2hr+ | Maximum |
Contributing
When adding new load tests:
- Follow naming convention:
<component>_load_test.go - Use table-driven tests: For parameterized scenarios
- Add SLO assertions: Define performance targets
- Document scenarios: Explain what's being tested
- Scale with LOAD_FACTOR: Use
scaleLoad()helper - Generate reports: Capture performance data
References
- E2E Test Framework - End-to-end test patterns
- Integration Tests - Integration test examples
- Worker Pool Implementation - Worker pool code
- Metrics Implementation - Prometheus metrics
- Performance Tuning Guide - Optimization tips
Support
For issues or questions:
- File a GitHub issue with
load-testlabel - Include test output and performance report
- Share cluster specs and LOAD_FACTOR used