README
¶
Destructive Node Reboot Tests
⚠️ EXTREMELY DANGEROUS - THESE TESTS WILL REBOOT OPENSHIFT NODES ⚠️
This directory contains standalone tests that validate node self-fencing mechanisms by actually rebooting OpenShift nodes. These tests should only be run in isolated test environments.
⚠️ CRITICAL SAFETY WARNINGS
- THESE TESTS WILL REBOOT THE TARGET NODE IMMEDIATELY
- DO NOT RUN IN PRODUCTION ENVIRONMENTS
- SAVE ALL WORK BEFORE RUNNING
- ENSURE TARGET NODE CAN BE SAFELY REBOOTED
- HAVE MONITORING IN PLACE TO VERIFY REBOOTS
Test Overview
1. Panic Reboot Test (panic-reboot-test.go)
- Purpose: Validates that Go panic() can trigger node reboot
- Method: Calls
panic()after a configurable delay - Expected Result: Node reboots immediately when panic occurs
- Use Case: Tests panic-based self-fencing mechanism
2. Watchdog Reboot Test (watchdog-reboot-test.go)
- Purpose: Validates that watchdog timeout can trigger node reboot
- Method: Opens
/dev/watchdog, pets it several times, then stops - Expected Result: Node reboots when watchdog timeout expires
- Use Case: Tests kernel watchdog-based self-fencing mechanism
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ UBI9 Image │───▶│ Go Test Binary │───▶│ Node Reboot │
│ │ │ │ │ │
│ registry.redhat.│ │ panic() or │ │ Immediate │
│ io/ubi9/ubi │ │ watchdog timeout │ │ ungraceful │
│ (built w/Podman)│ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Prerequisites
Environment Requirements
- Isolated test environment (not production!)
- OpenShift/Kubernetes cluster
- Cluster admin permissions
- Target worker node that can be safely rebooted
- Podman installed for building and pushing container images
Node Requirements
/dev/watchdogdevice available (for watchdog test)- Kernel watchdog functionality enabled
- Privileged containers allowed
- Node can be safely rebooted without data loss
Setup Instructions
1. Build Test Image
# Build the UBI-based image with test binaries
podman build -t sbd-destructive-tests:latest -f test/destructive/Dockerfile .
# Tag for your registry
podman tag sbd-destructive-tests:latest your-registry.com/sbd-destructive-tests:latest
# Push to registry
podman push your-registry.com/sbd-destructive-tests:latest
2. Update Manifests
Edit the YAML files and update:
TARGET_NODE_NAME_HERE→ actual node name to test- Image references to point to your registry
3. Verify Target Node
# Check if watchdog device exists on target node
oc debug node/TARGET_NODE_NAME -- chroot /host ls -la /dev/watchdog*
# Check node details
oc describe node TARGET_NODE_NAME
# Ensure node is Ready and schedulable
oc get nodes TARGET_NODE_NAME
Running the Tests
Option 1: Pre-built UBI Image (Simpler)
If you just want to test with basic binaries compiled on-the-fly:
# 1. Edit the manifests to set target node name
sed -i 's/TARGET_NODE_NAME_HERE/your-target-node/g' test/destructive/*.yaml
# 2. Apply the ConfigMaps for documentation
kubectl apply -f test/destructive/panic-reboot-test.yaml
kubectl apply -f test/destructive/watchdog-reboot-test.yaml
Note: The pre-built option uses basic UBI image and may need the Go binaries to be compiled differently.
Option 2: Custom Built Image (Recommended)
# 1. Build and push your image using Podman (see Setup section)
# 2. Update image references in YAML files
sed -i 's|registry.redhat.io/ubi9/ubi:latest|your-registry.com/sbd-destructive-tests:latest|g' test/destructive/*.yaml
# 3. Set target node name
sed -i 's/TARGET_NODE_NAME_HERE/your-target-node/g' test/destructive/*.yaml
# 4. Run panic test
kubectl apply -f test/destructive/panic-reboot-test.yaml
# 5. Monitor (node will reboot during this)
kubectl logs -f panic-reboot-test
# 6. After node comes back, run watchdog test
kubectl apply -f test/destructive/watchdog-reboot-test.yaml
kubectl logs -f watchdog-reboot-test
Monitoring and Verification
Real-time Monitoring
# Monitor test pod logs
kubectl logs -f panic-reboot-test # or watchdog-reboot-test
# Monitor node status (from another terminal)
watch -n 2 "kubectl get nodes TARGET_NODE_NAME"
# Monitor all events
kubectl get events --sort-by='.lastTimestamp' -w
Expected Behavior
Panic Test Timeline:
- T+0s: Pod starts, shows 30-second countdown
- T+30s: Triggers
panic() - T+30s: Node begins reboot process
- T+30-90s: Node becomes
NotReady - T+2-5min: Node reboots and becomes
Readyagain
Watchdog Test Timeline:
- T+0s: Pod starts, shows 30-second countdown
- T+30s: Opens
/dev/watchdog(activates watchdog) - T+30-60s: Pets watchdog 10 times over 30 seconds
- T+60s: Stops petting watchdog
- T+60-120s: Watchdog timeout triggers reboot
- T+2-5min: Node reboots and becomes
Readyagain
Verification Commands
# Check node uptime after reboot
oc debug node/TARGET_NODE_NAME -- chroot /host uptime
# Check system logs for reboot cause
oc debug node/TARGET_NODE_NAME -- chroot /host journalctl -b -0 | grep -i "reboot\|panic\|watchdog"
# Check kernel messages
oc debug node/TARGET_NODE_NAME -- chroot /host dmesg | grep -i "reboot\|panic\|watchdog"
Troubleshooting
Common Issues
"Pod stays in Pending state"
- Check node selector matches actual node name
- Verify node is schedulable:
kubectl get nodes - Check tolerations if node has taints
"Permission denied accessing /dev/watchdog"
- Ensure
privileged: truein securityContext - Verify
hostPathvolume mounts are correct - Check if watchdog device exists on host
"Node doesn't reboot"
- For panic test: Check if container has proper capabilities
- For watchdog test: Verify
/dev/watchdogexists and is functional - Check system logs:
dmesg | grep watchdog - Some virtualized environments may not support hardware watchdog
"Test binary not found"
- If using custom image, ensure build completed successfully
- Check image pull policy and registry access
- Verify Dockerfile builds binaries correctly
Debug Commands
# Check if watchdog is available
oc debug node/TARGET_NODE_NAME -- chroot /host lsmod | grep watchdog
# Test watchdog manually
oc debug node/TARGET_NODE_NAME -- chroot /host timeout 5 sh -c 'echo 1 > /dev/watchdog'
# Check container image contents
kubectl run debug-image --rm -i --tty --image=your-registry.com/sbd-destructive-tests:latest -- /bin/bash
Cleanup
# Delete test pods (may already be gone due to node reboot)
kubectl delete pod panic-reboot-test watchdog-reboot-test --ignore-not-found
# Delete ConfigMaps
kubectl delete configmap panic-test-instructions watchdog-test-instructions --ignore-not-found
# Clean up any failed pods
kubectl delete pods --field-selector=status.phase=Failed
Security Considerations
- Tests require
privileged: truecontainers - Tests need access to host devices (
/dev/watchdog) - Tests run as root user (UID 0)
- Tests have
SYS_BOOTandSYS_ADMINcapabilities - Never run these tests in production environments
Integration with SBD Operator
These tests validate the same self-fencing mechanisms used by the SBD operator:
- Panic test validates the
rebootMethod: panicsetting in SBD agent - Watchdog test validates the watchdog petting and timeout logic
- Both tests confirm that node reboots work as expected for self-fencing
Contributing
When modifying these tests:
- Ensure all safety warnings remain prominent
- Test in isolated environments only
- Update documentation for any new parameters
- Verify tests work with UBI9 images
- Follow Go best practices for binary compilation
Support
These are destructive tests intended for validation only. For production SBD operator issues, see the main project documentation.
Remember: These tests WILL reboot nodes. Use with extreme caution!