watchdog

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 3, 2026 License: Apache-2.0 Imports: 11 Imported by: 0

README

Watchdog Package

The watchdog package provides a Go interface for interacting with Linux kernel watchdog devices, with automatic fallback to software watchdog (softdog) when hardware watchdog devices are not available.

Overview

Hardware watchdog devices are used to automatically reset the system if the software becomes unresponsive. The application must periodically "pet" or "kick" the watchdog to prevent an automatic system reset.

Features

  • Hardware Watchdog Support: Direct interface to hardware watchdog devices via /dev/watchdog, /dev/watchdog0, etc.
  • Software Watchdog Fallback: Automatic loading and use of the Linux softdog kernel module when no hardware watchdog is present
  • Robust Error Handling: Comprehensive retry logic with exponential backoff for critical operations
  • Device Detection: Automatic scanning for available watchdog devices on the system
  • Logging Integration: Structured logging with logr for debugging and monitoring

Quick Start

Basic Usage with Automatic Fallback
package main

import (
    "github.com/medik8s/sbd-operator/pkg/watchdog"
    "github.com/go-logr/logr"
)

func main() {
    logger := logr.Discard() // Use your preferred logger
    
    // This will try hardware watchdog first, fallback to softdog if needed
    wd, err := watchdog.NewWithSoftdogFallback("/dev/watchdog", logger)
    if err != nil {
        panic(err)
    }
    defer wd.Close()
    
    // Check if we're using software watchdog
    if wd.IsSoftdog() {
        logger.Info("Using software watchdog (softdog)")
    }
    
    // Pet the watchdog periodically
    for {
        if err := wd.Pet(); err != nil {
            logger.Error(err, "Failed to pet watchdog")
            break
        }
        time.Sleep(10 * time.Second)
    }
}
Hardware-Only Usage
// For cases where you only want hardware watchdog (no fallback)
wd, err := watchdog.New("/dev/watchdog")
if err != nil {
    panic(err)
}
defer wd.Close()
Test Mode Usage

For development and testing environments, you can enable test mode to prevent actual system reboots:

// Enable test mode - softdog will use soft_noboot=1 parameter
wd, err := watchdog.NewWithSoftdogFallbackAndTestMode("/dev/watchdog", true, logger)
if err != nil {
    panic(err)
}
defer wd.Close()

// In test mode, watchdog timeouts won't cause system reboot
if wd.IsSoftdog() {
    logger.Info("Using software watchdog in test mode (no reboots)")
}

Test mode is useful for:

  • Development: Testing watchdog logic without system resets
  • CI/CD: Running tests that involve watchdog functionality
  • Debugging: Observing watchdog behavior without consequences

Note: Test mode only affects the softdog module. Hardware watchdogs will still cause system resets regardless of the test mode setting.

Softdog Fallback Behavior

The NewWithSoftdogFallback function implements intelligent fallback logic:

  1. Primary Attempt: Try to open the specified watchdog device path
  2. Device Scan: If that fails, scan for other existing watchdog devices
  3. Fallback Decision: Only attempt softdog loading if no hardware watchdog devices exist
  4. Module Loading: Load the softdog kernel module with appropriate timeout and optional test mode
  5. Device Creation: Wait for /dev/watchdog to appear and open it

The NewWithSoftdogFallbackAndTestMode function extends this behavior with test mode support:

  • Test Mode Disabled (default): nsenter --target 1 --mount --uts --ipc --net --pid -- modprobe softdog soft_margin=60
  • Test Mode Enabled: nsenter --target 1 --mount --uts --ipc --net --pid -- modprobe softdog soft_margin=60 soft_noboot=1

Note: The package uses nsenter to run modprobe in the host's namespace, ensuring the kernel module is loaded on the host system rather than in the container.

System Requirements for Softdog
  • Linux kernel with softdog module support
  • modprobe command available in PATH
  • nsenter command available in PATH (for running modprobe in host namespace)
  • Sufficient privileges to load kernel modules (typically requires SYS_MODULE capability)
  • Container environments need privileged: true or SYS_MODULE capability

Error Handling

The package provides detailed error information:

wd, err := watchdog.NewWithSoftdogFallback("/dev/watchdog", logger)
if err != nil {
    if strings.Contains(err.Error(), "failed to load softdog module") {
        // Softdog loading failed - likely permission or modprobe issues
        log.Printf("Cannot load softdog: %v", err)
    } else if strings.Contains(err.Error(), "other watchdog devices exist") {
        // Hardware watchdog present but can't access specified path
        log.Printf("Hardware watchdog access issue: %v", err)
    }
    return err
}

Integration with SBD Operator

In the SBD operator context, the watchdog is used for:

  • System Fencing: Ensuring unhealthy nodes reset themselves via watchdog timeout
  • Heartbeat Monitoring: Regular watchdog petting as a liveness indicator
  • High Availability: Automatic fallback ensures watchdog functionality even on systems without hardware support
Container Deployment

When deploying in containers, ensure the following:

apiVersion: apps/v1
kind: DaemonSet
spec:
  template:
    spec:
      containers:
      - name: sbd-agent
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
            - SYS_MODULE  # Required for loading softdog
        volumeMounts:
        - name: dev
          mountPath: /dev
        - name: modules
          mountPath: /lib/modules
          readOnly: true
      volumes:
      - name: dev
        hostPath:
          path: /dev
      - name: modules
        hostPath:
          path: /lib/modules

Logging

The package uses structured logging to provide visibility into watchdog operations:

INFO Successfully opened hardware watchdog device path="/dev/watchdog"
INFO Failed to open specified watchdog device, checking for alternatives requestedPath="/dev/watchdog" error="..."
INFO No watchdog devices found, attempting to load softdog module
INFO Loading softdog module using nsenter command="nsenter --target 1 --mount --uts --ipc --net --pid -- modprobe softdog soft_margin=60" timeout=60
INFO Successfully loaded and opened softdog watchdog device originalPath="/dev/watchdog" softdogPath="/dev/watchdog"

Testing

The package includes comprehensive tests:

# Run all watchdog tests
go test ./pkg/watchdog -v

# Run softdog integration tests (requires root)
sudo go test ./pkg/watchdog -v -run TestLoadSoftdogModule_Integration

Constants and Configuration

  • Default Softdog Timeout: 60 seconds
  • Retry Configuration: 2 retries with exponential backoff (50ms to 500ms)
  • Module Load Command:
    • Normal mode: nsenter --target 1 --mount --uts --ipc --net --pid -- modprobe softdog soft_margin=60
    • Test mode: nsenter --target 1 --mount --uts --ipc --net --pid -- modprobe softdog soft_margin=60 soft_noboot=1

Platform Support

  • Linux: Full support with hardware and software watchdog
  • Other Platforms: Hardware watchdog support only (no softdog fallback)

Troubleshooting

Common Issues
  1. Permission Denied: Ensure container has SYS_MODULE capability
  2. Module Not Found: Verify softdog module is available in kernel
  3. modprobe Not Found: Ensure util-linux or equivalent package is installed
  4. Device Access: Check /dev/watchdog permissions and ownership
Debug Logging

Enable debug logging to see detailed operation flow:

logger := logr.New(/* your debug-enabled logger */)
wd, err := watchdog.NewWithSoftdogFallback("/dev/watchdog", logger)

This will show device scanning, module loading attempts, and fallback decisions.

Linux Watchdog IOCTL Commands

The package uses standard Linux watchdog ioctl commands:

  • WDIOC_KEEPALIVE (0x40045705): Reset the watchdog timer
  • WDIOC_SETTIMEOUT (0x40045706): Set timeout period
  • WDIOC_GETTIMEOUT (0x40045707): Get current timeout

Security Considerations

  • Watchdog operations typically require root privileges
  • Improper use can cause unexpected system resets
  • Always implement proper error handling and logging
  • Consider graceful shutdown procedures

Dependencies

  • golang.org/x/sys/unix: For Linux system calls and ioctl operations
  • Standard Go library packages (os, fmt, etc.)

License

Copyright 2025 - Licensed under the Apache License, Version 2.0

Documentation

Index

Constants

View Source
const (
	// WDIOC_KEEPALIVE is the ioctl command to reset/pet the watchdog timer
	// This is equivalent to _IO('W', 5) in C
	WDIOC_KEEPALIVE = 0x40045705

	// WDIOC_SETTIMEOUT is the ioctl command to set the watchdog timeout
	// This is equivalent to _IOWR('W', 6, int) in C
	WDIOC_SETTIMEOUT = 0x40045706

	// WDIOC_GETTIMEOUT is the ioctl command to get the watchdog timeout
	// This is equivalent to _IOR('W', 7, int) in C
	WDIOC_GETTIMEOUT = 0x40045707
)

Linux watchdog ioctl constants Reference: include/uapi/linux/watchdog.h

View Source
const (
	// MaxWatchdogRetries is the maximum number of retry attempts for watchdog operations
	MaxWatchdogRetries = 2
	// InitialWatchdogRetryDelay is the initial delay between watchdog retry attempts
	InitialWatchdogRetryDelay = 50 * time.Millisecond
	// MaxWatchdogRetryDelay is the maximum delay between watchdog retry attempts
	MaxWatchdogRetryDelay = 500 * time.Millisecond
	// WatchdogRetryBackoffFactor is the exponential backoff factor for watchdog retry delays
	WatchdogRetryBackoffFactor = 2.0
)

Retry configuration constants for watchdog operations

View Source
const (
	// SoftdogModule is the name of the Linux software watchdog kernel module
	SoftdogModule = "softdog"
	// DefaultSoftdogTimeout is the default timeout in seconds for the softdog module
	DefaultSoftdogTimeout = 60
	// SoftdogModprobe is the command to load the softdog module
	SoftdogModprobe = "modprobe"
	// NsenterCommand is the command to enter host namespaces
	NsenterCommand = "nsenter"
	// HostPID is the PID of the host init process
	HostPID = "1"
)

Softdog configuration constants

Variables

View Source
var (
	// ErrIoctlNotSupported indicates that the watchdog driver doesn't support ioctl operations
	ErrIoctlNotSupported = errors.New("ioctl not supported by watchdog driver")
)

Errors for watchdog operations

Functions

This section is empty.

Types

type Watchdog

type Watchdog struct {
	// contains filtered or unexported fields
}

Watchdog represents a Linux kernel watchdog device interface. It provides methods to interact with hardware watchdog devices through the Linux watchdog subsystem.

func New

func New(path string) (*Watchdog, error)

New creates a new Watchdog instance by opening the watchdog device at the specified path. Common paths include '/dev/watchdog' or '/dev/watchdog0'.

Parameters:

  • path: The filesystem path to the watchdog device (e.g., "/dev/watchdog")

Returns:

  • *Watchdog: A new Watchdog instance if successful
  • error: An error if the device cannot be opened

The device is opened with O_WRONLY flag as required by most watchdog devices. Once opened, the watchdog timer is typically activated and must be periodically reset using the Pet() method to prevent system reset.

func NewWithLogger

func NewWithLogger(path string, logger logr.Logger) (*Watchdog, error)

NewWithLogger creates a new Watchdog instance with a logger for retry operations

func NewWithSoftdogFallback

func NewWithSoftdogFallback(path string, logger logr.Logger) (*Watchdog, error)

NewWithSoftdogFallback creates a new Watchdog instance, attempting to use the specified path first, and falling back to loading and using the softdog module if no hardware watchdog is available.

This function provides automatic fallback behavior: 1. Try to open the specified watchdog device path 2. If that fails and no other watchdog devices exist, try to load softdog module 3. If softdog loads successfully, use /dev/watchdog as the device path

Parameters:

  • path: The preferred filesystem path to the watchdog device (e.g., "/dev/watchdog")
  • logger: Logger for debugging and error reporting

Returns:

  • *Watchdog: A new Watchdog instance if successful
  • error: An error if neither hardware nor software watchdog can be initialized

This is the recommended function for production use as it provides the best reliability.

func NewWithSoftdogFallbackAndTestMode

func NewWithSoftdogFallbackAndTestMode(path string, testMode bool, logger logr.Logger) (*Watchdog, error)

NewWithSoftdogFallbackAndTestMode creates a new Watchdog instance with optional test mode support. This is similar to NewWithSoftdogFallback but allows enabling test mode for the softdog module.

Parameters:

  • path: The preferred filesystem path to the watchdog device (e.g., "/dev/watchdog")
  • testMode: If true, enables soft_noboot=1 for softdog (prevents actual reboots during testing)
  • logger: Logger for debugging and error reporting

Returns:

  • *Watchdog: A new Watchdog instance if successful
  • error: An error if neither hardware nor software watchdog can be initialized

Test mode is useful for development and testing environments where you want to test watchdog functionality without triggering actual system reboots.

func (*Watchdog) Close

func (w *Watchdog) Close() error

Close closes the watchdog device file descriptor and releases associated resources.

IMPORTANT: Closing the watchdog device may have different behaviors depending on the specific watchdog driver: - Some drivers stop the watchdog timer when the device is closed - Others continue running and will reset the system if not reopened and pet - Some require writing 'V' to the device before closing to stop the timer

Returns:

  • error: An error if the device cannot be closed properly

This method marks the watchdog as closed and prevents further operations. It's safe to call Close() multiple times.

func (*Watchdog) IsOpen

func (w *Watchdog) IsOpen() bool

IsOpen returns true if the watchdog device is currently open and available for operations.

func (*Watchdog) IsSoftdog

func (w *Watchdog) IsSoftdog() bool

IsSoftdog returns true if this watchdog is using the software watchdog (softdog) module

func (*Watchdog) Path

func (w *Watchdog) Path() string

Path returns the filesystem path of the watchdog device.

func (*Watchdog) Pet

func (w *Watchdog) Pet() error

Pet resets the watchdog timer, preventing the system from being reset. This method must be called periodically (before the timeout expires) to keep the system running. The frequency depends on the watchdog's configured timeout value.

This method includes retry logic for transient errors, as watchdog petting is critical for system stability. It uses a two-tier approach: 1. Primary: WDIOC_KEEPALIVE ioctl command (preferred method) 2. Fallback: Write-based keep-alive when ioctl is not supported (ENOTTY)

Returns:

  • error: An error if the watchdog cannot be pet after retries, or if the device is not open

This method automatically falls back to write-based keep-alive when the WDIOC_KEEPALIVE ioctl is not supported by the watchdog driver (such as some softdog implementations). This ensures compatibility across different kernel configurations and architectures.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL