devices

package
v0.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 12, 2026 License: MIT Imports: 16 Imported by: 0

README

Device Passthrough

This package provides GPU and PCI device passthrough for virtual machines using the Linux VFIO (Virtual Function I/O) framework.

Overview

Device passthrough allows a VM to have direct, near-native access to physical hardware (GPUs, network cards, etc.) by bypassing the host's device drivers and giving the guest exclusive control. For a deep dive into the VFIO framework, see the kernel documentation.

┌─────────────────────────────────────────────────────────────┐
│  Host                                                       │
│  ┌─────────────┐     ┌─────────────────────────────────┐    │
│  │  hypeman    │     │  VFIO Driver                    │    │
│  │  (VMM)      │────▶│  /dev/vfio/<group>              │    │
│  └─────────────┘     └─────────────────────────────────┘    │
│                              │                              │
│  ┌───────────────────────────┼──────────────────────────┐   │
│  │  IOMMU (hardware)         ▼                          │   │
│  │  - Translates guest physical → host physical         │   │
│  │  - Isolates DMA (device can only access VM memory)   │   │
│  └──────────────────────────────────────────────────────┘   │
│                              │                              │
│                              ▼                              │
│                    ┌──────────────┐                         │
│                    │  GPU (PCIe)  │                         │
│                    └──────────────┘                         │
└─────────────────────────────────────────────────────────────┘

Package Structure

lib/devices/
├── types.go        # Device, AvailableDevice, CreateDeviceRequest
├── errors.go       # Error definitions
├── discovery.go    # PCI device discovery from sysfs
├── vfio.go         # VFIO bind/unbind operations
├── manager.go      # Manager interface and implementation
├── manager_test.go # Unit tests
├── gpu_e2e_test.go # End-to-end GPU passthrough test (auto-skips if no GPU)
└── scripts/
    └── gpu-reset.sh  # GPU recovery script (see Troubleshooting)

Example: Full Workflow

# 1. Discover available devices
curl localhost:8080/devices/available
# → [{"pci_address": "0000:a2:00.0", "vendor_name": "NVIDIA Corporation", ...}]

# 2. Register the GPU
curl -X POST localhost:8080/devices \
  -d '{"name": "l4-gpu", "pci_address": "0000:a2:00.0"}'

# 3. Create instance with GPU (auto-binds to VFIO)
curl -X POST localhost:8080/instances \
  -d '{"name": "ml-training", "image": "nvidia/cuda:12.0-base", "devices": ["l4-gpu"]}'

# 4. Inside VM: verify GPU
lspci | grep -i nvidia
nvidia-smi

# 5. Delete instance (auto-unbinds from VFIO)
curl -X DELETE localhost:8080/instances/{id}
# GPU returns to host control

Device Lifecycle

1. Discovery

Discover passthrough-capable devices on the host:

GET /devices/available

Returns PCI devices that are candidates for passthrough (GPUs, 3D controllers). Each device includes its PCI address, vendor/device IDs, IOMMU group, and current driver.

2. Registration

Register a device with a unique name:

POST /devices
{
  "name": "l4-gpu",
  "pci_address": "0000:a2:00.0"
}

Registration does not modify the device's driver binding. The device remains usable by the host until an instance requests it.

3. Instance Creation (Auto-Bind)

When an instance is created with devices:

POST /instances
{
  "name": "gpu-workload",
  "image": "docker.io/nvidia/cuda:12.0-base",
  "devices": ["l4-gpu"]
}

The system automatically:

  1. Validates the device exists and isn't attached to another instance
  2. Binds to VFIO if not already bound (unbinds native driver like nvidia)
  3. Passes to cloud-hypervisor via the --device flag
  4. Marks as attached to prevent concurrent use
4. Instance Deletion (Auto-Unbind)

When an instance is deleted, the system automatically:

  1. Marks device as detached
  2. Unbinds from VFIO (triggers kernel driver probe to restore native driver)

This returns the device to host control so it can be used by other processes or a new instance.

5. Unregistration
DELETE /devices/{id}

Removes the device from hypeman's registry. Fails if the device is currently attached to an instance.

Cloud Hypervisor Integration

Cloud-hypervisor receives device passthrough configuration via the VmConfig.Devices field:

vmConfig.Devices = &[]vmm.DeviceConfig{
    {
        Path: "/sys/bus/pci/devices/0000:a2:00.0/",
    },
}

Cloud-hypervisor then:

  1. Opens the VFIO group file (/dev/vfio/<group>)
  2. Maps device BARs (memory regions) into guest physical address space
  3. Configures interrupt routing (MSI/MSI-X) to the guest
  4. The guest sees a real PCIe device and loads native drivers
NVIDIA-Specific Options

For multi-GPU configurations, cloud-hypervisor supports GPUDirect P2P:

DeviceConfig{
    Path: "/sys/bus/pci/devices/0000:a2:00.0/",
    XNvGpudirectClique: ptr(int8(0)),  // Enable P2P within clique 0
}

This is not currently exposed through the hypeman API but could be added for HPC workloads.

Constraints and Limitations

IOMMU Requirements
  • IOMMU must be enabled in BIOS and kernel (intel_iommu=on or amd_iommu=on)
  • All devices in an IOMMU group must be passed through together
  • Some motherboards place many devices in the same group (ACS override may help)
VFIO Module Requirements

The following kernel modules must be loaded:

modprobe vfio_pci
modprobe vfio_iommu_type1
Driver Binding
  • Binding to VFIO unloads the native driver (e.g., nvidia, amdgpu)
  • Host processes using the device will lose access
  • Some drivers (like NVIDIA) may resist unbinding if in use
Single Attachment

A device can only be attached to one instance at a time. Attempts to attach an already-attached device will fail.

No Hot-Plug

Devices must be specified at instance creation time. Hot-adding devices to a running VM is not currently supported (though cloud-hypervisor has this capability).

Guest Driver Requirements

The guest must have appropriate drivers:

  • NVIDIA GPUs: Install NVIDIA drivers in the guest image
  • AMD GPUs: Install amdgpu/ROCm in the guest image
Performance Considerations
  • ACS (Access Control Services): Required for proper isolation on some systems
  • Huge Pages: Recommended for GPU workloads (hugepages=on in cloud-hypervisor)
  • CPU Pinning: Can improve latency for GPU compute workloads

Troubleshooting

GPU Reset Script

If GPU passthrough tests fail or hang, the GPU may be left in a bad state (still bound to vfio-pci, or stuck without a driver). Use the provided reset script:

# Reset all NVIDIA GPUs to their native driver
sudo ./lib/devices/scripts/gpu-reset.sh

# Reset a specific GPU
sudo ./lib/devices/scripts/gpu-reset.sh 0000:a2:00.0

The script will:

  1. Kill any stuck cloud-hypervisor processes holding the GPU
  2. Unbind from vfio-pci if still bound
  3. Clear driver_override
  4. Trigger driver probe to rebind to the nvidia driver
  5. Restart nvidia-persistenced
Common Issues
VFIO Bind Hangs

Symptom: BindToVFIO hangs indefinitely.

Cause: The nvidia-persistenced service keeps /dev/nvidia* open, preventing driver unbind.

Solution: The code now automatically stops nvidia-persistenced before unbinding. If you're testing manually:

sudo systemctl stop nvidia-persistenced
# ... do VFIO bind/unbind ...
sudo systemctl start nvidia-persistenced
VM Exec Fails After Boot

Symptom: VM boots but exec commands time out.

Cause: Usually the container's main process exited (e.g., alpine image runs /bin/sh which exits immediately), causing init to exit and the VM to kernel panic.

Solution: Use an image with a long-running process (e.g., nginx:alpine) or ensure your container has a persistent entrypoint.

GPU Not Restored After Test

Symptom: GPU has no driver bound, nvidia-smi fails.

Solution:

# Trigger kernel driver probe
sudo sh -c 'echo 0000:a2:00.0 > /sys/bus/pci/drivers_probe'
# Restart nvidia-persistenced
sudo systemctl start nvidia-persistenced
# Verify
nvidia-smi

If that fails, a system reboot may be necessary.

VFIO Modules Not Loaded

Symptom: ErrVFIONotAvailable error.

Solution:

sudo modprobe vfio_pci vfio_iommu_type1
# Verify
ls /dev/vfio/

Add to /etc/modules-load.d/vfio.conf for persistence across reboots.

IOMMU Not Enabled

Symptom: No IOMMU groups found, passthrough fails.

Solution: Add kernel parameter to bootloader:

  • Intel: intel_iommu=on iommu=pt
  • AMD: amd_iommu=on iommu=pt

Then reboot.

Running the E2E Test

The GPU passthrough E2E test automatically detects GPU availability and skips if prerequisites aren't met.

Why GPU tests require root: Unlike network tests which can use Linux capabilities (CAP_NET_ADMIN), GPU passthrough requires writing to sysfs files (/sys/bus/pci/drivers/*/unbind, etc.) which are protected by standard Unix file permissions (owned by root, mode 0200). Capabilities don't bypass DAC (discretionary access control) for file writes.

Prerequisites for the test to run (not skip):

  • Root permissions (sudo) - required for sysfs driver operations
  • NVIDIA GPU on host
  • IOMMU enabled (intel_iommu=on or amd_iommu=on)
  • vfio_pci and vfio_iommu_type1 modules loaded
  • /sbin in PATH (for mkfs.ext4)
# Prepare the environment
sudo modprobe vfio_pci vfio_iommu_type1

# Run via make - test auto-skips if not root or no GPU
make test

# Or run directly with sudo
sudo env PATH=$PATH:/sbin:/usr/sbin \
  go test -v -run TestGPUPassthrough -timeout 5m ./lib/devices/...

The test will:

  1. Check prerequisites and skip if not met (not root, no GPU, no IOMMU, etc.)
  2. Discover available NVIDIA GPUs
  3. Register the first GPU found
  4. Create a VM with GPU passthrough
  5. Verify the GPU is visible inside the VM
  6. Clean up (delete VM, unbind from VFIO, restore nvidia driver)

Future Plans: GPU Sharing Across Multiple VMs

The Problem

With current VFIO passthrough, a GPU is assigned exclusively to one VM. To share a single GPU across multiple VMs (e.g., give each VM a "slice"), you need NVIDIA's vGPU (GRID) technology.

Why MIG Alone Doesn't Help

MIG (Multi-Instance GPU) partitions a GPU into isolated instances at the hardware level, but:

  • MIG partitions are not separate PCI devices—the GPU remains one PCI endpoint
  • MIG partitions are accessed via CUDA APIs (CUDA_VISIBLE_DEVICES=MIG-<uuid>)
  • You can only VFIO-passthrough the whole GPU to one VM
  • MIG is useful for workload isolation within a single host or VM, not for multi-VM sharing
Physical GPU (0000:a2:00.0) ─── still ONE PCI device
    └── MIG partitions (logical, not separate devices)
        ├── MIG Instance 0 ─┐
        ├── MIG Instance 1 ─┼── All accessed via CUDA on the same GPU
        └── MIG Instance 2 ─┘

Supported MIG Hardware: A100, A30, H100, H200 (NOT L4 or consumer GPUs)

vGPU/mdev: The Only Path to Multi-VM GPU Sharing

To assign GPU shares to separate VMs, NVIDIA requires their vGPU (GRID) technology, which uses the Linux mediated device (mdev) framework.

Cloud-Hypervisor mdev Support Status

Cloud-hypervisor does support mdev passthrough:

cloud-hypervisor --device path=/sys/bus/mdev/devices/<uuid>/

However, NVIDIA's proprietary vGPU manager has a QEMU-specific quirk: it reads the VMM process's /proc/<pid>/cmdline looking for a -uuid argument to map mdev UUIDs to VMs. This doesn't work out-of-the-box with cloud-hypervisor.

Workarounds (from cloud-hypervisor#5319):

  • Patch CH to accept a dummy -uuid flag
  • Use wrapper scripts that inject the UUID into the process name
  • Wait for NVIDIA to fix their driver's VMM assumptions
vGPU Requirements
  • Hardware: Datacenter GPUs (A100, L40, etc.)
  • Licensing: NVIDIA GRID subscription ($$/GPU/year)
  • Host Software: NVIDIA vGPU Manager installed on host
  • Guest Drivers: vGPU-aware guest drivers
Design Changes for mdev/vGPU Support
1. New Device Type: MdevDevice
type MdevDevice struct {
    UUID         string   // mdev instance UUID
    ParentGPU    string   // PCI address of parent GPU  
    Type         string   // vGPU type (e.g., "nvidia-256")
    Available    bool     // Not assigned to a VM
}
2. Discovery Extensions
// List mdev types supported by a GPU
func (m *manager) ListMdevTypes(ctx context.Context, pciAddress string) ([]MdevType, error)

// List existing mdev instances
func (m *manager) ListMdevInstances(ctx context.Context) ([]MdevDevice, error)

// Create an mdev instance
func (m *manager) CreateMdevInstance(ctx context.Context, pciAddress, mdevType string) (*MdevDevice, error)

// Destroy an mdev instance
func (m *manager) DestroyMdevInstance(ctx context.Context, uuid string) error
3. Passthrough Mechanism

mdev devices use a different sysfs path:

# mdev device path
/sys/bus/mdev/devices/<uuid>/

# vs VFIO-PCI (current)
/sys/bus/pci/devices/0000:a2:00.0/

Cloud-hypervisor's --device flag already accepts mdev paths.

4. NVIDIA vGPU Workaround

To work around NVIDIA's QEMU-specific UUID detection, we may need to:

  • Add a --platform uuid=<uuid> option to cloud-hypervisor invocation
  • Or use a wrapper that sets the process name appropriately
Implementation Phases

Phase 1: mdev Discovery & Passthrough

  • Detect mdev-capable GPUs
  • List available mdev types and instances
  • Pass mdev devices to VMs (path already works)

Phase 2: mdev Lifecycle Management

  • Create/destroy mdev instances via sysfs
  • API endpoints for mdev management

Phase 3: NVIDIA vGPU Integration

  • Implement UUID workaround for NVIDIA's driver
  • Test with GRID licensing
  • Document guest driver requirements
How vGPU + MIG Work Together

vGPU creates mdev devices that can be backed by MIG partitions, giving you both hardware isolation (MIG) and multi-VM assignment (vGPU):

Physical GPU (one PCI device)
    │
    ├── Without vGPU: VFIO passthrough gives whole GPU to ONE VM
    │
    └── With vGPU (GRID license required):
        └── MIG Mode enabled on host
            ├── MIG Instance 0 ──→ vGPU mdev A ──→ VM 1
            ├── MIG Instance 1 ──→ vGPU mdev B ──→ VM 2
            └── MIG Instance 2 ──→ vGPU mdev C ──→ VM 3

Without vGPU, MIG is only useful for workload isolation on the host or within a single VM that owns the whole GPU.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrNotFound is returned when a device is not found
	ErrNotFound = errors.New("device not found")

	// ErrInUse is returned when a device is currently attached to an instance
	ErrInUse = errors.New("device is in use")

	// ErrNotBound is returned when a VFIO operation requires the device to be bound
	ErrNotBound = errors.New("device is not bound to VFIO")

	// ErrAlreadyBound is returned when trying to bind a device that's already bound to VFIO
	ErrAlreadyBound = errors.New("device is already bound to VFIO")

	// ErrAlreadyExists is returned when trying to register a device that already exists
	ErrAlreadyExists = errors.New("device already exists")

	// ErrInvalidName is returned when the device name doesn't match the required pattern
	ErrInvalidName = errors.New("device name must match pattern ^[a-zA-Z0-9][a-zA-Z0-9_.-]+$")

	// ErrNameExists is returned when a device with the same name already exists
	ErrNameExists = errors.New("device name already exists")

	// ErrInvalidPCIAddress is returned when the PCI address format is invalid
	ErrInvalidPCIAddress = errors.New("invalid PCI address format")

	// ErrDeviceNotFound is returned when the PCI device doesn't exist on the host
	ErrDeviceNotFound = errors.New("PCI device not found on host")

	// ErrVFIONotAvailable is returned when VFIO modules are not loaded
	ErrVFIONotAvailable = errors.New("VFIO is not available (modules not loaded)")

	// ErrIOMMUGroupConflict is returned when not all devices in IOMMU group can be passed through
	ErrIOMMUGroupConflict = errors.New("IOMMU group contains other devices that must also be passed through")
)
View Source
var DeviceNamePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9_.-]+$`)

DeviceNamePattern is the regex pattern for valid device names Must start with alphanumeric, followed by alphanumeric, underscore, dot, or dash

Functions

func GetDeviceSysfsPath

func GetDeviceSysfsPath(pciAddress string) string

GetDeviceSysfsPath returns the sysfs path for a PCI device (used by cloud-hypervisor)

func GetIOMMUGroupDevices

func GetIOMMUGroupDevices(iommuGroup int) ([]string, error)

GetIOMMUGroupDevices returns all PCI devices in the same IOMMU group

func ValidateDeviceName

func ValidateDeviceName(name string) bool

ValidateDeviceName validates that a device name matches the required pattern

func ValidatePCIAddress

func ValidatePCIAddress(addr string) bool

ValidatePCIAddress validates that a string is a valid PCI address format

Types

type AvailableDevice

type AvailableDevice struct {
	PCIAddress    string  `json:"pci_address"`
	VendorID      string  `json:"vendor_id"`
	DeviceID      string  `json:"device_id"`
	VendorName    string  `json:"vendor_name"`
	DeviceName    string  `json:"device_name"`
	IOMMUGroup    int     `json:"iommu_group"`
	CurrentDriver *string `json:"current_driver"` // nil if no driver bound
}

AvailableDevice represents a PCI device discovered on the host

func DiscoverAvailableDevices

func DiscoverAvailableDevices() ([]AvailableDevice, error)

DiscoverAvailableDevices scans sysfs for PCI devices that can be used for passthrough It filters for devices that are likely candidates (GPUs, network cards, etc.)

func GetDeviceInfo

func GetDeviceInfo(pciAddress string) (*AvailableDevice, error)

GetDeviceInfo reads information about a specific PCI device

type CreateDeviceRequest

type CreateDeviceRequest struct {
	Name       string `json:"name,omitempty"` // optional: globally unique name (auto-generated if not provided)
	PCIAddress string `json:"pci_address"`    // required: PCI address (e.g., "0000:a2:00.0")
}

CreateDeviceRequest is the request to register a new device

type Device

type Device struct {
	Id          string     `json:"id"`            // cuid2 identifier
	Name        string     `json:"name"`          // user-provided globally unique name
	Type        DeviceType `json:"type"`          // gpu or pci
	PCIAddress  string     `json:"pci_address"`   // e.g., "0000:a2:00.0"
	VendorID    string     `json:"vendor_id"`     // e.g., "10de"
	DeviceID    string     `json:"device_id"`     // e.g., "27b8"
	IOMMUGroup  int        `json:"iommu_group"`   // IOMMU group number
	BoundToVFIO bool       `json:"bound_to_vfio"` // whether device is bound to vfio-pci
	AttachedTo  *string    `json:"attached_to"`   // instance ID if attached, nil otherwise
	CreatedAt   time.Time  `json:"created_at"`
}

Device represents a registered PCI device for passthrough

type DeviceType

type DeviceType string

DeviceType represents the type of PCI device

const (
	DeviceTypeGPU     DeviceType = "gpu"
	DeviceTypeGeneric DeviceType = "pci"
)

func DetermineDeviceType

func DetermineDeviceType(device *AvailableDevice) DeviceType

DetermineDeviceType determines the DeviceType based on device properties

type InstanceLivenessChecker

type InstanceLivenessChecker interface {
	// IsInstanceRunning returns true if the instance exists and is in a running state
	// (i.e., has an active VMM process). Returns false if the instance doesn't exist
	// or is stopped/standby/unknown.
	IsInstanceRunning(ctx context.Context, instanceID string) bool

	// GetInstanceDevices returns the list of device IDs attached to an instance.
	// Returns nil if the instance doesn't exist.
	GetInstanceDevices(ctx context.Context, instanceID string) []string

	// ListAllInstanceDevices returns a map of instanceID -> []deviceIDs for all instances.
	ListAllInstanceDevices(ctx context.Context) map[string][]string

	// DetectSuspiciousVMMProcesses finds cloud-hypervisor processes that don't match
	// known instances and logs warnings. Returns the count of suspicious processes found.
	DetectSuspiciousVMMProcesses(ctx context.Context) int
}

InstanceLivenessChecker provides a way to check if an instance is running. This interface allows devices to query instance state without a circular dependency.

type Manager

type Manager interface {
	// ListDevices returns all registered devices
	ListDevices(ctx context.Context) ([]Device, error)

	// ListAvailableDevices discovers passthrough-capable devices on the host
	ListAvailableDevices(ctx context.Context) ([]AvailableDevice, error)

	// CreateDevice registers a new device for passthrough
	CreateDevice(ctx context.Context, req CreateDeviceRequest) (*Device, error)

	// GetDevice returns a device by ID or name
	GetDevice(ctx context.Context, idOrName string) (*Device, error)

	// DeleteDevice unregisters a device
	DeleteDevice(ctx context.Context, id string) error

	// BindToVFIO binds a device to vfio-pci driver
	BindToVFIO(ctx context.Context, id string) error

	// UnbindFromVFIO unbinds a device from vfio-pci driver
	UnbindFromVFIO(ctx context.Context, id string) error

	// MarkAttached marks a device as attached to an instance
	MarkAttached(ctx context.Context, deviceID, instanceID string) error

	// MarkDetached marks a device as detached from an instance
	MarkDetached(ctx context.Context, deviceID string) error

	// ReconcileDevices cleans up stale device state on startup.
	// It detects devices with AttachedTo referencing non-existent instances
	// and clears the orphaned attachment state.
	ReconcileDevices(ctx context.Context) error

	// SetLivenessChecker sets the instance liveness checker after construction.
	// This allows breaking the circular dependency between device and instance managers.
	SetLivenessChecker(checker InstanceLivenessChecker)
}

Manager provides device management operations

func NewManager

func NewManager(p *paths.Paths) Manager

NewManager creates a new device manager. Use SetLivenessChecker after construction to enable accurate orphan detection.

type VFIOBinder

type VFIOBinder struct{}

VFIOBinder handles binding and unbinding devices to/from VFIO

func NewVFIOBinder

func NewVFIOBinder() *VFIOBinder

NewVFIOBinder creates a new VFIOBinder

func (*VFIOBinder) BindToVFIO

func (v *VFIOBinder) BindToVFIO(pciAddress string) error

BindToVFIO binds a PCI device to the vfio-pci driver This requires: 1. Stopping any processes using the device (e.g., nvidia-persistenced for NVIDIA GPUs) 2. Unbinding the device from its current driver (if any) 3. Binding it to vfio-pci

func (*VFIOBinder) CheckIOMMUGroupSafe

func (v *VFIOBinder) CheckIOMMUGroupSafe(pciAddress string, allowedDevices []string) error

CheckIOMMUGroupSafe checks if all devices in the IOMMU group are safe to pass through Returns an error if there are other devices in the group that aren't being passed through

func (*VFIOBinder) GetVFIOGroupPath

func (v *VFIOBinder) GetVFIOGroupPath(pciAddress string) (string, error)

GetVFIOGroupPath returns the path to the VFIO group device for a PCI device

func (*VFIOBinder) IsDeviceBoundToVFIO

func (v *VFIOBinder) IsDeviceBoundToVFIO(pciAddress string) bool

IsDeviceBoundToVFIO checks if a device is currently bound to vfio-pci

func (*VFIOBinder) IsVFIOAvailable

func (v *VFIOBinder) IsVFIOAvailable() bool

IsVFIOAvailable checks if VFIO is available on the system

func (*VFIOBinder) UnbindFromVFIO

func (v *VFIOBinder) UnbindFromVFIO(pciAddress string) error

UnbindFromVFIO unbinds a device from vfio-pci and restores the original driver

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL