devices

package

v0.0.4 Latest Latest Go to latest Published: Jan 12, 2026 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kernel/hypeman

Links

Open Source Insights

README ¶

Device Passthrough

This package provides GPU and PCI device passthrough for virtual machines using the Linux VFIO (Virtual Function I/O) framework.

Overview

Device passthrough allows a VM to have direct, near-native access to physical hardware (GPUs, network cards, etc.) by bypassing the host's device drivers and giving the guest exclusive control. For a deep dive into the VFIO framework, see the kernel documentation.

┌─────────────────────────────────────────────────────────────┐
│  Host                                                       │
│  ┌─────────────┐     ┌─────────────────────────────────┐    │
│  │  hypeman    │     │  VFIO Driver                    │    │
│  │  (VMM)      │────▶│  /dev/vfio/<group>              │    │
│  └─────────────┘     └─────────────────────────────────┘    │
│                              │                              │
│  ┌───────────────────────────┼──────────────────────────┐   │
│  │  IOMMU (hardware)         ▼                          │   │
│  │  - Translates guest physical → host physical         │   │
│  │  - Isolates DMA (device can only access VM memory)   │   │
│  └──────────────────────────────────────────────────────┘   │
│                              │                              │
│                              ▼                              │
│                    ┌──────────────┐                         │
│                    │  GPU (PCIe)  │                         │
│                    └──────────────┘                         │
└─────────────────────────────────────────────────────────────┘

Package Structure

lib/devices/
├── types.go        # Device, AvailableDevice, CreateDeviceRequest
├── errors.go       # Error definitions
├── discovery.go    # PCI device discovery from sysfs
├── vfio.go         # VFIO bind/unbind operations
├── manager.go      # Manager interface and implementation
├── manager_test.go # Unit tests
├── gpu_e2e_test.go # End-to-end GPU passthrough test (auto-skips if no GPU)
└── scripts/
    └── gpu-reset.sh  # GPU recovery script (see Troubleshooting)

Example: Full Workflow

# 1. Discover available devices
curl localhost:8080/devices/available
# → [{"pci_address": "0000:a2:00.0", "vendor_name": "NVIDIA Corporation", ...}]

# 2. Register the GPU
curl -X POST localhost:8080/devices \
  -d '{"name": "l4-gpu", "pci_address": "0000:a2:00.0"}'

# 3. Create instance with GPU (auto-binds to VFIO)
curl -X POST localhost:8080/instances \
  -d '{"name": "ml-training", "image": "nvidia/cuda:12.0-base", "devices": ["l4-gpu"]}'

# 4. Inside VM: verify GPU
lspci | grep -i nvidia
nvidia-smi

# 5. Delete instance (auto-unbinds from VFIO)
curl -X DELETE localhost:8080/instances/{id}
# GPU returns to host control

Device Lifecycle

1. Discovery

Discover passthrough-capable devices on the host:

GET /devices/available

Returns PCI devices that are candidates for passthrough (GPUs, 3D controllers). Each device includes its PCI address, vendor/device IDs, IOMMU group, and current driver.

2. Registration

Register a device with a unique name:

POST /devices
{
  "name": "l4-gpu",
  "pci_address": "0000:a2:00.0"
}

Registration does not modify the device's driver binding. The device remains usable by the host until an instance requests it.

3. Instance Creation (Auto-Bind)

When an instance is created with devices:

POST /instances
{
  "name": "gpu-workload",
  "image": "docker.io/nvidia/cuda:12.0-base",
  "devices": ["l4-gpu"]
}

The system automatically:

Validates the device exists and isn't attached to another instance
Binds to VFIO if not already bound (unbinds native driver like nvidia)
Passes to cloud-hypervisor via the --device flag
Marks as attached to prevent concurrent use

4. Instance Deletion (Auto-Unbind)

When an instance is deleted, the system automatically:

Marks device as detached
Unbinds from VFIO (triggers kernel driver probe to restore native driver)

This returns the device to host control so it can be used by other processes or a new instance.

5. Unregistration

DELETE /devices/{id}

Removes the device from hypeman's registry. Fails if the device is currently attached to an instance.

Cloud Hypervisor Integration

Cloud-hypervisor receives device passthrough configuration via the VmConfig.Devices field:

vmConfig.Devices = &[]vmm.DeviceConfig{
    {
        Path: "/sys/bus/pci/devices/0000:a2:00.0/",
    },
}

Cloud-hypervisor then:

Opens the VFIO group file (/dev/vfio/<group>)
Maps device BARs (memory regions) into guest physical address space
Configures interrupt routing (MSI/MSI-X) to the guest
The guest sees a real PCIe device and loads native drivers

NVIDIA-Specific Options

For multi-GPU configurations, cloud-hypervisor supports GPUDirect P2P:

DeviceConfig{
    Path: "/sys/bus/pci/devices/0000:a2:00.0/",
    XNvGpudirectClique: ptr(int8(0)),  // Enable P2P within clique 0
}

This is not currently exposed through the hypeman API but could be added for HPC workloads.

Constraints and Limitations

IOMMU Requirements

IOMMU must be enabled in BIOS and kernel (intel_iommu=on or amd_iommu=on)
All devices in an IOMMU group must be passed through together
Some motherboards place many devices in the same group (ACS override may help)

VFIO Module Requirements

The following kernel modules must be loaded:

modprobe vfio_pci
modprobe vfio_iommu_type1

Driver Binding

Binding to VFIO unloads the native driver (e.g., nvidia, amdgpu)
Host processes using the device will lose access
Some drivers (like NVIDIA) may resist unbinding if in use

Single Attachment

A device can only be attached to one instance at a time. Attempts to attach an already-attached device will fail.

No Hot-Plug

Devices must be specified at instance creation time. Hot-adding devices to a running VM is not currently supported (though cloud-hypervisor has this capability).

Guest Driver Requirements

The guest must have appropriate drivers:

NVIDIA GPUs: Install NVIDIA drivers in the guest image
AMD GPUs: Install amdgpu/ROCm in the guest image

Performance Considerations

ACS (Access Control Services): Required for proper isolation on some systems
Huge Pages: Recommended for GPU workloads (hugepages=on in cloud-hypervisor)
CPU Pinning: Can improve latency for GPU compute workloads

Troubleshooting

GPU Reset Script

If GPU passthrough tests fail or hang, the GPU may be left in a bad state (still bound to vfio-pci, or stuck without a driver). Use the provided reset script:

# Reset all NVIDIA GPUs to their native driver
sudo ./lib/devices/scripts/gpu-reset.sh

# Reset a specific GPU
sudo ./lib/devices/scripts/gpu-reset.sh 0000:a2:00.0

The script will:

Kill any stuck cloud-hypervisor processes holding the GPU
Unbind from vfio-pci if still bound
Clear driver_override
Trigger driver probe to rebind to the nvidia driver
Restart nvidia-persistenced

Common Issues

VFIO Bind Hangs

Symptom: BindToVFIO hangs indefinitely.

Cause: The nvidia-persistenced service keeps /dev/nvidia* open, preventing driver unbind.

Solution: The code now automatically stops nvidia-persistenced before unbinding. If you're testing manually:

sudo systemctl stop nvidia-persistenced
# ... do VFIO bind/unbind ...
sudo systemctl start nvidia-persistenced

VM Exec Fails After Boot

Symptom: VM boots but exec commands time out.

Cause: Usually the container's main process exited (e.g., alpine image runs /bin/sh which exits immediately), causing init to exit and the VM to kernel panic.

Solution: Use an image with a long-running process (e.g., nginx:alpine) or ensure your container has a persistent entrypoint.

GPU Not Restored After Test

Symptom: GPU has no driver bound, nvidia-smi fails.

Solution:

# Trigger kernel driver probe
sudo sh -c 'echo 0000:a2:00.0 > /sys/bus/pci/drivers_probe'
# Restart nvidia-persistenced
sudo systemctl start nvidia-persistenced
# Verify
nvidia-smi

If that fails, a system reboot may be necessary.

VFIO Modules Not Loaded

Symptom: ErrVFIONotAvailable error.

Solution:

sudo modprobe vfio_pci vfio_iommu_type1
# Verify
ls /dev/vfio/

Add to /etc/modules-load.d/vfio.conf for persistence across reboots.

IOMMU Not Enabled

Symptom: No IOMMU groups found, passthrough fails.

Solution: Add kernel parameter to bootloader:

Intel: intel_iommu=on iommu=pt
AMD: amd_iommu=on iommu=pt

Then reboot.

Running the E2E Test

The GPU passthrough E2E test automatically detects GPU availability and skips if prerequisites aren't met.

Why GPU tests require root: Unlike network tests which can use Linux capabilities (CAP_NET_ADMIN), GPU passthrough requires writing to sysfs files (/sys/bus/pci/drivers/*/unbind, etc.) which are protected by standard Unix file permissions (owned by root, mode 0200). Capabilities don't bypass DAC (discretionary access control) for file writes.

Prerequisites for the test to run (not skip):

Root permissions (sudo) - required for sysfs driver operations
NVIDIA GPU on host
IOMMU enabled (intel_iommu=on or amd_iommu=on)
vfio_pci and vfio_iommu_type1 modules loaded
/sbin in PATH (for mkfs.ext4)

# Prepare the environment
sudo modprobe vfio_pci vfio_iommu_type1

# Run via make - test auto-skips if not root or no GPU
make test

# Or run directly with sudo
sudo env PATH=$PATH:/sbin:/usr/sbin \
  go test -v -run TestGPUPassthrough -timeout 5m ./lib/devices/...

The test will:

Check prerequisites and skip if not met (not root, no GPU, no IOMMU, etc.)
Discover available NVIDIA GPUs
Register the first GPU found
Create a VM with GPU passthrough
Verify the GPU is visible inside the VM
Clean up (delete VM, unbind from VFIO, restore nvidia driver)

The Problem

With current VFIO passthrough, a GPU is assigned exclusively to one VM. To share a single GPU across multiple VMs (e.g., give each VM a "slice"), you need NVIDIA's vGPU (GRID) technology.

Why MIG Alone Doesn't Help

MIG (Multi-Instance GPU) partitions a GPU into isolated instances at the hardware level, but:

MIG partitions are not separate PCI devices—the GPU remains one PCI endpoint
MIG partitions are accessed via CUDA APIs (CUDA_VISIBLE_DEVICES=MIG-<uuid>)
You can only VFIO-passthrough the whole GPU to one VM
MIG is useful for workload isolation within a single host or VM, not for multi-VM sharing

Physical GPU (0000:a2:00.0) ─── still ONE PCI device
    └── MIG partitions (logical, not separate devices)
        ├── MIG Instance 0 ─┐
        ├── MIG Instance 1 ─┼── All accessed via CUDA on the same GPU
        └── MIG Instance 2 ─┘

Supported MIG Hardware: A100, A30, H100, H200 (NOT L4 or consumer GPUs)

To assign GPU shares to separate VMs, NVIDIA requires their vGPU (GRID) technology, which uses the Linux mediated device (mdev) framework.

Cloud-Hypervisor mdev Support Status

Cloud-hypervisor does support mdev passthrough:

cloud-hypervisor --device path=/sys/bus/mdev/devices/<uuid>/

However, NVIDIA's proprietary vGPU manager has a QEMU-specific quirk: it reads the VMM process's /proc/<pid>/cmdline looking for a -uuid argument to map mdev UUIDs to VMs. This doesn't work out-of-the-box with cloud-hypervisor.

Workarounds (from cloud-hypervisor#5319):

Patch CH to accept a dummy -uuid flag
Use wrapper scripts that inject the UUID into the process name
Wait for NVIDIA to fix their driver's VMM assumptions

vGPU Requirements

Hardware: Datacenter GPUs (A100, L40, etc.)
Licensing: NVIDIA GRID subscription ($$/GPU/year)
Host Software: NVIDIA vGPU Manager installed on host
Guest Drivers: vGPU-aware guest drivers

Design Changes for mdev/vGPU Support

1. New Device Type: `MdevDevice`

type MdevDevice struct {
    UUID         string   // mdev instance UUID
    ParentGPU    string   // PCI address of parent GPU  
    Type         string   // vGPU type (e.g., "nvidia-256")
    Available    bool     // Not assigned to a VM
}

2. Discovery Extensions

// List mdev types supported by a GPU
func (m *manager) ListMdevTypes(ctx context.Context, pciAddress string) ([]MdevType, error)

// List existing mdev instances
func (m *manager) ListMdevInstances(ctx context.Context) ([]MdevDevice, error)

// Create an mdev instance
func (m *manager) CreateMdevInstance(ctx context.Context, pciAddress, mdevType string) (*MdevDevice, error)

// Destroy an mdev instance
func (m *manager) DestroyMdevInstance(ctx context.Context, uuid string) error

3. Passthrough Mechanism

mdev devices use a different sysfs path:

# mdev device path
/sys/bus/mdev/devices/<uuid>/

# vs VFIO-PCI (current)
/sys/bus/pci/devices/0000:a2:00.0/

Cloud-hypervisor's --device flag already accepts mdev paths.

4. NVIDIA vGPU Workaround

To work around NVIDIA's QEMU-specific UUID detection, we may need to:

Add a --platform uuid=<uuid> option to cloud-hypervisor invocation
Or use a wrapper that sets the process name appropriately

Implementation Phases

Phase 1: mdev Discovery & Passthrough

Detect mdev-capable GPUs
List available mdev types and instances
Pass mdev devices to VMs (path already works)

Phase 2: mdev Lifecycle Management

Create/destroy mdev instances via sysfs
API endpoints for mdev management

Phase 3: NVIDIA vGPU Integration

Implement UUID workaround for NVIDIA's driver
Test with GRID licensing
Document guest driver requirements

How vGPU + MIG Work Together

vGPU creates mdev devices that can be backed by MIG partitions, giving you both hardware isolation (MIG) and multi-VM assignment (vGPU):

Physical GPU (one PCI device)
    │
    ├── Without vGPU: VFIO passthrough gives whole GPU to ONE VM
    │
    └── With vGPU (GRID license required):
        └── MIG Mode enabled on host
            ├── MIG Instance 0 ──→ vGPU mdev A ──→ VM 1
            ├── MIG Instance 1 ──→ vGPU mdev B ──→ VM 2
            └── MIG Instance 2 ──→ vGPU mdev C ──→ VM 3

Without vGPU, MIG is only useful for workload isolation on the host or within a single VM that owns the whole GPU.

Documentation ¶

Index ¶

Variables
func GetDeviceSysfsPath(pciAddress string) string
func GetIOMMUGroupDevices(iommuGroup int) ([]string, error)
func ValidateDeviceName(name string) bool
func ValidatePCIAddress(addr string) bool
type AvailableDevice
- func DiscoverAvailableDevices() ([]AvailableDevice, error)
- func GetDeviceInfo(pciAddress string) (*AvailableDevice, error)
type CreateDeviceRequest
type Device
type DeviceType
- func DetermineDeviceType(device *AvailableDevice) DeviceType
type InstanceLivenessChecker
type Manager
- func NewManager(p *paths.Paths) Manager
type VFIOBinder
- func NewVFIOBinder() *VFIOBinder

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrNotFound is returned when a device is not found
	ErrNotFound = errors.New("device not found")

	// ErrInUse is returned when a device is currently attached to an instance
	ErrInUse = errors.New("device is in use")

	// ErrNotBound is returned when a VFIO operation requires the device to be bound
	ErrNotBound = errors.New("device is not bound to VFIO")

	// ErrAlreadyBound is returned when trying to bind a device that's already bound to VFIO
	ErrAlreadyBound = errors.New("device is already bound to VFIO")

	// ErrAlreadyExists is returned when trying to register a device that already exists
	ErrAlreadyExists = errors.New("device already exists")

	// ErrInvalidName is returned when the device name doesn't match the required pattern
	ErrInvalidName = errors.New("device name must match pattern ^[a-zA-Z0-9][a-zA-Z0-9_.-]+$")

	// ErrNameExists is returned when a device with the same name already exists
	ErrNameExists = errors.New("device name already exists")

	// ErrInvalidPCIAddress is returned when the PCI address format is invalid
	ErrInvalidPCIAddress = errors.New("invalid PCI address format")

	// ErrDeviceNotFound is returned when the PCI device doesn't exist on the host
	ErrDeviceNotFound = errors.New("PCI device not found on host")

	// ErrVFIONotAvailable is returned when VFIO modules are not loaded
	ErrVFIONotAvailable = errors.New("VFIO is not available (modules not loaded)")

	// ErrIOMMUGroupConflict is returned when not all devices in IOMMU group can be passed through
	ErrIOMMUGroupConflict = errors.New("IOMMU group contains other devices that must also be passed through")
)

View Source

var DeviceNamePattern = regexp.MustCompile(`^[a-zA-Z0-9][a-zA-Z0-9_.-]+$`)

DeviceNamePattern is the regex pattern for valid device names Must start with alphanumeric, followed by alphanumeric, underscore, dot, or dash

Functions ¶

func GetDeviceSysfsPath ¶

func GetDeviceSysfsPath(pciAddress string) string

GetDeviceSysfsPath returns the sysfs path for a PCI device (used by cloud-hypervisor)

func GetIOMMUGroupDevices ¶

func GetIOMMUGroupDevices(iommuGroup int) ([]string, error)

GetIOMMUGroupDevices returns all PCI devices in the same IOMMU group

func ValidateDeviceName ¶

func ValidateDeviceName(name string) bool

ValidateDeviceName validates that a device name matches the required pattern

func ValidatePCIAddress ¶

func ValidatePCIAddress(addr string) bool

ValidatePCIAddress validates that a string is a valid PCI address format

Types ¶

type AvailableDevice ¶

type AvailableDevice struct {
	PCIAddress    string  `json:"pci_address"`
	VendorID      string  `json:"vendor_id"`
	DeviceID      string  `json:"device_id"`
	VendorName    string  `json:"vendor_name"`
	DeviceName    string  `json:"device_name"`
	IOMMUGroup    int     `json:"iommu_group"`
	CurrentDriver *string `json:"current_driver"` // nil if no driver bound
}

AvailableDevice represents a PCI device discovered on the host

func DiscoverAvailableDevices ¶

func DiscoverAvailableDevices() ([]AvailableDevice, error)

DiscoverAvailableDevices scans sysfs for PCI devices that can be used for passthrough It filters for devices that are likely candidates (GPUs, network cards, etc.)

func GetDeviceInfo ¶

func GetDeviceInfo(pciAddress string) (*AvailableDevice, error)

GetDeviceInfo reads information about a specific PCI device

type CreateDeviceRequest ¶

type CreateDeviceRequest struct {
	Name       string `json:"name,omitempty"` // optional: globally unique name (auto-generated if not provided)
	PCIAddress string `json:"pci_address"`    // required: PCI address (e.g., "0000:a2:00.0")
}

CreateDeviceRequest is the request to register a new device

type Device ¶

type Device struct {
	Id          string     `json:"id"`            // cuid2 identifier
	Name        string     `json:"name"`          // user-provided globally unique name
	Type        DeviceType `json:"type"`          // gpu or pci
	PCIAddress  string     `json:"pci_address"`   // e.g., "0000:a2:00.0"
	VendorID    string     `json:"vendor_id"`     // e.g., "10de"
	DeviceID    string     `json:"device_id"`     // e.g., "27b8"
	IOMMUGroup  int        `json:"iommu_group"`   // IOMMU group number
	BoundToVFIO bool       `json:"bound_to_vfio"` // whether device is bound to vfio-pci
	AttachedTo  *string    `json:"attached_to"`   // instance ID if attached, nil otherwise
	CreatedAt   time.Time  `json:"created_at"`
}

Device represents a registered PCI device for passthrough

type DeviceType ¶

type DeviceType string

DeviceType represents the type of PCI device

const (
	DeviceTypeGPU     DeviceType = "gpu"
	DeviceTypeGeneric DeviceType = "pci"
)

func DetermineDeviceType ¶

func DetermineDeviceType(device *AvailableDevice) DeviceType

DetermineDeviceType determines the DeviceType based on device properties

type InstanceLivenessChecker ¶

type InstanceLivenessChecker interface {
	// IsInstanceRunning returns true if the instance exists and is in a running state
	// (i.e., has an active VMM process). Returns false if the instance doesn't exist
	// or is stopped/standby/unknown.
	IsInstanceRunning(ctx context.Context, instanceID string) bool

	// GetInstanceDevices returns the list of device IDs attached to an instance.
	// Returns nil if the instance doesn't exist.
	GetInstanceDevices(ctx context.Context, instanceID string) []string

	// ListAllInstanceDevices returns a map of instanceID -> []deviceIDs for all instances.
	ListAllInstanceDevices(ctx context.Context) map[string][]string

	// DetectSuspiciousVMMProcesses finds cloud-hypervisor processes that don't match
	// known instances and logs warnings. Returns the count of suspicious processes found.
	DetectSuspiciousVMMProcesses(ctx context.Context) int
}

InstanceLivenessChecker provides a way to check if an instance is running. This interface allows devices to query instance state without a circular dependency.

type Manager ¶

type Manager interface {
	// ListDevices returns all registered devices
	ListDevices(ctx context.Context) ([]Device, error)

	// ListAvailableDevices discovers passthrough-capable devices on the host
	ListAvailableDevices(ctx context.Context) ([]AvailableDevice, error)

	// CreateDevice registers a new device for passthrough
	CreateDevice(ctx context.Context, req CreateDeviceRequest) (*Device, error)

	// GetDevice returns a device by ID or name
	GetDevice(ctx context.Context, idOrName string) (*Device, error)

	// DeleteDevice unregisters a device
	DeleteDevice(ctx context.Context, id string) error

	// BindToVFIO binds a device to vfio-pci driver
	BindToVFIO(ctx context.Context, id string) error

	// UnbindFromVFIO unbinds a device from vfio-pci driver
	UnbindFromVFIO(ctx context.Context, id string) error

	// MarkAttached marks a device as attached to an instance
	MarkAttached(ctx context.Context, deviceID, instanceID string) error

	// MarkDetached marks a device as detached from an instance
	MarkDetached(ctx context.Context, deviceID string) error

	// ReconcileDevices cleans up stale device state on startup.
	// It detects devices with AttachedTo referencing non-existent instances
	// and clears the orphaned attachment state.
	ReconcileDevices(ctx context.Context) error

	// SetLivenessChecker sets the instance liveness checker after construction.
	// This allows breaking the circular dependency between device and instance managers.
	SetLivenessChecker(checker InstanceLivenessChecker)
}

Manager provides device management operations

func NewManager ¶

func NewManager(p *paths.Paths) Manager

NewManager creates a new device manager. Use SetLivenessChecker after construction to enable accurate orphan detection.

type VFIOBinder ¶

type VFIOBinder struct{}

VFIOBinder handles binding and unbinding devices to/from VFIO

func NewVFIOBinder ¶

func NewVFIOBinder() *VFIOBinder

NewVFIOBinder creates a new VFIOBinder

func (*VFIOBinder) BindToVFIO ¶

func (v *VFIOBinder) BindToVFIO(pciAddress string) error

BindToVFIO binds a PCI device to the vfio-pci driver This requires: 1. Stopping any processes using the device (e.g., nvidia-persistenced for NVIDIA GPUs) 2. Unbinding the device from its current driver (if any) 3. Binding it to vfio-pci

func (*VFIOBinder) CheckIOMMUGroupSafe ¶

func (v *VFIOBinder) CheckIOMMUGroupSafe(pciAddress string, allowedDevices []string) error

CheckIOMMUGroupSafe checks if all devices in the IOMMU group are safe to pass through Returns an error if there are other devices in the group that aren't being passed through

func (*VFIOBinder) GetVFIOGroupPath ¶

func (v *VFIOBinder) GetVFIOGroupPath(pciAddress string) (string, error)

GetVFIOGroupPath returns the path to the VFIO group device for a PCI device

func (*VFIOBinder) IsDeviceBoundToVFIO ¶

func (v *VFIOBinder) IsDeviceBoundToVFIO(pciAddress string) bool

IsDeviceBoundToVFIO checks if a device is currently bound to vfio-pci

func (*VFIOBinder) IsVFIOAvailable ¶

func (v *VFIOBinder) IsVFIOAvailable() bool

IsVFIOAvailable checks if VFIO is available on the system

func (*VFIOBinder) UnbindFromVFIO ¶

func (v *VFIOBinder) UnbindFromVFIO(pciAddress string) error

UnbindFromVFIO unbinds a device from vfio-pci and restores the original driver

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL