device

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2026 License: Apache-2.0 Imports: 15 Imported by: 0

Documentation

Overview

Package device provides domestic chip device detection and management.

Package device - config_loader.go provides configuration-based device loading.

This module loads device and chip information from configuration files, making them available to the device detection and management system.

Package device provides domestic chip device detection and management.

This package handles detection and management of Chinese-made chip devices including:

  • Hardware detection and capability querying
  • Device availability checking
  • Device metadata and properties
  • Thread-safe access to device information

The package currently supports Huawei Ascend NPU chips (910B, 310P). In production deployments, this package integrates with vendor-specific drivers and libraries for actual hardware detection.

Index

Constants

This section is empty.

Variables

View Source
var KnownChips = LoadChipsFromConfig()

KnownChips loads and caches chip models from configuration

View Source
var KnownVendors = LoadVendorsFromConfig()

KnownVendors loads and caches vendors from configuration

Functions

func FindAIChips

func FindAIChips() (map[string][]DetectedChip, error)

FindAIChips scans for known AI chips on the system

This function combines PCI device scanning with the chip configuration to identify AI accelerators present in the system.

For multi-chip cards (where chips_per_device > 1 in config), each physical PCI device is expanded into multiple logical chips with consecutive indices. For example, if 4 dual-chip cards are detected, this returns 8 logical chips with indices 0-7, allowing the allocator to treat them as independent devices.

Returns:

  • Map of device type to slice of detected chips (logical chips with consecutive indices)
  • Error if scanning fails

Types

type Allocator

type Allocator struct {
	// contains filtered or unexported fields
}

Allocator manages the allocation and release of physical devices. It scans for available AI accelerators and dynamically tracks which devices are allocated by querying running Docker containers.

The allocator supports topology-aware allocation to optimize device placement for high-speed interconnected devices (e.g., NVLink, HCCS).

func NewAllocator

func NewAllocator() (*Allocator, error)

NewAllocator creates and initializes a new DeviceAllocator.

The allocator scans the system for AI accelerators and dynamically tracks device allocation by querying running Docker containers.

Returns:

  • Configured allocator
  • Error if device scanning or Docker client creation fails

func (*Allocator) Allocate

func (a *Allocator) Allocate(instanceID string, count int) ([]DeviceInfo, error)

Allocate attempts to allocate 'count' devices for a given instance.

This method selects free devices by checking current Docker container allocations. The device allocation is tracked in the container labels, not in a separate state file.

Parameters:

  • instanceID: Unique identifier for the instance
  • count: Number of devices to allocate

Returns:

  • Slice of allocated DeviceInfo
  • Error if insufficient devices are available

func (*Allocator) GetAllDevices

func (a *Allocator) GetAllDevices() []DeviceInfo

GetAllDevices returns information about all detected devices.

Returns:

  • Slice of all DeviceInfo (both allocated and free)

func (*Allocator) GetAllocations

func (a *Allocator) GetAllocations() map[string][]DeviceInfo

GetAllocations returns a map of current device allocations from Docker containers.

Returns:

  • Map from instanceID to slice of allocated devices

func (*Allocator) GetAvailableDevices

func (a *Allocator) GetAvailableDevices() []DeviceInfo

GetAvailableDevices returns information about all available (free) devices.

Returns:

  • Slice of DeviceInfo for devices that are currently unallocated

func (*Allocator) Release

func (a *Allocator) Release(instanceID string) error

Release frees devices previously allocated to an instance.

Since devices are tracked via Docker containers, this method only logs the release. The actual device freeing happens when the container is stopped/removed.

Parameters:

  • instanceID: Unique identifier for the instance

Returns:

  • Always returns nil (kept for API compatibility)

type ChipModel

type ChipModel struct {
	// VendorID is the PCI vendor ID
	VendorID string

	// DeviceID is the PCI device ID
	DeviceID string

	// ModelName is the human-readable model name
	ModelName string

	// ConfigKey is the key used in runtime configuration (e.g., "ascend-910b")
	ConfigKey string

	// DeviceType is the corresponding xw device type
	DeviceType api.DeviceType

	// Generation is the chip generation (optional)
	Generation string

	// Capabilities lists the chip's capabilities
	Capabilities []string
}

ChipModel represents a specific chip model with its PCI device ID

func GetChipByID

func GetChipByID(vendorID, deviceID string) *ChipModel

GetChipByID looks up a chip model by PCI vendor and device IDs.

This function searches the configured chip models for a match with the specified PCI identifiers. It's commonly used during device detection to identify discovered hardware.

Parameters:

  • vendorID: PCIe vendor ID (e.g., "0x19e5")
  • deviceID: PCIe device ID (e.g., "0xd802")

Returns:

  • Pointer to ChipModel if found
  • nil if not found

Example:

chip := GetChipByID("0x19e5", "0xd802")
if chip != nil {
    fmt.Printf("Found: %s\n", chip.ModelName)
}

func GetChipsByDeviceType

func GetChipsByDeviceType(deviceType api.DeviceType) []ChipModel

GetChipsByDeviceType returns all chip models for a specific device type.

This function filters chip models by device type, useful for showing all chips that support a particular device type.

Parameters:

  • deviceType: The device type to filter by

Returns:

  • Slice of ChipModel structs matching the device type

Example:

chips := GetChipsByDeviceType(api.DeviceType("ascend-910b"))
for _, chip := range chips {
    fmt.Printf("- %s\n", chip.ModelName)
}

func GetChipsByVendor

func GetChipsByVendor(vendorID string) []ChipModel

GetChipsByVendor returns all chip models for a specific vendor.

This function filters chip models by vendor ID, returning only those that match. Useful for displaying vendor-specific chip information.

Parameters:

  • vendorID: PCIe vendor ID to filter by

Returns:

  • Slice of ChipModel structs matching the vendor

Example:

chips := GetChipsByVendor("0x19e5")
fmt.Printf("Huawei has %d chip model(s)\n", len(chips))

func LoadChipsFromConfig

func LoadChipsFromConfig() []ChipModel

LoadChipsFromConfig loads chip model information from device configuration.

This function reads the device configuration file and extracts chip model details, including PCI IDs, capabilities, and device types.

Returns:

  • Slice of ChipModel structs
  • Empty slice if configuration loading fails

Example:

chips := LoadChipsFromConfig()
for _, chip := range chips {
    fmt.Printf("Chip: %s (%s:%s)\n", chip.ModelName, chip.VendorID, chip.DeviceID)
}

type ChipVendor

type ChipVendor struct {
	// VendorID is the PCI vendor ID (e.g., "0x19e5" for Huawei)
	VendorID string

	// VendorName is the human-readable vendor name
	VendorName string
}

ChipVendor represents a chip vendor's PCI vendor ID and name

func LoadVendorsFromConfig

func LoadVendorsFromConfig() []ChipVendor

LoadVendorsFromConfig loads vendor information from device configuration.

This function reads the device configuration file and extracts vendor information, making it available for device identification and display.

Returns:

  • Slice of ChipVendor structs
  • Empty slice if configuration loading fails

Example:

vendors := LoadVendorsFromConfig()
for _, vendor := range vendors {
    fmt.Printf("Vendor: %s (%s)\n", vendor.VendorName, vendor.VendorID)
}

type DetectedChip

type DetectedChip struct {
	// VendorID is the PCI vendor ID
	VendorID string `json:"vendor_id"`

	// DeviceID is the PCI device ID
	DeviceID string `json:"device_id"`

	// BusAddress is the PCI bus address
	BusAddress string `json:"bus_address"`

	// ModelName is the chip model name
	ModelName string `json:"model_name"`

	// ConfigKey is the base model config key (e.g., "ascend-910b")
	// Used for sandbox selection and image lookup
	ConfigKey string `json:"config_key"`

	// VariantKey is the specific variant key if matched (e.g., "ascend-910b1")
	// Used for runtime_params matching, empty if no variant matched
	VariantKey string `json:"variant_key,omitempty"`

	// DeviceType is the xw device type (same as VariantKey if variant matched, otherwise ConfigKey)
	DeviceType api.DeviceType `json:"device_type"`

	// Generation is the chip generation
	Generation string `json:"generation"`

	// Capabilities lists the chip's capabilities
	Capabilities []string `json:"capabilities"`

	// PhysicalDeviceIndex is the index of the physical PCI device (0-based)
	// Used to identify which physical card this chip belongs to
	PhysicalDeviceIndex int `json:"physical_device_index"`

	// ChipIndex is the chip index within a multi-chip card (0-based)
	// For single-chip cards: 0
	// For dual-chip cards: 0 or 1
	ChipIndex int `json:"chip_index"`

	// ChipsPerDevice indicates total chips on this physical device
	ChipsPerDevice int `json:"chips_per_device"`
}

DetectedChip represents a detected AI chip with full information

type Device

type Device struct {
	// Type is the device type identifying the chip architecture.
	// Example: DeviceTypeAscend
	Type api.DeviceType `json:"type"`

	// Name is the human-readable device name.
	// Example: "Huawei Ascend 910B"
	Name string `json:"name"`

	// Available indicates if the device is currently available for use.
	// A device may be unavailable if it's in use, has an error, or lacks drivers.
	Available bool `json:"available"`

	// Properties contains device-specific metadata and capabilities.
	// Common keys: "vendor", "version", "memory", "cores"
	// Values are stored as strings for flexibility.
	Properties map[string]string `json:"properties"`
}

Device represents a detected domestic chip device with its metadata.

A Device instance contains information about a specific hardware device including its type, availability status, and vendor-specific properties. This information is used to determine model compatibility and optimize model execution.

type DeviceInfo

type DeviceInfo struct {
	Type       string            `json:"type"`                  // Device type (e.g., "ascend", "cuda")
	Index      int               `json:"index"`                 // Device index (0-based)
	BusAddress string            `json:"bus_address"`           // PCI bus address
	ModelName  string            `json:"model_name"`            // Device model name
	ConfigKey  string            `json:"config_key"`            // Base model config key (for sandbox, image lookup)
	VariantKey string            `json:"variant_key,omitempty"` // Specific variant key (for runtime_params)
	Properties map[string]string `json:"properties"`            // Additional properties
}

DeviceInfo represents information about a device for runtime use. This is a simplified version focused on runtime needs.

type DeviceTopology

type DeviceTopology struct {
	// contains filtered or unexported fields
}

DeviceTopology provides distance information between logical chips.

Topology enables distance-aware device allocation to minimize inter-chip communication latency by preferring chips with shorter distances.

func NewDeviceTopology

func NewDeviceTopology(topologyConfig *config.TopologyConfig) *DeviceTopology

NewDeviceTopology creates a topology from configuration.

Parameters:

  • topologyConfig: Topology configuration from devices.yaml

Returns:

  • Initialized DeviceTopology or nil if no topology configured

func (*DeviceTopology) GetDistance

func (dt *DeviceTopology) GetDistance(chipA, chipB int) int

GetDistance calculates the distance between two logical chips.

Distance rules:

  • Same box: distance = 0 (high-speed interconnect)
  • Different boxes: distance = |box_a - box_b|
  • Unknown chip: distance = 999 (avoid allocation)

Parameters:

  • chipA: Logical chip index A
  • chipB: Logical chip index B

Returns:

  • Distance value (0 = closest, higher = farther)

type Manager

type Manager struct {
	// contains filtered or unexported fields
}

Manager manages device detection and maintains device availability state.

The Manager provides thread-safe access to information about detected hardware devices. It performs initial device detection at creation and maintains a registry of available devices.

In production, the Manager would integrate with vendor-specific APIs and drivers to perform actual hardware probing and capability detection.

func NewManager

func NewManager() *Manager

NewManager creates and initializes a new device manager.

The manager is created with an empty devices map and immediately performs device detection through detectDevices(). This identifies all available domestic chip devices on the system.

Device detection happens synchronously during initialization to ensure device information is available immediately after creation.

Returns:

  • A pointer to a fully initialized Manager with detected devices.

Example:

manager := device.NewManager()
if manager.IsAvailable(ConfigKeyAscend910B) {
    fmt.Println("Ascend 910B NPU is available")
}

func (*Manager) GetDetectedDeviceTypes

func (m *Manager) GetDetectedDeviceTypes() []api.DeviceType

GetDetectedDeviceTypes returns the types of all detected devices.

This method returns only the device types that have been detected on the current system. It's used to filter models to show only those compatible with available hardware.

Returns:

  • A slice of DeviceType values for detected devices. Returns an empty slice if no devices are detected.

Example:

detected := manager.GetDetectedDeviceTypes()
if len(detected) == 0 {
    fmt.Println("No AI accelerators detected")
} else {
    fmt.Printf("Detected: %v\n", detected)
}

func (*Manager) GetDevice

func (m *Manager) GetDevice(deviceType api.DeviceType) (*Device, error)

GetDevice retrieves detailed information for a specific device type.

This method returns the full Device struct for a specified device type, including all properties and metadata. Unlike IsAvailable(), this method returns detailed information even if the device is unavailable, allowing callers to determine why a device isn't available.

The method is thread-safe and can be called concurrently.

Parameters:

  • deviceType: The device type to retrieve information for

Returns:

  • A pointer to the Device struct if the device type is detected
  • An error if the device type was not detected on the system

Example:

device, err := manager.GetDevice(ConfigKeyAscend910B)
if err != nil {
    log.Printf("Ascend device not found: %v", err)
    return
}
fmt.Printf("Device version: %s\n", device.Properties["version"])

func (*Manager) GetSupportedTypes

func (m *Manager) GetSupportedTypes() []api.DeviceType

GetSupportedTypes returns all device types supported by the application.

This method returns a complete list of all device types that the xw application is designed to work with, regardless of whether they are currently detected on the system. This is useful for:

  • Displaying supported hardware to users
  • Validation of configuration files
  • Documentation and help text generation

The method reads device types from configuration. If configuration loading fails, it returns a fallback list of known device types.

Returns:

  • A slice of all supported DeviceType values.

Example:

supported := manager.GetSupportedTypes()
fmt.Printf("This application supports %d device types\n", len(supported))
for _, dt := range supported {
    fmt.Printf("- %s\n", dt)
}

func (*Manager) IsAvailable

func (m *Manager) IsAvailable(deviceType api.DeviceType) bool

IsAvailable checks if a specific device type is currently available.

This method performs a quick check to determine if a device of the specified type exists and is marked as available. It's commonly used before attempting to run models on specific hardware.

The method is thread-safe and can be called concurrently.

Parameters:

  • deviceType: The device type to check (e.g., ConfigKeyAscend910B)

Returns:

  • true if the device type exists and is available
  • false if the device doesn't exist or is unavailable

Example:

if !manager.IsAvailable(ConfigKeyAscend910B) {
    return fmt.Errorf("Ascend 910B device required but not available")
}

func (*Manager) ListAvailable

func (m *Manager) ListAvailable() []*Device

ListAvailable returns all currently available devices.

This method returns only devices that are marked as available, filtering out devices that are unavailable due to errors, being in use, or missing drivers.

The method is thread-safe and can be called concurrently. It returns pointers to Device structs, allowing callers to inspect detailed device properties.

Returns:

  • A slice of pointers to Device structs for all available devices. Returns an empty slice if no devices are available.

Example:

devices := manager.ListAvailable()
for _, dev := range devices {
    fmt.Printf("%s (%s): vendor=%s\n",
        dev.Name, dev.Type, dev.Properties["vendor"])
}

func (*Manager) ListDetectedChips

func (m *Manager) ListDetectedChips() ([]DetectedChip, error)

ListDetectedChips returns detailed information for all detected AI chips.

This method performs a fresh scan of the system and returns individual chip information including PCI addresses, vendor/device IDs, and capabilities. Unlike ListAvailable() which returns aggregated Device entries, this returns one entry per physical chip.

Returns:

  • A slice of DetectedChip with details for each physical chip
  • An error if hardware scanning fails

Example:

chips, err := manager.ListDetectedChips()
if err != nil {
    return err
}
for _, chip := range chips {
    fmt.Printf("%s: %s at %s\n", chip.DeviceType, chip.ModelName, chip.BusAddress)
}

type PCIDevice

type PCIDevice struct {
	// VendorID is the PCI vendor ID (e.g., "0x1db7")
	VendorID string

	// DeviceID is the PCI device ID
	DeviceID string

	// SubsystemVendorID is the subsystem vendor ID (optional)
	SubsystemVendorID string

	// SubsystemDeviceID is the subsystem device ID (optional)
	SubsystemDeviceID string

	// BusAddress is the PCI bus address (e.g., "0000:01:00.0")
	BusAddress string

	// Class is the PCI device class
	Class string
}

PCIDevice represents a PCI device with its identifiers

func ParseLspciOutput

func ParseLspciOutput(output string) []PCIDevice

ParseLspciOutput parses the output of `lspci -nn` command

This is an alternative method for systems where sysfs access is restricted. The output format should be: "bus:dev.fn Class [class]: Vendor [vid:did]"

Parameters:

  • output: The output from lspci -nn command

Returns:

  • Slice of PCIDevice parsed from the output

func ScanPCIDevices

func ScanPCIDevices() ([]PCIDevice, error)

ScanPCIDevices scans the system for PCI devices

This function reads PCI device information from /sys/bus/pci/devices which is the standard location on Linux systems.

Returns:

  • Slice of PCIDevice found on the system
  • Error if scanning fails

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL