gpu

package
v0.0.0-...-31909ca Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 14, 2026 License: Apache-2.0, Apache-2.0 Imports: 16 Imported by: 0

Documentation

Index

Constants

View Source
const (

	// NVIDIA GPU Feature Discovery (GFD) label keys
	LabelGPUCount   = "nvidia.com/gpu.count"
	LabelGPUProduct = "nvidia.com/gpu.product"
	LabelGPUMemory  = "nvidia.com/gpu.memory"
	// DCGM exporter label constants
	LabelApp                     = "app"
	LabelAppKubernetesName       = "app.kubernetes.io/name"
	LabelValueNvidiaDCGMExporter = "nvidia-dcgm-exporter"
	LabelValueDCGMExporter       = "dcgm-exporter"
	LabelValueGPUOperator        = "gpu-operator"
	GPUOperatorNamespace         = "gpu-operator"

	CloudProviderGCP     = "gcp"
	CloudProviderAWS     = "aws"
	CloudProviderAKS     = "aks"
	CloudProviderOther   = "other"
	CloudProviderUnknown = "unknown"
)

Variables

This section is empty.

Functions

func GetCloudProviderInfo

func GetCloudProviderInfo(ctx context.Context, k8sClient client.Reader) (string, error)

func InferHardwareSystem

func InferHardwareSystem(gpuProduct string) nvidiacomv1beta1.GPUSKUType

InferHardwareSystem maps GPU product name to hardware system identifier. Returns empty string if the GPU model cannot be confidently mapped.

This is a best-effort mapping based on common NVIDIA datacenter GPU naming patterns. The system identifier is used by the profiler for performance estimation and configuration.

Limitations:

  • Cannot distinguish SXM vs. PCIe variants from labels alone (assumes SXM for datacenter GPUs)
  • New GPU models require code updates (gracefully returns empty string)
  • Non-standard SKU names may not match

Users can manually override the system in their profiling config (hardware.system) if auto-detection is incorrect or unavailable.

Types

type GPUDiscovery

type GPUDiscovery struct {
	Scraper ScrapeMetricsFunc
}

func NewGPUDiscovery

func NewGPUDiscovery(scraper ScrapeMetricsFunc) *GPUDiscovery

func (*GPUDiscovery) DiscoverGPUsFromDCGM

func (g *GPUDiscovery) DiscoverGPUsFromDCGM(ctx context.Context, k8sClient client.Reader, cache *GPUDiscoveryCache) (*GPUInfo, error)

DiscoverGPUsFromDCGM discovers GPU information by scraping metrics directly from DCGM exporter pods running in the cluster.

The function performs the following:

  1. Returns cached GPU information if still valid.
  2. Lists DCGM exporter pods across all namespaces using supported labels.
  3. If no pods are found, attempts to find if GPU operator is installed and DCGM is enabled via Helm.
  4. Warns user appropriately.
  5. Scrapes each running pods metrics endpoint (http://<podIP>:9400/metrics).
  6. Selects the "best" GPU node based on: - Highest GPU count - Highest VRAM per GPU (tie-breaker)
  7. Caches the result for a short duration to avoid repeated scraping.

Behavior Notes:

  • Scrapes pods directly instead of using a Service ClusterIP to avoid load-balancing ambiguity in multi-node clusters.
  • If at least one pod is successfully scraped, partial failures are tolerated.
  • If all pods fail to scrape, an aggregated error is returned.
  • Assumes DCGM exporter runs as a DaemonSet (one pod per GPU node).
  • Designed for homogeneous clusters; heterogeneous cluster aggregation is not yet implemented.

Returns:

  • *GPUInfo for the selected node
  • error if no GPU data can be retrieved

TODO: Current implementation selects a single "best" GPU node (highest GPU count, tie-broken by VRAM). This works for homogeneous clusters where all GPU nodes are identical. For Heterogeneous GPU Support (mixed GPU models or capacities), this logic does not represent full cluster GPU inventory. Future improvements should aggregate and return GPU information for all nodes instead of selecting only one.

type GPUDiscoveryCache

type GPUDiscoveryCache struct {
	// contains filtered or unexported fields
}

func NewGPUDiscoveryCache

func NewGPUDiscoveryCache() *GPUDiscoveryCache

NewGPUDiscoveryCache creates a new GPUDiscoveryCache instance.

The cache stores a single discovered GPUInfo value with an expiration time. It is safe for concurrent use and is intended to reduce repeated DCGM scraping during reconciliation loops.

func (*GPUDiscoveryCache) Get

func (c *GPUDiscoveryCache) Get() (*GPUInfo, bool)

Get returns the cached GPUInfo if it exists and has not expired.

The boolean return value indicates whether a valid cached value was found. If the cache is empty or expired, it returns (nil, false).

This method is safe for concurrent use.

func (*GPUDiscoveryCache) Set

func (c *GPUDiscoveryCache) Set(info *GPUInfo, ttl time.Duration)

Set stores the provided GPUInfo in the cache with the given TTL (time-to-live).

The cached value will be considered valid until the TTL duration elapses. After expiration, Get will return (nil, false) until a new value is set.

This method is safe for concurrent use.

type GPUInfo

type GPUInfo struct {
	NodeName      string                      // Name of the node with this GPU configuration
	GPUsPerNode   int                         // Maximum GPUs per node found in the cluster
	NodesWithGPUs int                         // Number of nodes that have GPUs
	Model         string                      // GPU product name (e.g., "H100-SXM5-80GB")
	VRAMPerGPU    int                         // VRAM in MiB per GPU
	System        nvidiacomv1beta1.GPUSKUType // AIC hardware system identifier (e.g., "h100_sxm", "h200_sxm"), empty if unknown
	MIGEnabled    bool                        // True if MIG is enabled (inferred from model or additional labels, not implemented in this version)
	MIGProfiles   map[string]int              // Optional: map of MIG profile name to count (requires additional label parsing, not implemented in this version)
	CloudProvider string                      // NEW: aws | gcp | aks | other | unknown
}

GPUInfo contains discovered GPU configuration from cluster nodes

func DiscoverGPUs

func DiscoverGPUs(ctx context.Context, k8sClient client.Reader) (*GPUInfo, error)

DiscoverGPUs queries Kubernetes nodes to determine GPU configuration. It extracts GPU information from NVIDIA GPU Feature Discovery (GFD) labels and returns aggregated GPU info, preferring nodes with higher GPU count, then higher VRAM if counts are equal.

This function requires cluster-wide node read permissions and expects nodes to have GFD labels. If no nodes with GPU labels are found, it returns an error.

func ScrapeMetricsEndpoint

func ScrapeMetricsEndpoint(ctx context.Context, endpoint string) (*GPUInfo, error)

scrapeMetricsEndpoint retrieves and parses Prometheus metrics from a DCGM exporter pod endpoint.

The function performs an HTTP GET request against the provided endpoint (expected format: http://<podIP>:9400/metrics), validates the response, and parses the Prometheus text exposition format into metric families.

Parsed metric families are passed to parseMetrics to extract high-level GPU information.

Returns:

  • *GPUInfo derived from the parsed metrics
  • error if the HTTP request fails, the response is non-200, or metric parsing fails

This function does not implement retries or fallback logic. Error handling and multi-pod aggregation are managed by the caller.

type ScrapeMetricsFunc

type ScrapeMetricsFunc func(ctx context.Context, endpoint string) (*GPUInfo, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL