gpu

package
v0.0.0-...-0096472 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 4, 2026 License: Apache-2.0, Apache-2.0 Imports: 18 Imported by: 0

Documentation

Overview

* SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. * SPDX-License-Identifier: Apache-2.0 * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License.

Index

Constants

View Source
const (

	// NVIDIA GPU Feature Discovery (GFD) label keys
	LabelGPUCount   = "nvidia.com/gpu.count"
	LabelGPUProduct = "nvidia.com/gpu.product"
	LabelGPUMemory  = "nvidia.com/gpu.memory"
	// DCGM exporter label constants
	LabelApp                        = "app"
	LabelAppKubernetesName          = "app.kubernetes.io/name"
	LabelValueNvidiaDCGMExporter    = "nvidia-dcgm-exporter"
	LabelValueNvidiaNetworkOperator = "nvidia-network-operator"
	LabelValueDCGMExporter          = "dcgm-exporter"
	LabelValueGPUOperator           = "gpu-operator"
	GPUOperatorNamespace            = "gpu-operator"

	CloudProviderGCP            = "gcp"
	CloudProviderAWS            = "aws"
	CloudProviderAKS            = "aks"
	CloudProviderOther          = "other"
	CloudProviderUnknown        = "unknown"
	LabelNVIDIARDMAPresent      = "nvidia.com/rdma.present"
	LabelNFDRDMAAvailable       = "feature.node.kubernetes.io/rdma.available"
	LabelNFDNetworkSRIOVCapable = "feature.node.kubernetes.io/network-sriov.capable"
)
View Source
const (
	LabelNVLink = "nvlink"
)

--- GPU model tokens ---

Variables

This section is empty.

Functions

func GetCloudProviderInfo

func GetCloudProviderInfo(ctx context.Context, k8sClient client.Reader) (string, error)

GetCloudProviderInfo attempts to infer the cloud provider of the Kubernetes cluster.

The function inspects the first node in the cluster (assumes homogeneous node setup) and uses a combination of ProviderID and node labels to detect the provider.

Detection logic:

  • Primary detection uses node.Spec.ProviderID:
  • "azure" → AKS
  • "aws" → AWS
  • "gce" → GCP
  • Secondary detection uses node labels and instance type prefixes:
  • AKS: "kubernetes.azure.com/cluster" label or instance type starting with "standard_"
  • AWS: "eks.amazonaws.com/nodegroup" label or known AWS instance type prefix
  • GCP: "cloud.google.com/gke-nodepool" label or known GCP machine series prefix
  • If none match, returns "other".

Parameters:

  • ctx: Context for logging, cancellation, or timeout.
  • k8sClient: Kubernetes client for reading Node objects.

Returns:

  • A string identifying the cloud provider ("aks", "aws", "gcp", "other", or "unknown").
  • An error if no nodes are found or listing fails.

func InferHardwareSystem

func InferHardwareSystem(gpuProduct string) nvidiacomv1beta1.GPUSKUType

InferHardwareSystem attempts to infer a normalized GPU SKU type from a free-form product string (e.g. "NVIDIA H100 SXM", "A100-PCIE").

The function performs three main steps:

  1. Normalize the input string to a consistent format.
  2. Detect the GPU form factor (SXM vs PCIe).
  3. Match the normalized string against known GPU tokens and return the corresponding SKU type.

Matching is based on substring checks and is tolerant of variations in formatting (case, spaces, dashes). If no known GPU is detected, an empty SKU type is returned. Limitations:

  • Cannot distinguish SXM vs. PCIe variants from labels alone (assumes SXM for datacenter GPUs)
  • New GPU models require code updates (gracefully returns empty string)
  • Non-standard SKU names may not match

Users can manually override the system in their profiling config (hardware.system) if auto-detection is incorrect or unavailable.

Types

type GPUDiscovery

type GPUDiscovery struct {
	Scraper ScrapeMetricsFunc
	// contains filtered or unexported fields
}

func NewGPUDiscovery

func NewGPUDiscovery(scraper ScrapeMetricsFunc) *GPUDiscovery

func (*GPUDiscovery) DiscoverGPUsFromDCGM

func (g *GPUDiscovery) DiscoverGPUsFromDCGM(ctx context.Context, k8sClient client.Reader, cache *GPUDiscoveryCache) (*GPUInfo, error)

DiscoverGPUsFromDCGM is a convenience wrapper that calls DiscoverGPUsFromDCGMFiltered with no SKU filter. See DiscoverGPUsFromDCGMFiltered for full documentation.

func (*GPUDiscovery) DiscoverGPUsFromDCGMFiltered

func (g *GPUDiscovery) DiscoverGPUsFromDCGMFiltered(ctx context.Context, k8sClient client.Reader, cache *GPUDiscoveryCache, filterSKU nvidiacomv1beta1.GPUSKUType) (*GPUInfo, error)

DiscoverGPUsFromDCGMFiltered discovers GPU information by scraping metrics directly from DCGM exporter pods running in the cluster.

When filterSKU is non-empty, only nodes whose inferred SKU matches are considered. When empty, the best node is selected first (highest GPU count, then VRAM) and then only nodes with the same SKU are counted.

The function performs the following:

  1. Returns cached GPU information if still valid (keyed by filterSKU).
  2. Lists DCGM exporter pods across all namespaces using supported labels.
  3. If no pods are found, attempts to find if GPU operator is installed and DCGM is enabled via Helm.
  4. Warns user appropriately.
  5. Scrapes each running pods metrics endpoint (http://<podIP>:9400/metrics).
  6. Selects the "best" GPU node (filtered by SKU when set) based on: - Highest GPU count - Highest VRAM per GPU (tie-breaker)
  7. Counts only nodes matching the selected SKU for NodesWithGPUs.
  8. Caches the result per SKU for a short duration to avoid repeated scraping.

Behavior Notes:

  • Scrapes pods directly instead of using a Service ClusterIP to avoid load-balancing ambiguity in multi-node clusters.
  • If at least one pod is successfully scraped, partial failures are tolerated.
  • If all pods fail to scrape, an aggregated error is returned.
  • Assumes DCGM exporter runs as a DaemonSet (one pod per GPU node).

Returns:

  • *GPUInfo for the selected node
  • error if no GPU data can be retrieved

type GPUDiscoveryCache

type GPUDiscoveryCache struct {
	// contains filtered or unexported fields
}

GPUDiscoveryCache caches discovery results keyed by SKU filter. Bounded by the GPUSKUType enum plus empty for unfiltered discovery.

func NewGPUDiscoveryCache

func NewGPUDiscoveryCache() *GPUDiscoveryCache

NewGPUDiscoveryCache creates a new GPUDiscoveryCache instance.

The cache stores discovered GPUInfo values keyed by SKU filter with an expiration time. It is safe for concurrent use and is intended to reduce repeated DCGM scraping during reconciliation loops.

func (*GPUDiscoveryCache) Get

Get returns the cached GPUInfo for the given SKU filter if it exists and has not expired.

The boolean return value indicates whether a valid cached value was found. If the cache is empty or expired, it returns (nil, false).

This method is safe for concurrent use.

func (*GPUDiscoveryCache) Set

Set stores the provided GPUInfo in the cache with the given TTL (time-to-live).

The cached value will be considered valid until the TTL duration elapses. After expiration, Get will return (nil, false) until a new value is set.

This method is safe for concurrent use.

type GPUInfo

type GPUInfo struct {
	NodeName         string                      // Name of the node with this GPU configuration
	GPUsPerNode      int                         // Maximum GPUs per node found in the cluster
	NodesWithGPUs    int                         // Number of nodes that have GPUs
	Model            string                      // GPU product name (e.g., "H100-SXM5-80GB")
	VRAMPerGPU       int                         // VRAM in MiB per GPU
	System           nvidiacomv1beta1.GPUSKUType // AIC hardware system identifier (e.g., "h100_sxm", "h200_sxm"), empty if unknown
	MIGEnabled       bool                        // True if MIG is enabled (inferred from model or additional labels, not implemented in this version)
	MIGProfiles      map[string]int              // Optional: map of MIG profile name to count (requires additional label parsing, not implemented in this version)
	CloudProvider    string                      // aws | gcp | aks | other | unknown
	RDMAEnabled      bool                        // Indicates whether RDMA is enabled for this node (e.g., via InfiniBand, RoCE, or similar high-speed networking)
	RDMAType         string                      // Type of RDMA transport detected (e.g., "infiniband", "roce", "rdma", "sriov", or "none")
	Interconnect     string                      // Primary GPU-to-GPU interconnect technology used within the node (e.g., "nvlink" for high-bandwidth links or "pcie" for standard bus-based communication)
	InterconnectTier string                      // Qualitative or platform-specific classification of the interconnect (e.g., NVLink generation, topology tier, or vendor-defined performance level)
	NVLinkLinks      int                         // Number of NVLink connections per GPU (0 if NVLink is not present or interconnect is PCIe-only)
}

GPUInfo contains discovered GPU configuration from cluster nodes

func DiscoverGPUs

func DiscoverGPUs(ctx context.Context, k8sClient client.Reader) (*GPUInfo, error)

DiscoverGPUs queries Kubernetes nodes to determine GPU configuration. It is a convenience wrapper around DiscoverGPUsFiltered with no SKU filter. See DiscoverGPUsFiltered for full documentation.

func DiscoverGPUsFiltered

func DiscoverGPUsFiltered(ctx context.Context, k8sClient client.Reader, filterSKU nvidiacomv1beta1.GPUSKUType) (*GPUInfo, error)

DiscoverGPUsFiltered queries Kubernetes nodes to determine GPU configuration. It extracts GPU information from NVIDIA GPU Feature Discovery (GFD) labels and returns aggregated GPU info, preferring nodes with higher GPU count, then higher VRAM if counts are equal.

When filterSKU is non-empty, only nodes whose inferred SKU matches are considered for selection and counting. When empty, the best node is selected first and then only nodes with the same SKU are counted.

This function requires cluster-wide node read permissions and expects nodes to have GFD labels. If no nodes with GPU labels are found, it returns an error.

func ScrapeMetricsEndpoint

func ScrapeMetricsEndpoint(ctx context.Context, endpoint string) (*GPUInfo, error)

scrapeMetricsEndpoint retrieves and parses Prometheus metrics from a DCGM exporter pod endpoint.

The function performs an HTTP GET request against the provided endpoint (expected format: http://<podIP>:9400/metrics), validates the response, and parses the Prometheus text exposition format into metric families.

Parsed metric families are passed to parseMetrics to extract high-level GPU information.

Returns:

  • *GPUInfo derived from the parsed metrics
  • error if the HTTP request fails, the response is non-200, or metric parsing fails

This function does not implement retries or fallback logic. Error handling and multi-pod aggregation are managed by the caller.

type ScrapeMetricsFunc

type ScrapeMetricsFunc func(ctx context.Context, endpoint string) (*GPUInfo, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL