Documentation
¶
Index ¶
Constants ¶
const ( // NVIDIA GPU Feature Discovery (GFD) label keys LabelGPUCount = "nvidia.com/gpu.count" LabelGPUProduct = "nvidia.com/gpu.product" LabelGPUMemory = "nvidia.com/gpu.memory" // DCGM exporter label constants LabelApp = "app" LabelAppKubernetesName = "app.kubernetes.io/name" LabelValueNvidiaDCGMExporter = "nvidia-dcgm-exporter" LabelValueDCGMExporter = "dcgm-exporter" LabelValueGPUOperator = "gpu-operator" GPUOperatorNamespace = "gpu-operator" CloudProviderGCP = "gcp" CloudProviderAWS = "aws" CloudProviderAKS = "aks" CloudProviderOther = "other" CloudProviderUnknown = "unknown" )
Variables ¶
This section is empty.
Functions ¶
func GetCloudProviderInfo ¶
func InferHardwareSystem ¶
func InferHardwareSystem(gpuProduct string) nvidiacomv1beta1.GPUSKUType
InferHardwareSystem maps GPU product name to hardware system identifier. Returns empty string if the GPU model cannot be confidently mapped.
This is a best-effort mapping based on common NVIDIA datacenter GPU naming patterns. The system identifier is used by the profiler for performance estimation and configuration.
Limitations:
- Cannot distinguish SXM vs. PCIe variants from labels alone (assumes SXM for datacenter GPUs)
- New GPU models require code updates (gracefully returns empty string)
- Non-standard SKU names may not match
Users can manually override the system in their profiling config (hardware.system) if auto-detection is incorrect or unavailable.
Types ¶
type GPUDiscovery ¶
type GPUDiscovery struct {
Scraper ScrapeMetricsFunc
}
func NewGPUDiscovery ¶
func NewGPUDiscovery(scraper ScrapeMetricsFunc) *GPUDiscovery
func (*GPUDiscovery) DiscoverGPUsFromDCGM ¶
func (g *GPUDiscovery) DiscoverGPUsFromDCGM(ctx context.Context, k8sClient client.Reader, cache *GPUDiscoveryCache) (*GPUInfo, error)
DiscoverGPUsFromDCGM discovers GPU information by scraping metrics directly from DCGM exporter pods running in the cluster.
The function performs the following:
- Returns cached GPU information if still valid.
- Lists DCGM exporter pods across all namespaces using supported labels.
- If no pods are found, attempts to find if GPU operator is installed and DCGM is enabled via Helm.
- Warns user appropriately.
- Scrapes each running pods metrics endpoint (http://<podIP>:9400/metrics).
- Selects the "best" GPU node based on: - Highest GPU count - Highest VRAM per GPU (tie-breaker)
- Caches the result for a short duration to avoid repeated scraping.
Behavior Notes:
- Scrapes pods directly instead of using a Service ClusterIP to avoid load-balancing ambiguity in multi-node clusters.
- If at least one pod is successfully scraped, partial failures are tolerated.
- If all pods fail to scrape, an aggregated error is returned.
- Assumes DCGM exporter runs as a DaemonSet (one pod per GPU node).
- Designed for homogeneous clusters; heterogeneous cluster aggregation is not yet implemented.
Returns:
- *GPUInfo for the selected node
- error if no GPU data can be retrieved
TODO: Current implementation selects a single "best" GPU node (highest GPU count, tie-broken by VRAM). This works for homogeneous clusters where all GPU nodes are identical. For Heterogeneous GPU Support (mixed GPU models or capacities), this logic does not represent full cluster GPU inventory. Future improvements should aggregate and return GPU information for all nodes instead of selecting only one.
type GPUDiscoveryCache ¶
type GPUDiscoveryCache struct {
// contains filtered or unexported fields
}
func NewGPUDiscoveryCache ¶
func NewGPUDiscoveryCache() *GPUDiscoveryCache
NewGPUDiscoveryCache creates a new GPUDiscoveryCache instance.
The cache stores a single discovered GPUInfo value with an expiration time. It is safe for concurrent use and is intended to reduce repeated DCGM scraping during reconciliation loops.
func (*GPUDiscoveryCache) Get ¶
func (c *GPUDiscoveryCache) Get() (*GPUInfo, bool)
Get returns the cached GPUInfo if it exists and has not expired.
The boolean return value indicates whether a valid cached value was found. If the cache is empty or expired, it returns (nil, false).
This method is safe for concurrent use.
func (*GPUDiscoveryCache) Set ¶
func (c *GPUDiscoveryCache) Set(info *GPUInfo, ttl time.Duration)
Set stores the provided GPUInfo in the cache with the given TTL (time-to-live).
The cached value will be considered valid until the TTL duration elapses. After expiration, Get will return (nil, false) until a new value is set.
This method is safe for concurrent use.
type GPUInfo ¶
type GPUInfo struct {
NodeName string // Name of the node with this GPU configuration
GPUsPerNode int // Maximum GPUs per node found in the cluster
NodesWithGPUs int // Number of nodes that have GPUs
Model string // GPU product name (e.g., "H100-SXM5-80GB")
VRAMPerGPU int // VRAM in MiB per GPU
System nvidiacomv1beta1.GPUSKUType // AIC hardware system identifier (e.g., "h100_sxm", "h200_sxm"), empty if unknown
MIGEnabled bool // True if MIG is enabled (inferred from model or additional labels, not implemented in this version)
MIGProfiles map[string]int // Optional: map of MIG profile name to count (requires additional label parsing, not implemented in this version)
CloudProvider string // NEW: aws | gcp | aks | other | unknown
}
GPUInfo contains discovered GPU configuration from cluster nodes
func DiscoverGPUs ¶
DiscoverGPUs queries Kubernetes nodes to determine GPU configuration. It extracts GPU information from NVIDIA GPU Feature Discovery (GFD) labels and returns aggregated GPU info, preferring nodes with higher GPU count, then higher VRAM if counts are equal.
This function requires cluster-wide node read permissions and expects nodes to have GFD labels. If no nodes with GPU labels are found, it returns an error.
func ScrapeMetricsEndpoint ¶
scrapeMetricsEndpoint retrieves and parses Prometheus metrics from a DCGM exporter pod endpoint.
The function performs an HTTP GET request against the provided endpoint (expected format: http://<podIP>:9400/metrics), validates the response, and parses the Prometheus text exposition format into metric families.
Parsed metric families are passed to parseMetrics to extract high-level GPU information.
Returns:
- *GPUInfo derived from the parsed metrics
- error if the HTTP request fails, the response is non-200, or metric parsing fails
This function does not implement retries or fallback logic. Error handling and multi-pod aggregation are managed by the caller.