Documentation
¶
Index ¶
- Variables
- func RemoveNodeMetrics(nodeName string)
- func RemoveWorkerMetrics(workerName string, deletionTime time.Time)
- func SetNodeMetrics(node *tfv1.GPUNode, poolObj *tfv1.GPUPool, gpuModels []string)
- func SetWorkerMetricsByWorkload(pod *corev1.Pod, workload *tfv1.TensorFusionWorkload, now time.Time)
- type MetricsRecorder
- type NodeMetrics
- type RawBillingPricing
- type WorkerMetrics
Constants ¶
This section is empty.
Variables ¶
View Source
var NodeMetricsMap = map[string]*NodeMetrics{}
Functions ¶
func RemoveNodeMetrics ¶ added in v1.33.4
func RemoveNodeMetrics(nodeName string)
func RemoveWorkerMetrics ¶ added in v1.33.4
func SetNodeMetrics ¶ added in v1.33.4
func SetWorkerMetricsByWorkload ¶ added in v1.33.4
Types ¶
type MetricsRecorder ¶ added in v1.33.4
type MetricsRecorder struct { MetricsOutputPath string // Raw billing result for node and workers HourlyUnitPriceMap map[string]float64 // Worker level unit price map, key is pool name, second level key is QoS level WorkerUnitPriceMap map[string]map[string]RawBillingPricing }
func (*MetricsRecorder) RecordMetrics ¶ added in v1.33.4
func (mr *MetricsRecorder) RecordMetrics(writer io.Writer)
func (*MetricsRecorder) Start ¶ added in v1.33.4
func (mr *MetricsRecorder) Start()
Start metrics recorder The leader container will fill the metrics map, so followers don't have metrics point thus metrics recorder only printed in one controller instance One minute interval could cause some metrics ignored or billing not accurate, known issue
type NodeMetrics ¶ added in v1.33.4
type NodeMetrics struct { NodeName string `json:"nodeName"` PoolName string `json:"poolName"` AllocatedTflops float64 `json:"allocatedTflops"` AllocatedTflopsPercent float64 `json:"allocatedTflopsPercent"` AllocatedVramBytes float64 `json:"allocatedVramBytes"` AllocatedVramPercent float64 `json:"allocatedVramPercent"` AllocatedTflopsPercentToVirtualCap float64 `json:"allocatedTflopsPercentToVirtualCap"` AllocatedVramPercentToVirtualCap float64 `json:"allocatedVramPercentToVirtualCap"` RawCost float64 `json:"rawCost"` LastRecordTime time.Time `json:"lastRecordTime"` // additional field for raw cost calculation since each GPU has different price GPUModels []string `json:"gpuModels"` }
type RawBillingPricing ¶ added in v1.33.4
type WorkerMetrics ¶ added in v1.33.4
type WorkerMetrics struct { WorkerName string `json:"workerName"` WorkloadName string `json:"workloadName"` PoolName string `json:"poolName"` Namespace string `json:"namespace"` QoS string `json:"qos"` TflopsRequest float64 `json:"tflopsRequest"` TflopsLimit float64 `json:"tflopsLimit"` VramBytesRequest float64 `json:"vramBytesRequest"` VramBytesLimit float64 `json:"vramBytesLimit"` GPUCount int `json:"gpuCount"` RawCost float64 `json:"rawCost"` LastRecordTime time.Time `json:"lastRecordTime"` // For more accurate metrics, should record the deletion timestamp to calculate duration for the last metrics DeletionTimestamp time.Time `json:"deletionTimestamp"` }
Metrics will be stored in a map, key is the worker name, value is the metrics By default, metrics will be updated every minute
Click to show internal directories.
Click to hide internal directories.