metrics

package
v1.33.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 7, 2025 License: Apache-2.0 Imports: 9 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var NodeMetricsMap = map[string]*NodeMetrics{}

Functions

func RemoveNodeMetrics added in v1.33.4

func RemoveNodeMetrics(nodeName string)

func RemoveWorkerMetrics added in v1.33.4

func RemoveWorkerMetrics(workerName string, deletionTime time.Time)

func SetNodeMetrics added in v1.33.4

func SetNodeMetrics(node *tfv1.GPUNode, poolObj *tfv1.GPUPool, gpuModels []string)

func SetWorkerMetricsByWorkload added in v1.33.4

func SetWorkerMetricsByWorkload(pod *corev1.Pod, workload *tfv1.TensorFusionWorkload, now time.Time)

Types

type MetricsRecorder added in v1.33.4

type MetricsRecorder struct {
	MetricsOutputPath string

	// Raw billing result for node and workers
	HourlyUnitPriceMap map[string]float64

	// Worker level unit price map, key is pool name, second level key is QoS level
	WorkerUnitPriceMap map[string]map[string]RawBillingPricing
}

func (*MetricsRecorder) RecordMetrics added in v1.33.4

func (mr *MetricsRecorder) RecordMetrics(writer io.Writer)

func (*MetricsRecorder) Start added in v1.33.4

func (mr *MetricsRecorder) Start()

Start metrics recorder The leader container will fill the metrics map, so followers don't have metrics point thus metrics recorder only printed in one controller instance One minute interval could cause some metrics ignored or billing not accurate, known issue

type NodeMetrics added in v1.33.4

type NodeMetrics struct {
	NodeName string `json:"nodeName"`
	PoolName string `json:"poolName"`

	AllocatedTflops        float64 `json:"allocatedTflops"`
	AllocatedTflopsPercent float64 `json:"allocatedTflopsPercent"`
	AllocatedVramBytes     float64 `json:"allocatedVramBytes"`
	AllocatedVramPercent   float64 `json:"allocatedVramPercent"`

	AllocatedTflopsPercentToVirtualCap float64 `json:"allocatedTflopsPercentToVirtualCap"`
	AllocatedVramPercentToVirtualCap   float64 `json:"allocatedVramPercentToVirtualCap"`

	RawCost float64 `json:"rawCost"`

	LastRecordTime time.Time `json:"lastRecordTime"`

	// additional field for raw cost calculation since each GPU has different price
	GPUModels []string `json:"gpuModels"`
}

type RawBillingPricing added in v1.33.4

type RawBillingPricing struct {
	TflopsPerSecond float64
	VramPerSecond   float64

	TflopsOverRequestPerSecond float64
	VramOverRequestPerSecond   float64
}

type WorkerMetrics added in v1.33.4

type WorkerMetrics struct {
	WorkerName   string `json:"workerName"`
	WorkloadName string `json:"workloadName"`
	PoolName     string `json:"poolName"`
	Namespace    string `json:"namespace"`
	QoS          string `json:"qos"`

	TflopsRequest    float64 `json:"tflopsRequest"`
	TflopsLimit      float64 `json:"tflopsLimit"`
	VramBytesRequest float64 `json:"vramBytesRequest"`
	VramBytesLimit   float64 `json:"vramBytesLimit"`
	GPUCount         int     `json:"gpuCount"`
	RawCost          float64 `json:"rawCost"`

	LastRecordTime time.Time `json:"lastRecordTime"`

	// For more accurate metrics, should record the deletion timestamp to calculate duration for the last metrics
	DeletionTimestamp time.Time `json:"deletionTimestamp"`
}

Metrics will be stored in a map, key is the worker name, value is the metrics By default, metrics will be updated every minute

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL