Documentation
¶
Overview ¶
Package metricdispatch provides a unified interface for loading and caching job metric data.
This package serves as a central dispatcher that routes metric data requests to the appropriate backend based on job state. For running jobs, data is fetched from the metric store (e.g., cc-metric-store). For completed jobs, data is retrieved from the file-based job archive.
Key Features ¶
- Automatic backend selection based on job state (running vs. archived)
- LRU cache for performance optimization (128 MB default cache size)
- Data resampling using Largest Triangle Three Bucket algorithm for archived data
- Automatic statistics series generation for jobs with many nodes
- Support for scoped metrics (node, socket, accelerator, core)
Cache Behavior ¶
Cached data has different TTL (time-to-live) values depending on job state:
- Running jobs: 2 minutes (data changes frequently)
- Completed jobs: 5 hours (data is static)
The cache key is based on job ID, state, requested metrics, scopes, and resolution.
Usage ¶
The primary entry point is LoadData, which automatically handles both running and archived jobs:
jobData, err := metricdispatch.LoadData(job, metrics, scopes, ctx, resolution)
if err != nil {
// Handle error
}
For statistics only, use LoadJobStats, LoadScopedJobStats, or LoadAverages depending on the required format.
Copyright (C) NHR@FAU, University Erlangen-Nuremberg. All rights reserved. Use of this source code is governed by a MIT-style license that can be found in the LICENSE file.
Index ¶
- func Init(rawConfig json.RawMessage) error
- func LoadAverages(job *schema.Job, metrics []string, data [][]schema.Float, ctx context.Context) error
- func LoadData(job *schema.Job, metrics []string, scopes []schema.MetricScope, ...) (schema.JobData, error)
- func LoadJobStats(job *schema.Job, metrics []string, ctx context.Context) (map[string]schema.MetricStatistics, error)
- func LoadNodeData(cluster string, metrics, nodes []string, scopes []schema.MetricScope, ...) (map[string]map[string][]*schema.JobMetric, error)
- func LoadNodeListData(cluster, subCluster string, nodes []string, metrics []string, ...) (map[string]schema.JobData, error)
- func LoadScopedJobStats(job *schema.Job, metrics []string, scopes []schema.MetricScope, ...) (schema.ScopedJobStats, error)
- type CCMetricStoreConfig
- type MetricDataRepository
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Init ¶
func Init(rawConfig json.RawMessage) error
func LoadAverages ¶
func LoadAverages( job *schema.Job, metrics []string, data [][]schema.Float, ctx context.Context, ) error
LoadAverages computes average values for the specified metrics across all nodes of a job. For running jobs, it loads statistics from the metric store. For completed jobs, it uses the pre-calculated averages from the job archive. The results are appended to the data slice.
func LoadData ¶
func LoadData(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int, ) (schema.JobData, error)
LoadData retrieves metric data for a job from the appropriate backend (memory store for running jobs, archive for completed jobs) and applies caching, resampling, and statistics generation as needed.
For running jobs or when archive is disabled, data is fetched from the metric store. For completed archived jobs, data is loaded from the job archive and resampled if needed.
Parameters:
- job: The job for which to load metric data
- metrics: List of metric names to load (nil loads all metrics for the cluster)
- scopes: Metric scopes to include (nil defaults to node scope)
- ctx: Context for cancellation and timeouts
- resolution: Target number of data points for resampling (only applies to archived data)
Returns the loaded job data and any error encountered. For partial errors (some metrics failed), the function returns the successfully loaded data with a warning logged.
func LoadJobStats ¶
func LoadJobStats( job *schema.Job, metrics []string, ctx context.Context, ) (map[string]schema.MetricStatistics, error)
LoadJobStats retrieves aggregated statistics (min/avg/max) for each requested metric across all job nodes. For running jobs, statistics are computed from the metric store. For completed jobs, pre-calculated statistics are loaded from the job archive.
func LoadNodeData ¶
func LoadNodeData( cluster string, metrics, nodes []string, scopes []schema.MetricScope, from, to time.Time, ctx context.Context, ) (map[string]map[string][]*schema.JobMetric, error)
LoadNodeData retrieves metric data for specific nodes in a cluster within a time range. This is used for node monitoring views and system status pages. Data is always fetched from the metric store (not the archive) since it's for current/recent node status monitoring.
Returns a nested map structure: node -> metric -> scoped data. FIXME: Add support for subcluster specific cc-metric-stores
func LoadNodeListData ¶
func LoadNodeListData( cluster, subCluster string, nodes []string, metrics []string, scopes []schema.MetricScope, resolution int, from, to time.Time, ctx context.Context, ) (map[string]schema.JobData, error)
LoadNodeListData retrieves time-series metric data for multiple nodes within a time range, with optional resampling and automatic statistics generation for large datasets. This is used for comparing multiple nodes or displaying node status over time.
Returns a map of node names to their job-like metric data structures.
func LoadScopedJobStats ¶
func LoadScopedJobStats( job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, ) (schema.ScopedJobStats, error)
LoadScopedJobStats retrieves job statistics organized by metric scope (node, socket, core, accelerator). For running jobs, statistics are computed from the metric store. For completed jobs, pre-calculated statistics are loaded from the job archive.
Types ¶
type CCMetricStoreConfig ¶
type MetricDataRepository ¶
type MetricDataRepository interface {
// Return the JobData for the given job, only with the requested metrics.
LoadData(job *schema.Job,
metrics []string,
scopes []schema.MetricScope,
ctx context.Context,
resolution int) (schema.JobData, error)
// Return a map of metrics to a map of nodes to the metric statistics of the job. node scope only.
LoadStats(job *schema.Job,
metrics []string,
ctx context.Context) (map[string]map[string]schema.MetricStatistics, error)
// Return a map of metrics to a map of scopes to the scoped metric statistics of the job.
LoadScopedStats(job *schema.Job,
metrics []string,
scopes []schema.MetricScope,
ctx context.Context) (schema.ScopedJobStats, error)
// Return a map of hosts to a map of metrics at the requested scopes (currently only node) for that node.
LoadNodeData(cluster string,
metrics, nodes []string,
scopes []schema.MetricScope,
from, to time.Time,
ctx context.Context) (map[string]map[string][]*schema.JobMetric, error)
// Return a map of hosts to a map of metrics to a map of scopes for multiple nodes.
LoadNodeListData(cluster, subCluster string,
nodes []string,
metrics []string,
scopes []schema.MetricScope,
resolution int,
from, to time.Time,
ctx context.Context) (map[string]schema.JobData, error)
// HealthCheck evaluates the monitoring state for a set of nodes against expected metrics.
HealthCheck(cluster string,
nodes []string,
metrics []string) (map[string]metricstore.HealthCheckResult, error)
}
func GetHealthCheckRepo ¶
func GetHealthCheckRepo(cluster string) (MetricDataRepository, error)
GetHealthCheckRepo returns the MetricDataRepository for performing health checks on a cluster. It uses the same fallback logic as GetMetricDataRepo: cluster → wildcard → internal.
func GetMetricDataRepo ¶
func GetMetricDataRepo(cluster string, subcluster string) (MetricDataRepository, error)