sys package
The sys package provides lightweight runtime visibility into host and container resources, including:
- CPU count and load estimation
- memory statistics
- container-aware reporting on Linux (cgroup v2 and v1)
- simple, dependency-free fallbacks when preferred sources are unavailable
Table of Contents
Initialization
Package initialization is split into two phases:
init(): sets NumCPU to runtime.NumCPU() - always safe, no dependencies.
Init(contForced bool): performs container auto-detection, applies cgroup-aware CPU count, adjusts GOMAXPROCS. External modules may skip calling Init() and still get a sane NumCPU().
A package-level cgroupVer variable (0, 1, or 2) is set once during Init() and drives all subsequent CPU and memory reads. This avoids re-probing cgroup paths on every sample.
The ForceContainerCPUMem feature flag (in cmn/feat) can override failed auto-detection for deployments where the heuristics miss. Requires restart.
CPU reporting
| Function |
Description |
NumCPU() |
effective CPU count (container-aware) |
MaxParallelism() |
derives internal parallelism from NumCPU() |
Refresh(now, periodic) |
samples CPU, updates ring and EMA; returns (util, throttled, error) |
CPU(periodic) |
returns smoothed CPU utilization percentage (0–100) plus isExtreme boolean |
HighLoadWM() |
high-load watermark derived from CPU count |
LoadAverage() |
system load averages (fallback only) |
Effective CPU count
At startup, the package initializes a process-wide CPU count:
- default:
runtime.NumCPU()
- container override: cgroup-based CPU quota when detected
The cgroup version is determined once during package's Init() call:
- Try cgroup v2 (
cpu.max).
- Fall back to cgroup v1 (
cpu.cfs_quota_us / cpu.cfs_period_us).
- If both fail or are missing, keep
runtime.NumCPU().
Errors from both paths are aggregated and reported to stderr (nlog is not yet available at init time).
Utilization model
CPU utilization is sampled as a delta of cumulative CPU time over wall-clock time.
Utilization is computed as:
(delta_cpu_usage * 100) / (delta_wall_time * NumCPU)
All values are integer percentages. Utilization and throttling are computed atomically in a single pass from the same time delta.
Sampling and smoothing
CPU samples are collected in a circular ring of 4 entries, each containing cumulative usage, throttled time, and a monotonic timestamp. The ring is advanced on each Refresh() call; instantaneous utilization is computed as a delta between the current and previous ring entries.
Refresh() is called from multiple paths:
- on-demand (gated at
minIvalLong, 8s): by CPU() callers that need a current value but aren't periodic
- periodic (gated at
MinIvalShort, 2s): piggybacked on three existing ticks:
ios.refresh() (disk stats)
- stats-runner (
periodic.stats_time), and
- memsys housekeeping callback
When gated, Refresh() returns the cached EMA value without reading /proc/stat or cgroup files.
Raw instantaneous utilization is smoothed using a time-scaled exponential moving average (EMA). The smoothing alpha is adjusted based on the elapsed time since the previous sample:
For details, see compute() method in sys/cpu.go
The smoothed value and throttled percentage are stored atomically; CPU() reads them without locking.
This approach is reminiscent of the disk utilization smoothing in the ios package, but it is much simpler: CPU has a single global value (not per-mountpath), so no ring walk or per-device map lookup is needed.
Linux source hierarchy
The cgroup version is determined at init time and stored in cgroupVer. The cpu.read() method switches on it - there is no fallback cascade per sample:
cgroupVer |
Source |
Fields |
| 2 |
cpu.stat |
usage_usec, throttled_usec |
| 1 |
cpuacct.usage |
cumulative nanoseconds |
| 0 |
/proc/stat |
aggregate jiffy line |
If all sources fail at read time, CPU() falls back to /proc/loadavg converted to a percentage.
The aggregate cpu line from /proc/stat is parsed using a whitelist of fields:
user (1), nice (2), system (3), irq (6), softirq (7), steal (8)
Explicitly excluded:
idle (4), iowait (5): not active CPU time
guest (9), guest_nice (10): already included in user and nice by the kernel - summing them would double-count
Steal is included because it represents CPU time unavailable to the node, which is what load-based throttling and worker tuning need to make decisions.
CPU starvation vs utilization
CPU() distinguishes between:
- high utilization: CPU is busy
- CPU starvation: the container is being throttled
In cgroup v2 environments, throttled_usec from cpu.stat is tracked as a percentage of wall-clock time. If throttling exceeds the extreme threshold (>10%), the system is reported as under extreme CPU pressure - even if raw utilization appears moderate.
This is intentional: throttling indicates lack of CPU availability, not just high usage.
Operational thresholds are compile-time constants: HighLoad (85%), ExtremeLoad (95%), and throttleExtremeThresh (10%).
Memory reporting
| Function |
Description |
MemStat.Get() |
populates memory statistics |
MemStat.Str() |
formats a compact summary |
Get() switches on cgroupVer and calls the appropriate stateless reader:
cgroupVer |
Reader |
Source |
| 2 |
readMemCgroupV2() |
memory.max, memory.current, memory.stat |
| 1 |
readMemCgroupV1() |
memory.limit_in_bytes, memory.usage_in_bytes, memory.stat |
| 0 |
readMemHost() |
/proc/meminfo |
All readers are stateless free functions returning (MemStat, error).
Host memory
Host memory is read from /proc/meminfo. Fields used:
MemTotal, MemFree, MemAvailable, Cached, Buffers, SwapTotal, SwapFree
If MemAvailable is not present (older kernels), ActualFree falls back to MemFree + BuffCache.
Container memory (cgroup v2)
memory.max - limit in bytes, or "max" (no limit → fall back to host)
memory.current - current usage including kernel caches
memory.stat - inactive_file used as reclaimable cache (BuffCache)
Derived: ActualUsed = Used - BuffCache, ActualFree = Total - ActualUsed.
Usage is capped at the limit to handle transient kernel overshoot before OOM.
Container memory (cgroup v1)
memory.limit_in_bytes - values > MaxInt64/2 treated as "no limit" (fall back to host)
memory.usage_in_bytes - current usage
memory.stat - total_cache used as reclaimable cache
Swap statistics are always host statistics regardless of cgroup version.
Container detection
Container detection uses a best-effort heuristic at init time:
- Check for
/.dockerenv
- Scan
/proc/1/cgroup for markers: docker, containerd, kubepods, kube, lxc, libpod, podman
If auto-detection fails but the deployment is known to be containerized, set the ForceContainerCPUMem feature flag. This forces cgroup-based CPU and memory accounting. Requires restart.
Fallback
The package follows these rules:
- preferred source first, determined once at init time
- degrade to older or coarser source when the preferred source is unavailable
- for CPU: preserve a usable percentage whenever possible
- for memory: prefer host stats over failing when container-specific files cannot be read
Example: testing sys package inside a constrained container
A simple way to validate container-aware CPU and memory reporting is to compare the same test run on the host and inside a Docker container with explicit CPU and memory limits.
Host run
go test -v -tags=debug
Example output:
=== RUN TestNumCPU
--- PASS: TestNumCPU (0.00s)
=== RUN TestLoadAvg
sys_test.go:41: Load average: 0.63, 0.49, 0.49
--- PASS: TestLoadAvg (0.00s)
=== RUN TestMaxProcs
--- PASS: TestMaxProcs (0.00s)
=== RUN TestMemoryStats
sys_test.go:76: Memory stats: {used 29GiB, free 2GiB, buffcache 20GiB, actfree 23GiB}
sys_test.go:79: Either swap is off or failed to read its stats
--- PASS: TestMemoryStats (0.00s)
=== RUN TestProcAndMaxLoad
sys_test.go:110: First call: load=0, extreme=false
...
sys_test.go:133: Second call: load=3, extreme=false
sys_test.go:145: Process CPU usage: 1.85%
--- PASS: TestProcAndMaxLoad (5.69s)
PASS
ok github.com/NVIDIA/aistore/sys 5.696s
Container run
docker run --rm \
--cpus=1.5 \
--memory=512m \
-v "$PWD":/src -w /src \
-v "$HOME/go/pkg/mod":/go/pkg/mod \
-v "$HOME/.cache/go-build":/root/.cache/go-build \
golang:1.26 \
go test ./sys -run . -v -count=1 2>&1
Example output:
=== RUN TestNumCPU
--- PASS: TestNumCPU (0.00s)
=== RUN TestLoadAvg
sys_test.go:41: Load average: 0.36, 0.41, 0.47
--- PASS: TestLoadAvg (0.00s)
=== RUN TestMaxProcs
--- PASS: TestMaxProcs (0.00s)
=== RUN TestMemoryStats
sys_test.go:76: Memory stats: {used 29MiB, free 483MiB, buffcache 152KiB, actfree 483MiB}
sys_test.go:79: Either swap is off or failed to read its stats
--- PASS: TestMemoryStats (0.00s)
=== RUN TestProcAndMaxLoad
sys_test.go:110: First call: load=0, extreme=false
...
sys_test.go:133: Second call: load=15, extreme=false
sys_test.go:145: Process CPU usage: 14.82%
--- PASS: TestProcAndMaxLoad (5.70s)
The comparison illustrates several points:
- On the host,
TestMemoryStats reports host-scale memory totals.
- Inside the container, the same test reports memory bounded by the cgroup limit (
--memory=512m) rather than the host's physical RAM.
TestNumCPU and TestMaxProcs exercise init-time CPU detection and container-aware CPU count.
TestProcAndMaxLoad burns CPU in-process and verifies that CPU() reports non-zero utilization on a subsequent sample.
- Swap may report as zero or be unavailable inside short-lived containers; this is not unusual.
To further confirm container-scoped memory accounting, rerun the container example with a different limit (for example, --memory=4G). TestMemoryStats should then report a total close to 4 GiB instead of 512 MiB.
The container example above assumes cgroup v2 - the default on modern Linux distributions and container runtimes.
Current limitations and future plans
cgroup v1 deprecation
Note that cgroup v1 support is deprecated and will be removed in a future (post-4.4) releases.
All major container runtimes and orchestrators now default to cgroup v2.
References