Documentation
¶
Overview ¶
Package capacity is modeld's hardware capacity planner: it resolves the EFFECTIVE context window a model can actually be served at on this device, from the model's KV-cache footprint and the device's free memory — not the model's trained ceiling alone. modeld owns this because it owns the hardware (see docs/blueprints/modeld-interface-boundary.md and plan-llamacpp.md:16); the runtime consumes the resolved number and never computes it.
Index ¶
Constants ¶
const DefaultHeadroomFrac = 0.1
DefaultHeadroomFrac of free memory is reserved for activations, the compute graph, and fragmentation, leaving the rest for model weights + KV cache.
Variables ¶
This section is empty.
Functions ¶
func HeadroomFromEnv ¶
func HeadroomFromEnv() float64
HeadroomFromEnv reads CONTENOX_MODELD_MEM_HEADROOM (a fraction in (0,1)), falling back to DefaultHeadroomFrac.
func KVBytesPerToken ¶
KVBytesPerToken is the memory one token of context costs in the KV cache: K and V, across every layer and KV head, at the KV precision.
Types ¶
type MemorySource ¶
MemorySource reports the free memory of the device a backend serves on. modeld picks the source by device: system RAM for CPU; GPU VRAM (ov::Core / ggml) is a CGO seam filled per backend when a GPU device is selected.
type ModelCapacity ¶
type ModelCapacity struct {
ModelMaxContext int
EffectiveContext int
KVBytesPerToken int64
FreeBytes int64
WeightsBytes int64
}
ModelCapacity is the resolved result reported to the runtime. EffectiveContext is the window modeld will actually serve and the value the cache identity (manifest context_size) must use; the rest explain how it was derived.
func Resolve ¶
func Resolve(p Params) ModelCapacity
Resolve computes the effective context window:
effective = clamp(request, 0,
min(modelMax, (free*(1-headroom) - weights) / kvBytesPerToken))
Unknown inputs degrade gracefully: with no KV cost it falls back to the model ceiling (clamped by request); with no ceiling it uses the memory budget.
type Params ¶
type Params struct {
ModelMaxCtx int // model's trained context ceiling (0 = unknown)
KVBytesPerToken int64 // 0 = unknown (cannot budget by memory)
WeightsBytes int64 // resident model weight footprint
FreeBytes int64 // device free memory
Request int // requested window (0 = use the resolved max)
HeadroomFrac float64 // <=0 or >=1 falls back to DefaultHeadroomFrac
}
Params are the inputs to a capacity resolution. Zero values mean "unknown": an unknown ModelMaxCtx or KVBytesPerToken disables that side of the clamp rather than producing a bogus window.