Documentation
¶
Overview ¶
Package capacity is modeld's hardware capacity planner: it resolves the EFFECTIVE context window a model can actually be served at on this device, from the model's KV-cache footprint and the device's free memory — not the model's trained ceiling alone. modeld owns this because it owns the hardware (see docs/blueprints/modeld-interface-boundary.md and plan-llamacpp.md:16); the runtime consumes the resolved number and never computes it.
Index ¶
Constants ¶
const DefaultHeadroomFrac = 0.1
DefaultHeadroomFrac of free memory is reserved for activations, the compute graph, and fragmentation, leaving the rest for model weights + KV cache.
const DefaultMaxResidentFrac = 0.8
DefaultMaxResidentFrac is the launch-time cap used when the user did not set a memory ceiling. modeld will not grow past this fraction of the memory that was free when the backend service was created; per-call current free memory can still clamp lower.
Variables ¶
This section is empty.
Functions ¶
func HeadroomFromEnv ¶
func HeadroomFromEnv() float64
HeadroomFromEnv reads CONTENOX_MODELD_MEM_HEADROOM (a fraction in (0,1)), falling back to DefaultHeadroomFrac.
func KVBytesPerToken ¶
KVBytesPerToken is the memory one token of context costs in the KV cache: K and V, across every layer and KV head, at the KV precision.
func ParseBytes ¶ added in v0.32.5
ParseBytes parses byte strings used by modeld memory settings.
Types ¶
type DeviceSnapshot ¶ added in v0.32.5
type DeviceSnapshot struct {
Kind string `json:"kind,omitempty"`
DeviceID string `json:"device_id,omitempty"`
TotalBytes int64 `json:"total_bytes,omitempty"`
FreeBytes int64 `json:"free_bytes,omitempty"`
}
DeviceSnapshot describes the memory pool the backend will allocate from.
func Snapshot ¶ added in v0.32.5
func Snapshot(src MemorySource) (DeviceSnapshot, error)
Snapshot returns a DeviceSnapshot for either a richer source with Snapshot or a legacy FreeBytes-only source.
func (DeviceSnapshot) Key ¶ added in v0.32.5
func (d DeviceSnapshot) Key() string
Key identifies the memory pool for launch-default budgeting. Kind+ID is the normal path; total/shared are included so anonymous test or fallback sources still get stable separation when possible.
type LaunchDefaults ¶ added in v0.32.5
type LaunchDefaults struct {
// contains filtered or unexported fields
}
LaunchDefaults records the first observed free-memory snapshot per memory pool. It lets services apply the "80% of launch-free memory" default lazily for the actual selected device, while keeping an explicit MaxResidentBytes as a hard user cap.
func (*LaunchDefaults) Policy ¶ added in v0.32.5
func (d *LaunchDefaults) Policy(base Policy, st DeviceSnapshot) Policy
Policy returns base with a default MaxResidentBytes filled from the first snapshot seen for this device. It is intentionally sticky per memory pool: if memory later gets tighter, Resolve clamps on current FreeBytes; if memory gets freer, modeld does not grow past the launch budget.
type MemorySource ¶
MemorySource reports the free memory of the device a backend serves on. modeld picks the source by device: system RAM for CPU; GPU VRAM (ov::Core / ggml) is a CGO seam filled per backend when a GPU device is selected.
type ModelCapacity ¶
type ModelCapacity struct {
ModelMaxContext int
EffectiveContext int
KVBytesPerToken int64
FreeBytes int64
WeightsBytes int64
OverheadBytes int64
ReservedBytes int64
UserLimitBytes int64
MinFreeBytes int64
UsableBytes int64
RequiredBytes int64
Clamped bool
Reason string
}
ModelCapacity is the resolved result reported to the runtime. EffectiveContext is the window modeld will actually serve and the value the cache identity (manifest context_size) must use; the rest explain how it was derived.
func Resolve ¶
func Resolve(p Params) ModelCapacity
Resolve computes the physical hot context window:
usable = min(free - minFree, userLimit - reserved) * (1 - headroom) effective = clamp(request, 0, min(modelMax, (usable - weights - overhead) / kvBytesPerToken))
Unknown inputs degrade gracefully: with no KV cost it falls back to the model ceiling (clamped by request); with no ceiling it uses the memory budget.
type Params ¶
type Params struct {
ModelMaxCtx int // model's trained context ceiling (0 = unknown)
KVBytesPerToken int64 // 0 = unknown (cannot budget by memory)
WeightsBytes int64 // resident model weight footprint
OverheadBytes int64 // fixed runtime buffers (compute graph, staging)
FreeBytes int64 // device free memory
ReservedBytes int64 // memory already reserved by resident sessions
UserLimitBytes int64 // user cap for modeld resident memory (0 = no cap)
MinFreeBytes int64 // memory to leave free for the desktop/other workloads
Request int // requested window (0 = use the resolved max)
HeadroomFrac float64 // <=0 or >=1 falls back to DefaultHeadroomFrac
}
Params are the inputs to a capacity resolution. Zero values mean "unknown": an unknown ModelMaxCtx or KVBytesPerToken disables that side of the clamp rather than producing a bogus window.
type Policy ¶ added in v0.32.5
type Policy struct {
MaxResidentBytes int64 `json:"max_resident_bytes,omitempty"`
MinFreeBytes int64 `json:"min_free_bytes,omitempty"`
HeadroomFrac float64 `json:"headroom_frac,omitempty"`
}
Policy is the user/operator memory policy modeld applies before opening a resident session. MaxResidentBytes is a hard ceiling on modeld's resident footprint for the served device; MinFreeBytes preserves memory for the desktop or other local workloads that may share the same device.
func LoadPolicy ¶ added in v0.32.5
LoadPolicy reads <dataRoot>/modeld.json and then applies env overrides. The JSON accepts either numeric byte fields or string fields ("8GiB", "512MiB"):
{"memory":{"max_resident":"8GiB","reserve_free":"2GiB","headroom_frac":0.15}}
func WithLaunchDefaults ¶ added in v0.32.5
func WithLaunchDefaults(p Policy, launch DeviceSnapshot) Policy
WithLaunchDefaults fills missing policy values from the launch-time device snapshot. The default resident cap is intentionally a top floor based on launch free memory, not a moving target: if memory later gets tighter, the current FreeBytes in Resolve clamps lower; if memory later gets freer, modeld does not opportunistically consume more than the launch budget.