Documentation
¶
Overview ¶
Package transport defines the contract the modeld daemon implements and the runtime calls: a persistent, manifest-keyed warm-reuse inference session.
The boundary is the backend-neutral session seam already validated on llama.cpp and OpenVINO — EnsurePrefix / PrefillSuffix / Decode. The runtime owns this contract; modeld implements it per backend.
An earlier draft put a lower token-level Evaluate/Generate boundary here. Stress-checking it against the real llama.Session and OpenVINO GenAISession showed both backends sit at this higher, manifest-keyed altitude: OpenVINO GenAI holds the tokenizer and chat template internally and caches a string prefix, so a token-only daemon could not honor its proven prefix reuse. See docs/blueprints/modeld-interface-boundary.md.
Index ¶
- Variables
- type ActiveModel
- type Config
- type ContextManifest
- type ContextReport
- type DaemonStatus
- type DecodeConfig
- type DeviceInfo
- type EmbedRequest
- type EmbedResult
- type Fence
- type LoadModelRequest
- type MemoryService
- type ModelController
- type ModelInfo
- type OpenSessionRequest
- type Option
- type PrefixInput
- type PrefixStatus
- type Service
- type Session
- type SessionSnapshot
- type SlotState
- type StreamChunk
- type StructuredOutputConfig
- type SuffixInput
- type SuffixStatus
- type ToolCall
- type UnloadModelRequest
Constants ¶
This section is empty.
Variables ¶
var ( ErrNotOwner = errors.New("instance is not the local runtime owner") ErrStaleFence = errors.New("stale owner fence token") ErrSessionClosed = errors.New("session is closed") ErrContextOverflow = errors.New("exceeded the session context window") ErrModelBusy = errors.New("modeld active model slot is busy") ErrModelNotActive = errors.New("requested model is not active in modeld") ErrModelSwitchRequired = errors.New("modeld active model slot must be switched") ErrModelLoadFailed = errors.New("modeld failed to load model") ErrInsufficientMemory = errors.New("insufficient memory for requested model") ErrSlotGenerationStale = errors.New("stale modeld slot generation") // ErrBackendMismatch means the requested model Type is not the backend this // daemon serves (e.g. a llama model requested from an openvino-mode modeld). ErrBackendMismatch = errors.New("model type not served by this modeld backend") // ErrUnsupportedFeature means the backend is healthy but does not implement // the requested product surface for this model or backend mode. ErrUnsupportedFeature = errors.New("unsupported transport feature") )
Canonical errors expected to cross the boundary.
Functions ¶
This section is empty.
Types ¶
type ActiveModel ¶ added in v0.32.5
type ActiveModel struct {
ModelName string `json:"model_name,omitempty"`
Type string `json:"type,omitempty"`
Digest string `json:"digest,omitempty"`
Path string `json:"path,omitempty"`
Config Config `json:"config"`
Generation uint64 `json:"generation"`
}
ActiveModel describes the model identity and runtime config currently loaded in modeld's single active slot.
type Config ¶
type Config struct {
NumCtx int // context window in tokens
NumBatch int // prefill batch size
NumThreads int // CPU threads (0 = NumCPU)
NumGpuLayers int // layers offloaded to the GPU (0 = CPU only)
TensorSplit []float32 // multi-GPU split
FlashAttn bool
KVCacheType string // "", "q8_0", "q4_0"
PromptFormat string // profile-declared prompt format, e.g. "chatml" or "llama3"
PromptTemplateDigest string // digest of the declared/rendered prompt template
DisableBOS bool
ReasoningFormat string // backend-native reasoning format for chat-template parsing/rendering
}
Config is the explicit hardware/runtime configuration for a session. Every knob is a tested setting, not a magic default.
type ContextManifest ¶
type ContextManifest = contextasm.ContextManifest
ContextManifest is the shared, backend-neutral cache key: profile, model, tokenizer/template digests, BOS policy, and stable/volatile hashes. Reuse is valid only when the manifest matches; byte equality alone is not enough.
type ContextReport ¶
type ContextReport struct {
ResidentTokens int
PrefixTokens int
NumCtx int
AvailableTokens int
StableByteHash string
StableTokenHash string
ManifestDigest string
Manifest ContextManifest
Closed bool
}
ContextReport explains the session's resident context (explain-context).
type DaemonStatus ¶ added in v0.32.5
type DaemonStatus struct {
OwnerInstanceID string `json:"owner_instance_id,omitempty"`
Backend string `json:"backend,omitempty"`
State SlotState `json:"state,omitempty"`
Active *ActiveModel `json:"active,omitempty"`
BusyOperation string `json:"busy_operation,omitempty"`
LastError string `json:"last_error,omitempty"`
}
DaemonStatus reports the owner-local modeld slot state. It is intentionally about resident compute state, not the offline installed-model library.
type DecodeConfig ¶
type DecodeConfig struct {
MaxTokens int
Temperature *float64
TopP *float64
TopK int
Seed *int
ParserProtocols []string
ReasoningFormat string
StructuredOutput StructuredOutputConfig
}
DecodeConfig controls a single decode pass.
type DeviceInfo ¶ added in v0.32.5
type EmbedRequest ¶ added in v0.32.5
type EmbedRequest struct {
Fence
ModelName string // logical model name, e.g. "bge-small-en"
Type string // backend type the model targets: "llama" | "openvino"
Digest string // content digest; part of the model identity
Path string // runtime-resolved filesystem location (GGUF file or IR dir)
Config Config
Text string
}
EmbedRequest asks the owner to compute a one-shot embedding for Text.
type EmbedResult ¶ added in v0.32.5
type EmbedResult struct {
Vector []float32 `json:"vector,omitempty"`
}
type Fence ¶
type Fence struct {
OwnerInstanceID string
}
Fence carries the owner identity a client expects to be serving it. It is supplied once, at OpenSession; the returned Session is bound to that owner epoch, so a takeover invalidates the session rather than every method needing a fence. It is a freshness check, not an authentication secret.
type LoadModelRequest ¶ added in v0.32.5
type LoadModelRequest struct {
Fence
ModelName string
Type string
Digest string
Path string
Config Config
// ExpectedGeneration, when non-zero, makes load/switch conditional on the
// caller's view of the active slot.
ExpectedGeneration uint64
}
LoadModelRequest explicitly activates modeld's single local model slot. A different active model may be switched only when the slot has no open session holder and is not busy.
type MemoryService ¶
type MemoryService struct {
// contains filtered or unexported fields
}
MemoryService is an in-process, in-memory Service. It does no real inference: it models the warm-reuse contract so the runtime wrapper can be built and tested against the boundary before any CGO backend exists. Reuse is keyed on the manifest (a changed stable segment OR a changed profile/template/runtime digest invalidates the resident prefix), and token counts are byte-length proxies. See docs/blueprints/modeld-interface-boundary.md.
It is safe for concurrent use.
func NewMemoryService ¶
func NewMemoryService(opts ...Option) *MemoryService
NewMemoryService returns an in-memory Service.
func (*MemoryService) Describe ¶ added in v0.32.3
func (m *MemoryService) Describe(_ context.Context, req OpenSessionRequest) (ModelInfo, error)
Describe reports the requested context window back; the in-memory service has no real model to inspect, so it echoes Config.NumCtx (0 when unset).
func (*MemoryService) Embed ¶ added in v0.32.5
func (m *MemoryService) Embed(_ context.Context, req EmbedRequest) (EmbedResult, error)
Embed is intentionally unsupported by the memory service; it only models the warm session contract.
func (*MemoryService) OpenSession ¶
func (m *MemoryService) OpenSession(_ context.Context, req OpenSessionRequest) (Session, error)
OpenSession binds a session to the owner epoch (the fence) and the requested context window.
type ModelController ¶ added in v0.32.5
type ModelController interface {
Status(ctx context.Context) (DaemonStatus, error)
LoadModel(ctx context.Context, req LoadModelRequest) (ActiveModel, error)
UnloadModel(ctx context.Context, req UnloadModelRequest) error
}
ModelController is implemented by modeld services that expose explicit single-slot control. Service remains the compute contract; this interface is the daemon lifecycle/control extension.
type ModelInfo ¶ added in v0.32.3
type ModelInfo struct {
ModelMaxContext int `json:"model_max_context"`
EffectiveContext int `json:"effective_context"`
KVBytesPerToken int64 `json:"kv_bytes_per_token,omitempty"`
FreeBytes int64 `json:"free_bytes,omitempty"`
WeightsBytes int64 `json:"weights_bytes,omitempty"`
OverheadBytes int64 `json:"overhead_bytes,omitempty"`
ReservedBytes int64 `json:"reserved_bytes,omitempty"`
UserLimitBytes int64 `json:"user_limit_bytes,omitempty"`
MinFreeBytes int64 `json:"min_free_bytes,omitempty"`
UsableBytes int64 `json:"usable_bytes,omitempty"`
RequiredBytes int64 `json:"required_bytes,omitempty"`
Clamped bool `json:"clamped,omitempty"`
// Reason explains why EffectiveContext was lower than the requested or model
// dense context. It is telemetry/debug text, not a stable API enum yet.
Reason string `json:"reason,omitempty"`
// DeviceKind/DeviceID identify the memory pool modeld used for the capacity
// decision. Physical hot context is separate from future planner-level
// effective context, which may exceed the model's dense trained window.
DeviceKind string `json:"device_kind,omitempty"`
DeviceID string `json:"device_id,omitempty"`
DeviceTotalBytes int64 `json:"device_total_bytes,omitempty"`
// RequestedGpuLayers is what the profile/env asked for. ResolvedGpuLayers is
// what modeld will actually open after applying the device memory budget.
RequestedGpuLayers int `json:"requested_gpu_layers,omitempty"`
ResolvedGpuLayers int `json:"resolved_gpu_layers,omitempty"`
// Runtime identity and device inventory explain which native runtime modeld
// actually linked and what memory pools it can allocate from.
RuntimeName string `json:"runtime_name,omitempty"`
RuntimeDigest string `json:"runtime_digest,omitempty"`
RuntimeSystemInfo string `json:"runtime_system_info,omitempty"`
SupportsGPUOffload bool `json:"supports_gpu_offload,omitempty"`
Devices []DeviceInfo `json:"devices,omitempty"`
}
ModelInfo is what the daemon reports about a model: capabilities resolved from the model metadata AND the device's memory by the backend adapter — never guessed by the runtime. The runtime is the consumer (capabilities, cache identity); it does not parse model files or probe hardware itself.
EffectiveContext is the window modeld will actually serve on this device — min(model ceiling, what fits in free memory) — and is the value the runtime uses for NumCtx, display, and the cache-identity manifest. ModelMaxContext and the byte fields explain how it was derived (telemetry / explain-context).
type OpenSessionRequest ¶
type OpenSessionRequest struct {
Fence
ModelName string // logical model name, e.g. "qwen2.5-1.5b"
Type string // backend type the model targets: "llama" | "openvino"
Digest string // content digest; part of the cache identity
Path string // runtime-resolved filesystem location (GGUF file or IR dir)
Config Config
}
OpenSessionRequest asks the owner to open a session for a model. The model is identified by a typed handle, not an opaque path: ModelName + Type + Digest is the cache identity, and Type lets the daemon reject a model it does not serve (see ErrBackendMismatch) instead of failing deep in the engine. Path is the runtime-resolved on-disk location the daemon loads from — a hint, not identity.
type Option ¶
type Option func(*MemoryService)
Option configures a MemoryService.
func WithOwnerFence ¶
WithOwnerFence makes OpenSession reject a request whose Fence does not match ownerInstanceID with ErrStaleFence. With no fence configured (the default), the fence is ignored, keeping the unwired placeholder path simple.
type PrefixInput ¶
type PrefixInput struct {
Text string
Manifest ContextManifest
// Tools is a JSON array of tool definitions to render into the prompt via the
// model's own GGUF chat template (model-native tool calls). "" means no tools.
// The daemon renders it; the runtime never sees the model's tool format.
Tools string `json:",omitempty"`
}
PrefixInput is the stable prefix text plus the manifest that makes reuse valid: tokenizer, template, runtime config, BOS policy, and model identity are part of the cache key, not just the text.
type PrefixStatus ¶
type PrefixStatus struct {
ReusedTokens int
PrefilledTokens int
DroppedTokens int
PrefixTokens int
ResidentTokens int
AvailableTokens int
StableByteHash string
StableTokenHash string
ManifestDigest string
}
PrefixStatus reports what EnsurePrefix reused versus had to (re)compute. ReusedTokens > 0 is a warm hit.
type Service ¶
type Service interface {
OpenSession(ctx context.Context, req OpenSessionRequest) (Session, error)
// Describe reports a model's capabilities from its on-disk metadata. The
// daemon is the authority because it owns the model format and hardware;
// Config carries the requested context/runtime knobs for capacity planning.
Describe(ctx context.Context, req OpenSessionRequest) (ModelInfo, error)
// Embed computes a one-shot embedding for a model. It uses the same typed
// handle and owner fence as Describe, but does not create a persistent decode
// session because embedding pipelines do not participate in KV reuse.
Embed(ctx context.Context, req EmbedRequest) (EmbedResult, error)
}
Service is the entry point modeld serves: it opens persistent sessions on the owned hardware, and reports model capabilities it reads from the model itself. Opening is where the model is made resident and the session is bound to the owner epoch.
type Session ¶
type Session interface {
// EnsurePrefix makes the resident KV equal `prefix`, reusing the longest
// already-resident matching prefix and prefilling only the divergent tail
// (this also drops any previous suffix and generated tokens).
EnsurePrefix(ctx context.Context, prefix PrefixInput) (PrefixStatus, error)
// PrefillSuffix prefills the volatile suffix (diff / test output / user
// turn) after the stable prefix, leaving the stable KV untouched.
PrefillSuffix(ctx context.Context, suffix SuffixInput) (SuffixStatus, error)
// Decode streams generated text from the current resident state.
Decode(ctx context.Context, cfg DecodeConfig) (<-chan StreamChunk, error)
// ExplainContext reports the resident context for observability.
ExplainContext() ContextReport
// Snapshot captures backend state for durability/branching. State is opaque
// backend data; the manifest and bookkeeping fields are the compatibility
// gate needed before Restore may trust it.
Snapshot(ctx context.Context) (SessionSnapshot, error)
// Restore replaces the resident session state from a compatible snapshot.
Restore(ctx context.Context, snap SessionSnapshot) error
// Close releases the session's resources.
Close() error
}
Session is a persistent, workspace-scoped inference session. The hot coding loop is EnsurePrefix -> PrefillSuffix -> Decode: keep the stable prefix's KV hot, re-prefill only the changed suffix, decode.
type SessionSnapshot ¶ added in v0.32.5
type SessionSnapshot struct {
State []byte `json:"state,omitempty"`
ResidentTokens int `json:"resident_tokens,omitempty"`
PrefixTokens int `json:"prefix_tokens,omitempty"`
NumCtx int `json:"num_ctx,omitempty"`
ResidentTokenIDs []int `json:"resident_token_ids,omitempty"`
StableText string `json:"stable_text,omitempty"`
PrefixText string `json:"prefix_text,omitempty"`
Tools string `json:"tools,omitempty"`
Manifest ContextManifest `json:"manifest"`
}
type SlotState ¶ added in v0.32.5
type SlotState string
SlotState is the daemon-visible lifecycle state of the single active local model slot. The empty string is treated as SlotEmpty by older callers.
const ( SlotEmpty SlotState = "Empty" SlotLoading SlotState = "Loading" SlotReady SlotState = "Ready" SlotBusy SlotState = "Busy" SlotSwitching SlotState = "Switching" SlotUnloading SlotState = "Unloading" SlotFailed SlotState = "Failed" SlotShuttingDown SlotState = "ShuttingDown" SlotLostOwner SlotState = "LostOwner" )
type StreamChunk ¶
StreamChunk is a decoded text delta, parsed model output, or a terminal error.
type StructuredOutputConfig ¶ added in v0.32.5
StructuredOutputConfig carries a backend-typed structured-output request for decode calls that need the native backend to constrain generation.
type SuffixInput ¶
type SuffixInput struct {
Text string
Manifest ContextManifest
// EnableThinking controls model-owned chat-template rendering for the
// assistant generation prompt when a backend supports it. nil means backend
// default.
EnableThinking *bool `json:",omitempty"`
}
SuffixInput is the volatile text appended after the stable prefix. It carries the same manifest so a suffix cannot be prefilled against resident KV from a different profile/template/runtime.
type SuffixStatus ¶
type SuffixStatus struct {
SuffixTokens int
PrefixTokens int
ResidentTokens int
AvailableTokens int
ManifestDigest string
}
SuffixStatus reports the volatile suffix added after the stable prefix.
type ToolCall ¶ added in v0.32.5
type ToolCall struct {
ID string `json:"id,omitempty"`
Type string `json:"type"`
Function struct {
Name string `json:"name"`
Arguments string `json:"arguments"`
} `json:"function"`
}
ToolCall is a backend-neutral parsed function call emitted by a model.
type UnloadModelRequest ¶ added in v0.32.5
UnloadModelRequest explicitly releases modeld's active slot. It is idempotent when the slot is already empty unless ExpectedGeneration is set and stale.