llama

package

v0.32.0 Latest Latest Go to latest Published: Jun 17, 2026 License: Apache-2.0 Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/contenox/runtime

Links

Open Source Insights

Documentation ¶

Overview ¶

Package llama is the graduated local coding-node runtime: a persistent, workspace-scoped inference session that keeps a stable prefix's KV hot and re-prefills only the changed suffix (the live warm-reuse hot path), distinct from the toy fixed-constant `local` provider.

This package defines the backend-neutral session contract. Backend adapters implement it — llama.cpp now (./llamasession), OpenVINO later. Product code talks to Session, never to llama.cpp or OpenVINO concepts. Snapshot/restore (durability, branching, crash recovery) is a separate, later concern; the hot coding loop is EnsurePrefix -> PrefillSuffix -> Decode on a live session.

Index ¶

Variables
func EmbedAvailable() bool
func HashTokenIDs(tokens []int) string
func NewContextOverflowError(stage string, resident, additional, numCtx int) error
func NewManifestMismatchError(reason string) error
func NewUnsupportedFeatureError(feature string) error
func SessionAvailable() bool
func SetEmbedFunc(f EmbedFunc)
func SetSessionFactory(f SessionFactory)
type Config
type ContextManifest
type ContextOverflowError
- func (e *ContextOverflowError) Error() string
- func (e *ContextOverflowError) Is(target error) bool
type ContextReport
type DecodeConfig
type EmbedFunc
type ManifestMismatchError
type ManifestSegment
type PrefixInput
type PrefixStatus
type Session
type SessionFactory
type StreamChunk
type SuffixInput
type SuffixStatus
type TokenizeFunc
type UnsupportedFeatureError
- func (e *UnsupportedFeatureError) Error() string
- func (e *UnsupportedFeatureError) Is(target error) bool

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrSessionUnavailable means no native llama backend was compiled into
	// this binary.
	ErrSessionUnavailable = errors.New("llama: session backend unavailable")
	// ErrSessionClosed means the caller used a closed persistent session.
	ErrSessionClosed = errors.New("llama: session closed")
	// ErrContextOverflow means a prefix, suffix, or decode would exceed NumCtx.
	ErrContextOverflow = errors.New("llama: context overflow")
	// ErrUnsupportedFeature marks explicit product-surface gaps such as tools.
	ErrUnsupportedFeature = errors.New("llama: unsupported feature")
	// ErrSessionFatal means the backend marked the session unusable and callers
	// must evict it instead of trying to reuse resident KV.
	ErrSessionFatal = errors.New("llama: session fatal")
)

View Source

var ErrManifestMismatch = contextasm.ErrManifestMismatch

Functions ¶

func EmbedAvailable ¶

func EmbedAvailable() bool

EmbedAvailable reports whether an embedding backend is compiled into this build.

func HashTokenIDs ¶

func HashTokenIDs(tokens []int) string

func NewContextOverflowError ¶

func NewContextOverflowError(stage string, resident, additional, numCtx int) error

func NewManifestMismatchError ¶

func NewManifestMismatchError(reason string) error

func NewUnsupportedFeatureError ¶

func NewUnsupportedFeatureError(feature string) error

func SessionAvailable ¶

func SessionAvailable() bool

SessionAvailable reports whether a session backend is compiled into this build.

func SetEmbedFunc ¶

func SetEmbedFunc(f EmbedFunc)

SetEmbedFunc registers the native embedding backend.

func SetSessionFactory ¶

func SetSessionFactory(f SessionFactory)

SetSessionFactory registers the backend that creates sessions. The llama.cpp adapter (./llamasession) calls this from its init when built with the 'llamanode' tag, so the provider never imports the CGo package directly (no import cycle, default build stays CGo-free).

Types ¶

type Config ¶

type Config struct {
	NumCtx       int       // context window in tokens
	NumBatch     int       // prefill batch size
	NumThreads   int       // CPU threads (0 = NumCPU)
	NumGpuLayers int       // layers offloaded to GPU (0 = CPU only)
	TensorSplit  []float32 // multi-GPU split
	FlashAttn    bool
	KVCacheType  string // "", "q8_0", "q4_0"

	PromptFormat         string // profile-declared prompt format, e.g. "chatml" or "llama3"
	PromptTemplateDigest string // digest of the declared/rendered prompt template
	DisableBOS           bool   // false means tokenize the stable prefix with backend BOS handling
}

Config is the explicit runtime configuration for a local session. The toy constants (4096 ctx, 512 batch, Flash Attention off, 0 GPU layers, fresh context per call) die here — every knob is a tested setting, not a magic default.

type ContextManifest ¶

type ContextManifest = contextasm.ContextManifest

type ContextOverflowError ¶

type ContextOverflowError struct {
	Stage            string
	ResidentTokens   int
	AdditionalTokens int
	NumCtx           int
}

ContextOverflowError carries token counts for an overflow at a specific primitive boundary.

func (*ContextOverflowError) Error ¶

func (e *ContextOverflowError) Error() string

func (*ContextOverflowError) Is ¶

func (e *ContextOverflowError) Is(target error) bool

type ContextReport ¶

type ContextReport struct {
	ResidentTokens  int
	PrefixTokens    int
	NumCtx          int
	AvailableTokens int
	StableByteHash  string
	StableTokenHash string
	ManifestDigest  string
	Manifest        ContextManifest
	Closed          bool
}

ContextReport explains the session's resident context (explain-context).

type DecodeConfig ¶

type DecodeConfig struct {
	MaxTokens   int
	Temperature *float64
	TopP        *float64
	TopK        int
	Seed        *int
}

DecodeConfig controls a single decode pass.

type EmbedFunc ¶

type EmbedFunc func(ctx context.Context, modelPath string, cfg Config, input string) ([]float64, error)

EmbedFunc computes a single embedding via the native backend. The llama.cpp adapter registers one from its init when built with the 'llamanode' tag.

type ManifestMismatchError ¶

type ManifestMismatchError = contextasm.ManifestMismatchError

type ManifestSegment ¶

type ManifestSegment = contextasm.ManifestSegment

type PrefixInput ¶

type PrefixInput struct {
	Text     string
	Manifest ContextManifest
}

PrefixInput is the stable prefix text plus the profile/runtime manifest that makes reuse valid. Byte equality alone is not enough: tokenizer, template, runtime config, BOS policy, and model identity are part of the cache key.

type PrefixStatus ¶

type PrefixStatus struct {
	ReusedTokens    int // tokens kept warm from the resident prefix
	PrefilledTokens int // divergent tail that had to be (re)prefilled
	DroppedTokens   int // old resident tokens removed before prefill
	PrefixTokens    int // total prefix tokens now resident
	ResidentTokens  int // total resident tokens after EnsurePrefix
	AvailableTokens int // remaining context capacity
	StableByteHash  string
	StableTokenHash string
	ManifestDigest  string
}

PrefixStatus reports what EnsurePrefix reused versus had to (re)compute. This is the live-reuse signal: ReusedTokens > 0 means a warm hit.

type Session ¶

type Session interface {
	// EnsurePrefix makes the resident KV equal `prefix`, reusing the longest
	// already-resident matching token prefix and prefilling only the divergent
	// tail (this also drops any previous suffix and generated tokens).
	EnsurePrefix(ctx context.Context, prefix PrefixInput) (PrefixStatus, error)

	// PrefillSuffix prefills the volatile suffix (diff / test output / user turn)
	// after the stable prefix, leaving the stable KV untouched.
	PrefillSuffix(ctx context.Context, suffix SuffixInput) (SuffixStatus, error)

	// Decode streams generated text from the current resident state.
	Decode(ctx context.Context, cfg DecodeConfig) (<-chan StreamChunk, error)

	// ExplainContext reports the resident context for observability.
	ExplainContext() ContextReport

	// Close releases the session's resources.
	Close() error
}

Session is a persistent, workspace-scoped inference session.

The hot coding loop is: keep the stable prefix's KV hot, prefill only the changed suffix, decode. EnsurePrefix does token-level longest-common-prefix reuse, so an unchanged stable workspace context stays warm across turns and only the divergent tail is recomputed.

type SessionFactory ¶

type SessionFactory func(modelPath string, cfg Config) (Session, error)

SessionFactory creates a backend session for a model with explicit config.

type StreamChunk ¶

type StreamChunk struct {
	Text  string
	Error error
}

StreamChunk is a decoded text delta or a terminal error.

type SuffixInput ¶

type SuffixInput struct {
	Text     string
	Manifest ContextManifest
}

SuffixInput is the volatile text appended after the stable prefix. It carries the same manifest so direct Session callers cannot accidentally prefill a suffix against resident KV from a different profile/template/runtime.

type SuffixStatus ¶

type SuffixStatus struct {
	SuffixTokens    int
	PrefixTokens    int
	ResidentTokens  int
	AvailableTokens int
	ManifestDigest  string
}

SuffixStatus reports the volatile suffix that was added after the stable prefix. It is the measurement point for suffix-growth TTFT curves.

type TokenizeFunc ¶

type TokenizeFunc = contextasm.TokenizeFunc

type UnsupportedFeatureError ¶

type UnsupportedFeatureError struct {
	Feature string
}

UnsupportedFeatureError describes a deliberately unsupported surface.

func (*UnsupportedFeatureError) Error ¶

func (e *UnsupportedFeatureError) Error() string

func (*UnsupportedFeatureError) Is ¶

func (e *UnsupportedFeatureError) Is(target error) bool

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
llamasession

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL