inference

package

v1.0.10 Latest Latest Go to latest Published: Dec 17, 2025 License: Apache-2.0 Imports: 6 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/docker/model-runner

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func ValidateRuntimeFlags(flags []string) error
type Backend
type BackendConfiguration
type BackendMode
- func ParseBackendMode(mode string) (BackendMode, bool)
type ErrGGUFParse
- func (e *ErrGGUFParse) Error() string
type HFOverrides
- func (h HFOverrides) Validate() error
type LlamaCppConfig
type RequiredMemory
type SpeculativeDecodingConfig
type VLLMConfig

Constants ¶

View Source

const ExperimentalEndpointsPrefix = "/exp/vDD4.40"

ExperimentalEndpointsPrefix is used to prefix all <paths.InferencePrefix> routes on the Docker socket while they are still in their experimental stage. This prefix doesn't apply to endpoints on model-runner.docker.internal.

View Source

const (
	// OriginOllamaCompletion indicates the request came from the Ollama /api/chat or /api/generate endpoints
	OriginOllamaCompletion = "ollama/completion"
)

Valid origin values for the RequestOriginHeader.

View Source

const RequestOriginHeader = "X-Request-Origin"

RequestOriginHeader is the HTTP header used to track the origin of inference requests. This header is set internally by proxy handlers (e.g., Ollama compatibility layer) to provide more granular tracking of model usage by source.

Variables ¶

View Source

var InferencePrefix = "/engines"

InferencePrefix is the prefix for inference related routes.

View Source

var ModelsPrefix = "/models"

ModelsPrefix is the prefix for all model manager related routes.

Functions ¶

func ValidateRuntimeFlags ¶ added in v1.0.10

func ValidateRuntimeFlags(flags []string) error

ValidateRuntimeFlags ensures runtime flags don't contain paths (forward slash "/" or backslash "\") to prevent malicious users from overwriting host files via arguments like --log-file /some/path, --output-file /etc/passwd, or --log-file C:\Windows\file.

This validation rejects any flag or value containing "/" or "\" to block: - Unix/Linux/macOS absolute paths: /var/log/file, /etc/passwd - Unix/Linux/macOS relative paths: ../file.txt, ./config - Windows absolute paths: C:\Users\file, D:\data\file - Windows relative paths: ..\file.txt, .\config - UNC paths: \\network\share\file

Returns an error if any flag contains a forward slash or backslash.

Types ¶

type Backend ¶

type Backend interface {
	// Name returns the backend name. It must be all lowercase and usable as a
	// path component in an HTTP request path and a Unix domain socket path. It
	// should also be suitable for presenting to users (at least in logs). The
	// package providing the backend implementation should also expose a
	// constant called Name which matches the value returned by this method.
	Name() string
	// UsesExternalModelManagement should return true if the backend uses an
	// external model management system and false if the backend uses the shared
	// model manager.
	UsesExternalModelManagement() bool
	// Install ensures that the backend is installed. It should return a nil
	// error if installation succeeds or if the backend is already installed.
	// The provided HTTP client should be used for any HTTP operations.
	Install(ctx context.Context, httpClient *http.Client) error
	// Run runs an OpenAI API web server on the specified Unix domain socket
	// for the specified model using the backend. It should start any
	// process(es) necessary for the backend to function for the model. It
	// should not return until either the process(es) fail or the provided
	// context is cancelled. By the time Run returns, any process(es) it has
	// spawned must terminate.
	//
	// Backend implementations should be "one-shot" (i.e. returning from Run
	// after the failure of an underlying process). Backends should not attempt
	// to perform restarts on failure. Backends should only return a nil error
	// in the case of context cancellation, otherwise they should return the
	// error that caused them to fail.
	//
	// Run will be provided with the path to a Unix domain socket on which the
	// backend should listen for incoming OpenAI API requests and a model name
	// to be loaded. Backends should not load multiple models at once and should
	// instead load only the specified model. Backends should still respond to
	// OpenAI API requests for other models with a 421 error code.
	Run(ctx context.Context, socket, model string, modelRef string, mode BackendMode, config *BackendConfiguration) error
	// Status returns a description of the backend's state.
	Status() string
	// GetDiskUsage returns the disk usage of the backend.
	GetDiskUsage() (int64, error)
	// GetRequiredMemoryForModel returns the required working memory for a given
	// model.
	GetRequiredMemoryForModel(ctx context.Context, model string, config *BackendConfiguration) (RequiredMemory, error)
}

Backend is the interface implemented by inference engine backends. Backend implementations need not be safe for concurrent invocation of the following methods, though their underlying server implementations do need to support concurrent API requests.

type BackendConfiguration ¶

type BackendConfiguration struct {
	// Shared configuration across all backends
	ContextSize  *int32                     `json:"context-size,omitempty"`
	RuntimeFlags []string                   `json:"runtime-flags,omitempty"`
	Speculative  *SpeculativeDecodingConfig `json:"speculative,omitempty"`

	// Backend-specific configuration
	VLLM     *VLLMConfig     `json:"vllm,omitempty"`
	LlamaCpp *LlamaCppConfig `json:"llamacpp,omitempty"`
}

type BackendMode ¶

type BackendMode uint8

BackendMode encodes the mode in which a backend should operate.

const (
	// BackendModeCompletion indicates that the backend should run in chat
	// completion mode.
	BackendModeCompletion BackendMode = iota
	// BackendModeEmbedding indicates that the backend should run in embedding
	// mode.
	BackendModeEmbedding
	BackendModeReranking
)

func ParseBackendMode ¶ added in v1.0.8

func ParseBackendMode(mode string) (BackendMode, bool)

ParseBackendMode converts a string mode to BackendMode. It returns the parsed mode and a boolean indicating if the mode was known. For unknown modes, it returns BackendModeCompletion and false.

func (BackendMode) MarshalJSON ¶ added in v1.0.10

func (m BackendMode) MarshalJSON() ([]byte, error)

MarshalJSON implements json.Marshaler for BackendMode.

func (BackendMode) String ¶

func (m BackendMode) String() string

String implements Stringer.String for BackendMode.

func (*BackendMode) UnmarshalJSON ¶ added in v1.0.10

func (m *BackendMode) UnmarshalJSON(data []byte) error

UnmarshalJSON implements json.Unmarshaler for BackendMode.

type ErrGGUFParse ¶

type ErrGGUFParse struct {
	Err error
}

func (*ErrGGUFParse) Error ¶

func (e *ErrGGUFParse) Error() string

type HFOverrides ¶ added in v1.0.7

type HFOverrides map[string]interface{}

HFOverrides contains HuggingFace model configuration overrides. Uses map[string]interface{} for flexibility, with validation to prevent injection attacks. This matches vLLM's --hf-overrides which accepts "a JSON string parsed into a dictionary".

func (HFOverrides) Validate ¶ added in v1.0.7

func (h HFOverrides) Validate() error

Validate ensures all keys and values in HFOverrides are safe. Keys must be alphanumeric with underscores only (no special characters that could be exploited). Values can be primitives (string, bool, number), arrays, or nested objects. Nested objects have their keys validated recursively.

type LlamaCppConfig ¶ added in v1.0.7

type LlamaCppConfig struct {
	// ReasoningBudget sets the reasoning budget for reasoning models.
	// Maps to llama.cpp's --reasoning-budget flag.
	ReasoningBudget *int32 `json:"reasoning-budget,omitempty"`
}

LlamaCppConfig contains llama.cpp-specific configuration options.

type RequiredMemory ¶

type RequiredMemory struct {
	RAM  uint64
	VRAM uint64 // TODO(p1-0tr): for now assume we are working with single GPU set-ups
}

type SpeculativeDecodingConfig ¶

type SpeculativeDecodingConfig struct {
	DraftModel        string  `json:"draft_model,omitempty"`
	NumTokens         int     `json:"num_tokens,omitempty"`
	MinAcceptanceRate float64 `json:"min_acceptance_rate,omitempty"`
}

type VLLMConfig ¶ added in v1.0.7

type VLLMConfig struct {
	// HFOverrides contains HuggingFace model configuration overrides.
	// This maps to vLLM's --hf-overrides flag which accepts a JSON dictionary.
	HFOverrides HFOverrides `json:"hf-overrides,omitempty"`
	// GPUMemoryUtilization sets the fraction of GPU memory to be used for the model executor.
	// Must be between 0.0 and 1.0. If not specified, vLLM uses its default value of 0.9.
	// This maps to vLLM's --gpu-memory-utilization flag.
	GPUMemoryUtilization *float64 `json:"gpu-memory-utilization,omitempty"`
}

VLLMConfig contains vLLM-specific configuration options.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
backends
llamacpp
mlx
vllm
config
models
platform
scheduling

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL