scheduling

package
v1.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 20, 2025 License: Apache-2.0 Imports: 33 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrBackendNotFound = errors.New("backend not found")

ErrBackendNotFound indicates that an unknown backend was requested. If returned in conjunction with an HTTP request, it should be paired with a 404 response status.

View Source
var RunnerSocketPath = func(slot int) (string, error) {
	return fmt.Sprintf("inference-runner-%d.sock", slot), nil
}

RunnerSocketPath determines the Unix domain socket path used to communicate with runners at the specified slot. It can be overridden during init().

Functions

This section is empty.

Types

type BackendStatus

type BackendStatus struct {
	// BackendName is the name of the backend
	BackendName string `json:"backend_name"`
	// ModelName is the name of the model loaded in the backend
	ModelName string `json:"model_name"`
	// Mode is the mode the backend is operating in
	Mode string `json:"mode"`
	// LastUsed represents when this (backend, model, mode) tuple was last used
	LastUsed time.Time `json:"last_used,omitempty"`
	// InUse indicates whether this backend is currently handling a request
	InUse bool `json:"in_use,omitempty"`
}

BackendStatus represents information about a running backend

type ConfigureRequest

type ConfigureRequest struct {
	Model           string                               `json:"model"`
	ContextSize     int64                                `json:"context-size,omitempty"`
	RuntimeFlags    []string                             `json:"runtime-flags,omitempty"`
	RawRuntimeFlags string                               `json:"raw-runtime-flags,omitempty"`
	Speculative     *inference.SpeculativeDecodingConfig `json:"speculative,omitempty"`
}

ConfigureRequest specifies per-model runtime configuration options.

type DiskUsage

type DiskUsage struct {
	ModelsDiskUsage         int64 `json:"models_disk_usage"`
	DefaultBackendDiskUsage int64 `json:"default_backend_disk_usage"`
}

DiskUsage represents the disk usage of the models and default backend.

type OpenAIErrorResponse

type OpenAIErrorResponse struct {
	Type           string  `json:"type"` // always "error"
	Code           *string `json:"code"`
	Message        string  `json:"message"`
	Param          *string `json:"param"`
	SequenceNumber int     `json:"sequence_number"`
}

OpenAIErrorResponse is used to format an OpenAI API compatible error response (see https://platform.openai.com/docs/api-reference/responses-streaming/error)

type OpenAIInferenceRequest

type OpenAIInferenceRequest struct {
	// Model is the requested model name.
	Model string `json:"model"`
}

OpenAIInferenceRequest is used to extract the model specification from either a chat completion or embedding request in the OpenAI API.

type Scheduler

type Scheduler struct {
	// contains filtered or unexported fields
}

Scheduler is used to coordinate inference scheduling across multiple backends and models.

func NewScheduler

func NewScheduler(
	log logging.Logger,
	backends map[string]inference.Backend,
	defaultBackend inference.Backend,
	modelManager *models.Manager,
	httpClient *http.Client,
	allowedOrigins []string,
	tracker *metrics.Tracker,
	sysMemInfo memory.SystemMemoryInfo,
) *Scheduler

NewScheduler creates a new inference scheduler.

func (*Scheduler) Configure

func (s *Scheduler) Configure(w http.ResponseWriter, r *http.Request)

func (*Scheduler) GetAllActiveRunners

func (s *Scheduler) GetAllActiveRunners() []metrics.ActiveRunner

GetAllActiveRunners returns information about all active runners

func (*Scheduler) GetBackendStatus

func (s *Scheduler) GetBackendStatus(w http.ResponseWriter, r *http.Request)

func (*Scheduler) GetDiskUsage

func (s *Scheduler) GetDiskUsage(w http.ResponseWriter, _ *http.Request)

func (*Scheduler) GetLlamaCppSocket

func (s *Scheduler) GetLlamaCppSocket() (string, error)

GetLlamaCppSocket returns the Unix socket path for an active llama.cpp runner

func (*Scheduler) GetRunningBackends

func (s *Scheduler) GetRunningBackends(w http.ResponseWriter, r *http.Request)

GetRunningBackends returns information about all running backends

func (*Scheduler) RebuildRoutes

func (s *Scheduler) RebuildRoutes(allowedOrigins []string)

func (*Scheduler) ResetInstaller

func (s *Scheduler) ResetInstaller(httpClient *http.Client)

func (*Scheduler) Run

func (s *Scheduler) Run(ctx context.Context) error

Run is the scheduler's main run loop. By the time it returns, all inference backends will have been unloaded from memory.

func (*Scheduler) ServeHTTP

func (s *Scheduler) ServeHTTP(w http.ResponseWriter, r *http.Request)

ServeHTTP implements net/http.Handler.ServeHTTP.

func (*Scheduler) Unload

func (s *Scheduler) Unload(w http.ResponseWriter, r *http.Request)

Unload unloads the specified runners (backend, model) from the backend. Currently, this doesn't work for runners that are handling an OpenAI request.

type UnloadRequest

type UnloadRequest struct {
	All     bool     `json:"all"`
	Backend string   `json:"backend"`
	Models  []string `json:"models"`
}

UnloadRequest is used to specify which models to unload.

type UnloadResponse

type UnloadResponse struct {
	UnloadedRunners int `json:"unloaded_runners"`
}

UnloadResponse is used to return the number of unloaded runners (backend, model).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL