routingalgorithms

package
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 5, 2025 License: Apache-2.0 Imports: 26 Imported by: 0

README

Routing Algorithms

Prefix Cache Aware

Below is the pseudo-code for prefix-cache aware routing.

func prefix_cache_routing(ready_pods []*v1.Pod) {
    if check_load_imbalance(ready_pods) {
        target_pod = select_pod_with_least_running_requests(ready_pods)
    } else {
        match_pods, prefix_hashes = match_prefix(ready_pods)
        if len(match_pod) > 0 {
            target_pod = select_least_loaded_match_pod(match_pods)
        }
    }

    // if no target pod is selected, fallback to select pod with least request
    if target_pod == nil {
        target_pod = select_pod_with_least_running_requests(ready_pods)
    }
}

func check_load_imbalance(ready_pods) {
    // filter pods with min and max number of running requests
    min_pod = select_pod_min_running_requests()
    max_pod = select_pod_max_running_requests()
    
    // if difference between max & min running requests count 
    // is more than configurable ABS_RUNNING_REQUEST_COUNT (default: 8)
    // then load is imbalanced
    if max_pod - min_pod > ABS_RUNNING_REQUEST_COUNT {
        return true
    }
    return false
}

func match_prefix(input_tokens, ready_pods) {
    // input_tokens are split based off configurable block_sizes and 
    // hash is calculated for each token_block
    hashes = calculate_hashes(input_tokens)

    // checks if token_block exists on ready_pods [prefix_match], 
    // if present calculate pod_name: prefix_match_percent
    match_pods_with_prefix_match_percent = check_hashes_on_ready_pods(hashes, ready_pods)
}

func select_least_loaded_match_pod(match_pods_with_prefix_match_percent, ready_pods) {
    mean = calculate_mean_running_request(ready_pods)   
    std_dev = calculate_std_dev_running_request(ready_pods)

    // sort match_pods in decreasing perfix_match_percent and 
    // for same prefix_match_percent, sort in increasing running_request count.
    sort(match_pods_with_prefix_match_percent)

    // select match pod with highest prefix and running_request < (mean + std_dev)
    for pod := range match_pods_with_prefix_match_percent {
        if pod.running_request < mean + load_factor*std_dev {
            return pod
        }
    }
}

// selects pod with minimum running requests, similar to least-request routing algorithm
func select_pod_with_least_running_requests(ready_pods) {
    return select_pod_min_running_requests()
}

Configurations

  • AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE

    AIBrix gateway implements two tokenizers character and tiktoken. Default tokenizer is character.

    Tokenizer Type Details
    character splits input text into characters
    tiktoken open-source openai/tiktoken tokenizer
  • AIBRIX_PREFIX_CACHE_BLOCK_SIZE

    Tokenized input request is split into blocks and hash value of the blocks is cached for future match. Size of the block (i.e. number of tokens per block) defines how effective prefix match will be. Default is character tokenizer and 128 block size (tokens per block).

    Tokenizer Type Block Size Recommendation
    character 128
    tiktoken 16
  • AIBRIX_PREFIX_CACHE_BLOCK_NUMBER

    Maximum number of prefix cache blocks. Default is 200000.

  • AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT

    Before evaluating prefix cache match, router checks if there is imbalance of running requests across pods. Imbalance is measured using absolute difference between max & min running requests across pods, for example if imbalance_abs_count = 16 and running requests for pods are [p1: 1, p2: 2, p3:20] then current scenario is flagged as imbalanced. If flagged as imbalanced then prefix match is ignored and request is routed to pod with least running requests which in above example will to route to pod p1. Default is 16 and should be adjusted based on GPU hardware & prompt length.

  • AIBRIX_PREFIX_CACHE_STANDARD_DEVIATION_FACTOR

    After evaluating prefix match, pods are selected with matching prefix cache. Selected pods are re-evaluated to prevent a hotspot scenario where bulk of prefix matching requests are routed to same pod. Imbalanced is checked as follows

      prefix_match_pod.running_requests <= mean + load_factor * standard_deviation
      

    load_factor determines number of standard deviations. Default is 2

Virtual Token Counter (VTC)

The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving based on the paper "Fairness in Serving Large Language Models" (Sheng et al.). VTC aims to provide fairness among clients by tracking the service (weighted token count) each client has received and prioritizing those who have received less service. It integrates with continuous batching and handles challenges unique to LLM serving, like variable token costs and unknown output lengths. The research paper and reference implementation artifact can be found at Fairness in Serving Large Language Models (Sheng et al.) .

vtc-basic

The vtc-basic variant implements a simplified version of VTC. It routes requests by combining two scores: a fairness score based on the user’s token count (normalized against all users) within a time window, and a utilization score based on the pod’s current load. The pod with the lowest combined score is selected. This approach adapts dynamically to system load and balances fairness with efficient resource use.

End-to-end tests for this algorithm can be found in vtc_routing_test.go.

Environment Variables
Variable Description Default
AIBRIX_ROUTING_ALGORITHM Set to vtc-basic to enable this routing strategy. prefix-aware
AIBRIX_ROUTER_VTC_TOKEN_TRACKER_TIME_UNIT Time unit of the sliding window for tracking user token usage. minutes
AIBRIX_ROUTER_VTC_TOKEN_TRACKER_WINDOW_SIZE Size of the sliding window for tracking user token usage. 5
AIBRIX_ROUTER_VTC_TOKEN_TRACKER_MIN_TOKENS Sensible min default value for adaptive token tracking (see vtc_basic) 1000
AIBRIX_ROUTER_VTC_TOKEN_TRACKER_MAX_TOKENS Sensible max default value for adaptive token tracking (see vtc_basic) 8000
AIBRIX_ROUTER_VTC_BASIC_INPUT_TOKEN_WEIGHT Weight applied to input tokens in fairness calculations. 1.0
AIBRIX_ROUTER_VTC_BASIC_OUTPUT_TOKEN_WEIGHT Weight applied to output tokens in fairness calculations. 2.0
AIBRIX_ROUTER_VTC_BASIC_MAX_POD_LOAD Normalization factor for pod load in utilization score calculation. 100.0
AIBRIX_ROUTER_VTC_BASIC_FAIRNESS_WEIGHT Weight applied to fairness score in combined score calculation. 1.0
AIBRIX_ROUTER_VTC_BASIC_UTILIZATION_WEIGHT Weight applied to utilization score in combined score calculation. 1.0
Other VTC variants

The VTC paper and reference implementation (slora/server/router) describe other VTC variants like vtc-fair (pure fairness), vtc-max-fair, and vtc-pred (using length prediction). Adapting these variants for use within the Aibrix Kubernetes routing context, which involves selecting specific pods rather than managing a single request queue, is a potential area for future enhancement.

Prefill-Decode Disaggregation

Below is the pseudo-code for prefix-cache aware routing.

// Route take all ready pods for that model and later are filtered for Prefill and Decode.
func Route(readyPods *[]v1.Pod) {
    prefillPods, decodePods := filterPrefillAndDecodePods(readPods)

    doPrefillRequest(prefillPods)

    // randomly selects decode worker
    return selectDecode(decodePods)
}

func filterPrefillAndDecodePods(readyPods *[]v1.Pod) {
    prefillPods := filterBasedOfLabelSelector("role-name=prefill")
    decodePods := filterBasedOfLabelSelector("role-name=decode")

    return prefillPods, decodePods
}

// selects prefill worker and sends prefill request (based of engine)
func doPrefillRequest(prefillPods) {

    // check if any prefill pod already have shared prefix hence re-use kv-cache 
    // if no prefix sharing then use load-balancing to select prefill worker
    prefillPod := evaluatePrefixCache(prefillPods)

    // get inference engine using pod labels
    llmEngine := labelValue("model.aibrix.ai/engine")
    
    if llmEngine == "sglang" {
        async prefill_request() // for sglang send async prefill request
    } else {
        sync prefill_request()
    }

    return
}
  • AIBRIX_PREFILL_REQUEST_TIMEOUT

    AIBrix gateway is an external processing plugin for envoy proxy which handles inference request validation, selection of target worker and more, but request forwarding to inference engine is done by envoy proxy.

    Prefill-Decode disaggregation is a special scenario, where prefill request is directly sent to inference engine by AIBrix gateway plugin and decode request is fowarded by envoy proxy. For prefill request if needed timeout can be configured using env variable, default is 60s.

Documentation

Overview

Copyright 2024 The Aibrix Team.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2024 The Aibrix Team.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2024 The Aibrix Team.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Constants

View Source
const (
	RouterPD                      types.RoutingAlgorithm = "pd"
	VLLMEngine                    string                 = "vllm"
	SGLangEngine                  string                 = "sglang"
	SGLangBootstrapPort           int64                  = 8998
	SGLangBootstrapPortIdentifier string                 = "model.aibrix.ai/sglang-bootstrap-port"
	LLMEngineIdentifier           string                 = constants.ModelLabelEngine
	PDRoleIdentifier              string                 = "role-name"
	RoleReplicaIndex              string                 = "stormservice.orchestration.aibrix.ai/role-replica-index"
	PodGroupIndex                 string                 = "stormservice.orchestration.aibrix.ai/pod-group-index"
)
View Source
const (
	PREBLE_TARGET_GPU             = "AIBRIX_ROUTER_PREBLE_TARGET_GPU"
	PREBLE_DECODING_LENGTH        = "AIBRIX_ROUTER_PREBLE_DECODING_LENGTH"
	PREBLE_SLIDING_WINDOW_PERIOD  = "AIBRIX_ROUTER_PREBLE_SLIDING_WINDOW_PERIOD"
	PREBLE_EVICTION_LOOP_INTERVAL = "AIBRIX_ROUTER_PREBLE_EVICTION_LOOP_INTERVAL"
)
View Source
const (
	// Default routing algorithm for slo-family algorithms, setting to RouterSLOLeastLoadPulling.
	RouterSLO types.RoutingAlgorithm = "slo"

	// SLO-aware routing algorithm that using SLOQueue and packLoadRouter as the backend.
	RouterSLOPackLoad types.RoutingAlgorithm = "slo-pack-load"

	// SLO-aware routing algorithm that using SLOQueue and leastLoadRouter (push mode) as the backend.
	RouterSLOLeastLoad types.RoutingAlgorithm = "slo-least-load"

	// SLO-aware routing algorithm that using SLOQueue and leastLoadRouter (pull mode) as the backend.
	RouterSLOLeastLoadPulling types.RoutingAlgorithm = "slo-least-load-pulling"
)
View Source
const DefaultFallbackAlgorithm types.RoutingAlgorithm = RouterRandom
View Source
const RouterLeastBusyTime types.RoutingAlgorithm = "least-busy-time"
View Source
const RouterLeastGpuCache types.RoutingAlgorithm = "least-gpu-cache"
View Source
const RouterLeastKvCache types.RoutingAlgorithm = "least-kv-cache"
View Source
const RouterLeastLatency types.RoutingAlgorithm = "least-latency"
View Source
const RouterLeastRequest types.RoutingAlgorithm = "least-request"
View Source
const (
	RouterNotSet = ""
)
View Source
const RouterPrefixCachePreble types.RoutingAlgorithm = "prefix-cache-preble"
View Source
const RouterRandom types.RoutingAlgorithm = "random"
View Source
const RouterThroughput types.RoutingAlgorithm = "throughput"
View Source
const RouterUtil types.RoutingAlgorithm = "least-utilization"

Variables

View Source
var (
	RandomRouter             = &randomRouter{}
	RandomRouterProviderFunc = func(_ *types.RoutingContext) (types.Router, error) { return RandomRouter, nil }
)
View Source
var (
	ErrInitTimeout           = errors.New("router initialization timeout")
	ErrFallbackNotSupported  = errors.New("router not support fallback")
	ErrFallbackNotRegistered = errors.New("fallback router not registered")
)
View Source
var (
	ErrorNoAvailablePod = fmt.Errorf("no pod available")
)
View Source
var ModelRouterFactory = NewSLORouter
View Source
var (
	RouterPrefixCache types.RoutingAlgorithm = "prefix-cache"
)

Functions

func Init

func Init()

func NewLeastBusyTimeRouter

func NewLeastBusyTimeRouter() (types.Router, error)

func NewLeastExpectedLatencyRouter

func NewLeastExpectedLatencyRouter() (types.Router, error)

func NewLeastGpuCacheRouter added in v0.4.0

func NewLeastGpuCacheRouter() (types.Router, error)

func NewLeastKvCacheRouter

func NewLeastKvCacheRouter() (types.Router, error)

func NewLeastLoadPullingRouter added in v0.4.0

func NewLeastLoadPullingRouter(provider cache.CappedLoadProvider) (types.Router, error)

NewLeastLoadRouter creates leastLoadRouter instance in the pull mode.

func NewLeastLoadRouter added in v0.4.0

func NewLeastLoadRouter(provider cache.LoadProvider) (types.Router, error)

NewLeastLoadRouter creates leastLoadRouter instance in the push mode.

func NewLeastRequestRouter

func NewLeastRequestRouter() (types.Router, error)

func NewLeastUtilRouter added in v0.4.0

func NewLeastUtilRouter() (types.Router, error)

func NewPDRouter added in v0.4.0

func NewPDRouter() (types.Router, error)

func NewPackLoadRouter added in v0.4.0

func NewPackLoadRouter(provider cache.CappedLoadProvider) (types.Router, error)

NewPackLoadRouter creates packLoadRouter instance.

func NewPrefixCacheAndLoadRouter

func NewPrefixCacheAndLoadRouter() (types.Router, error)

func NewPrefixCacheRouter

func NewPrefixCacheRouter() (types.Router, error)

func NewQueueRouter added in v0.4.0

func NewQueueRouter(backend types.Router, queue types.RouterQueue[*types.RoutingContext]) (types.QueueRouter, error)

func NewRandomRouter

func NewRandomRouter() (types.Router, error)

func NewSLORouter added in v0.4.0

func NewSLORouter(modelName string) (types.QueueRouter, error)

func NewThroughputRouter

func NewThroughputRouter() (types.Router, error)

func Register

func Register(algorithm types.RoutingAlgorithm, constructor types.RouterConstructor)

func RegisterProvider added in v0.4.0

func RegisterProvider(algorithm types.RoutingAlgorithm, provider types.RouterProviderFunc)

func Select

func Select(ctx *types.RoutingContext) (types.Router, error)

func SelectRandomPodAsFallback

func SelectRandomPodAsFallback(ctx *types.RoutingContext, pods []*v1.Pod, randomFunc func(int) int) (*v1.Pod, error)

SelectRandomPodAsFallback selects a pod randomly as a fallback. This method should only be used when all other selection mechanisms have failed. For example, if no pods meet the required criteria (e.g., valid metrics or specific conditions), this method can be called to randomly select a pod from the provided list.

func SetFallback added in v0.4.0

func SetFallback(router types.Router, fallback types.RoutingAlgorithm) error

func Validate

func Validate(algorithms string) (types.RoutingAlgorithm, bool)

Types

type FallbackRouter added in v0.4.0

type FallbackRouter struct {
	// contains filtered or unexported fields
}

func (*FallbackRouter) Route added in v0.4.0

func (r *FallbackRouter) Route(ctx *types.RoutingContext, pods types.PodList) (string, error)

func (*FallbackRouter) SetFallback added in v0.4.0

func (r *FallbackRouter) SetFallback(fallback types.RoutingAlgorithm, provider types.RouterProviderFunc)

type PrefillTimeParams

type PrefillTimeParams struct {
	NumRequests      int
	NumBatchedTokens int
	TotalContext     int
	InputIDLens      []int
	NumUniqueKV      int
	SeqLens          []int
}

type PrefixCacheMetrics added in v0.4.0

type PrefixCacheMetrics struct {
	// contains filtered or unexported fields
}

PrefixCacheMetrics holds all prefix cache metrics

type RouterManager added in v0.4.0

type RouterManager struct {
	// contains filtered or unexported fields
}

func NewRouterManager added in v0.4.0

func NewRouterManager() *RouterManager

func (*RouterManager) Init added in v0.4.0

func (rm *RouterManager) Init()

func (*RouterManager) Register added in v0.4.0

func (rm *RouterManager) Register(algorithm types.RoutingAlgorithm, constructor types.RouterConstructor)

func (*RouterManager) RegisterProvider added in v0.4.0

func (rm *RouterManager) RegisterProvider(algorithm types.RoutingAlgorithm, provider types.RouterProviderFunc)

func (*RouterManager) Select added in v0.4.0

func (rm *RouterManager) Select(ctx *types.RoutingContext) (types.Router, error)

Select the user provided router provider supported by gateway, no error reported and fallback to random router Call Validate before this function to ensure expected behavior.

func (*RouterManager) SetFallback added in v0.4.0

func (rm *RouterManager) SetFallback(router types.Router, fallback types.RoutingAlgorithm) error

func (*RouterManager) Validate added in v0.4.0

func (rm *RouterManager) Validate(algorithms string) (types.RoutingAlgorithm, bool)

Validate validates if user provided routing routers is supported by gateway

type SLORouter added in v0.4.0

type SLORouter struct {
	FallbackRouter
	*queue.SLOQueue
}

SLORouter is a router that add FallbackRouter mechanism to the queue.

func (*SLORouter) Route added in v0.4.0

func (r *SLORouter) Route(ctx *types.RoutingContext, pods types.PodList) (string, error)

type SlidingWindowHistogram

type SlidingWindowHistogram struct {
	// contains filtered or unexported fields
}

type TokenizerPool added in v0.4.0

type TokenizerPool struct {
	// contains filtered or unexported fields
}

TokenizerPool manages model-specific tokenizers with caching and health checking

func NewTokenizerPool added in v0.4.0

func NewTokenizerPool(config TokenizerPoolConfig, cache cache.Cache) *TokenizerPool

NewTokenizerPool creates a new TokenizerPool instance

func (*TokenizerPool) Close added in v0.4.0

func (p *TokenizerPool) Close() error

Close gracefully shuts down the TokenizerPool

func (*TokenizerPool) GetTokenizer added in v0.4.0

func (p *TokenizerPool) GetTokenizer(model string, pods []*v1.Pod) tokenizer.Tokenizer

GetTokenizer returns a tokenizer for the specified model

type TokenizerPoolConfig added in v0.4.0

type TokenizerPoolConfig struct {
	EnableVLLMRemote     bool                // Feature flag
	EndpointTemplate     string              // "http://%s:8000"
	HealthCheckPeriod    time.Duration       // Default: 30s
	TokenizerTTL         time.Duration       // Default: 5m
	MaxTokenizersPerPool int                 // Default: 100
	DefaultTokenizer     tokenizer.Tokenizer // Default when remote fails
	ModelServiceMap      map[string]string   // Model -> Service endpoint mapping
	Timeout              time.Duration       // Request timeout
}

TokenizerPoolConfig represents configuration for the TokenizerPool

type TokenizerPoolInterface added in v0.4.0

type TokenizerPoolInterface interface {
	GetTokenizer(model string, pods []*v1.Pod) tokenizer.Tokenizer
	Close() error
}

TokenizerPoolInterface defines the interface for tokenizer pools

type TokenizerPoolMetrics added in v0.4.0

type TokenizerPoolMetrics struct {
	// contains filtered or unexported fields
}

TokenizerPoolMetrics contains Prometheus metrics for the pool

Directories

Path Synopsis
Package vtc implements the Virtual Token Counter routing algorithms focused on fairness and utilization
Package vtc implements the Virtual Token Counter routing algorithms focused on fairness and utilization

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL