routingalgorithms

package

v0.4.0 Latest Latest Go to latest Published: Aug 5, 2025 License: Apache-2.0 Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/vllm-project/aibrix

Links

Open Source Insights

README ¶

Routing Algorithms

Prefix Cache Aware

Below is the pseudo-code for prefix-cache aware routing.

func prefix_cache_routing(ready_pods []*v1.Pod) {
    if check_load_imbalance(ready_pods) {
        target_pod = select_pod_with_least_running_requests(ready_pods)
    } else {
        match_pods, prefix_hashes = match_prefix(ready_pods)
        if len(match_pod) > 0 {
            target_pod = select_least_loaded_match_pod(match_pods)
        }
    }

    // if no target pod is selected, fallback to select pod with least request
    if target_pod == nil {
        target_pod = select_pod_with_least_running_requests(ready_pods)
    }
}

func check_load_imbalance(ready_pods) {
    // filter pods with min and max number of running requests
    min_pod = select_pod_min_running_requests()
    max_pod = select_pod_max_running_requests()
    
    // if difference between max & min running requests count 
    // is more than configurable ABS_RUNNING_REQUEST_COUNT (default: 8)
    // then load is imbalanced
    if max_pod - min_pod > ABS_RUNNING_REQUEST_COUNT {
        return true
    }
    return false
}

func match_prefix(input_tokens, ready_pods) {
    // input_tokens are split based off configurable block_sizes and 
    // hash is calculated for each token_block
    hashes = calculate_hashes(input_tokens)

    // checks if token_block exists on ready_pods [prefix_match], 
    // if present calculate pod_name: prefix_match_percent
    match_pods_with_prefix_match_percent = check_hashes_on_ready_pods(hashes, ready_pods)
}

func select_least_loaded_match_pod(match_pods_with_prefix_match_percent, ready_pods) {
    mean = calculate_mean_running_request(ready_pods)   
    std_dev = calculate_std_dev_running_request(ready_pods)

    // sort match_pods in decreasing perfix_match_percent and 
    // for same prefix_match_percent, sort in increasing running_request count.
    sort(match_pods_with_prefix_match_percent)

    // select match pod with highest prefix and running_request < (mean + std_dev)
    for pod := range match_pods_with_prefix_match_percent {
        if pod.running_request < mean + load_factor*std_dev {
            return pod
        }
    }
}

// selects pod with minimum running requests, similar to least-request routing algorithm
func select_pod_with_least_running_requests(ready_pods) {
    return select_pod_min_running_requests()
}

Configurations

AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE

AIBrix gateway implements two tokenizers character and tiktoken. Default tokenizer is character.

Tokenizer Type Details

character splits input text into characters

tiktoken open-source openai/tiktoken tokenizer
AIBRIX_PREFIX_CACHE_BLOCK_SIZE

Tokenized input request is split into blocks and hash value of the blocks is cached for future match. Size of the block (i.e. number of tokens per block) defines how effective prefix match will be. Default is character tokenizer and 128 block size (tokens per block).

Tokenizer Type Block Size Recommendation

character 128

tiktoken 16
AIBRIX_PREFIX_CACHE_BLOCK_NUMBER

Maximum number of prefix cache blocks. Default is 200000.
AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT

Before evaluating prefix cache match, router checks if there is imbalance of running requests across pods. Imbalance is measured using absolute difference between max & min running requests across pods, for example if imbalance_abs_count = 16 and running requests for pods are [p1: 1, p2: 2, p3:20] then current scenario is flagged as imbalanced. If flagged as imbalanced then prefix match is ignored and request is routed to pod with least running requests which in above example will to route to pod p1. Default is 16 and should be adjusted based on GPU hardware & prompt length.
AIBRIX_PREFIX_CACHE_STANDARD_DEVIATION_FACTOR

After evaluating prefix match, pods are selected with matching prefix cache. Selected pods are re-evaluated to prevent a hotspot scenario where bulk of prefix matching requests are routed to same pod. Imbalanced is checked as follows
```
  prefix_match_pod.running_requests <= mean + load_factor * standard_deviation
  
```
load_factor determines number of standard deviations. Default is 2

Virtual Token Counter (VTC)

The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving based on the paper "Fairness in Serving Large Language Models" (Sheng et al.). VTC aims to provide fairness among clients by tracking the service (weighted token count) each client has received and prioritizing those who have received less service. It integrates with continuous batching and handles challenges unique to LLM serving, like variable token costs and unknown output lengths. The research paper and reference implementation artifact can be found at Fairness in Serving Large Language Models (Sheng et al.) .

vtc-basic

The vtc-basic variant implements a simplified version of VTC. It routes requests by combining two scores: a fairness score based on the user’s token count (normalized against all users) within a time window, and a utilization score based on the pod’s current load. The pod with the lowest combined score is selected. This approach adapts dynamically to system load and balances fairness with efficient resource use.

End-to-end tests for this algorithm can be found in vtc_routing_test.go.

Environment Variables

Variable	Description	Default
`AIBRIX_ROUTING_ALGORITHM`	Set to `vtc-basic` to enable this routing strategy.	`prefix-aware`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_TIME_UNIT`	Time unit of the sliding window for tracking user token usage.	`minutes`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_WINDOW_SIZE`	Size of the sliding window for tracking user token usage.	`5`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_MIN_TOKENS`	Sensible min default value for adaptive token tracking (see vtc_basic)	`1000`
`AIBRIX_ROUTER_VTC_TOKEN_TRACKER_MAX_TOKENS`	Sensible max default value for adaptive token tracking (see vtc_basic)	`8000`
`AIBRIX_ROUTER_VTC_BASIC_INPUT_TOKEN_WEIGHT`	Weight applied to input tokens in fairness calculations.	`1.0`
`AIBRIX_ROUTER_VTC_BASIC_OUTPUT_TOKEN_WEIGHT`	Weight applied to output tokens in fairness calculations.	`2.0`
`AIBRIX_ROUTER_VTC_BASIC_MAX_POD_LOAD`	Normalization factor for pod load in utilization score calculation.	`100.0`
`AIBRIX_ROUTER_VTC_BASIC_FAIRNESS_WEIGHT`	Weight applied to fairness score in combined score calculation.	`1.0`
`AIBRIX_ROUTER_VTC_BASIC_UTILIZATION_WEIGHT`	Weight applied to utilization score in combined score calculation.	`1.0`

Other VTC variants

The VTC paper and reference implementation (slora/server/router) describe other VTC variants like vtc-fair (pure fairness), vtc-max-fair, and vtc-pred (using length prediction). Adapting these variants for use within the Aibrix Kubernetes routing context, which involves selecting specific pods rather than managing a single request queue, is a potential area for future enhancement.

Prefill-Decode Disaggregation

Below is the pseudo-code for prefix-cache aware routing.

// Route take all ready pods for that model and later are filtered for Prefill and Decode.
func Route(readyPods *[]v1.Pod) {
    prefillPods, decodePods := filterPrefillAndDecodePods(readPods)

    doPrefillRequest(prefillPods)

    // randomly selects decode worker
    return selectDecode(decodePods)
}

func filterPrefillAndDecodePods(readyPods *[]v1.Pod) {
    prefillPods := filterBasedOfLabelSelector("role-name=prefill")
    decodePods := filterBasedOfLabelSelector("role-name=decode")

    return prefillPods, decodePods
}

// selects prefill worker and sends prefill request (based of engine)
func doPrefillRequest(prefillPods) {

    // check if any prefill pod already have shared prefix hence re-use kv-cache 
    // if no prefix sharing then use load-balancing to select prefill worker
    prefillPod := evaluatePrefixCache(prefillPods)

    // get inference engine using pod labels
    llmEngine := labelValue("model.aibrix.ai/engine")
    
    if llmEngine == "sglang" {
        async prefill_request() // for sglang send async prefill request
    } else {
        sync prefill_request()
    }

    return
}

AIBRIX_PREFILL_REQUEST_TIMEOUT

AIBrix gateway is an external processing plugin for envoy proxy which handles inference request validation, selection of target worker and more, but request forwarding to inference engine is done by envoy proxy.

Prefill-Decode disaggregation is a special scenario, where prefill request is directly sent to inference engine by AIBrix gateway plugin and decode request is fowarded by envoy proxy. For prefill request if needed timeout can be configured using env variable, default is 60s.

Documentation ¶

Overview ¶

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index ¶

Constants
Variables
func Init()
func NewLeastBusyTimeRouter() (types.Router, error)
func NewLeastExpectedLatencyRouter() (types.Router, error)
func NewLeastGpuCacheRouter() (types.Router, error)
func NewLeastKvCacheRouter() (types.Router, error)
func NewLeastLoadPullingRouter(provider cache.CappedLoadProvider) (types.Router, error)
func NewLeastLoadRouter(provider cache.LoadProvider) (types.Router, error)
func NewLeastRequestRouter() (types.Router, error)
func NewLeastUtilRouter() (types.Router, error)
func NewPDRouter() (types.Router, error)
func NewPackLoadRouter(provider cache.CappedLoadProvider) (types.Router, error)
func NewPrefixCacheAndLoadRouter() (types.Router, error)
func NewPrefixCacheRouter() (types.Router, error)
func NewQueueRouter(backend types.Router, queue types.RouterQueue[*types.RoutingContext]) (types.QueueRouter, error)
func NewRandomRouter() (types.Router, error)
func NewSLORouter(modelName string) (types.QueueRouter, error)
func NewThroughputRouter() (types.Router, error)
func Register(algorithm types.RoutingAlgorithm, constructor types.RouterConstructor)
func RegisterProvider(algorithm types.RoutingAlgorithm, provider types.RouterProviderFunc)
func Select(ctx *types.RoutingContext) (types.Router, error)
func SelectRandomPodAsFallback(ctx *types.RoutingContext, pods []*v1.Pod, randomFunc func(int) int) (*v1.Pod, error)
func SetFallback(router types.Router, fallback types.RoutingAlgorithm) error
func Validate(algorithms string) (types.RoutingAlgorithm, bool)
type FallbackRouter
- func (r *FallbackRouter) Route(ctx *types.RoutingContext, pods types.PodList) (string, error)
- func (r *FallbackRouter) SetFallback(fallback types.RoutingAlgorithm, provider types.RouterProviderFunc)
type PrefillTimeParams
type PrefixCacheMetrics
type RouterManager
- func NewRouterManager() *RouterManager
- func (rm *RouterManager) Init()
- func (rm *RouterManager) Register(algorithm types.RoutingAlgorithm, constructor types.RouterConstructor)
- func (rm *RouterManager) RegisterProvider(algorithm types.RoutingAlgorithm, provider types.RouterProviderFunc)
- func (rm *RouterManager) Select(ctx *types.RoutingContext) (types.Router, error)
- func (rm *RouterManager) SetFallback(router types.Router, fallback types.RoutingAlgorithm) error
- func (rm *RouterManager) Validate(algorithms string) (types.RoutingAlgorithm, bool)
type SLORouter
- func (r *SLORouter) Route(ctx *types.RoutingContext, pods types.PodList) (string, error)
type SlidingWindowHistogram
type TokenizerPool
- func NewTokenizerPool(config TokenizerPoolConfig, cache cache.Cache) *TokenizerPool
- func (p *TokenizerPool) Close() error
- func (p *TokenizerPool) GetTokenizer(model string, pods []*v1.Pod) tokenizer.Tokenizer
type TokenizerPoolConfig
type TokenizerPoolInterface
type TokenizerPoolMetrics

Constants ¶

View Source

const (
	RouterPD                      types.RoutingAlgorithm = "pd"
	VLLMEngine                    string                 = "vllm"
	SGLangEngine                  string                 = "sglang"
	SGLangBootstrapPort           int64                  = 8998
	SGLangBootstrapPortIdentifier string                 = "model.aibrix.ai/sglang-bootstrap-port"
	LLMEngineIdentifier           string                 = constants.ModelLabelEngine
	PDRoleIdentifier              string                 = "role-name"
	RoleReplicaIndex              string                 = "stormservice.orchestration.aibrix.ai/role-replica-index"
	PodGroupIndex                 string                 = "stormservice.orchestration.aibrix.ai/pod-group-index"
)

View Source

const (
	PREBLE_TARGET_GPU             = "AIBRIX_ROUTER_PREBLE_TARGET_GPU"
	PREBLE_DECODING_LENGTH        = "AIBRIX_ROUTER_PREBLE_DECODING_LENGTH"
	PREBLE_SLIDING_WINDOW_PERIOD  = "AIBRIX_ROUTER_PREBLE_SLIDING_WINDOW_PERIOD"
	PREBLE_EVICTION_LOOP_INTERVAL = "AIBRIX_ROUTER_PREBLE_EVICTION_LOOP_INTERVAL"
)

View Source

const (
	// Default routing algorithm for slo-family algorithms, setting to RouterSLOLeastLoadPulling.
	RouterSLO types.RoutingAlgorithm = "slo"

	// SLO-aware routing algorithm that using SLOQueue and packLoadRouter as the backend.
	RouterSLOPackLoad types.RoutingAlgorithm = "slo-pack-load"

	// SLO-aware routing algorithm that using SLOQueue and leastLoadRouter (push mode) as the backend.
	RouterSLOLeastLoad types.RoutingAlgorithm = "slo-least-load"

	// SLO-aware routing algorithm that using SLOQueue and leastLoadRouter (pull mode) as the backend.
	RouterSLOLeastLoadPulling types.RoutingAlgorithm = "slo-least-load-pulling"
)

View Source

const DefaultFallbackAlgorithm types.RoutingAlgorithm = RouterRandom

View Source

const RouterLeastBusyTime types.RoutingAlgorithm = "least-busy-time"

View Source

const RouterLeastGpuCache types.RoutingAlgorithm = "least-gpu-cache"

View Source

const RouterLeastKvCache types.RoutingAlgorithm = "least-kv-cache"

View Source

const RouterLeastLatency types.RoutingAlgorithm = "least-latency"

View Source

const RouterLeastRequest types.RoutingAlgorithm = "least-request"

View Source

const (
	RouterNotSet = ""
)

View Source

const RouterPrefixCachePreble types.RoutingAlgorithm = "prefix-cache-preble"

View Source

const RouterRandom types.RoutingAlgorithm = "random"

View Source

const RouterThroughput types.RoutingAlgorithm = "throughput"

View Source

const RouterUtil types.RoutingAlgorithm = "least-utilization"

Variables ¶

View Source

var (
	RandomRouter             = &randomRouter{}
	RandomRouterProviderFunc = func(_ *types.RoutingContext) (types.Router, error) { return RandomRouter, nil }
)

View Source

var (
	ErrInitTimeout           = errors.New("router initialization timeout")
	ErrFallbackNotSupported  = errors.New("router not support fallback")
	ErrFallbackNotRegistered = errors.New("fallback router not registered")
)

View Source

var DefaultFallbackRouter types.RouterProviderFunc = RandomRouterProviderFunc

View Source

var (
	ErrorNoAvailablePod = fmt.Errorf("no pod available")
)

View Source

var ModelRouterFactory = NewSLORouter

View Source

var (
	RouterPrefixCache types.RoutingAlgorithm = "prefix-cache"
)

Functions ¶

func Init ¶

func Init()

func NewLeastBusyTimeRouter ¶

func NewLeastBusyTimeRouter() (types.Router, error)

func NewLeastExpectedLatencyRouter ¶

func NewLeastExpectedLatencyRouter() (types.Router, error)

func NewLeastGpuCacheRouter ¶ added in v0.4.0

func NewLeastGpuCacheRouter() (types.Router, error)

func NewLeastKvCacheRouter ¶

func NewLeastKvCacheRouter() (types.Router, error)

func NewLeastLoadPullingRouter ¶ added in v0.4.0

func NewLeastLoadPullingRouter(provider cache.CappedLoadProvider) (types.Router, error)

NewLeastLoadRouter creates leastLoadRouter instance in the pull mode.

func NewLeastLoadRouter ¶ added in v0.4.0

func NewLeastLoadRouter(provider cache.LoadProvider) (types.Router, error)

NewLeastLoadRouter creates leastLoadRouter instance in the push mode.

func NewLeastRequestRouter ¶

func NewLeastRequestRouter() (types.Router, error)

func NewLeastUtilRouter ¶ added in v0.4.0

func NewLeastUtilRouter() (types.Router, error)

func NewPDRouter ¶ added in v0.4.0

func NewPDRouter() (types.Router, error)

func NewPackLoadRouter ¶ added in v0.4.0

func NewPackLoadRouter(provider cache.CappedLoadProvider) (types.Router, error)

NewPackLoadRouter creates packLoadRouter instance.

func NewPrefixCacheAndLoadRouter ¶

func NewPrefixCacheAndLoadRouter() (types.Router, error)

func NewPrefixCacheRouter ¶

func NewPrefixCacheRouter() (types.Router, error)

func NewQueueRouter ¶ added in v0.4.0

func NewQueueRouter(backend types.Router, queue types.RouterQueue[*types.RoutingContext]) (types.QueueRouter, error)

func NewRandomRouter ¶

func NewRandomRouter() (types.Router, error)

func NewSLORouter ¶ added in v0.4.0

func NewSLORouter(modelName string) (types.QueueRouter, error)

func NewThroughputRouter ¶

func NewThroughputRouter() (types.Router, error)

func Register ¶

func Register(algorithm types.RoutingAlgorithm, constructor types.RouterConstructor)

func RegisterProvider ¶ added in v0.4.0

func RegisterProvider(algorithm types.RoutingAlgorithm, provider types.RouterProviderFunc)

func Select ¶

func Select(ctx *types.RoutingContext) (types.Router, error)

func SelectRandomPodAsFallback ¶

func SelectRandomPodAsFallback(ctx *types.RoutingContext, pods []*v1.Pod, randomFunc func(int) int) (*v1.Pod, error)

SelectRandomPodAsFallback selects a pod randomly as a fallback. This method should only be used when all other selection mechanisms have failed. For example, if no pods meet the required criteria (e.g., valid metrics or specific conditions), this method can be called to randomly select a pod from the provided list.

func SetFallback ¶ added in v0.4.0

func SetFallback(router types.Router, fallback types.RoutingAlgorithm) error

func Validate ¶

func Validate(algorithms string) (types.RoutingAlgorithm, bool)

Types ¶

type FallbackRouter ¶ added in v0.4.0

type FallbackRouter struct {
	// contains filtered or unexported fields
}

func (*FallbackRouter) Route ¶ added in v0.4.0

func (r *FallbackRouter) Route(ctx *types.RoutingContext, pods types.PodList) (string, error)

func (*FallbackRouter) SetFallback ¶ added in v0.4.0

func (r *FallbackRouter) SetFallback(fallback types.RoutingAlgorithm, provider types.RouterProviderFunc)

type PrefillTimeParams ¶

type PrefillTimeParams struct {
	NumRequests      int
	NumBatchedTokens int
	TotalContext     int
	InputIDLens      []int
	NumUniqueKV      int
	SeqLens          []int
}

type PrefixCacheMetrics ¶ added in v0.4.0

type PrefixCacheMetrics struct {
	// contains filtered or unexported fields
}

PrefixCacheMetrics holds all prefix cache metrics

type RouterManager ¶ added in v0.4.0

type RouterManager struct {
	// contains filtered or unexported fields
}

func NewRouterManager ¶ added in v0.4.0

func NewRouterManager() *RouterManager

func (*RouterManager) Init ¶ added in v0.4.0

func (rm *RouterManager) Init()

func (*RouterManager) Register ¶ added in v0.4.0

func (rm *RouterManager) Register(algorithm types.RoutingAlgorithm, constructor types.RouterConstructor)

func (*RouterManager) RegisterProvider ¶ added in v0.4.0

func (rm *RouterManager) RegisterProvider(algorithm types.RoutingAlgorithm, provider types.RouterProviderFunc)

func (*RouterManager) Select ¶ added in v0.4.0

func (rm *RouterManager) Select(ctx *types.RoutingContext) (types.Router, error)

Select the user provided router provider supported by gateway, no error reported and fallback to random router Call Validate before this function to ensure expected behavior.

func (*RouterManager) SetFallback ¶ added in v0.4.0

func (rm *RouterManager) SetFallback(router types.Router, fallback types.RoutingAlgorithm) error

func (*RouterManager) Validate ¶ added in v0.4.0

func (rm *RouterManager) Validate(algorithms string) (types.RoutingAlgorithm, bool)

Validate validates if user provided routing routers is supported by gateway

type SLORouter ¶ added in v0.4.0

type SLORouter struct {
	FallbackRouter
	*queue.SLOQueue
}

SLORouter is a router that add FallbackRouter mechanism to the queue.

func (*SLORouter) Route ¶ added in v0.4.0

func (r *SLORouter) Route(ctx *types.RoutingContext, pods types.PodList) (string, error)

type SlidingWindowHistogram ¶

type SlidingWindowHistogram struct {
	// contains filtered or unexported fields
}

type TokenizerPool ¶ added in v0.4.0

type TokenizerPool struct {
	// contains filtered or unexported fields
}

TokenizerPool manages model-specific tokenizers with caching and health checking

func NewTokenizerPool ¶ added in v0.4.0

func NewTokenizerPool(config TokenizerPoolConfig, cache cache.Cache) *TokenizerPool

NewTokenizerPool creates a new TokenizerPool instance

func (*TokenizerPool) Close ¶ added in v0.4.0

func (p *TokenizerPool) Close() error

Close gracefully shuts down the TokenizerPool

func (*TokenizerPool) GetTokenizer ¶ added in v0.4.0

func (p *TokenizerPool) GetTokenizer(model string, pods []*v1.Pod) tokenizer.Tokenizer

GetTokenizer returns a tokenizer for the specified model

type TokenizerPoolConfig ¶ added in v0.4.0

type TokenizerPoolConfig struct {
	EnableVLLMRemote     bool                // Feature flag
	EndpointTemplate     string              // "http://%s:8000"
	HealthCheckPeriod    time.Duration       // Default: 30s
	TokenizerTTL         time.Duration       // Default: 5m
	MaxTokenizersPerPool int                 // Default: 100
	DefaultTokenizer     tokenizer.Tokenizer // Default when remote fails
	ModelServiceMap      map[string]string   // Model -> Service endpoint mapping
	Timeout              time.Duration       // Request timeout
}

TokenizerPoolConfig represents configuration for the TokenizerPool

type TokenizerPoolInterface ¶ added in v0.4.0

type TokenizerPoolInterface interface {
	GetTokenizer(model string, pods []*v1.Pod) tokenizer.Tokenizer
	Close() error
}

TokenizerPoolInterface defines the interface for tokenizer pools

type TokenizerPoolMetrics ¶ added in v0.4.0

type TokenizerPoolMetrics struct {
	// contains filtered or unexported fields
}

TokenizerPoolMetrics contains Prometheus metrics for the pool

Directories ¶

Path	Synopsis
vtc Package vtc implements the Virtual Token Counter routing algorithms focused on fairness and utilization	Package vtc implements the Virtual Token Counter routing algorithms focused on fairness and utilization

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Tokenizer Type	Details
character	splits input text into characters
tiktoken	open-source openai/tiktoken tokenizer

README ¶

Routing Algorithms

Prefix Cache Aware

Configurations

Virtual Token Counter (VTC)

vtc-basic

Environment Variables

Other VTC variants

Prefill-Decode Disaggregation

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Init ¶

func NewLeastBusyTimeRouter ¶

func NewLeastExpectedLatencyRouter ¶

func NewLeastGpuCacheRouter ¶ added in v0.4.0

func NewLeastKvCacheRouter ¶

func NewLeastLoadPullingRouter ¶ added in v0.4.0

func NewLeastLoadRouter ¶ added in v0.4.0

func NewLeastRequestRouter ¶

func NewLeastUtilRouter ¶ added in v0.4.0

func NewPDRouter ¶ added in v0.4.0

func NewPackLoadRouter ¶ added in v0.4.0

func NewPrefixCacheAndLoadRouter ¶

func NewPrefixCacheRouter ¶

func NewQueueRouter ¶ added in v0.4.0

func NewRandomRouter ¶

func NewSLORouter ¶ added in v0.4.0

func NewThroughputRouter ¶

func Register ¶

func RegisterProvider ¶ added in v0.4.0

func Select ¶

func SelectRandomPodAsFallback ¶

func SetFallback ¶ added in v0.4.0

func Validate ¶

Types ¶

type FallbackRouter ¶ added in v0.4.0

func (*FallbackRouter) Route ¶ added in v0.4.0

func (*FallbackRouter) SetFallback ¶ added in v0.4.0

type PrefillTimeParams ¶

type PrefixCacheMetrics ¶ added in v0.4.0

type RouterManager ¶ added in v0.4.0

func NewRouterManager ¶ added in v0.4.0

func (*RouterManager) Init ¶ added in v0.4.0

func (*RouterManager) Register ¶ added in v0.4.0

func (*RouterManager) RegisterProvider ¶ added in v0.4.0

func (*RouterManager) Select ¶ added in v0.4.0

func (*RouterManager) SetFallback ¶ added in v0.4.0

func (*RouterManager) Validate ¶ added in v0.4.0

type SLORouter ¶ added in v0.4.0

func (*SLORouter) Route ¶ added in v0.4.0

type SlidingWindowHistogram ¶

type TokenizerPool ¶ added in v0.4.0

func NewTokenizerPool ¶ added in v0.4.0

func (*TokenizerPool) Close ¶ added in v0.4.0

func (*TokenizerPool) GetTokenizer ¶ added in v0.4.0

type TokenizerPoolConfig ¶ added in v0.4.0

type TokenizerPoolInterface ¶ added in v0.4.0

type TokenizerPoolMetrics ¶ added in v0.4.0

Source Files ¶

Directories ¶