analyzer

package
v0.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 15, 2025 License: Apache-2.0 Imports: 3 Imported by: 0

README

Queue Analyzer

The queue analyzer is used to analyze and size inference servers. It utilizes a queueing model which captures the queueing and processing (prefill and decode) statistical behavior of requests.

The configuration of the model includes:

  • queueing parameters: max batch size and max queue length
  • processing parameters: constants used to calculate prefill and decode times

The traffic load on the model includes:

  • request rate
  • average request size (average number of input and output tokens)

The model is used for:

  • analysis: evaluate performance metrics given load
  • sizing: evaluate max request rate to achieve a given target performance

The model may be used for different scenarios by setting the number of tokens:

  • prefill only: inputTokens > 0, outputTokens = 1
  • decode only: inputTokens = 0, outputTokens > 0
  • mixed: inputTokens > 0, outputTokens > 1

Units of performance metrics:

  • rate: requests/sec, except internal to the queueing model (lambda)
  • time: msec

Timing metrics are defined as follows:

  • AvgRespTime: average request response time (aka latency)
  • AvgWaitTime: average request queueing time
  • AvgPrefillTime: average request prefill time (processing input tokens and generating first output token)
  • AvgTokenTime: average token decode time (generating time of a subsequent output token)
  • TTFT: AvgWaitTime + AvgPrefillTime
  • ITL: AvgTokenTime

Target metrics are defined as follows:

  • TTFT: max sum of queueing and prefill time (msec)
  • ITL: max decode time (msec)
  • TPS: min token generation rate (tokens/sec)

Target values are positive, if zero then target not considered.

Documentation

Index

Constants

View Source
const Epsilon = float32(0.001)

small disturbance around a value

View Source
const StabilitySafetyFraction = float32(0.1)

fraction of maximum server throughput to provide stability (running this fraction below the maximum)

Variables

This section is empty.

Functions

func BinarySearch

func BinarySearch(xMin float32, xMax float32, yTarget float32,
	eval func(float32) (float32, error)) (float32, int, error)

Binary search: find xStar in a range [xMin, xMax] such that f(xStar)=yTarget. Function f() must be monotonically increasing or decreasing over the range. Returns an indicator of whether target is below (-1), within (0), or above (+1) the bounded region. Returns an error if the function cannot be evaluated or the target is not found.

func EffectiveConcurrency

func EffectiveConcurrency(avgServiceTime float32, serviceParms *ServiceParms, requestSize *RequestSize, maxBatchSize int) float32

calculate effective average number of requests in service (n), given average request service time

  • n has to satisfy: prefillTime(n) + totalDecodeTime(n) = avgServiceTime
  • prefillTime(n) = gamma + delta * inTokens * n
  • totalDecodeTime(n) = (alpha + beta * n) * (outTokens - 1)

func EvalITL

func EvalITL(x float32) (float32, error)

Function used in binary search (target ITL)

  • x is lambda req/msec

func EvalServTime

func EvalServTime(x float32) (float32, error)

Function used in binary search (target service time)

func EvalTTFT

func EvalTTFT(x float32) (float32, error)

Function used in binary search (target TTFT)

  • x is lambda req/msec

func EvalWaitingTime

func EvalWaitingTime(x float32) (float32, error)

Function used in binary search (target waiting time)

func WithinTolerance

func WithinTolerance(x, value, tolerance float32) bool

A variable x is relatively within a given tolerance from a value

Types

type AnalysisMetrics

type AnalysisMetrics struct {
	Throughput     float32 // effective throughput (requests/sec)
	AvgRespTime    float32 // average request response time (aka latency) (msec)
	AvgWaitTime    float32 // average request queueing time (msec)
	AvgNumInServ   float32 // average number of requests in service
	AvgPrefillTime float32 // average request prefill time (msec)
	AvgTokenTime   float32 // average token decode time (msec)
	MaxRate        float32 // maximum throughput (requests/sec)
	Rho            float32 // utilization
}

analysis solution metrics data

func (*AnalysisMetrics) String

func (am *AnalysisMetrics) String() string

type Configuration

type Configuration struct {
	MaxBatchSize int           // maximum batch size (limit on the number of requests concurrently receiving service >0)
	MaxQueueSize int           // maximum queue size (limit on the number of requests queued for servive >=0)
	ServiceParms *ServiceParms // request processing parameters
}

queue configuration parameters

func (*Configuration) String

func (c *Configuration) String() string

type DecodeParms

type DecodeParms struct {
	Alpha float32 // base
	Beta  float32 // slope
}

decode time = alpha + beta * batchSize (msec); batchSize > 0

func (*DecodeParms) DecodeTime

func (p *DecodeParms) DecodeTime(batchSize float32) float32

func (*DecodeParms) String

func (p *DecodeParms) String() string

type MM1KModel

type MM1KModel struct {
	QueueModel     // extends base class
	K          int // limit on number in system
	// contains filtered or unexported fields
}

M/M/1/K Finite storage single server queue

func NewMM1KModel

func NewMM1KModel(K int) *MM1KModel

func (*MM1KModel) ComputeRho

func (m *MM1KModel) ComputeRho() float32

Compute utilization of queueing model

func (*MM1KModel) GetProbabilities

func (m *MM1KModel) GetProbabilities() []float64

func (*MM1KModel) GetRhoMax

func (m *MM1KModel) GetRhoMax() float32

Compute the maximum utilization of queueing model

func (*MM1KModel) GetThroughput

func (m *MM1KModel) GetThroughput() float32

func (*MM1KModel) Solve

func (m *MM1KModel) Solve(lambda float32, mu float32)

Solve queueing model given arrival and service rates

func (*MM1KModel) String

func (m *MM1KModel) String() string

type MM1ModelStateDependent

type MM1ModelStateDependent struct {
	MM1KModel // extends base class
	// contains filtered or unexported fields
}

M/M/1 model with state dependent service rate

model as global variable, accesses by eval functions

func NewMM1ModelStateDependent

func NewMM1ModelStateDependent(K int, servRate []float32) *MM1ModelStateDependent

func (*MM1ModelStateDependent) ComputeRho

func (m *MM1ModelStateDependent) ComputeRho() float32

Compute utilization of queueing model

func (*MM1ModelStateDependent) GetAvgNumInServers

func (m *MM1ModelStateDependent) GetAvgNumInServers() float32

func (*MM1ModelStateDependent) Solve

func (m *MM1ModelStateDependent) Solve(lambda float32, mu float32)

Solve queueing model given arrival and service rates

func (*MM1ModelStateDependent) String

func (m *MM1ModelStateDependent) String() string

type PrefillParms

type PrefillParms struct {
	Gamma float32 // base
	Delta float32 // slope
}

prefill time = gamma + delta * inputTokens * batchSize (msec); inputTokens > 0

func (*PrefillParms) PrefillTime

func (p *PrefillParms) PrefillTime(avgInputTokens int, batchSize float32) float32

func (*PrefillParms) String

func (p *PrefillParms) String() string

type QueueAnalyzer

type QueueAnalyzer struct {
	MaxBatchSize int                     // maximum batch size
	MaxQueueSize int                     // maximum queue size
	ServiceParms *ServiceParms           // request processing parameters
	RequestSize  *RequestSize            // number of input and output tokens per request
	Model        *MM1ModelStateDependent // queueing model
	RateRange    *RateRange              // range of request rates for model stability
}

Analyzer of inference server queue

func BuildModel

func BuildModel(qConfig *Configuration, requestSize *RequestSize) (modelData *QueueAnalyzer)

build queueing model using service rates, leaving arrival rate as parameter

func NewQueueAnalyzer

func NewQueueAnalyzer(qConfig *Configuration, requestSize *RequestSize) (*QueueAnalyzer, error)

create a new queue analyzer from config

func (*QueueAnalyzer) Analyze

func (qa *QueueAnalyzer) Analyze(requestRate float32) (metrics *AnalysisMetrics, err error)

evaluate performance metrics given request rate

func (*QueueAnalyzer) Size

func (qa *QueueAnalyzer) Size(targetPerf *TargetPerf) (targetRate *TargetRate, metrics *AnalysisMetrics, achieved *TargetPerf, err error)

evaluate max request rates to achieve a given target performance, returns

  • max request rates
  • performance metrics at min of max request rates
  • achieved values of targets

func (*QueueAnalyzer) String

func (qa *QueueAnalyzer) String() string

type QueueModel

type QueueModel struct {
	ComputeRho func() float32 // compute utilization of queueing model
	GetRhoMax  func() float32 // compute the maximum utilization of queueing model
	// contains filtered or unexported fields
}

Basic Queueing Model (Abstract Class)

func (*QueueModel) GetAvgNumInSystem

func (m *QueueModel) GetAvgNumInSystem() float32

func (*QueueModel) GetAvgQueueLength

func (m *QueueModel) GetAvgQueueLength() float32

func (*QueueModel) GetAvgRespTime

func (m *QueueModel) GetAvgRespTime() float32

func (*QueueModel) GetAvgServTime

func (m *QueueModel) GetAvgServTime() float32

func (*QueueModel) GetAvgWaitTime

func (m *QueueModel) GetAvgWaitTime() float32

func (*QueueModel) GetLambda

func (m *QueueModel) GetLambda() float32

func (*QueueModel) GetMu

func (m *QueueModel) GetMu() float32

func (*QueueModel) GetRho

func (m *QueueModel) GetRho() float32

func (*QueueModel) IsValid

func (m *QueueModel) IsValid() bool

func (*QueueModel) Solve

func (m *QueueModel) Solve(lambda float32, mu float32)

Solve queueing model given arrival and service rates

func (*QueueModel) String

func (m *QueueModel) String() string

type RateRange

type RateRange struct {
	Min float32 // lowest rate (slightly larger than zero)
	Max float32 // highest rate (slightly less than maximum service rate)
}

range of request rates (requests/sec)

func (*RateRange) String

func (rr *RateRange) String() string

type RequestSize

type RequestSize struct {
	AvgInputTokens  int // average number of input tokens per request
	AvgOutputTokens int // average number of output tokens per request
}

request tokens data

func (*RequestSize) String

func (rq *RequestSize) String() string

type ServiceParms

type ServiceParms struct {
	Prefill *PrefillParms // parameters to calculate prefill time
	Decode  *DecodeParms  // parameters to calculate decode time
}

request processing parameters

func (*ServiceParms) String

func (sp *ServiceParms) String() string

type TargetPerf

type TargetPerf struct {
	TargetTTFT float32 // target time to first token (queueing + prefill) (msec)
	TargetITL  float32 // target inter-token latency (msec)
	TargetTPS  float32 // target token generation throughtput (tokens/sec)
}

queue performance targets

func (*TargetPerf) String

func (tp *TargetPerf) String() string

type TargetRate

type TargetRate struct {
	RateTargetTTFT float32 // max request rate for target TTFT (requests/sec)
	RateTargetITL  float32 // max request rate for target ITL (requests/sec)
	RateTargetTPS  float32 // max request rate for target TPS (requests/sec)
}

queue max request rates to achieve performance targets

func (*TargetRate) String

func (tr *TargetRate) String() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL