analyzer

package

v0.0.4 Latest Latest Go to latest Published: Nov 15, 2025 License: Apache-2.0 Imports: 3 Imported by: 0

README ¶

Queue Analyzer

The queue analyzer is used to analyze and size inference servers. It utilizes a queueing model which captures the queueing and processing (prefill and decode) statistical behavior of requests.

The configuration of the model includes:

queueing parameters: max batch size and max queue length
processing parameters: constants used to calculate prefill and decode times

The traffic load on the model includes:

request rate
average request size (average number of input and output tokens)

The model is used for:

analysis: evaluate performance metrics given load
sizing: evaluate max request rate to achieve a given target performance

The model may be used for different scenarios by setting the number of tokens:

prefill only: inputTokens > 0, outputTokens = 1
decode only: inputTokens = 0, outputTokens > 0
mixed: inputTokens > 0, outputTokens > 1

Units of performance metrics:

rate: requests/sec, except internal to the queueing model (lambda)
time: msec

Timing metrics are defined as follows:

AvgRespTime: average request response time (aka latency)
AvgWaitTime: average request queueing time
AvgPrefillTime: average request prefill time (processing input tokens and generating first output token)
AvgTokenTime: average token decode time (generating time of a subsequent output token)
TTFT: AvgWaitTime + AvgPrefillTime
ITL: AvgTokenTime

Target metrics are defined as follows:

TTFT: max sum of queueing and prefill time (msec)
ITL: max decode time (msec)
TPS: min token generation rate (tokens/sec)

Target values are positive, if zero then target not considered.

Documentation ¶

Index ¶

Constants
func BinarySearch(xMin float32, xMax float32, yTarget float32, ...) (float32, int, error)
func EffectiveConcurrency(avgServiceTime float32, serviceParms *ServiceParms, requestSize *RequestSize, ...) float32
func EvalITL(x float32) (float32, error)
func EvalServTime(x float32) (float32, error)
func EvalTTFT(x float32) (float32, error)
func EvalWaitingTime(x float32) (float32, error)
func WithinTolerance(x, value, tolerance float32) bool
type AnalysisMetrics
- func (am *AnalysisMetrics) String() string
type Configuration
- func (c *Configuration) String() string
type DecodeParms
- func (p *DecodeParms) DecodeTime(batchSize float32) float32
- func (p *DecodeParms) String() string
type MM1KModel
- func NewMM1KModel(K int) *MM1KModel
- func (m *MM1KModel) ComputeRho() float32
- func (m *MM1KModel) GetProbabilities() []float64
- func (m *MM1KModel) GetRhoMax() float32
- func (m *MM1KModel) GetThroughput() float32
- func (m *MM1KModel) Solve(lambda float32, mu float32)
- func (m *MM1KModel) String() string
type MM1ModelStateDependent
- func NewMM1ModelStateDependent(K int, servRate []float32) *MM1ModelStateDependent
- func (m *MM1ModelStateDependent) ComputeRho() float32
- func (m *MM1ModelStateDependent) GetAvgNumInServers() float32
- func (m *MM1ModelStateDependent) Solve(lambda float32, mu float32)
- func (m *MM1ModelStateDependent) String() string
type PrefillParms
- func (p *PrefillParms) PrefillTime(avgInputTokens int, batchSize float32) float32
- func (p *PrefillParms) String() string
type QueueAnalyzer
- func BuildModel(qConfig *Configuration, requestSize *RequestSize) (modelData *QueueAnalyzer)
- func NewQueueAnalyzer(qConfig *Configuration, requestSize *RequestSize) (*QueueAnalyzer, error)
- func (qa *QueueAnalyzer) Analyze(requestRate float32) (metrics *AnalysisMetrics, err error)
- func (qa *QueueAnalyzer) Size(targetPerf *TargetPerf) (targetRate *TargetRate, metrics *AnalysisMetrics, achieved *TargetPerf, ...)
- func (qa *QueueAnalyzer) String() string
type QueueModel
- func (m *QueueModel) GetAvgNumInSystem() float32
- func (m *QueueModel) GetAvgQueueLength() float32
- func (m *QueueModel) GetAvgRespTime() float32
- func (m *QueueModel) GetAvgServTime() float32
- func (m *QueueModel) GetAvgWaitTime() float32
- func (m *QueueModel) GetLambda() float32
- func (m *QueueModel) GetMu() float32
- func (m *QueueModel) GetRho() float32
- func (m *QueueModel) IsValid() bool
- func (m *QueueModel) Solve(lambda float32, mu float32)
- func (m *QueueModel) String() string
type RateRange
- func (rr *RateRange) String() string
type RequestSize
- func (rq *RequestSize) String() string
type ServiceParms
- func (sp *ServiceParms) String() string
type TargetPerf
- func (tp *TargetPerf) String() string
type TargetRate
- func (tr *TargetRate) String() string

Constants ¶

View Source

const Epsilon = float32(0.001)

small disturbance around a value

View Source

const StabilitySafetyFraction = float32(0.1)

fraction of maximum server throughput to provide stability (running this fraction below the maximum)

Variables ¶

This section is empty.

Functions ¶

func BinarySearch ¶

func BinarySearch(xMin float32, xMax float32, yTarget float32,
	eval func(float32) (float32, error)) (float32, int, error)

Binary search: find xStar in a range [xMin, xMax] such that f(xStar)=yTarget. Function f() must be monotonically increasing or decreasing over the range. Returns an indicator of whether target is below (-1), within (0), or above (+1) the bounded region. Returns an error if the function cannot be evaluated or the target is not found.

func EffectiveConcurrency ¶

func EffectiveConcurrency(avgServiceTime float32, serviceParms *ServiceParms, requestSize *RequestSize, maxBatchSize int) float32

calculate effective average number of requests in service (n), given average request service time

n has to satisfy: prefillTime(n) + totalDecodeTime(n) = avgServiceTime
prefillTime(n) = gamma + delta * inTokens * n
totalDecodeTime(n) = (alpha + beta * n) * (outTokens - 1)

func EvalITL ¶

func EvalITL(x float32) (float32, error)

Function used in binary search (target ITL)

x is lambda req/msec

func EvalServTime ¶

func EvalServTime(x float32) (float32, error)

Function used in binary search (target service time)

func EvalTTFT ¶

func EvalTTFT(x float32) (float32, error)

Function used in binary search (target TTFT)

x is lambda req/msec

func EvalWaitingTime ¶

func EvalWaitingTime(x float32) (float32, error)

Function used in binary search (target waiting time)

func WithinTolerance ¶

func WithinTolerance(x, value, tolerance float32) bool

A variable x is relatively within a given tolerance from a value

Types ¶

type AnalysisMetrics ¶

type AnalysisMetrics struct {
	Throughput     float32 // effective throughput (requests/sec)
	AvgRespTime    float32 // average request response time (aka latency) (msec)
	AvgWaitTime    float32 // average request queueing time (msec)
	AvgNumInServ   float32 // average number of requests in service
	AvgPrefillTime float32 // average request prefill time (msec)
	AvgTokenTime   float32 // average token decode time (msec)
	MaxRate        float32 // maximum throughput (requests/sec)
	Rho            float32 // utilization
}

analysis solution metrics data

func (*AnalysisMetrics) String ¶

func (am *AnalysisMetrics) String() string

type Configuration ¶

type Configuration struct {
	MaxBatchSize int           // maximum batch size (limit on the number of requests concurrently receiving service >0)
	MaxQueueSize int           // maximum queue size (limit on the number of requests queued for servive >=0)
	ServiceParms *ServiceParms // request processing parameters
}

queue configuration parameters

func (*Configuration) String ¶

func (c *Configuration) String() string

type DecodeParms ¶

type DecodeParms struct {
	Alpha float32 // base
	Beta  float32 // slope
}

decode time = alpha + beta * batchSize (msec); batchSize > 0

func (*DecodeParms) DecodeTime ¶

func (p *DecodeParms) DecodeTime(batchSize float32) float32

func (*DecodeParms) String ¶

func (p *DecodeParms) String() string

type MM1KModel ¶

type MM1KModel struct {
	QueueModel     // extends base class
	K          int // limit on number in system
	// contains filtered or unexported fields
}

M/M/1/K Finite storage single server queue

func NewMM1KModel ¶

func NewMM1KModel(K int) *MM1KModel

func (*MM1KModel) ComputeRho ¶

func (m *MM1KModel) ComputeRho() float32

Compute utilization of queueing model

func (*MM1KModel) GetProbabilities ¶

func (m *MM1KModel) GetProbabilities() []float64

func (*MM1KModel) GetRhoMax ¶

func (m *MM1KModel) GetRhoMax() float32

Compute the maximum utilization of queueing model

func (*MM1KModel) GetThroughput ¶

func (m *MM1KModel) GetThroughput() float32

func (*MM1KModel) Solve ¶

func (m *MM1KModel) Solve(lambda float32, mu float32)

Solve queueing model given arrival and service rates

func (*MM1KModel) String ¶

func (m *MM1KModel) String() string

type MM1ModelStateDependent ¶

type MM1ModelStateDependent struct {
	MM1KModel // extends base class
	// contains filtered or unexported fields
}

M/M/1 model with state dependent service rate

var Model *MM1ModelStateDependent

model as global variable, accesses by eval functions

func NewMM1ModelStateDependent ¶

func NewMM1ModelStateDependent(K int, servRate []float32) *MM1ModelStateDependent

func (*MM1ModelStateDependent) ComputeRho ¶

func (m *MM1ModelStateDependent) ComputeRho() float32

Compute utilization of queueing model

func (*MM1ModelStateDependent) GetAvgNumInServers ¶

func (m *MM1ModelStateDependent) GetAvgNumInServers() float32

func (*MM1ModelStateDependent) Solve ¶

func (m *MM1ModelStateDependent) Solve(lambda float32, mu float32)

Solve queueing model given arrival and service rates

func (*MM1ModelStateDependent) String ¶

func (m *MM1ModelStateDependent) String() string

type PrefillParms ¶

type PrefillParms struct {
	Gamma float32 // base
	Delta float32 // slope
}

prefill time = gamma + delta * inputTokens * batchSize (msec); inputTokens > 0

func (*PrefillParms) PrefillTime ¶

func (p *PrefillParms) PrefillTime(avgInputTokens int, batchSize float32) float32

func (*PrefillParms) String ¶

func (p *PrefillParms) String() string

type QueueAnalyzer ¶

type QueueAnalyzer struct {
	MaxBatchSize int                     // maximum batch size
	MaxQueueSize int                     // maximum queue size
	ServiceParms *ServiceParms           // request processing parameters
	RequestSize  *RequestSize            // number of input and output tokens per request
	Model        *MM1ModelStateDependent // queueing model
	RateRange    *RateRange              // range of request rates for model stability
}

Analyzer of inference server queue

func BuildModel ¶

func BuildModel(qConfig *Configuration, requestSize *RequestSize) (modelData *QueueAnalyzer)

build queueing model using service rates, leaving arrival rate as parameter

func NewQueueAnalyzer ¶

func NewQueueAnalyzer(qConfig *Configuration, requestSize *RequestSize) (*QueueAnalyzer, error)

create a new queue analyzer from config

func (*QueueAnalyzer) Analyze ¶

func (qa *QueueAnalyzer) Analyze(requestRate float32) (metrics *AnalysisMetrics, err error)

evaluate performance metrics given request rate

func (*QueueAnalyzer) Size ¶

func (qa *QueueAnalyzer) Size(targetPerf *TargetPerf) (targetRate *TargetRate, metrics *AnalysisMetrics, achieved *TargetPerf, err error)

evaluate max request rates to achieve a given target performance, returns

max request rates
performance metrics at min of max request rates
achieved values of targets

func (*QueueAnalyzer) String ¶

func (qa *QueueAnalyzer) String() string

type QueueModel ¶

type QueueModel struct {
	ComputeRho func() float32 // compute utilization of queueing model
	GetRhoMax  func() float32 // compute the maximum utilization of queueing model
	// contains filtered or unexported fields
}

Basic Queueing Model (Abstract Class)

func (*QueueModel) GetAvgNumInSystem ¶

func (m *QueueModel) GetAvgNumInSystem() float32

func (*QueueModel) GetAvgQueueLength ¶

func (m *QueueModel) GetAvgQueueLength() float32

func (*QueueModel) GetAvgRespTime ¶

func (m *QueueModel) GetAvgRespTime() float32

func (*QueueModel) GetAvgServTime ¶

func (m *QueueModel) GetAvgServTime() float32

func (*QueueModel) GetAvgWaitTime ¶

func (m *QueueModel) GetAvgWaitTime() float32

func (*QueueModel) GetLambda ¶

func (m *QueueModel) GetLambda() float32

func (*QueueModel) GetMu ¶

func (m *QueueModel) GetMu() float32

func (*QueueModel) GetRho ¶

func (m *QueueModel) GetRho() float32

func (*QueueModel) IsValid ¶

func (m *QueueModel) IsValid() bool

func (*QueueModel) Solve ¶

func (m *QueueModel) Solve(lambda float32, mu float32)

Solve queueing model given arrival and service rates

func (*QueueModel) String ¶

func (m *QueueModel) String() string

type RateRange ¶

type RateRange struct {
	Min float32 // lowest rate (slightly larger than zero)
	Max float32 // highest rate (slightly less than maximum service rate)
}

range of request rates (requests/sec)

func (*RateRange) String ¶

func (rr *RateRange) String() string

type RequestSize ¶

type RequestSize struct {
	AvgInputTokens  int // average number of input tokens per request
	AvgOutputTokens int // average number of output tokens per request
}

request tokens data

func (*RequestSize) String ¶

func (rq *RequestSize) String() string

type ServiceParms ¶

type ServiceParms struct {
	Prefill *PrefillParms // parameters to calculate prefill time
	Decode  *DecodeParms  // parameters to calculate decode time
}

request processing parameters

func (*ServiceParms) String ¶

func (sp *ServiceParms) String() string

type TargetPerf ¶

type TargetPerf struct {
	TargetTTFT float32 // target time to first token (queueing + prefill) (msec)
	TargetITL  float32 // target inter-token latency (msec)
	TargetTPS  float32 // target token generation throughtput (tokens/sec)
}

queue performance targets

func (*TargetPerf) String ¶

func (tp *TargetPerf) String() string

type TargetRate ¶

type TargetRate struct {
	RateTargetTTFT float32 // max request rate for target TTFT (requests/sec)
	RateTargetITL  float32 // max request rate for target ITL (requests/sec)
	RateTargetTPS  float32 // max request rate for target TPS (requests/sec)
}

queue max request rates to achieve performance targets

func (*TargetRate) String ¶

func (tr *TargetRate) String() string

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL