metrics

package
v1.4.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 9, 2026 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

Package metrics is the SIP stack observability surface: a dependency-free Prometheus registry plus typed SIP counters.

Registry (registry.go, labels.go, async.go, app.go):

  • counters, gauges, summary-style histograms
  • cardinality whitelist via RegisterLabels
  • optional async ObserveAsync drain for hot paths
  • VoiceServer app helpers (CallStarted, Handler, …)

SIP helpers (metrics.go, voice_attach.go):

  • INVITE/BYE/transaction/session-timer/STIR/DTLS/QoS counters
  • voice-attach counters at the OnACK seam

Cardinality discipline:

  • Label keys come from a small enum set (direction, scenario, code_class, method, result, reason_class). Per-call identifiers (Call-ID, phone, SSRC) are NEVER labelled — those belong in the CDR record, not the metrics registry.
  • Each metric declares its allowed keys via RegisterLabels in init(); the registry enforces this softly.

Hot-path discipline:

  • All exported helpers take only enums/integers and look up the pre-allocated label map. Zero string formatting at the call site, zero allocation in the steady state.

Registry is a tiny, dependency-free Prometheus exposition backend. It provides counters, gauges, and summary-style histograms with P50/P90/P95/P99 quantiles. See package doc in metrics.go.

Index

Constants

View Source
const (
	// Calls.
	MetricActiveCalls = "voiceserver_active_calls"
	MetricCallsTotal  = "voiceserver_calls_total"

	// Recognizer / synthesizer errors.
	MetricASRErrors = "voiceserver_asr_errors_total"
	MetricTTSErrors = "voiceserver_tts_errors_total"

	// User-interrupts-AI events.
	MetricBargeInTotal = "voiceserver_barge_in_total"

	// Latencies (milliseconds).
	MetricE2EFirstByteMs = "voiceserver_e2e_first_byte_ms"
	MetricTTSFirstByteMs = "voiceserver_tts_first_byte_ms"
	MetricLLMFirstByteMs = "voiceserver_llm_first_byte_ms"

	// Dialog plane.
	MetricDialogReconnectTotal = "voiceserver_dialog_reconnect_total"
)

Metric name constants. Kept in one place so dashboards can grep for a single source of truth. Names follow Prometheus convention: `<namespace>_<subsystem>_<name>_<unit>`.

View Source
const (
	// INVITE responses, classified by direction and response class.
	// One series per (direction, class) — bounded to 2 × 6 = 12 series.
	MetricInviteResultTotal = "sip_invite_result_total"

	// BYE events, classified by who initiated and reason class.
	MetricByeTotal = "sip_bye_total"

	// Transaction-level timeouts (timer F / B fired, RFC 3261).
	MetricTransactionTimeoutTotal = "sip_transaction_timeout_total"

	// RFC 4028 session-timer refresher events. result = ok / 422 /
	// 481 / role-swap / gave-up.
	MetricSessionTimerRefreshTotal = "sip_session_timer_refresh_total"

	// DTLS-SRTP handshake outcomes. result = ok / fail / timeout /
	// fingerprint-mismatch.
	MetricDTLSHandshakeTotal = "sip_dtls_handshake_total"

	// RFC 8224 STIR verification outcomes.
	MetricSTIRVerifyTotal = "sip_stir_verify_total"

	// RTCP-derived per-call QoS roll-ups, recorded ONCE at call end.
	MetricCallRTTMs        = "sip_call_rtt_ms"
	MetricCallJitterMs     = "sip_call_jitter_ms"
	MetricCallLossFraction = "sip_call_loss_fraction"
	MetricCallMOSEstimate  = "sip_call_mos_estimate"
)

Metric names. Single source of truth so dashboards can grep.

View Source
const (
	DirectionInbound  = "inbound"
	DirectionOutbound = "outbound"
)

Direction enum.

View Source
const (
	ByeByLocal  = "local"
	ByeByRemote = "remote"

	// Reason classes — bounded enum.
	ByeReasonNormal       = "normal"        // 200 OK BYE no special cause
	ByeReasonTimerExpired = "timer-expired" // RFC 4028 session-timer expired
	ByeReasonError        = "error"         // unexpected (pipeline failure, etc.)
	ByeReasonUserHangup   = "user-hangup"   // explicit hangup intent
)

Bye classification.

View Source
const (
	RefreshResultOK         = "ok"          // peer accepted with 200
	Refresh422Bumped        = "422-bumped"  // got 422, retried with peer Min-SE
	Refresh422GaveUp        = "422-gave-up" // second 422, stopped
	Refresh481DialogGone    = "481"         // dialog disappeared
	RefreshRoleSwappedToUAS = "role-swap"   // peer flipped refresher to itself
)

Refresher event classification.

View Source
const (
	DTLSResultOK                  = "ok"
	DTLSResultFail                = "fail"
	DTLSResultTimeout             = "timeout"
	DTLSResultFingerprintMismatch = "fingerprint-mismatch"
)
View Source
const (
	STIRResultVerified = "verified"
	STIRResultFailed   = "failed"
	STIRResultSoftFail = "soft-fail" // verifier rejected but call continued
	STIRResultNoIdent  = "no-identity"
)
View Source
const (
	// MetricVoiceAttachTotal counts voice-attach attempts at the OnACK
	// seam, classified by resolved engine.Mode and final outcome.
	//
	// labels:
	//   mode   = "cascaded" | "realtime"
	//   result = "ok" | "config_error"
	//
	// "config_error" is the umbrella for every failure path that
	// played scripts/config_error.wav (no tenant id, env load error,
	// missing/incomplete credentials). Granular reasons live in the
	// log lines emitted by AttachCascadedLegacy / AttachRealtimeLegacy;
	// they're not labels because the cardinality blows up fast.
	MetricVoiceAttachTotal = "sip_voice_attach_total"

	// MetricVoiceAttachModeFallbackTotal counts implicit mode
	// fallbacks made by ResolveAttachMode (today: tenant persisted
	// voice_mode="pipeline" but pipeline creds are unusable and
	// realtime is ready → we auto-select realtime).
	//
	// labels:
	//   from = "pipeline"
	//   to   = "realtime"
	MetricVoiceAttachModeFallbackTotal = "sip_voice_attach_mode_fallback_total"

	// MetricVoiceAttachNativeTotal counts decisions made by the
	// PR-9d feature flag to route a cascaded call through the
	// native cascaded.Engine (engine.ModeCascadedNative) instead of
	// the legacy bridge. Independent from MetricVoiceAttachTotal so
	// dashboards can monitor opt-in rollout without churn-affecting
	// the existing per-mode chart.
	//
	// labels:
	//   result = "ok" | "err"
	MetricVoiceAttachNativeTotal = "sip_voice_attach_native_total"
)
View Source
const (
	VoiceAttachModeCascaded = "cascaded"
	VoiceAttachModeRealtime = "realtime"
)

Voice-attach mode enum. Mirrors engine.Mode but kept as plain strings here so this package doesn't import pkg/dialog/engine (which would create an import cycle once engines start emitting metrics directly). The constants MUST stay in sync with engine.Mode's string values.

View Source
const (
	VoiceAttachResultOK          = "ok"
	VoiceAttachResultConfigError = "config_error"
)

Voice-attach result enum.

View Source
const MetricObserveDroppedTotal = "voiceserver_metrics_observe_dropped_total"

MetricObserveDroppedTotal counts samples lost because the async Observe buffer was full. If this is non-zero in production the drain goroutine isn't keeping up — usually a downstream stall rather than a real load issue.

View Source
const MetricUnknownLabelTotal = "voiceserver_metrics_unknown_label_total"

MetricUnknownLabelTotal counts soft-whitelist violations. Visible via /metrics so on-call can spot "someone is shipping a metric the declared whitelist doesn't cover" without grepping logs.

Variables

View Source
var (
	LabelsTransportSIP    = map[string]string{"transport": "sip"}
	LabelsTransportWebRTC = map[string]string{"transport": "webrtc"}
)

LabelsTransportSIP / LabelsTransportWebRTC are the two transports we use today. The whitelist for any metric labelled by transport should be: RegisterLabels(metric, "transport").

View Source
var Default = NewRegistry()

Default is the process-wide registry. Use this for application-level metrics so a single /metrics handler serves everything.

Functions

func ASRError added in v1.4.3

func ASRError(transport string)

ASRError bumps the ASR error counter. Called from the recognizer error callback in the gateway client.

func AsyncDroppedCount added in v1.4.3

func AsyncDroppedCount() uint64

AsyncDroppedCount returns the total samples dropped since process start. Exposed for tests and self-observability tooling.

func BYE

func BYE(direction, by, reasonClass string)

BYE bumps the BYE counter for the given direction (inbound / outbound), initiator (local / remote), and reason class. Backwards- compat shim: a 2-arg call still works via the Bye() helper which defaults direction to outbound. Hot path; zero allocation for any known combination.

func BargeIn added in v1.4.3

func BargeIn(transport string)

BargeIn counts how often the VAD interrupted the AI's TTS because the user started talking. Good predictor of conversation health — a high rate usually means the AI is too verbose or VAD is too twitchy.

func Bye

func Bye(by, reasonClass string)

Bye is the outbound-default shim. Kept for existing callers that don't yet care about direction. New callers should prefer BYE() with an explicit direction.

func CallEnded added in v1.4.3

func CallEnded(transport, status string)

CallEnded mirrors CallStarted. status is a short classification like "ok", "dialog-hangup", "ice-failed", "pipeline-error" — use the same vocabulary you use in call_events.kind so dashboards line up.

func CallStarted added in v1.4.3

func CallStarted(transport string)

CallStarted increments the active-calls gauge and the calls_total counter for the given transport. Call at the moment the session becomes "live" (ASR/TTS wired + dialog plane connected).

func DTLSHandshake

func DTLSHandshake(result string)

DTLSHandshake reports the outcome of one DTLS-SRTP handshake.

func DialogReconnect added in v1.4.3

func DialogReconnect(transport, outcome string)

DialogReconnect counts reconnect attempts to the dialog plane regardless of outcome. A growing counter means the dialog app is flaky; pair with the ok/fail counters for success rate.

func Handler added in v1.4.3

func Handler() http.Handler

Handler returns an http.Handler that writes the Default registry in Prometheus text exposition format. Mount at /metrics — no auth by default; add middleware if the listener is internet-exposed.

func InviteResult

func InviteResult(direction string, code int)

InviteResult bumps the INVITE result counter. `code` is the SIP status code (100..699); it's classified to its hundreds class so the label cardinality stays bounded at 6 per direction.

func LabelsCall added in v1.4.3

func LabelsCall(transport, status string) map[string]string

LabelsCall composes a 2-key label set for the common (transport, status) shape used by voiceserver_calls_total. We pre-build the known combinations rather than allocating per-call. Add more statuses here if dashboards need to slice on them.

Return type is map[string]string to fit the existing API; pointer identity is preserved across calls so map-key dedupe inside the registry stays cheap.

func LabelsDialogOutcome added in v1.4.3

func LabelsDialogOutcome(transport, outcome string) map[string]string

LabelsDialogOutcome is used by DialogReconnect — bounded set of outcomes per the original API contract.

func ObserveAsync added in v1.4.3

func ObserveAsync(name, help string, v float64)

ObserveAsync queues a histogram sample on the global async drain. Hot-path safe: non-blocking, zero allocation, drops on full (incrementing the dropped-samples counter).

This is the recommended call for any observation that fires more than ~10x/sec per process. For one-off latencies (per turn, per call) the synchronous Default.Observe is fine and slightly more accurate (no buffering reorder concerns).

func ObserveCallQoS

func ObserveCallQoS(rttMs uint32, jitterMs float64, lossFraction float64, mosEstimate float64)

ObserveCallQoS records the per-call RTCP-derived metrics. Call this ONCE per call at cleanup (after the last RTCPSnapshot). All inputs are optional; zero / negative values are skipped so "no data" doesn't pollute the distribution.

Hot path? No — this runs at most once per call (~0.02 Hz/leg). Cardinality? Zero labels — these are global distributions.

func ObserveE2EFirstByte added in v1.4.3

func ObserveE2EFirstByte(ms int)

ObserveE2EFirstByte records the user-perceived latency from ASR final to first audible AI byte. Only meaningful values (>0) should be passed — 0 means "no ASR final preceded this turn" which shouldn't skew the distribution.

func ObserveLLMFirstByte added in v1.4.3

func ObserveLLMFirstByte(ms int)

ObserveLLMFirstByte records the dialog app's reported time to first LLM token (ms). Comes from CommandMeta.LLMFirstMs on tts.speak.

func ObserveTTSFirstByte added in v1.4.3

func ObserveTTSFirstByte(ms int)

ObserveTTSFirstByte records Speak -> first PCM frame latency (ms). Measures the TTS engine's cold-start / TTFB across all turns.

func RegisterLabels added in v1.4.3

func RegisterLabels(metric string, keys ...string)

RegisterLabels declares the allowed label keys for a metric. Subsequent updates with extra keys will have those keys dropped (soft defense). Calling RegisterLabels twice for the same metric REPLACES the whitelist (last write wins) — intended for tests.

Safe to call from init().

func STIRVerify

func STIRVerify(result string)

STIRVerify reports one STIR (RFC 8224) verification outcome.

func SessionTimerRefresh

func SessionTimerRefresh(result string)

SessionTimerRefresh logs one refresher state transition. Hot path — called from outbound refresher response handler.

func TTSError added in v1.4.3

func TTSError(transport string)

TTSError bumps the TTS error counter. Called when Speak returns an error or is interrupted / drained before producing any audio.

func TransactionTimeout

func TransactionTimeout(method string)

TransactionTimeout reports a transaction-layer timeout (timer B/F fired). Method is the SIP method name (UPPER); we collapse the long tail into "other" to keep cardinality bounded.

func VoiceAttach

func VoiceAttach(mode string, ok bool)

VoiceAttach bumps the voice-attach counter for one OnACK dispatch. Unknown mode / result strings are dropped silently — the goal is hot-path safety, not enforcement (dashboards alert on missing series, not on rejected inputs).

func VoiceAttachModeFallback

func VoiceAttachModeFallback(from, to string)

VoiceAttachModeFallback bumps the mode-fallback counter. Today this is only called when ResolveAttachMode promotes "pipeline" to "realtime" because pipeline creds are unusable. Future fallbacks would add new pre-allocated label maps and a switch arm.

func VoiceAttachNative

func VoiceAttachNative(ok bool)

VoiceAttachNative bumps the native-cascaded routing counter. ok reflects whether the native attach succeeded (engine.New + Attach both returned nil). Hot-path: same allocation profile as VoiceAttach.

Types

type Registry added in v1.4.3

type Registry struct {
	// contains filtered or unexported fields
}

Registry is the single source of truth for VoiceServer process-level metrics. A call-site imports the package, mutates the Default registry via helpers like IncCounter(), and a single HTTP handler serialises the registry to Prometheus text format on /metrics scrape.

func NewRegistry added in v1.4.3

func NewRegistry() *Registry

NewRegistry returns an empty, ready-to-use registry.

func (*Registry) AddCounter added in v1.4.3

func (r *Registry) AddCounter(name, help string, labels map[string]string, n uint64)

AddCounter adds `n` to the counter. n must be >= 0 (Prometheus counters are monotonic); negative values are silently ignored so a buggy call site doesn't corrupt the series.

Labels are filtered through the cardinality whitelist registered via RegisterLabels (see labels.go). Unknown keys are dropped and reported via metrics_unknown_label_total.

func (*Registry) AddGauge added in v1.4.3

func (r *Registry) AddGauge(name, help string, labels map[string]string, v float64)

AddGauge increments (v > 0) / decrements (v < 0) a gauge atomically. Labels run through the cardinality whitelist (see labels.go).

func (*Registry) IncCounter added in v1.4.3

func (r *Registry) IncCounter(name, help string, labels map[string]string)

IncCounter bumps a counter by 1. Safe to call from hot paths.

func (*Registry) Observe added in v1.4.3

func (r *Registry) Observe(name, help string, v float64)

Observe records one sample into a histogram. The registry keeps at most `maxSamples` most recent observations to bound memory; older values are dropped in FIFO order. Quantiles are computed at scrape time from the live buffer, so a /metrics request is O(n log n) in buffer size — perfectly fine for n up to a few thousand.

func (*Registry) ObserveN added in v1.4.3

func (r *Registry) ObserveN(name, help string, v float64, maxSamples int)

ObserveN is Observe with a custom buffer cap. Use when you want finer control over memory vs resolution (e.g. 8192 for a hot call latency signal you scrape every 10s).

func (*Registry) SetGauge added in v1.4.3

func (r *Registry) SetGauge(name, help string, labels map[string]string, v float64)

SetGauge stores a value for a gauge. Labels run through the cardinality whitelist (see labels.go).

func (*Registry) WritePromText added in v1.4.3

func (r *Registry) WritePromText(w io.Writer)

WritePromText serialises the registry in Prometheus text exposition format (v0.0.4). Safe to call concurrently with metric updates; snapshot is point-in-time per metric.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL