Documentation
¶
Overview ¶
Package metrics is the SIP stack observability surface: a dependency-free Prometheus registry plus typed SIP counters.
Registry (registry.go, labels.go, async.go, app.go):
- counters, gauges, summary-style histograms
- cardinality whitelist via RegisterLabels
- optional async ObserveAsync drain for hot paths
- VoiceServer app helpers (CallStarted, Handler, …)
SIP helpers (metrics.go, voice_attach.go):
- INVITE/BYE/transaction/session-timer/STIR/DTLS/QoS counters
- voice-attach counters at the OnACK seam
Cardinality discipline:
- Label keys come from a small enum set (direction, scenario, code_class, method, result, reason_class). Per-call identifiers (Call-ID, phone, SSRC) are NEVER labelled — those belong in the CDR record, not the metrics registry.
- Each metric declares its allowed keys via RegisterLabels in init(); the registry enforces this softly.
Hot-path discipline:
- All exported helpers take only enums/integers and look up the pre-allocated label map. Zero string formatting at the call site, zero allocation in the steady state.
Registry is a tiny, dependency-free Prometheus exposition backend. It provides counters, gauges, and summary-style histograms with P50/P90/P95/P99 quantiles. See package doc in metrics.go.
Index ¶
- Constants
- Variables
- func ASRError(transport string)
- func AsyncDroppedCount() uint64
- func BYE(direction, by, reasonClass string)
- func BargeIn(transport string)
- func Bye(by, reasonClass string)
- func CallEnded(transport, status string)
- func CallStarted(transport string)
- func DTLSHandshake(result string)
- func DialogReconnect(transport, outcome string)
- func Handler() http.Handler
- func InviteResult(direction string, code int)
- func LabelsCall(transport, status string) map[string]string
- func LabelsDialogOutcome(transport, outcome string) map[string]string
- func ObserveAsync(name, help string, v float64)
- func ObserveCallQoS(rttMs uint32, jitterMs float64, lossFraction float64, mosEstimate float64)
- func ObserveE2EFirstByte(ms int)
- func ObserveLLMFirstByte(ms int)
- func ObserveTTSFirstByte(ms int)
- func RegisterLabels(metric string, keys ...string)
- func STIRVerify(result string)
- func SessionTimerRefresh(result string)
- func TTSError(transport string)
- func TransactionTimeout(method string)
- func VoiceAttach(mode string, ok bool)
- func VoiceAttachModeFallback(from, to string)
- func VoiceAttachNative(ok bool)
- type Registry
- func (r *Registry) AddCounter(name, help string, labels map[string]string, n uint64)
- func (r *Registry) AddGauge(name, help string, labels map[string]string, v float64)
- func (r *Registry) IncCounter(name, help string, labels map[string]string)
- func (r *Registry) Observe(name, help string, v float64)
- func (r *Registry) ObserveN(name, help string, v float64, maxSamples int)
- func (r *Registry) SetGauge(name, help string, labels map[string]string, v float64)
- func (r *Registry) WritePromText(w io.Writer)
Constants ¶
const ( // Calls. MetricActiveCalls = "voiceserver_active_calls" MetricCallsTotal = "voiceserver_calls_total" // Recognizer / synthesizer errors. MetricASRErrors = "voiceserver_asr_errors_total" MetricTTSErrors = "voiceserver_tts_errors_total" // User-interrupts-AI events. MetricBargeInTotal = "voiceserver_barge_in_total" // Latencies (milliseconds). MetricE2EFirstByteMs = "voiceserver_e2e_first_byte_ms" MetricTTSFirstByteMs = "voiceserver_tts_first_byte_ms" MetricLLMFirstByteMs = "voiceserver_llm_first_byte_ms" // Dialog plane. MetricDialogReconnectTotal = "voiceserver_dialog_reconnect_total" )
Metric name constants. Kept in one place so dashboards can grep for a single source of truth. Names follow Prometheus convention: `<namespace>_<subsystem>_<name>_<unit>`.
const ( // INVITE responses, classified by direction and response class. // One series per (direction, class) — bounded to 2 × 6 = 12 series. MetricInviteResultTotal = "sip_invite_result_total" // BYE events, classified by who initiated and reason class. MetricByeTotal = "sip_bye_total" // Transaction-level timeouts (timer F / B fired, RFC 3261). MetricTransactionTimeoutTotal = "sip_transaction_timeout_total" // RFC 4028 session-timer refresher events. result = ok / 422 / // 481 / role-swap / gave-up. MetricSessionTimerRefreshTotal = "sip_session_timer_refresh_total" // DTLS-SRTP handshake outcomes. result = ok / fail / timeout / // fingerprint-mismatch. MetricDTLSHandshakeTotal = "sip_dtls_handshake_total" // RFC 8224 STIR verification outcomes. MetricSTIRVerifyTotal = "sip_stir_verify_total" // RTCP-derived per-call QoS roll-ups, recorded ONCE at call end. MetricCallRTTMs = "sip_call_rtt_ms" MetricCallJitterMs = "sip_call_jitter_ms" MetricCallLossFraction = "sip_call_loss_fraction" MetricCallMOSEstimate = "sip_call_mos_estimate" )
Metric names. Single source of truth so dashboards can grep.
const ( DirectionInbound = "inbound" DirectionOutbound = "outbound" )
Direction enum.
const ( ByeByLocal = "local" ByeByRemote = "remote" // Reason classes — bounded enum. ByeReasonNormal = "normal" // 200 OK BYE no special cause ByeReasonTimerExpired = "timer-expired" // RFC 4028 session-timer expired ByeReasonError = "error" // unexpected (pipeline failure, etc.) ByeReasonUserHangup = "user-hangup" // explicit hangup intent )
Bye classification.
const ( RefreshResultOK = "ok" // peer accepted with 200 Refresh422Bumped = "422-bumped" // got 422, retried with peer Min-SE Refresh422GaveUp = "422-gave-up" // second 422, stopped Refresh481DialogGone = "481" // dialog disappeared RefreshRoleSwappedToUAS = "role-swap" // peer flipped refresher to itself )
Refresher event classification.
const ( DTLSResultOK = "ok" DTLSResultFail = "fail" DTLSResultTimeout = "timeout" DTLSResultFingerprintMismatch = "fingerprint-mismatch" )
const ( STIRResultVerified = "verified" STIRResultFailed = "failed" STIRResultSoftFail = "soft-fail" // verifier rejected but call continued STIRResultNoIdent = "no-identity" )
const ( // MetricVoiceAttachTotal counts voice-attach attempts at the OnACK // seam, classified by resolved engine.Mode and final outcome. // // labels: // mode = "cascaded" | "realtime" // result = "ok" | "config_error" // // "config_error" is the umbrella for every failure path that // played scripts/config_error.wav (no tenant id, env load error, // missing/incomplete credentials). Granular reasons live in the // log lines emitted by AttachCascadedLegacy / AttachRealtimeLegacy; // they're not labels because the cardinality blows up fast. MetricVoiceAttachTotal = "sip_voice_attach_total" // MetricVoiceAttachModeFallbackTotal counts implicit mode // fallbacks made by ResolveAttachMode (today: tenant persisted // voice_mode="pipeline" but pipeline creds are unusable and // realtime is ready → we auto-select realtime). // // labels: // from = "pipeline" // to = "realtime" MetricVoiceAttachModeFallbackTotal = "sip_voice_attach_mode_fallback_total" // MetricVoiceAttachNativeTotal counts decisions made by the // PR-9d feature flag to route a cascaded call through the // native cascaded.Engine (engine.ModeCascadedNative) instead of // the legacy bridge. Independent from MetricVoiceAttachTotal so // dashboards can monitor opt-in rollout without churn-affecting // the existing per-mode chart. // // labels: // result = "ok" | "err" MetricVoiceAttachNativeTotal = "sip_voice_attach_native_total" )
const ( VoiceAttachModeCascaded = "cascaded" VoiceAttachModeRealtime = "realtime" )
Voice-attach mode enum. Mirrors engine.Mode but kept as plain strings here so this package doesn't import pkg/dialog/engine (which would create an import cycle once engines start emitting metrics directly). The constants MUST stay in sync with engine.Mode's string values.
const ( VoiceAttachResultOK = "ok" VoiceAttachResultConfigError = "config_error" )
Voice-attach result enum.
const MetricObserveDroppedTotal = "voiceserver_metrics_observe_dropped_total"
MetricObserveDroppedTotal counts samples lost because the async Observe buffer was full. If this is non-zero in production the drain goroutine isn't keeping up — usually a downstream stall rather than a real load issue.
const MetricUnknownLabelTotal = "voiceserver_metrics_unknown_label_total"
MetricUnknownLabelTotal counts soft-whitelist violations. Visible via /metrics so on-call can spot "someone is shipping a metric the declared whitelist doesn't cover" without grepping logs.
Variables ¶
var ( LabelsTransportSIP = map[string]string{"transport": "sip"} LabelsTransportWebRTC = map[string]string{"transport": "webrtc"} )
LabelsTransportSIP / LabelsTransportWebRTC are the two transports we use today. The whitelist for any metric labelled by transport should be: RegisterLabels(metric, "transport").
var Default = NewRegistry()
Default is the process-wide registry. Use this for application-level metrics so a single /metrics handler serves everything.
Functions ¶
func ASRError ¶ added in v1.4.3
func ASRError(transport string)
ASRError bumps the ASR error counter. Called from the recognizer error callback in the gateway client.
func AsyncDroppedCount ¶ added in v1.4.3
func AsyncDroppedCount() uint64
AsyncDroppedCount returns the total samples dropped since process start. Exposed for tests and self-observability tooling.
func BYE ¶
func BYE(direction, by, reasonClass string)
BYE bumps the BYE counter for the given direction (inbound / outbound), initiator (local / remote), and reason class. Backwards- compat shim: a 2-arg call still works via the Bye() helper which defaults direction to outbound. Hot path; zero allocation for any known combination.
func BargeIn ¶ added in v1.4.3
func BargeIn(transport string)
BargeIn counts how often the VAD interrupted the AI's TTS because the user started talking. Good predictor of conversation health — a high rate usually means the AI is too verbose or VAD is too twitchy.
func Bye ¶
func Bye(by, reasonClass string)
Bye is the outbound-default shim. Kept for existing callers that don't yet care about direction. New callers should prefer BYE() with an explicit direction.
func CallEnded ¶ added in v1.4.3
func CallEnded(transport, status string)
CallEnded mirrors CallStarted. status is a short classification like "ok", "dialog-hangup", "ice-failed", "pipeline-error" — use the same vocabulary you use in call_events.kind so dashboards line up.
func CallStarted ¶ added in v1.4.3
func CallStarted(transport string)
CallStarted increments the active-calls gauge and the calls_total counter for the given transport. Call at the moment the session becomes "live" (ASR/TTS wired + dialog plane connected).
func DTLSHandshake ¶
func DTLSHandshake(result string)
DTLSHandshake reports the outcome of one DTLS-SRTP handshake.
func DialogReconnect ¶ added in v1.4.3
func DialogReconnect(transport, outcome string)
DialogReconnect counts reconnect attempts to the dialog plane regardless of outcome. A growing counter means the dialog app is flaky; pair with the ok/fail counters for success rate.
func Handler ¶ added in v1.4.3
Handler returns an http.Handler that writes the Default registry in Prometheus text exposition format. Mount at /metrics — no auth by default; add middleware if the listener is internet-exposed.
func InviteResult ¶
InviteResult bumps the INVITE result counter. `code` is the SIP status code (100..699); it's classified to its hundreds class so the label cardinality stays bounded at 6 per direction.
func LabelsCall ¶ added in v1.4.3
LabelsCall composes a 2-key label set for the common (transport, status) shape used by voiceserver_calls_total. We pre-build the known combinations rather than allocating per-call. Add more statuses here if dashboards need to slice on them.
Return type is map[string]string to fit the existing API; pointer identity is preserved across calls so map-key dedupe inside the registry stays cheap.
func LabelsDialogOutcome ¶ added in v1.4.3
LabelsDialogOutcome is used by DialogReconnect — bounded set of outcomes per the original API contract.
func ObserveAsync ¶ added in v1.4.3
ObserveAsync queues a histogram sample on the global async drain. Hot-path safe: non-blocking, zero allocation, drops on full (incrementing the dropped-samples counter).
This is the recommended call for any observation that fires more than ~10x/sec per process. For one-off latencies (per turn, per call) the synchronous Default.Observe is fine and slightly more accurate (no buffering reorder concerns).
func ObserveCallQoS ¶
ObserveCallQoS records the per-call RTCP-derived metrics. Call this ONCE per call at cleanup (after the last RTCPSnapshot). All inputs are optional; zero / negative values are skipped so "no data" doesn't pollute the distribution.
Hot path? No — this runs at most once per call (~0.02 Hz/leg). Cardinality? Zero labels — these are global distributions.
func ObserveE2EFirstByte ¶ added in v1.4.3
func ObserveE2EFirstByte(ms int)
ObserveE2EFirstByte records the user-perceived latency from ASR final to first audible AI byte. Only meaningful values (>0) should be passed — 0 means "no ASR final preceded this turn" which shouldn't skew the distribution.
func ObserveLLMFirstByte ¶ added in v1.4.3
func ObserveLLMFirstByte(ms int)
ObserveLLMFirstByte records the dialog app's reported time to first LLM token (ms). Comes from CommandMeta.LLMFirstMs on tts.speak.
func ObserveTTSFirstByte ¶ added in v1.4.3
func ObserveTTSFirstByte(ms int)
ObserveTTSFirstByte records Speak -> first PCM frame latency (ms). Measures the TTS engine's cold-start / TTFB across all turns.
func RegisterLabels ¶ added in v1.4.3
RegisterLabels declares the allowed label keys for a metric. Subsequent updates with extra keys will have those keys dropped (soft defense). Calling RegisterLabels twice for the same metric REPLACES the whitelist (last write wins) — intended for tests.
Safe to call from init().
func STIRVerify ¶
func STIRVerify(result string)
STIRVerify reports one STIR (RFC 8224) verification outcome.
func SessionTimerRefresh ¶
func SessionTimerRefresh(result string)
SessionTimerRefresh logs one refresher state transition. Hot path — called from outbound refresher response handler.
func TTSError ¶ added in v1.4.3
func TTSError(transport string)
TTSError bumps the TTS error counter. Called when Speak returns an error or is interrupted / drained before producing any audio.
func TransactionTimeout ¶
func TransactionTimeout(method string)
TransactionTimeout reports a transaction-layer timeout (timer B/F fired). Method is the SIP method name (UPPER); we collapse the long tail into "other" to keep cardinality bounded.
func VoiceAttach ¶
VoiceAttach bumps the voice-attach counter for one OnACK dispatch. Unknown mode / result strings are dropped silently — the goal is hot-path safety, not enforcement (dashboards alert on missing series, not on rejected inputs).
func VoiceAttachModeFallback ¶
func VoiceAttachModeFallback(from, to string)
VoiceAttachModeFallback bumps the mode-fallback counter. Today this is only called when ResolveAttachMode promotes "pipeline" to "realtime" because pipeline creds are unusable. Future fallbacks would add new pre-allocated label maps and a switch arm.
func VoiceAttachNative ¶
func VoiceAttachNative(ok bool)
VoiceAttachNative bumps the native-cascaded routing counter. ok reflects whether the native attach succeeded (engine.New + Attach both returned nil). Hot-path: same allocation profile as VoiceAttach.
Types ¶
type Registry ¶ added in v1.4.3
type Registry struct {
// contains filtered or unexported fields
}
Registry is the single source of truth for VoiceServer process-level metrics. A call-site imports the package, mutates the Default registry via helpers like IncCounter(), and a single HTTP handler serialises the registry to Prometheus text format on /metrics scrape.
func NewRegistry ¶ added in v1.4.3
func NewRegistry() *Registry
NewRegistry returns an empty, ready-to-use registry.
func (*Registry) AddCounter ¶ added in v1.4.3
AddCounter adds `n` to the counter. n must be >= 0 (Prometheus counters are monotonic); negative values are silently ignored so a buggy call site doesn't corrupt the series.
Labels are filtered through the cardinality whitelist registered via RegisterLabels (see labels.go). Unknown keys are dropped and reported via metrics_unknown_label_total.
func (*Registry) AddGauge ¶ added in v1.4.3
AddGauge increments (v > 0) / decrements (v < 0) a gauge atomically. Labels run through the cardinality whitelist (see labels.go).
func (*Registry) IncCounter ¶ added in v1.4.3
IncCounter bumps a counter by 1. Safe to call from hot paths.
func (*Registry) Observe ¶ added in v1.4.3
Observe records one sample into a histogram. The registry keeps at most `maxSamples` most recent observations to bound memory; older values are dropped in FIFO order. Quantiles are computed at scrape time from the live buffer, so a /metrics request is O(n log n) in buffer size — perfectly fine for n up to a few thousand.
func (*Registry) ObserveN ¶ added in v1.4.3
ObserveN is Observe with a custom buffer cap. Use when you want finer control over memory vs resolution (e.g. 8192 for a hot call latency signal you scrape every 10s).
func (*Registry) SetGauge ¶ added in v1.4.3
SetGauge stores a value for a gauge. Labels run through the cardinality whitelist (see labels.go).
func (*Registry) WritePromText ¶ added in v1.4.3
WritePromText serialises the registry in Prometheus text exposition format (v0.0.4). Safe to call concurrently with metric updates; snapshot is point-in-time per metric.