voiceagent

package
v0.40.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 28, 2026 License: Apache-2.0 Imports: 32 Imported by: 0

Documentation

Overview

Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target. It provides three things:

  1. A Session Manager that tracks active Voice Agent sessions, enforces per-identity and global concurrency limits, and mints HMAC-signed one-time tickets so browser clients can upgrade a WebSocket without sending Authorization headers.
  2. A wire protocol (see protocol.go) that carries control frames as JSON and audio frames as binary.
  3. An adapter that bridges the WebSocket to the Framework kernel's internal/voiceagent.Session without the kernel needing to know anything about HTTP.

This file covers the Session Manager and ticket machinery. The wire protocol, WebSocket handler, and adapter live in sibling files.

Index

Constants

View Source
const (
	MsgStart        = "start"
	MsgAudioEnd     = "audio_end"
	MsgText         = "text"
	MsgToolResponse = "tool_response"
	MsgPing         = "ping"
	MsgStop         = "stop"
	MsgAdvanceStep  = "advance_step"
)

Client-to-server message types.

View Source
const (
	MediaTransportWebSocket = "websocket"
	MediaTransportLiveKit   = "livekit"
)
View Source
const (
	MsgState            = "state"
	MsgInputTranscript  = "input_transcript"
	MsgOutputTranscript = "output_transcript"
	MsgToolCall         = "tool_call"
	MsgSequenceStep     = "sequence_step"
	MsgInterrupted      = "interrupted"
	MsgError            = "error"
	MsgSessionEnd       = "session_end"
	MsgPong             = "pong"
)

Server-to-client message types.

Variables

View Source
var (
	ErrSessionNotFound       = errors.New("voiceagent: session not found")
	ErrSessionAlreadyActive  = errors.New("voiceagent: session already has an active WS connection")
	ErrSessionExpired        = errors.New("voiceagent: session ticket expired")
	ErrInvalidTicket         = errors.New("voiceagent: ticket signature or payload invalid")
	ErrGlobalLimitExceeded   = errors.New("voiceagent: global session limit exceeded")
	ErrIdentityLimitExceeded = errors.New("voiceagent: per-identity session limit exceeded")
)

Common errors reported by the session manager.

View Source
var NewAgentFlowAdapter = cascaded.NewAgentFlowAdapter

NewAgentFlowAdapter re-exports the cross-platform helper used by internal/server/core/voiceagent_wiring.go.

Functions

func RenderHostInstructionUpdate added in v0.28.2

func RenderHostInstructionUpdate(cfg LiveConfigFrame) string

Types

type ActivityDetectionFrame

type ActivityDetectionFrame struct {
	Automatic         bool   `json:"automatic"`
	StartSensitivity  string `json:"start_sensitivity,omitempty"`
	EndSensitivity    string `json:"end_sensitivity,omitempty"`
	PrefixPaddingMs   int32  `json:"prefix_padding_ms,omitempty"`
	SilenceDurationMs int32  `json:"silence_duration_ms,omitempty"`
	ActivityHandling  string `json:"activity_handling,omitempty"`
	TurnCoverage      string `json:"turn_coverage,omitempty"`
}

ActivityDetectionFrame maps to voiceagent.ActivityDetectionPolicy.

type Adapter

type Adapter struct {
	Session  *ManagedSession
	Conn     *websocket.Conn
	Provider LiveProviderAdapter
	Persona  PersonaResolver
	// MediaBridge starts an optional LiveKit media bridge for sessions that
	// keep this WebSocket as control transport and move audio through LiveKit.
	MediaBridge MediaBridgeFactory
	// IdleTimeout terminates a session whose readPump and writePump have
	// both been silent for the duration. Zero disables the server-side
	// idle watchdog. Defaults to 15 minutes when set by the WS handler.
	IdleTimeout time.Duration
	// OnClose runs after both pumps have returned. Typically removes the
	// session from the manager.
	OnClose func()
	// contains filtered or unexported fields
}

Adapter bridges a live WebSocket conversation to the Framework kernel's voiceagent session. One Adapter instance handles exactly one session, which matches the manager's concurrency model.

The adapter owns two long-lived goroutines:

readPump:  WebSocket → kernel (control + audio frames from client)
writePump: kernel → WebSocket (audio + transcript + tool-call frames)

When either pump errors or the provider reports session_end, the adapter closes the socket, calls OnClose, and returns from Run.

func (*Adapter) Run

func (a *Adapter) Run(parent context.Context)

Run blocks until the session ends. The first frame from the client MUST be a StartFrame; if it isn't, the adapter closes with an error.

type AdvanceStepFrame added in v0.28.2

type AdvanceStepFrame struct {
	Type   string `json:"type"` // "advance_step"
	StepID string `json:"step_id,omitempty"`
	Reason string `json:"reason,omitempty"`
}

AdvanceStepFrame asks the server-side workflow runner to move from the active sequence step to the next step. StepID is reserved for future direct jumps; v1 advances linearly through the authored sequence.

type CascadedAgent

type CascadedAgent = cascaded.Agent

Re-export the cascaded types under their historical server-package names. These aliases are intentionally type aliases (not new types) so existing code that passes app.STTRouter / app.TTSRouter etc. continues to type-check without conversions.

type CascadedConfig

type CascadedConfig = cascaded.Config

Re-export the cascaded types under their historical server-package names. These aliases are intentionally type aliases (not new types) so existing code that passes app.STTRouter / app.TTSRouter etc. continues to type-check without conversions.

type CascadedDeps

type CascadedDeps = cascaded.Deps

Re-export the cascaded types under their historical server-package names. These aliases are intentionally type aliases (not new types) so existing code that passes app.STTRouter / app.TTSRouter etc. continues to type-check without conversions.

type CascadedProvider

type CascadedProvider struct {
	// contains filtered or unexported fields
}

CascadedProvider wraps cascaded.Provider and adapts its SessionConfig/Message types to the server's LiveConfigFrame/ LiveMessage types so the result satisfies LiveProviderAdapter.

func NewCascadedProvider

func NewCascadedProvider(deps CascadedDeps) *CascadedProvider

NewCascadedProvider preserves the historical constructor name.

func (*CascadedProvider) Close

func (p *CascadedProvider) Close() error

Close delegates.

func (*CascadedProvider) Connect

func (p *CascadedProvider) Connect(ctx context.Context, cfg LiveConfigFrame) error

Connect translates the rich server LiveConfigFrame into a minimal cascaded.SessionConfig and forwards.

func (*CascadedProvider) Name

func (p *CascadedProvider) Name() string

Name returns the provider identifier.

func (*CascadedProvider) Receive

func (p *CascadedProvider) Receive(ctx context.Context) (*LiveMessage, error)

Receive blocks for the next cascaded.Message and translates it into the server's LiveMessage type.

func (*CascadedProvider) SendAudio

func (p *CascadedProvider) SendAudio(chunk []byte) error

SendAudio delegates.

func (*CascadedProvider) SendAudioStreamEnd

func (p *CascadedProvider) SendAudioStreamEnd() error

SendAudioStreamEnd delegates.

func (*CascadedProvider) SendText

func (p *CascadedProvider) SendText(text string) error

SendText delegates.

func (*CascadedProvider) UpdateInstructions added in v0.28.2

func (p *CascadedProvider) UpdateInstructions(ctx context.Context, cfg LiveConfigFrame) error

UpdateInstructions translates rich LiveConfigFrame to SessionConfig.

type CascadedSTT

type CascadedSTT = cascaded.STT

Re-export the cascaded types under their historical server-package names. These aliases are intentionally type aliases (not new types) so existing code that passes app.STTRouter / app.TTSRouter etc. continues to type-check without conversions.

type CascadedTTS

type CascadedTTS = cascaded.TTS

Re-export the cascaded types under their historical server-package names. These aliases are intentionally type aliases (not new types) so existing code that passes app.STTRouter / app.TTSRouter etc. continues to type-check without conversions.

type ErrorFrame

type ErrorFrame struct {
	Type    string `json:"type"` // "error"
	Code    string `json:"code"`
	Message string `json:"message"`
}

type Handler

type Handler struct {
	// contains filtered or unexported fields
}

Handler exposes both the HTTP session-creation endpoint and the WS upgrade endpoint under /v1/voiceagent/*.

func New

func New(opts HandlerOptions) (*Handler, error)

New constructs a handler. All options except MaxAllowedClockSkew are required — the adapter cannot function without a manager, provider, and persona resolver.

func (*Handler) Mount

func (h *Handler) Mount(mux *http.ServeMux)

Mount wires the voiceagent endpoints onto mux:

POST   /v1/voiceagent/sessions        — create session + mint ticket
GET    /v1/voiceagent/sessions        — list caller's active sessions
DELETE /v1/voiceagent/sessions/{id}   — force close a session
GET    /v1/voiceagent/sessions/{id}/ws — upgrade to WebSocket with Sec-WebSocket-Protocol: ticket.<ticket>

type HandlerOptions

type HandlerOptions struct {
	Manager             *SessionManager
	Provider            ProviderFactory
	Persona             PersonaResolver
	PublicURL           string
	AllowedOrigins      []string
	MaxAllowedClockSkew time.Duration
	// IdleTimeout terminates a session that hasn't seen any activity
	// (client frame OR provider message) within the duration. Zero
	// disables the server-side idle watchdog. Defaults to 15 minutes
	// when zero is passed; pass a negative value to disable explicitly.
	IdleTimeout time.Duration
	Store       store.Store
	LiveKit     *LiveKitTokenIssuer
	MediaBridge MediaBridgeFactory
	// ReadLimit caps per-frame bytes the upgraded WebSocket will read.
	// Zero or negative falls back to defaultWSReadLimitBytes (64 KiB).
	ReadLimit int64
}

HandlerOptions configures the WebSocket handler.

type Identity

type Identity struct {
	UserID string
	OrgID  string
	Plan   string
	Role   string
}

Identity is the caller's resolved identity from the auth middleware. Stored on each session so closes can verify ownership.

type InterruptedFrame

type InterruptedFrame struct {
	Type string `json:"type"` // "interrupted"
}

type LiveConfigFrame

type LiveConfigFrame struct {
	PersonaID string
	RoleID    string
	// Sequence metadata is internal runtime state derived from the
	// persona/role resolver. It lets the WebSocket adapter report and advance
	// workflow steps without coupling to the persona package.
	SequenceID         string
	SequenceCompletion string
	SequenceMaxTurns   int
	StepID             string
	StepIndex          int
	StepCount          int
	StepInstruction    string
	StepExitCriteria   string
	StepMaxTurns       int

	Model string
	// FallbackModel is forwarded to providers that support same-provider
	// fallback (kernel Gemini Live retries the fallback when the primary
	// connect fails). Empty disables the fallback.
	FallbackModel    string
	APIKey           string
	Voice            string
	SystemPrompt     string
	RefinementPrompt string
	Locale           string
	// Raw activity-detection passthrough; adapter translates to the
	// kernel's internal types.
	Automatic         bool
	StartSensitivity  string
	EndSensitivity    string
	PrefixPaddingMs   int32
	SilenceDurationMs int32
	ActivityHandling  string
	TurnCoverage      string
}

LiveConfigFrame is the subset of configuration the adapter derives from a StartFrame and the persona/role resolver. Kept as a separate type so the test double doesn't need to depend on the kernel's concrete LiveConfig (which embeds Google genai types).

type LiveInstructionUpdater added in v0.28.2

type LiveInstructionUpdater interface {
	UpdateInstructions(ctx context.Context, cfg LiveConfigFrame) error
}

LiveInstructionUpdater is implemented by providers that can update their active host instructions without treating the update as a user turn.

type LiveKitJoinInfo added in v0.31.0

type LiveKitJoinInfo struct {
	URL                 string `json:"url"`
	Room                string `json:"room"`
	Token               string `json:"token"`
	TokenExpiresAt      string `json:"token_expires_at"`
	ParticipantIdentity string `json:"participant_identity"`
}

type LiveKitMediaBridgeFactory added in v0.31.0

type LiveKitMediaBridgeFactory struct {
	Issuer         *LiveKitTokenIssuer
	ConnectTimeout time.Duration
	// contains filtered or unexported fields
}

LiveKitMediaBridgeFactory joins the existing session room as an agent participant. It subscribes to client microphone Opus tracks, decodes them into 16 kHz S16 mono PCM for the realtime provider, and publishes provider PCM output as an Opus track back into the room.

func NewLiveKitMediaBridgeFactory added in v0.31.0

func NewLiveKitMediaBridgeFactory(issuer *LiveKitTokenIssuer) *LiveKitMediaBridgeFactory

func (*LiveKitMediaBridgeFactory) Start added in v0.31.0

type LiveKitTokenIssuer added in v0.31.0

type LiveKitTokenIssuer struct {
	URL        string
	APIKey     string
	APISecret  string
	TokenTTL   time.Duration
	RoomPrefix string
	Clock      func() time.Time
}

LiveKitTokenIssuer mints short-lived LiveKit join tokens for authenticated SpeechKit voice-agent sessions. It deliberately does not create rooms via the LiveKit admin API: a scoped join token is enough for LiveKit to lazily create self-hosted rooms, while keeping this path independent of Twirp.

func (*LiveKitTokenIssuer) Enabled added in v0.31.0

func (i *LiveKitTokenIssuer) Enabled() bool

func (*LiveKitTokenIssuer) IssueJoinToken added in v0.31.0

func (i *LiveKitTokenIssuer) IssueJoinToken(_ context.Context, sessionID string, owner Identity) (LiveKitJoinInfo, error)

type LiveKitTransportSupport added in v0.31.0

type LiveKitTransportSupport interface {
	SupportsLiveKitTransport() bool
}

LiveKitTransportSupport marks providers whose realtime APIs consume and emit raw PCM audio without SpeechKit-side transcoding.

type LiveMessage

type LiveMessage struct {
	Audio                []byte
	OutputTranscript     string
	OutputTranscriptDone bool
	InputTranscript      string
	InputTranscriptDone  bool
	ToolCalls            []ToolCall
	Interrupted          bool
	GoAway               bool
}

LiveMessage is the subset of kernel/internal/voiceagent.LiveMessage the adapter relays to the client. Matching field names keep the translation trivial.

type LiveProviderAdapter

type LiveProviderAdapter interface {
	Connect(ctx context.Context, cfg LiveConfigFrame) error
	SendAudio(chunk []byte) error
	SendAudioStreamEnd() error
	SendText(text string) error
	Receive(ctx context.Context) (*LiveMessage, error)
	Close() error
	Name() string
}

LiveProviderAdapter is the minimal slice of the kernel's LiveProvider interface the Server-Target adapter needs. Keeping it narrow makes it trivial for tests to stub without pulling in the full genai SDK.

type LiveToolResponder added in v0.28.2

type LiveToolResponder interface {
	SendToolResponse(ToolResponseFrame) error
}

LiveToolResponder is implemented by providers that accept host-side tool results from the client.

type ManagedSession

type ManagedSession struct {
	ID          string
	Owner       Identity
	CreatedAt   time.Time
	State       State
	HasWSClient bool // true once the client has upgraded
}

ManagedSession is the manager's record for one active or pending session. The WebSocket handler and adapter pull additional fields (conn, pumps) onto this struct at handshake time.

type MediaBridge added in v0.31.0

type MediaBridge interface {
	SendAudio([]byte) error
	Close() error
}

MediaBridge moves provider output audio to the selected media transport and delivers inbound transport audio to the provider.

type MediaBridgeFactory added in v0.31.0

type MediaBridgeFactory interface {
	Start(ctx context.Context, req MediaBridgeRequest) (MediaBridge, error)
}

MediaBridgeFactory creates the media transport sidecar for a session. The WebSocket stays the control channel; a bridge only carries PCM audio.

type MediaBridgeRequest added in v0.31.0

type MediaBridgeRequest struct {
	SessionID string
	Owner     Identity
	Provider  LiveProviderAdapter
}

MediaBridgeRequest is the minimum session/provider state needed to attach LiveKit media to an already accepted Voice Agent session.

type Options

type Options struct {
	// TicketSecret signs the one-time WS upgrade tickets. Must be at least
	// 16 bytes; when empty, the manager generates a random secret at
	// construction time (acceptable only for single-process deployments).
	TicketSecret []byte
	// TicketTTL limits how long a minted ticket remains valid. Defaults to
	// 90 seconds.
	TicketTTL time.Duration
	// MaxGlobalSessions caps concurrent sessions across all callers.
	// Defaults to 100.
	MaxGlobalSessions int
	// MaxPerIdentitySessions caps concurrent sessions per caller.
	// Defaults to 3.
	MaxPerIdentitySessions int
	// Clock is a time source; nil means time.Now. Tests can override.
	Clock func() time.Time
}

Options configures a SessionManager.

type PersonaResolver

type PersonaResolver interface {
	Resolve(StartFrame) (LiveConfigFrame, error)
}

PersonaResolver derives a LiveConfigFrame from a StartFrame. The server's persona registry implements this in M5; for M4 a stub resolver is used that echoes the StartFrame through with sensible defaults.

type PongFrame

type PongFrame struct {
	Type string `json:"type"` // "pong"
}

type ProviderFactory

type ProviderFactory interface {
	NewProvider() LiveProviderAdapter
}

ProviderFactory builds a Framework kernel voice-agent provider on demand. Each WebSocket session gets its own provider instance so concurrent sessions don't share Gemini Live state.

The concrete production implementation returns a *voiceagent.GeminiLive; tests supply a fake that records frames and replies with canned audio.

type SequenceRunner added in v0.28.2

type SequenceRunner struct {
	// contains filtered or unexported fields
}

SequenceRunner owns per-session workflow state for one Voice Agent WebSocket. It is small on purpose: persona resolution remains the source of truth for prompts and defaults.

func NewSequenceRunner added in v0.28.2

func NewSequenceRunner(start StartFrame, initial LiveConfigFrame, resolver StepResolver) *SequenceRunner

func (*SequenceRunner) Active added in v0.28.2

func (r *SequenceRunner) Active() bool

func (*SequenceRunner) Advance added in v0.28.2

func (*SequenceRunner) Current added in v0.28.2

func (r *SequenceRunner) Current() LiveConfigFrame

func (*SequenceRunner) InitialEnteredFrame added in v0.28.2

func (r *SequenceRunner) InitialEnteredFrame() *SequenceStepFrame

func (*SequenceRunner) RecordUserTurn added in v0.28.2

func (r *SequenceRunner) RecordUserTurn() bool

type SequenceStepFrame

type SequenceStepFrame struct {
	Type       string `json:"type"` // "sequence_step"
	SequenceID string `json:"sequence_id,omitempty"`
	StepID     string `json:"step_id"`
	StepIndex  int    `json:"step_index,omitempty"`
	Status     string `json:"status"` // "entered" | "completed" | "sequence_completed"
	Reason     string `json:"reason,omitempty"`
}

type SequenceTransition added in v0.28.2

type SequenceTransition struct {
	Completed         *SequenceStepFrame
	Entered           *SequenceStepFrame
	NextConfig        LiveConfigFrame
	SequenceCompleted bool
}

SequenceTransition describes the frames and provider update produced by an advance operation.

type SessionEndFrame

type SessionEndFrame struct {
	Type   string `json:"type"`   // "session_end"
	Reason string `json:"reason"` // "idle" | "go_away" | "client" | "error" | "shutdown"
}

type SessionManager

type SessionManager struct {
	// contains filtered or unexported fields
}

SessionManager tracks active Voice Agent sessions.

func NewSessionManager

func NewSessionManager(opts Options) (*SessionManager, error)

NewSessionManager constructs a manager with opts. Unset fields receive sensible defaults.

func (*SessionManager) Attach

func (m *SessionManager) Attach(id string) error

Attach moves a session from pending to active. Returns an error when the session is already attached to another WS client or has been closed.

func (*SessionManager) Create

func (m *SessionManager) Create(owner Identity) (*ManagedSession, string, error)

Create registers a new pending session for the given caller and returns the session record plus a one-time WS upgrade ticket. Fails with a limit error when concurrency caps are exceeded.

func (*SessionManager) Get

func (m *SessionManager) Get(id string) (*ManagedSession, error)

Get returns the session with the given ID, or ErrSessionNotFound.

func (*SessionManager) List

func (m *SessionManager) List(userID string) []*ManagedSession

List returns a snapshot of sessions owned by the given user (or all when userID is empty — reserved for admin callers).

func (*SessionManager) Remove

func (m *SessionManager) Remove(id string)

Remove closes and removes the session. Safe to call multiple times.

func (*SessionManager) Stats added in v0.40.6

func (m *SessionManager) Stats(userID string) SessionStats

Stats returns a point-in-time session and capacity snapshot.

func (*SessionManager) VerifyTicket

func (m *SessionManager) VerifyTicket(sessionID, ticket string) error

VerifyTicket validates a ticket string against its expected session ID. Returns nil when the ticket is cryptographically valid, un-expired, and bound to this sessionID. Mutates nothing; callers must follow up with Attach to mark the session active.

type SessionStats added in v0.40.6

type SessionStats struct {
	TotalSessions          int `json:"total_sessions"`
	ActiveSessions         int `json:"active_sessions"`
	PendingSessions        int `json:"pending_sessions"`
	IdentitySessions       int `json:"identity_sessions,omitempty"`
	MaxGlobalSessions      int `json:"max_global_sessions"`
	MaxPerIdentitySessions int `json:"max_per_identity_sessions"`
}

SessionStats reports current session occupancy and configured capacity.

type StartFrame

type StartFrame struct {
	Type       string `json:"type"` // must be "start"
	PersonaID  string `json:"persona_id,omitempty"`
	RoleID     string `json:"role_id,omitempty"`
	SequenceID string `json:"sequence_id,omitempty"`
	// MediaTransport selects where microphone and model audio move. Empty
	// defaults to "websocket" for existing clients. "livekit" keeps this
	// WebSocket as the control channel and moves audio through LiveKit tracks.
	MediaTransport string `json:"media_transport,omitempty"`
	Voice          string `json:"voice,omitempty"`
	Locale         string `json:"locale,omitempty"`
	Model          string `json:"model,omitempty"`
	Thinking       string `json:"thinking,omitempty"` // "off" | "low" | "medium" | "high"
	// Raw activity-detection policy override. Pipeline translates these to
	// the kernel's internal enums.
	ActivityDetection *ActivityDetectionFrame `json:"activity_detection,omitempty"`
	// Optional durable instruction layered on top of the role's prompt.
	SystemPromptOverride string `json:"system_prompt_override,omitempty"`
}

StartFrame is the mandatory first client frame. Fields marked optional fall back to persona/role defaults when omitted.

type State

type State string

State captures the session's manager-level lifecycle. The Framework kernel (internal/voiceagent.Session) has its own finer-grained state; this one only distinguishes pending ticket vs. active connection.

const (
	StatePendingWS State = "pending_ws"
	StateActive    State = "active"
	StateClosed    State = "closed"
)

type StateFrame

type StateFrame struct {
	Type  string `json:"type"` // "state"
	State string `json:"state"`
}

type StepResolver added in v0.28.2

type StepResolver interface {
	ResolveStep(StartFrame, int) (LiveConfigFrame, error)
}

StepResolver is the optional extension implemented by persona resolvers that can compose a LiveConfigFrame for a specific sequence step.

type TextFrame

type TextFrame struct {
	Type string `json:"type"` // "text"
	Text string `json:"text"`
}

TextFrame carries an injected text turn from the client.

type ToolCall added in v0.28.2

type ToolCall struct {
	ID   string
	Name string
	Args map[string]any
}

type ToolCallFrame

type ToolCallFrame struct {
	Type string         `json:"type"` // "tool_call"
	ID   string         `json:"id"`
	Name string         `json:"name"`
	Args map[string]any `json:"args,omitempty"`
}

type ToolResponseFrame

type ToolResponseFrame struct {
	Type     string         `json:"type"` // "tool_response"
	ID       string         `json:"id"`
	Name     string         `json:"name"`
	Response map[string]any `json:"response"`
}

ToolResponseFrame resolves a tool call issued by the server.

type TranscriptFrame

type TranscriptFrame struct {
	Type string `json:"type"` // "input_transcript" | "output_transcript"
	Text string `json:"text"`
	Done bool   `json:"done"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL