Documentation
¶
Overview ¶
Package audio provides voice activity detection (VAD), turn detection, and audio session management for real-time voice AI applications.
The package follows industry-standard patterns for voice AI:
- VAD (Voice Activity Detection): Detects when someone is speaking vs. silent
- Turn Detection: Determines when a speaker has finished their turn
- Interruption Handling: Manages user interrupting bot output
Architecture ¶
Audio processing follows a two-stage approach:
- VADAnalyzer detects voice activity in real-time
- TurnDetector uses VAD output plus additional signals to detect turn boundaries
Usage Example ¶
vad := audio.NewSimpleVAD(audio.DefaultVADParams())
detector := audio.NewSilenceDetector(500 * time.Millisecond)
for chunk := range audioStream {
vad.Analyze(ctx, chunk)
if detector.DetectTurnEnd(ctx, vad) {
// User finished speaking
}
}
Package audio provides audio processing utilities.
Index ¶
- Constants
- func Resample24kTo16k(input []byte) ([]byte, error)
- func ResamplePCM16(input []byte, fromRate, toRate int) ([]byte, error)
- type AccumulatingTurnDetector
- type InterruptionCallback
- type InterruptionHandler
- func (h *InterruptionHandler) IsBotSpeaking() bool
- func (h *InterruptionHandler) NotifySentenceBoundary()
- func (h *InterruptionHandler) OnInterrupt(callback InterruptionCallback)
- func (h *InterruptionHandler) ProcessAudio(ctx context.Context, audio []byte) (bool, error)
- func (h *InterruptionHandler) ProcessVADState(ctx context.Context, state VADState) (bool, error)
- func (h *InterruptionHandler) Reset()
- func (h *InterruptionHandler) SetBotSpeaking(speaking bool)
- func (h *InterruptionHandler) WasInterrupted() bool
- type InterruptionStrategy
- type SilenceDetector
- func (d *SilenceDetector) GetAccumulatedAudio() []byte
- func (d *SilenceDetector) IsUserSpeaking() bool
- func (d *SilenceDetector) Name() string
- func (d *SilenceDetector) OnTurnComplete(callback TurnCallback)
- func (d *SilenceDetector) ProcessAudio(ctx context.Context, audio []byte) (bool, error)
- func (d *SilenceDetector) ProcessVADState(ctx context.Context, state VADState) (bool, error)
- func (d *SilenceDetector) Reset()
- func (d *SilenceDetector) SetTranscript(transcript string)
- type SimpleVAD
- type TurnCallback
- type TurnDetector
- type VADAnalyzer
- type VADEvent
- type VADParams
- type VADState
- type ValidationError
Constants ¶
const ( SampleRate24kHz = 24000 // Common TTS output rate SampleRate16kHz = 16000 // Common STT/ASR input rate )
Standard audio sample rates for common use cases.
const ( DefaultVADConfidence = 0.5 DefaultVADStartSecs = 0.2 DefaultVADStopSecs = 0.8 DefaultVADMinVolume = 0.01 DefaultVADSampleRate = 16000 )
Default VAD parameter values.
Variables ¶
This section is empty.
Functions ¶
func Resample24kTo16k ¶
Resample24kTo16k is a convenience function for the common case of resampling from 24kHz (TTS output) to 16kHz (Gemini input).
Types ¶
type AccumulatingTurnDetector ¶
type AccumulatingTurnDetector interface {
TurnDetector
// OnTurnComplete registers a callback for when a complete turn is detected.
OnTurnComplete(callback TurnCallback)
// GetAccumulatedAudio returns audio accumulated so far (may be incomplete turn).
GetAccumulatedAudio() []byte
// SetTranscript sets the transcript for the current turn (from external STT).
SetTranscript(transcript string)
}
AccumulatingTurnDetector is a TurnDetector that accumulates audio during a turn.
type InterruptionCallback ¶
type InterruptionCallback func()
InterruptionCallback is called when user interrupts the bot.
type InterruptionHandler ¶
type InterruptionHandler struct {
// contains filtered or unexported fields
}
InterruptionHandler manages user interruption logic during bot output.
func NewInterruptionHandler ¶
func NewInterruptionHandler(strategy InterruptionStrategy, vad VADAnalyzer) *InterruptionHandler
NewInterruptionHandler creates an InterruptionHandler with the given strategy and VAD.
func (*InterruptionHandler) IsBotSpeaking ¶
func (h *InterruptionHandler) IsBotSpeaking() bool
IsBotSpeaking returns true if the bot is currently outputting audio.
func (*InterruptionHandler) NotifySentenceBoundary ¶
func (h *InterruptionHandler) NotifySentenceBoundary()
NotifySentenceBoundary notifies the handler of a sentence boundary. For deferred interruption strategy, this may trigger the pending interruption.
func (*InterruptionHandler) OnInterrupt ¶
func (h *InterruptionHandler) OnInterrupt(callback InterruptionCallback)
OnInterrupt registers a callback for when interruption occurs.
func (*InterruptionHandler) ProcessAudio ¶
ProcessAudio processes audio and detects user interruption. Returns true if an interruption was detected and should be acted upon.
func (*InterruptionHandler) ProcessVADState ¶
ProcessVADState processes a VAD state update for interruption detection. Returns true if an interruption was detected and should be acted upon.
func (*InterruptionHandler) Reset ¶
func (h *InterruptionHandler) Reset()
Reset clears interruption state for a new turn.
func (*InterruptionHandler) SetBotSpeaking ¶
func (h *InterruptionHandler) SetBotSpeaking(speaking bool)
SetBotSpeaking sets whether the bot is currently outputting audio.
func (*InterruptionHandler) WasInterrupted ¶
func (h *InterruptionHandler) WasInterrupted() bool
WasInterrupted returns true if an interruption occurred.
type InterruptionStrategy ¶
type InterruptionStrategy int
InterruptionStrategy determines how to handle user interrupting bot.
const ( // InterruptionIgnore ignores user speech during bot output. InterruptionIgnore InterruptionStrategy = iota // InterruptionImmediate immediately stops bot and starts listening. InterruptionImmediate // InterruptionDeferred waits for bot's current sentence, then switches. InterruptionDeferred )
func (InterruptionStrategy) String ¶
func (s InterruptionStrategy) String() string
String returns a human-readable representation of the interruption strategy.
type SilenceDetector ¶
type SilenceDetector struct {
// Threshold is the silence duration required to trigger turn end.
Threshold time.Duration
// contains filtered or unexported fields
}
SilenceDetector detects turn boundaries based on silence duration. It triggers end-of-turn when silence exceeds a configurable threshold.
func NewSilenceDetector ¶
func NewSilenceDetector(threshold time.Duration) *SilenceDetector
NewSilenceDetector creates a SilenceDetector with the given threshold. threshold is the duration of silence required to trigger end-of-turn.
func (*SilenceDetector) GetAccumulatedAudio ¶
func (d *SilenceDetector) GetAccumulatedAudio() []byte
GetAccumulatedAudio returns audio accumulated so far.
func (*SilenceDetector) IsUserSpeaking ¶
func (d *SilenceDetector) IsUserSpeaking() bool
IsUserSpeaking returns true if user is currently speaking.
func (*SilenceDetector) Name ¶
func (d *SilenceDetector) Name() string
Name returns the detector identifier.
func (*SilenceDetector) OnTurnComplete ¶
func (d *SilenceDetector) OnTurnComplete(callback TurnCallback)
OnTurnComplete registers a callback for when a complete turn is detected.
func (*SilenceDetector) ProcessAudio ¶
ProcessAudio processes an incoming audio chunk. This implementation delegates to ProcessVADState and expects VAD to be run separately. Returns true if end of turn is detected.
func (*SilenceDetector) ProcessVADState ¶
ProcessVADState processes a VAD state update and detects turn boundaries. Returns true if end of turn is detected.
func (*SilenceDetector) Reset ¶
func (d *SilenceDetector) Reset()
Reset clears state for a new conversation.
func (*SilenceDetector) SetTranscript ¶
func (d *SilenceDetector) SetTranscript(transcript string)
SetTranscript sets the transcript for the current turn.
type SimpleVAD ¶
type SimpleVAD struct {
// contains filtered or unexported fields
}
SimpleVAD is a basic voice activity detector using RMS (Root Mean Square) analysis. It provides a lightweight VAD implementation without requiring external ML models. For more accurate detection, consider using SileroVAD.
func NewSimpleVAD ¶
NewSimpleVAD creates a SimpleVAD analyzer with the given parameters.
func (*SimpleVAD) Analyze ¶
Analyze processes audio and returns voice probability based on RMS volume.
func (*SimpleVAD) OnStateChange ¶
OnStateChange returns a channel that receives state transitions.
type TurnCallback ¶
TurnCallback is called when a complete user turn is detected. audio contains the accumulated audio for the turn. transcript contains any accumulated transcript (may be empty).
type TurnDetector ¶
type TurnDetector interface {
// Name returns the detector identifier.
Name() string
// ProcessAudio processes an incoming audio chunk.
// Returns true if end of turn is detected.
ProcessAudio(ctx context.Context, audio []byte) (bool, error)
// ProcessVADState processes a VAD state update.
// Returns true if end of turn is detected based on VAD state.
ProcessVADState(ctx context.Context, state VADState) (bool, error)
// IsUserSpeaking returns true if user is currently speaking.
IsUserSpeaking() bool
// Reset clears state for a new conversation.
Reset()
}
TurnDetector determines when a speaker has finished their turn. This is separate from VAD - VAD detects voice activity, turn detection determines conversation boundaries.
type VADAnalyzer ¶
type VADAnalyzer interface {
// Name returns the analyzer identifier.
Name() string
// Analyze processes audio and returns voice probability (0.0-1.0).
// audio should be raw PCM samples at the configured sample rate.
Analyze(ctx context.Context, audio []byte) (float64, error)
// State returns the current VAD state based on accumulated analysis.
State() VADState
// OnStateChange returns a channel that receives state transitions.
// The channel is buffered and may drop events if not consumed.
OnStateChange() <-chan VADEvent
// Reset clears accumulated state for a new conversation.
Reset()
}
VADAnalyzer analyzes audio for voice activity.
type VADEvent ¶
type VADEvent struct {
State VADState
PrevState VADState
Timestamp time.Time
Duration time.Duration // How long in the previous state
Confidence float64 // Voice confidence at transition
}
VADEvent represents a state transition in VAD.
type VADParams ¶
type VADParams struct {
// Confidence threshold for voice detection (0.0-1.0, default: 0.5).
// Higher values require more confidence before triggering.
Confidence float64
// StartSecs is seconds of speech required to trigger VADStateSpeaking (default: 0.2).
// Prevents false starts from brief noise.
StartSecs float64
// StopSecs is seconds of silence required to trigger VADStateQuiet (default: 0.8).
// Allows natural pauses without ending turn.
StopSecs float64
// MinVolume is the minimum RMS volume threshold (default: 0.01).
// Audio below this is treated as silence.
MinVolume float64
// SampleRate is the audio sample rate in Hz (default: 16000).
SampleRate int
}
VADParams configures voice activity detection behavior.
func DefaultVADParams ¶
func DefaultVADParams() VADParams
DefaultVADParams returns sensible defaults for voice activity detection.
type VADState ¶
type VADState int
VADState represents the current voice activity state.
const ( // VADStateQuiet indicates no voice activity detected. VADStateQuiet VADState = iota // VADStateStarting indicates voice is starting (within start threshold). VADStateStarting // VADStateSpeaking indicates active speech. VADStateSpeaking // VADStateStopping indicates voice is stopping (within stop threshold). VADStateStopping )
type ValidationError ¶
ValidationError represents a parameter validation error.
func (*ValidationError) Error ¶
func (e *ValidationError) Error() string