audio

package
v1.1.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 23, 2025 License: Apache-2.0 Imports: 6 Imported by: 0

Documentation

Overview

Package audio provides voice activity detection (VAD), turn detection, and audio session management for real-time voice AI applications.

The package follows industry-standard patterns for voice AI:

  • VAD (Voice Activity Detection): Detects when someone is speaking vs. silent
  • Turn Detection: Determines when a speaker has finished their turn
  • Interruption Handling: Manages user interrupting bot output

Architecture

Audio processing follows a two-stage approach:

  1. VADAnalyzer detects voice activity in real-time
  2. TurnDetector uses VAD output plus additional signals to detect turn boundaries

Usage Example

vad := audio.NewSimpleVAD(audio.DefaultVADParams())
detector := audio.NewSilenceDetector(500 * time.Millisecond)

for chunk := range audioStream {
    vad.Analyze(ctx, chunk)
    if detector.DetectTurnEnd(ctx, vad) {
        // User finished speaking
    }
}

Package audio provides audio processing utilities.

Index

Constants

View Source
const (
	SampleRate24kHz = 24000 // Common TTS output rate
	SampleRate16kHz = 16000 // Common STT/ASR input rate
)

Standard audio sample rates for common use cases.

View Source
const (
	DefaultVADConfidence = 0.5
	DefaultVADStartSecs  = 0.2
	DefaultVADStopSecs   = 0.8
	DefaultVADMinVolume  = 0.01
	DefaultVADSampleRate = 16000
)

Default VAD parameter values.

Variables

This section is empty.

Functions

func Resample24kTo16k

func Resample24kTo16k(input []byte) ([]byte, error)

Resample24kTo16k is a convenience function for the common case of resampling from 24kHz (TTS output) to 16kHz (Gemini input).

func ResamplePCM16

func ResamplePCM16(input []byte, fromRate, toRate int) ([]byte, error)

ResamplePCM16 resamples PCM16 audio data from one sample rate to another. Uses linear interpolation for reasonable quality resampling. Input and output are little-endian 16-bit signed PCM samples.

Types

type AccumulatingTurnDetector

type AccumulatingTurnDetector interface {
	TurnDetector

	// OnTurnComplete registers a callback for when a complete turn is detected.
	OnTurnComplete(callback TurnCallback)

	// GetAccumulatedAudio returns audio accumulated so far (may be incomplete turn).
	GetAccumulatedAudio() []byte

	// SetTranscript sets the transcript for the current turn (from external STT).
	SetTranscript(transcript string)
}

AccumulatingTurnDetector is a TurnDetector that accumulates audio during a turn.

type InterruptionCallback

type InterruptionCallback func()

InterruptionCallback is called when user interrupts the bot.

type InterruptionHandler

type InterruptionHandler struct {
	// contains filtered or unexported fields
}

InterruptionHandler manages user interruption logic during bot output.

func NewInterruptionHandler

func NewInterruptionHandler(strategy InterruptionStrategy, vad VADAnalyzer) *InterruptionHandler

NewInterruptionHandler creates an InterruptionHandler with the given strategy and VAD.

func (*InterruptionHandler) IsBotSpeaking

func (h *InterruptionHandler) IsBotSpeaking() bool

IsBotSpeaking returns true if the bot is currently outputting audio.

func (*InterruptionHandler) NotifySentenceBoundary

func (h *InterruptionHandler) NotifySentenceBoundary()

NotifySentenceBoundary notifies the handler of a sentence boundary. For deferred interruption strategy, this may trigger the pending interruption.

func (*InterruptionHandler) OnInterrupt

func (h *InterruptionHandler) OnInterrupt(callback InterruptionCallback)

OnInterrupt registers a callback for when interruption occurs.

func (*InterruptionHandler) ProcessAudio

func (h *InterruptionHandler) ProcessAudio(ctx context.Context, audio []byte) (bool, error)

ProcessAudio processes audio and detects user interruption. Returns true if an interruption was detected and should be acted upon.

func (*InterruptionHandler) ProcessVADState

func (h *InterruptionHandler) ProcessVADState(ctx context.Context, state VADState) (bool, error)

ProcessVADState processes a VAD state update for interruption detection. Returns true if an interruption was detected and should be acted upon.

func (*InterruptionHandler) Reset

func (h *InterruptionHandler) Reset()

Reset clears interruption state for a new turn.

func (*InterruptionHandler) SetBotSpeaking

func (h *InterruptionHandler) SetBotSpeaking(speaking bool)

SetBotSpeaking sets whether the bot is currently outputting audio.

func (*InterruptionHandler) WasInterrupted

func (h *InterruptionHandler) WasInterrupted() bool

WasInterrupted returns true if an interruption occurred.

type InterruptionStrategy

type InterruptionStrategy int

InterruptionStrategy determines how to handle user interrupting bot.

const (
	// InterruptionIgnore ignores user speech during bot output.
	InterruptionIgnore InterruptionStrategy = iota
	// InterruptionImmediate immediately stops bot and starts listening.
	InterruptionImmediate
	// InterruptionDeferred waits for bot's current sentence, then switches.
	InterruptionDeferred
)

func (InterruptionStrategy) String

func (s InterruptionStrategy) String() string

String returns a human-readable representation of the interruption strategy.

type SilenceDetector

type SilenceDetector struct {
	// Threshold is the silence duration required to trigger turn end.
	Threshold time.Duration
	// contains filtered or unexported fields
}

SilenceDetector detects turn boundaries based on silence duration. It triggers end-of-turn when silence exceeds a configurable threshold.

func NewSilenceDetector

func NewSilenceDetector(threshold time.Duration) *SilenceDetector

NewSilenceDetector creates a SilenceDetector with the given threshold. threshold is the duration of silence required to trigger end-of-turn.

func (*SilenceDetector) GetAccumulatedAudio

func (d *SilenceDetector) GetAccumulatedAudio() []byte

GetAccumulatedAudio returns audio accumulated so far.

func (*SilenceDetector) IsUserSpeaking

func (d *SilenceDetector) IsUserSpeaking() bool

IsUserSpeaking returns true if user is currently speaking.

func (*SilenceDetector) Name

func (d *SilenceDetector) Name() string

Name returns the detector identifier.

func (*SilenceDetector) OnTurnComplete

func (d *SilenceDetector) OnTurnComplete(callback TurnCallback)

OnTurnComplete registers a callback for when a complete turn is detected.

func (*SilenceDetector) ProcessAudio

func (d *SilenceDetector) ProcessAudio(ctx context.Context, audio []byte) (bool, error)

ProcessAudio processes an incoming audio chunk. This implementation delegates to ProcessVADState and expects VAD to be run separately. Returns true if end of turn is detected.

func (*SilenceDetector) ProcessVADState

func (d *SilenceDetector) ProcessVADState(ctx context.Context, state VADState) (bool, error)

ProcessVADState processes a VAD state update and detects turn boundaries. Returns true if end of turn is detected.

func (*SilenceDetector) Reset

func (d *SilenceDetector) Reset()

Reset clears state for a new conversation.

func (*SilenceDetector) SetTranscript

func (d *SilenceDetector) SetTranscript(transcript string)

SetTranscript sets the transcript for the current turn.

type SimpleVAD

type SimpleVAD struct {
	// contains filtered or unexported fields
}

SimpleVAD is a basic voice activity detector using RMS (Root Mean Square) analysis. It provides a lightweight VAD implementation without requiring external ML models. For more accurate detection, consider using SileroVAD.

func NewSimpleVAD

func NewSimpleVAD(params VADParams) (*SimpleVAD, error)

NewSimpleVAD creates a SimpleVAD analyzer with the given parameters.

func (*SimpleVAD) Analyze

func (v *SimpleVAD) Analyze(ctx context.Context, audio []byte) (float64, error)

Analyze processes audio and returns voice probability based on RMS volume.

func (*SimpleVAD) Name

func (v *SimpleVAD) Name() string

Name returns the analyzer identifier.

func (*SimpleVAD) OnStateChange

func (v *SimpleVAD) OnStateChange() <-chan VADEvent

OnStateChange returns a channel that receives state transitions.

func (*SimpleVAD) Reset

func (v *SimpleVAD) Reset()

Reset clears accumulated state for a new conversation.

func (*SimpleVAD) State

func (v *SimpleVAD) State() VADState

State returns the current VAD state.

type TurnCallback

type TurnCallback func(audio []byte, transcript string)

TurnCallback is called when a complete user turn is detected. audio contains the accumulated audio for the turn. transcript contains any accumulated transcript (may be empty).

type TurnDetector

type TurnDetector interface {
	// Name returns the detector identifier.
	Name() string

	// ProcessAudio processes an incoming audio chunk.
	// Returns true if end of turn is detected.
	ProcessAudio(ctx context.Context, audio []byte) (bool, error)

	// ProcessVADState processes a VAD state update.
	// Returns true if end of turn is detected based on VAD state.
	ProcessVADState(ctx context.Context, state VADState) (bool, error)

	// IsUserSpeaking returns true if user is currently speaking.
	IsUserSpeaking() bool

	// Reset clears state for a new conversation.
	Reset()
}

TurnDetector determines when a speaker has finished their turn. This is separate from VAD - VAD detects voice activity, turn detection determines conversation boundaries.

type VADAnalyzer

type VADAnalyzer interface {
	// Name returns the analyzer identifier.
	Name() string

	// Analyze processes audio and returns voice probability (0.0-1.0).
	// audio should be raw PCM samples at the configured sample rate.
	Analyze(ctx context.Context, audio []byte) (float64, error)

	// State returns the current VAD state based on accumulated analysis.
	State() VADState

	// OnStateChange returns a channel that receives state transitions.
	// The channel is buffered and may drop events if not consumed.
	OnStateChange() <-chan VADEvent

	// Reset clears accumulated state for a new conversation.
	Reset()
}

VADAnalyzer analyzes audio for voice activity.

type VADEvent

type VADEvent struct {
	State      VADState
	PrevState  VADState
	Timestamp  time.Time
	Duration   time.Duration // How long in the previous state
	Confidence float64       // Voice confidence at transition
}

VADEvent represents a state transition in VAD.

type VADParams

type VADParams struct {
	// Confidence threshold for voice detection (0.0-1.0, default: 0.5).
	// Higher values require more confidence before triggering.
	Confidence float64

	// StartSecs is seconds of speech required to trigger VADStateSpeaking (default: 0.2).
	// Prevents false starts from brief noise.
	StartSecs float64

	// StopSecs is seconds of silence required to trigger VADStateQuiet (default: 0.8).
	// Allows natural pauses without ending turn.
	StopSecs float64

	// MinVolume is the minimum RMS volume threshold (default: 0.01).
	// Audio below this is treated as silence.
	MinVolume float64

	// SampleRate is the audio sample rate in Hz (default: 16000).
	SampleRate int
}

VADParams configures voice activity detection behavior.

func DefaultVADParams

func DefaultVADParams() VADParams

DefaultVADParams returns sensible defaults for voice activity detection.

func (VADParams) Validate

func (p VADParams) Validate() error

Validate checks that VAD parameters are within acceptable ranges.

type VADState

type VADState int

VADState represents the current voice activity state.

const (
	// VADStateQuiet indicates no voice activity detected.
	VADStateQuiet VADState = iota
	// VADStateStarting indicates voice is starting (within start threshold).
	VADStateStarting
	// VADStateSpeaking indicates active speech.
	VADStateSpeaking
	// VADStateStopping indicates voice is stopping (within stop threshold).
	VADStateStopping
)

func (VADState) String

func (s VADState) String() string

String returns a human-readable representation of the VAD state.

type ValidationError

type ValidationError struct {
	Field   string
	Message string
}

ValidationError represents a parameter validation error.

func (*ValidationError) Error

func (e *ValidationError) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL