Documentation
¶
Overview ¶
Package realtime provides a unified interface for real-time voice-to-voice providers.
Real-time providers enable native voice-to-voice conversations with ~100-300ms latency by handling audio input and output directly, without separate STT/TTS steps.
Supported Providers ¶
The following providers implement the Provider interface:
- OpenAI Realtime API (github.com/plexusone/omni-openai/omnivoice/realtime)
- Gemini Live API (github.com/plexusone/omni-google/omnivoice)
Audio Format ¶
Input audio should be PCM16 (signed 16-bit little-endian) at the provider's expected sample rate:
- OpenAI Realtime: 24kHz mono
- Gemini Live: 16kHz mono (input), 24kHz mono (output)
Output audio is PCM16 24kHz mono for both providers.
Usage ¶
provider := openairealtime.NewProvider(apiKey,
openairealtime.WithVoice("alloy"),
openairealtime.WithInstructions("You are a helpful assistant."),
)
audioIn := make(chan []byte, 100)
audioCh, transcriptCh, err := provider.ProcessAudioStream(ctx, audioIn, realtime.ProcessConfig{
OnFunctionCall: func(id, name, args string) (any, error) {
return handleFunction(name, args)
},
})
// Send audio from microphone
go func() {
for chunk := range microphoneAudio {
audioIn <- chunk
}
close(audioIn)
}()
// Receive audio and transcripts
for {
select {
case audio, ok := <-audioCh:
if !ok {
return
}
playAudio(audio.Audio)
case transcript := <-transcriptCh:
log.Printf("[%s] %s", transcript.Role(), transcript.Text)
}
}
Integration with Telephony ¶
Real-time providers integrate with telephony gateways (Twilio, Telnyx, Plivo) by connecting the gateway's audio streams to the provider:
gateway.OnCall(func(session gateway.Session) {
audioIn := make(chan []byte, 100)
// Forward gateway audio to provider
go func() {
for chunk := range session.AudioIn() {
// Convert mulaw 8kHz to PCM16 24kHz
pcm := codec.MulawToPCM16(chunk)
resampled := resample(pcm, 8000, 24000)
audioIn <- resampled
}
close(audioIn)
}()
audioCh, _, _ := provider.ProcessAudioStream(ctx, audioIn, config)
// Forward provider audio to gateway
for audio := range audioCh {
// Convert PCM16 24kHz to mulaw 8kHz
resampled := resample(audio.Audio, 24000, 8000)
mulaw := codec.PCM16ToMulaw(resampled)
session.SendAudio(mulaw)
}
})
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ErrSessionClosed is returned when operating on a closed session. ErrSessionClosed = errors.New("session closed") // ErrConnectionFailed is returned when the WebSocket connection fails. ErrConnectionFailed = errors.New("connection failed") // ErrAuthenticationFailed is returned when API authentication fails. ErrAuthenticationFailed = errors.New("authentication failed") // ErrRateLimited is returned when the provider rate limits the request. ErrRateLimited = errors.New("rate limited") // ErrInvalidConfig is returned when the configuration is invalid. ErrInvalidConfig = errors.New("invalid configuration") ErrProviderUnavailable = errors.New("provider unavailable") // ErrContextCancelled is returned when the context is cancelled. ErrContextCancelled = errors.New("context cancelled") )
Common errors returned by real-time providers.
Functions ¶
This section is empty.
Types ¶
type AudioChunk ¶
type AudioChunk struct {
// Audio is the raw audio data.
// Format is PCM16 (signed 16-bit little-endian) at 24kHz mono.
Audio []byte
// IsFinal indicates this is the last chunk for the current turn.
// Use this to know when the model has finished speaking.
IsFinal bool
}
AudioChunk represents a chunk of audio data from the model.
type Client ¶
Client provides a unified interface across multiple real-time providers. It supports provider selection and fallback.
func NewClient ¶
NewClient creates a new real-time client with the specified providers. The first provider is set as the primary.
func (*Client) ProcessAudioStream ¶
func (c *Client) ProcessAudioStream(ctx context.Context, audioIn <-chan []byte, config ProcessConfig) (<-chan AudioChunk, <-chan Transcript, error)
ProcessAudioStream uses the primary provider with fallback on connection errors.
type FunctionDeclaration ¶
type FunctionDeclaration struct {
// Name is the function name.
Name string `json:"name"`
// Description explains what the function does.
Description string `json:"description"`
// Parameters is a JSON Schema describing the function parameters.
// Use json.RawMessage for flexibility across providers.
Parameters json.RawMessage `json:"parameters,omitempty"`
}
FunctionDeclaration describes a function the model can call.
type ProcessConfig ¶
type ProcessConfig struct {
// Instructions is the system prompt for the conversation.
Instructions string
// Voice is the voice identifier for audio output.
// Provider-specific (e.g., "alloy", "Puck").
Voice string
// Functions are functions the model can call during the conversation.
Functions []FunctionDeclaration
// OnFunctionCall is called when the model invokes a function.
// The handler should execute the function and return the result.
//
// Parameters:
// - id: unique identifier for this function call
// - name: function name being called
// - args: JSON-encoded function arguments
//
// Returns:
// - result: any JSON-serializable value to return to the model
// - error: if non-nil, sent as error response to the model
OnFunctionCall func(id, name, args string) (result any, err error)
// Temperature controls response randomness (0.0 to 2.0).
// Default varies by provider.
Temperature float64
// Extensions holds provider-specific settings.
// Keys should be namespaced by provider (e.g., "openai.turn_detection").
Extensions map[string]any
}
ProcessConfig configures a real-time audio processing session.
type Provider ¶
type Provider interface {
// ProcessAudioStream starts a real-time voice session.
//
// audioIn receives raw audio chunks from the user (microphone, telephony).
// The audio format depends on the provider (typically PCM16 16-24kHz mono).
//
// Returns two channels:
// - audioCh: audio chunks from the model (PCM16 24kHz mono)
// - transcriptCh: transcript updates (both user input and model output)
//
// Both channels are closed when the session ends (context cancelled,
// audioIn closed, or error).
ProcessAudioStream(ctx context.Context, audioIn <-chan []byte, config ProcessConfig) (
audioCh <-chan AudioChunk,
transcriptCh <-chan Transcript,
err error,
)
// Name returns the provider name (e.g., "openai-realtime", "gemini-live").
Name() string
// Close releases any resources held by the provider.
Close() error
}
Provider defines the interface for real-time voice-to-voice providers.
Real-time providers handle bidirectional audio streaming, enabling native voice conversations with low latency (~100-300ms). Unlike traditional STT+LLM+TTS pipelines, real-time providers process audio directly.
type Transcript ¶
type Transcript struct {
// Text is the transcript text.
Text string
// IsFinal indicates this is a final (non-interim) transcript.
// Interim transcripts may be revised; final transcripts are stable.
IsFinal bool
// IsInput indicates this is user input transcription.
// If false, this is model output transcription.
IsInput bool
// ItemID is a provider-specific identifier for this transcript item.
// Can be used to correlate with audio chunks.
ItemID string
}
Transcript represents a transcript update during the conversation.
func (Transcript) Role ¶
func (t Transcript) Role() string
Role returns the role associated with this transcript.