omnivoice-core

module

v0.5.0 Latest Latest Go to latest Published: Feb 28, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/plexusone/omnivoice-core

Links

Open Source Insights

README ¶

OmniVoice

Voice abstraction layer for AgentPlexus supporting TTS, STT, and Voice Agents across multiple providers and transport protocols.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              OmniVoice                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────────────┐  │
│  │     TTS     │    │     STT     │    │          Voice Agent            │  │
│  │             │    │             │    │                                 │  │
│  │ Text → Audio│    │ Audio → Text│    │  Real-time bidirectional voice  │  │
│  └──────┬──────┘    └──────┬──────┘    └───────────────┬─────────────────┘  │
│         │                  │                           │                    │
│         ▼                  ▼                           ▼                    │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Provider Layer                              │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │ ElevenLabs  │  Deepgram   │ Google Cloud│    AWS      │   Azure     │    │
│  │ Cartesia    │  Whisper    │ AssemblyAI  │   Polly     │   Speech    │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Transport Layer                             │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │   WebRTC    │     SIP     │    PSTN     │  WebSocket  │    HTTP     │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Call System Integration                        │    │
│  ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤    │
│  │   Twilio    │ RingCentral │    Zoom     │   LiveKit   │   Daily     │    │
│  └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Package Structure

omnivoice/
├── tts/                    # Text-to-Speech
│   ├── tts.go              # Interface definitions
│   ├── elevenlabs/         # ElevenLabs provider
│   ├── polly/              # AWS Polly provider
│   ├── google/             # Google Cloud TTS
│   ├── azure/              # Azure Speech
│   └── cartesia/           # Cartesia provider
│
├── stt/                    # Speech-to-Text
│   ├── stt.go              # Interface definitions
│   ├── whisper/            # OpenAI Whisper
│   ├── deepgram/           # Deepgram provider
│   ├── google/             # Google Speech-to-Text
│   ├── azure/              # Azure Speech
│   └── assemblyai/         # AssemblyAI provider
│
├── agent/                  # Voice Agent orchestration
│   ├── agent.go            # Interface definitions
│   ├── session.go          # Conversation session management
│   ├── elevenlabs/         # ElevenLabs Agents
│   ├── vapi/               # Vapi.ai
│   ├── retell/             # Retell AI
│   └── custom/             # Custom agent (STT + LLM + TTS)
│
├── transport/              # Audio transport protocols
│   ├── transport.go        # Interface definitions
│   ├── webrtc/             # WebRTC transport
│   ├── websocket/          # WebSocket streaming
│   ├── sip/                # SIP protocol
│   └── http/               # HTTP-based (batch)
│
├── callsystem/             # Call system integrations
│   ├── callsystem.go       # Interface definitions
│   ├── twilio/             # Twilio ConversationRelay
│   ├── ringcentral/        # RingCentral Voice API
│   ├── zoom/               # Zoom SDK integration
│   ├── livekit/            # LiveKit rooms
│   └── daily/              # Daily.co
│
├── subtitle/               # Subtitle generation
│   └── subtitle.go         # SRT/VTT from transcription results
│
└── examples/
    ├── simple-tts/         # Basic TTS example
    ├── voice-agent/        # Voice agent with Twilio
    └── multi-provider/     # Provider fallback example

Call System Integration

How Voice Agents Connect to Phone/Video Calls

Voice AI agents need a transport layer to receive and send audio. The choice depends on the use case:

┌───────────────────────────────────────────────────────────────────────┐
│                        Call System Options                            │
├────────────────┬───────────────┬─────────────────┬────────────────────┤
│    Platform    │   Protocol    │   Best For      │   Complexity       │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Twilio         │ WebRTC/SIP/   │ Phone calls,    │ Medium - managed   │
│ Conversation-  │ PSTN          │ IVR, call       │ infrastructure     │
│ Relay          │               │ centers         │                    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ RingCentral    │ WebRTC/SIP    │ Enterprise PBX, │ Medium - native    │
│ Voice API      │               │ business phones │ AI receptionist    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Zoom SDK       │ Proprietary   │ Video meetings  │ High - requires    │
│                │ (via SDK)     │ with voice bots │ native SDK         │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ LiveKit        │ WebRTC        │ Custom apps,    │ Low - open source  │
│                │               │ real-time AI    │ WebRTC rooms       │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ Daily.co       │ WebRTC        │ Embedded video, │ Low - simple API   │
│                │               │ browser-based   │                    │
├────────────────┼───────────────┼─────────────────┼────────────────────┤
│ WebSocket      │ WS/WSS        │ Web apps,       │ Low - direct       │
│ (Direct)       │               │ custom UIs      │ streaming          │
└────────────────┴───────────────┴─────────────────┴────────────────────┘

Wiring Diagram: Voice Agent in a Phone Call

┌────────────────────────────────────────────────────────────────────────────────┐
│                     PSTN/WebRTC Call Flow                                      │
│                                                                                │
│   ┌─────────┐         ┌─────────────┐          ┌───────────────────────────┐   │
│   │  User   │◄───────►│   Twilio    │◄────────►│        OmniVoice          │   │
│   │ (Phone) │  PSTN   │ Conversation│ WebSocket│                           │   │
│   │         │         │   Relay     │          │  ┌─────────────────────┐  │   │
│   └─────────┘         └─────────────┘          │  │   Voice Agent       │  │   │
│                                                │  │                     │  │   │
│                                                │  │  ┌───────┐          │  │   │
│                         Audio In ─────────────►│  │  │  STT  │──┐       │  │   │
│                                                │  │  └───────┘  │       │  │   │
│                                                │  │             ▼       │  │   │
│                                                │  │  ┌───────────────┐  │  │   │
│                                                │  │  │  LLM / Agent  │  │  │   │
│                                                │  │  │  (Eino, etc.) │  │  │   │
│                                                │  │  └───────────────┘  │  │   │
│                                                │  │             │       │  │   │
│                                                │  │             ▼       │  │   │
│                                                │  │  ┌───────┐          │  │   │
│                         Audio Out ◄────────────│  │  │  TTS  │◄─┘       │  │   │
│                                                │  │  └───────┘          │  │   │
│                                                │  └─────────────────────┘  │   │
│                                                └───────────────────────────┘   │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Wiring Diagram: Voice Agent in a Zoom Meeting

┌────────────────────────────────────────────────────────────────────────────┐
│                     Zoom Meeting Flow                                      │
│                                                                            │
│   ┌────────────────────────────────────────────────────────────────────┐   │
│   │                         Zoom Meeting                               │   │
│   │                                                                    │   │
│   │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────────┐   │   │
│   │   │ User 1  │  │ User 2  │  │ User 3  │  │     Bot Client      │   │   │
│   │   │ (Human) │  │ (Human) │  │ (Human) │  │   (Zoom SDK)        │   │   │
│   │   └─────────┘  └─────────┘  └─────────┘  └──────────┬──────────┘   │   │
│   │                                                     │              │   │
│   └─────────────────────────────────────────────────────┼──────────────┘   │
│                                                         │                  │
│                                        Raw Audio Stream │                  │
│                                                         ▼                  │
│   ┌────────────────────────────────────────────────────────────────────┐   │
│   │                        OmniVoice Agent                             │   │
│   │                                                                    │   │
│   │   Option A: Use Recall.ai (recommended)                            │   │
│   │   ┌─────────────┐                                                  │   │
│   │   │  Recall.ai  │──► Handles Zoom SDK complexity                   │   │
│   │   │     Bot     │──► Provides audio stream via WebSocket           │   │
│   │   └─────────────┘                                                  │   │
│   │                                                                    │   │
│   │   Option B: Self-hosted Zoom SDK Bot                               │   │
│   │   ┌─────────────┐                                                  │   │
│   │   │ Zoom Linux  │──► Complex: requires native SDK                  │   │
│   │   │   SDK Bot   │──► One instance per meeting                      │   │
│   │   └─────────────┘──► Months of engineering                         │   │
│   │                                                                    │   │
│   └────────────────────────────────────────────────────────────────────┘   │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Use Case Recommendations

Use Case	Call System	Transport	Notes
IVR / Call Center	Twilio ConversationRelay	PSTN/SIP	Best managed solution
Business Phone	RingCentral	WebRTC/SIP	Native AI Receptionist available
Custom Web App	LiveKit or Daily	WebRTC	Open source, flexible
Zoom Meetings	Recall.ai + Zoom	SDK → WebSocket	Avoid building Zoom bot yourself
Browser Widget	Direct WebSocket	WebSocket	ElevenLabs widget or custom
Mobile App	LiveKit	WebRTC	Cross-platform support

Latency Considerations

For natural conversation, total round-trip latency should be under 500ms:

User speaks → STT (100-300ms) → LLM (200-500ms) → TTS (100-200ms) → User hears

Target: < 500ms total for "instant" feel
Acceptable: < 1000ms for natural conversation
Poor: > 1500ms feels laggy

Optimization Strategies

Streaming STT: Start processing before user finishes speaking
Streaming TTS: Start playing audio before full response generated
Edge inference: Use providers with edge nodes (Deepgram, ElevenLabs)
Turn detection: Use voice activity detection (VAD) for quick turn-taking

Provider Comparison

TTS Providers

Provider	Latency	Quality	Voices	Streaming	Price
ElevenLabs	Low	Excellent	5000+	Yes	$$$
Cartesia	Very Low	Good	100+	Yes	$$
AWS Polly	Low	Good	60+	Yes	$
Google TTS	Low	Good	200+	Yes	$
Azure Speech	Low	Excellent	400+	Yes	$$

STT Providers

Provider	Latency	Accuracy	Streaming	Languages	Price
Deepgram	Very Low	Excellent	Yes	30+	$$
Whisper (OpenAI)	Medium	Excellent	No*	50+	$
Google Speech	Low	Excellent	Yes	125+	$$
AssemblyAI	Low	Excellent	Yes	20+	$$
Azure Speech	Low	Excellent	Yes	100+	$$

*Whisper requires self-hosting for streaming (e.g., faster-whisper)

Voice Agent Platforms

Provider	Customization	Latency	Telephony	Price
ElevenLabs Agents	Medium	Low	Via Twilio	$$$
Vapi	High	Low	Built-in	$$
Retell AI	High	Low	Built-in	$$
Custom (OmniVoice)	Full	Variable	Via integration	Variable

Provider Conformance Testing

OmniVoice includes conformance test suites that provider implementations can use to verify they correctly implement the TTS and STT interfaces with consistent behavior.

Using Conformance Tests

Provider implementations should import the providertest packages and run the conformance tests:

// In your provider's conformance_test.go
import (
    "github.com/plexusone/omnivoice-core/stt/providertest"
    // or for TTS:
    // "github.com/plexusone/omnivoice-core/tts/providertest"
)

func TestConformance(t *testing.T) {
    p, err := New(WithAPIKey(apiKey))
    if err != nil {
        t.Fatal(err)
    }

    providertest.RunAll(t, providertest.Config{
        Provider:        p,
        TestAudioFile:   "/path/to/test.mp3",
        TestAudioURL:    "https://example.com/test.mp3",
        // ...
    })
}

Test Categories

Category	Description	API Required
Interface	Verify provider implements interface contract (Name, etc.)	No
Behavior	Verify edge case handling (empty input, context cancellation)	Sometimes
Integration	Verify actual synthesis/transcription works	Yes

STT Integration Tests

Test	Description
`Transcribe`	Batch transcription from audio bytes
`TranscribeFile`	Batch transcription from local file path
`TranscribeURL`	Batch transcription from remote URL
`TranscribeStream`	Real-time streaming transcription

TTS Integration Tests

Test	Description
`Synthesize`	Returns valid audio bytes
`SynthesizeStream`	Streams audio chunks
`SynthesizeFromReader`	Handles streaming text input

See Provider Conformance Testing TRD for detailed design documentation.

Resources

Call Systems

Voice AI Providers

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
agent Package agent provides voice agent orchestration for real-time conversations.	Package agent provides voice agent orchestration for real-time conversations.
audio
codec Package codec provides audio codec implementations for telephony.	Package codec provides audio codec implementations for telephony.
callsystem Package callsystem provides integrations with telephony and meeting platforms.	Package callsystem provides integrations with telephony and meeting platforms.
providertest Package providertest provides conformance tests for CallSystem provider implementations.	Package providertest provides conformance tests for CallSystem provider implementations.
examples
simple-tts command Example: Simple TTS with provider fallback	Example: Simple TTS with provider fallback
twilio-agent command Example: Voice agent handling inbound Twilio calls	Example: Voice agent handling inbound Twilio calls
zoom-agent command Example: Voice agent in Zoom meetings	Example: Voice agent in Zoom meetings
mcp Package mcp provides an MCP (Model Context Protocol) server for voice interactions.	Package mcp provides an MCP (Model Context Protocol) server for voice interactions.
pipeline Package pipeline provides components for connecting voice processing stages.	Package pipeline provides components for connecting voice processing stages.
stt Package stt provides a unified interface for Speech-to-Text providers.	Package stt provides a unified interface for Speech-to-Text providers.
providertest Package providertest provides conformance tests for STT provider implementations.	Package providertest provides conformance tests for STT provider implementations.
subtitle Package subtitle generates SRT and WebVTT subtitles from STT transcription results.	Package subtitle generates SRT and WebVTT subtitles from STT transcription results.
transport Package transport provides audio transport protocols for voice agents.	Package transport provides audio transport protocols for voice agents.
providertest Package providertest provides conformance tests for Transport provider implementations.	Package providertest provides conformance tests for Transport provider implementations.
tts Package tts provides a unified interface for Text-to-Speech providers.	Package tts provides a unified interface for Text-to-Speech providers.
providertest Package providertest provides conformance tests for TTS provider implementations.	Package providertest provides conformance tests for TTS provider implementations.