README
¶
Minimal voice pipeline example
This example runs a voice agent pipeline: STT → LLM → TTS → transport.
- STT: Audio from the client (e.g.
AudioRawFrame) is transcribed to text (TranscriptionFrame). - LLM: User text is sent to the LLM; responses are streamed as
LLMTextFrame. - TTS: LLM text is converted to speech (
TTSAudioRawFrame) and sent to the client.
Config
Use a config that sets provider and model so the server builds the voice pipeline. By default,
the network transport is a WebSocket server on /ws:
{
"host": "localhost",
"port": 8080,
"model": "gpt-3.5-turbo",
"provider": "openai",
"plugins": [],
"transport": "websocket"
}
For Groq (faster, cheaper), with optional task-specific models:
{
"host": "localhost",
"port": 8080,
"model": "llama-3.1-8b-instant",
"provider": "groq",
"stt_model": "whisper-large-v3-turbo",
"tts_model": "canopylabs/orpheus-v1-english",
"tts_voice": "alloy",
"plugins": [],
"transport": "websocket"
}
You can mix providers per task (e.g. OpenAI for STT, Groq for LLM and TTS) by setting stt_provider, llm_provider, and tts_provider; put the corresponding keys in api_keys (e.g. openai, groq).
To enable SmallWebRTC signaling (mirroring the small_webrtc network transport), set:
{
"transport": "both",
"webrtc_ice_servers": ["stun:stun.l.google.com:19302"]
}
This will keep the WebSocket endpoint at /ws and also expose a WebRTC offer/answer endpoint at
/webrtc/offer that accepts a JSON body like:
{
"offer": "{ \"type\": \"offer\", \"sdp\": \"...\" }"
}
The response contains an answer field with the JSON SDP to use as the remote description on the client.
WebRTC + Sarvam STT/TTS + Groq LLM
For a WebRTC-based real-time voice pipeline with Sarvam (STT and TTS) and Groq (LLM):
{
"host": "localhost",
"port": 8080,
"transport": "both",
"stt_provider": "sarvam",
"llm_provider": "groq",
"tts_provider": "sarvam",
"model": "llama-3.1-8b-instant",
"stt_model": "saarika:v2.5",
"tts_model": "bulbul:v2",
"webrtc_ice_servers": ["stun:stun.l.google.com:19302"],
"api_keys": {
"sarvam": "<SARVAM_API_KEY>",
"groq": "<GROQ_API_KEY>"
}
}
- Run the server:
go run ./cmd/voxray -config config.json(from repo root). - Run the WebRTC client: Open
tests/frontend/webrtc-voice.htmlin a browser (or serve it), set the server URL tohttp://localhost:8080, click Start, allow the microphone, then speak. The synthesized reply is played back over the remote audio track.
See tests/frontend/README.md for the JavaScript WebRTC client details.
Run
-
Set API keys:
- OpenAI:
OPENAI_API_KEY - Groq:
GROQ_API_KEY
- OpenAI:
-
Start the server:
cd ../.. go run ./cmd/voxray --config config.json -
Connect a WebSocket client to
ws://localhost:8080/ws. -
Send a
StartFrame, then sendAudioRawFrame(or for testing,TranscriptionFramewith user text). The pipeline will run STT (if audio), LLM, TTS and send audio back.
Frame flow
- Client →
StartFrame, thenAudioRawFrame(PCM 16-bit mono, e.g. 16 kHz) orTranscriptionFrame(text). - Server →
TranscriptionFrame(if STT),LLMTextFrame(streamed),TTSAudioRawFrame(output audio).
See pkg/frames/serialize for JSON envelope format.
Documentation
¶
There is no documentation for this package.