Documentation
¶
Overview ¶
Package tts provides text-to-speech services. This file contains WebSocket streaming implementation for Cartesia TTS. It is excluded from coverage testing due to the difficulty of mocking WebSocket connections.
Package tts provides text-to-speech services for converting text responses to audio.
The package defines a common Service interface that abstracts TTS providers, enabling voice AI applications to convert text-only LLM responses to speech.
Architecture ¶
The package provides:
- Service interface for TTS providers
- SynthesisConfig for voice/format configuration
- Voice and AudioFormat types for provider capabilities
- Multiple provider implementations (OpenAI, ElevenLabs, etc.)
Usage ¶
Basic usage with OpenAI TTS:
service := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
reader, err := service.Synthesize(ctx, "Hello world", tts.SynthesisConfig{
Voice: "alloy",
Format: tts.FormatMP3,
})
if err != nil {
log.Fatal(err)
}
defer reader.Close()
// Stream audio to speaker or save to file
io.Copy(audioOutput, reader)
Streaming TTS ¶
For low-latency applications, use StreamingService:
streamer := tts.NewCartesia(os.Getenv("CARTESIA_API_KEY"))
chunks, err := streamer.SynthesizeStream(ctx, "Hello world", config)
for chunk := range chunks {
// Play audio chunk immediately
speaker.Write(chunk)
}
Available Providers ¶
The package includes implementations for:
- OpenAI TTS (tts-1, tts-1-hd models)
- ElevenLabs (high-quality voice cloning)
- Cartesia (ultra-low latency streaming)
- Google Cloud Text-to-Speech (multi-language)
Index ¶
- Constants
- Variables
- type AudioChunk
- type AudioFormat
- type CartesiaOption
- type CartesiaService
- func (s *CartesiaService) Name() string
- func (s *CartesiaService) SupportedFormats() []AudioFormat
- func (s *CartesiaService) SupportedVoices() []Voice
- func (s *CartesiaService) Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)
- func (s *CartesiaService) SynthesizeStream(ctx context.Context, text string, config SynthesisConfig) (<-chan AudioChunk, error)
- type ElevenLabsOption
- type ElevenLabsService
- type OpenAIOption
- type OpenAIService
- type Service
- type StreamingService
- type SynthesisConfig
- type SynthesisError
- type Voice
Constants ¶
const ( // ElevenLabsModelMultilingual is the multilingual v2 model. ElevenLabsModelMultilingual = "eleven_multilingual_v2" // ElevenLabsModelTurbo is the fast turbo v2.5 model. ElevenLabsModelTurbo = "eleven_turbo_v2_5" // ElevenLabsModelEnglish is the English monolingual v1 model. ElevenLabsModelEnglish = "eleven_monolingual_v1" // ElevenLabsModelMultilingualV1 is the older multilingual v1 model. ElevenLabsModelMultilingualV1 = "eleven_multilingual_v1" )
const ( // ModelTTS1 is the OpenAI TTS model optimized for speed. ModelTTS1 = "tts-1" // ModelTTS1HD is the OpenAI TTS model optimized for quality. ModelTTS1HD = "tts-1-hd" )
const ( VoiceAlloy = "alloy" // Neutral voice. VoiceEcho = "echo" // Male voice. VoiceFable = "fable" // British accent. VoiceOnyx = "onyx" // Deep male voice. VoiceNova = "nova" // Female voice. VoiceShimmer = "shimmer" // Soft female voice. )
OpenAI voices.
const (
// CartesiaModelSonic is the latest Sonic model for Cartesia TTS.
CartesiaModelSonic = "sonic-2024-10-01"
)
Variables ¶
var ( // ErrInvalidVoice is returned when the requested voice is not available. ErrInvalidVoice = errors.New("invalid or unsupported voice") // ErrInvalidFormat is returned when the requested format is not supported. ErrInvalidFormat = errors.New("invalid or unsupported audio format") // ErrEmptyText is returned when attempting to synthesize empty text. ErrEmptyText = errors.New("text cannot be empty") // ErrSynthesisFailed is returned when TTS synthesis fails. ErrSynthesisFailed = errors.New("speech synthesis failed") // ErrRateLimited is returned when API rate limits are exceeded. ErrRateLimited = errors.New("rate limit exceeded") // ErrQuotaExceeded is returned when account quota is exceeded. ErrQuotaExceeded = errors.New("quota exceeded") ErrServiceUnavailable = errors.New("TTS service unavailable") )
Common TTS errors.
var ( // FormatMP3 is MP3 format (most compatible). FormatMP3 = AudioFormat{ Name: "mp3", MIMEType: "audio/mpeg", SampleRate: sampleRateDefault, BitDepth: 0, Channels: 1, } // FormatOpus is Opus format (best for streaming). FormatOpus = AudioFormat{ Name: "opus", MIMEType: "audio/opus", SampleRate: sampleRateDefault, BitDepth: 0, Channels: 1, } // FormatAAC is AAC format. FormatAAC = AudioFormat{ Name: "aac", MIMEType: "audio/aac", SampleRate: sampleRateDefault, BitDepth: 0, Channels: 1, } // FormatFLAC is FLAC format (lossless). FormatFLAC = AudioFormat{ Name: "flac", MIMEType: "audio/flac", SampleRate: sampleRateDefault, BitDepth: bitDepthDefault, Channels: 1, } // FormatPCM16 is raw 16-bit PCM (for processing). FormatPCM16 = AudioFormat{ Name: "pcm", MIMEType: "audio/pcm", SampleRate: sampleRateDefault, BitDepth: bitDepthDefault, Channels: 1, } // FormatWAV is WAV format (PCM with header). FormatWAV = AudioFormat{ Name: "wav", MIMEType: "audio/wav", SampleRate: sampleRateDefault, BitDepth: bitDepthDefault, Channels: 1, } )
Common audio formats.
Functions ¶
This section is empty.
Types ¶
type AudioChunk ¶
type AudioChunk struct {
// Data is the raw audio bytes.
Data []byte
// Index is the chunk sequence number (0-indexed).
Index int
// Final indicates this is the last chunk.
Final bool
// Error is set if an error occurred during synthesis.
Error error
}
AudioChunk represents a chunk of synthesized audio data.
type AudioFormat ¶
type AudioFormat struct {
// Name is the format identifier ("mp3", "opus", "pcm", "aac", "flac").
Name string
// MIMEType is the content type (e.g., "audio/mpeg").
MIMEType string
// SampleRate is the audio sample rate in Hz.
SampleRate int
// BitDepth is the bits per sample (for PCM formats).
BitDepth int
// Channels is the number of audio channels (1=mono, 2=stereo).
Channels int
}
AudioFormat describes an audio output format.
type CartesiaOption ¶
type CartesiaOption func(*CartesiaService)
CartesiaOption configures the Cartesia TTS service.
func WithCartesiaBaseURL ¶
func WithCartesiaBaseURL(url string) CartesiaOption
WithCartesiaBaseURL sets a custom base URL.
func WithCartesiaClient ¶
func WithCartesiaClient(client *http.Client) CartesiaOption
WithCartesiaClient sets a custom HTTP client.
func WithCartesiaModel ¶
func WithCartesiaModel(model string) CartesiaOption
WithCartesiaModel sets the TTS model.
func WithCartesiaWSURL ¶
func WithCartesiaWSURL(url string) CartesiaOption
WithCartesiaWSURL sets a custom WebSocket URL.
type CartesiaService ¶
type CartesiaService struct {
// contains filtered or unexported fields
}
CartesiaService implements TTS using Cartesia's ultra-low latency API. Cartesia specializes in real-time streaming TTS with <100ms first-byte latency.
func NewCartesia ¶
func NewCartesia(apiKey string, opts ...CartesiaOption) *CartesiaService
NewCartesia creates a Cartesia TTS service.
func (*CartesiaService) Name ¶
func (s *CartesiaService) Name() string
Name returns the provider identifier.
func (*CartesiaService) SupportedFormats ¶
func (s *CartesiaService) SupportedFormats() []AudioFormat
SupportedFormats returns audio formats supported by Cartesia.
func (*CartesiaService) SupportedVoices ¶
func (s *CartesiaService) SupportedVoices() []Voice
SupportedVoices returns a sample of available Cartesia voices.
func (*CartesiaService) Synthesize ¶
func (s *CartesiaService) Synthesize( ctx context.Context, text string, config SynthesisConfig, ) (io.ReadCloser, error)
Synthesize converts text to audio using Cartesia's REST API. For streaming output, use SynthesizeStream instead.
func (*CartesiaService) SynthesizeStream ¶
func (s *CartesiaService) SynthesizeStream( ctx context.Context, text string, config SynthesisConfig, ) (<-chan AudioChunk, error)
SynthesizeStream converts text to audio with streaming output via WebSocket. This provides ultra-low latency (<100ms first-byte) for real-time applications.
type ElevenLabsOption ¶
type ElevenLabsOption func(*ElevenLabsService)
ElevenLabsOption configures the ElevenLabs TTS service.
func WithElevenLabsBaseURL ¶
func WithElevenLabsBaseURL(url string) ElevenLabsOption
WithElevenLabsBaseURL sets a custom base URL.
func WithElevenLabsClient ¶
func WithElevenLabsClient(client *http.Client) ElevenLabsOption
WithElevenLabsClient sets a custom HTTP client.
func WithElevenLabsModel ¶
func WithElevenLabsModel(model string) ElevenLabsOption
WithElevenLabsModel sets the TTS model.
type ElevenLabsService ¶
type ElevenLabsService struct {
// contains filtered or unexported fields
}
ElevenLabsService implements TTS using ElevenLabs' API. ElevenLabs specializes in high-quality voice cloning and natural-sounding speech.
func NewElevenLabs ¶
func NewElevenLabs(apiKey string, opts ...ElevenLabsOption) *ElevenLabsService
NewElevenLabs creates an ElevenLabs TTS service.
func (*ElevenLabsService) Name ¶
func (s *ElevenLabsService) Name() string
Name returns the provider identifier.
func (*ElevenLabsService) SupportedFormats ¶
func (s *ElevenLabsService) SupportedFormats() []AudioFormat
SupportedFormats returns audio formats supported by ElevenLabs.
func (*ElevenLabsService) SupportedVoices ¶
func (s *ElevenLabsService) SupportedVoices() []Voice
SupportedVoices returns a sample of available ElevenLabs voices. Note: ElevenLabs has many more voices including custom cloned voices. Use the ElevenLabs API to get a complete list of available voices.
func (*ElevenLabsService) Synthesize ¶
func (s *ElevenLabsService) Synthesize( ctx context.Context, text string, config SynthesisConfig, ) (io.ReadCloser, error)
Synthesize converts text to audio using ElevenLabs' TTS API.
type OpenAIOption ¶
type OpenAIOption func(*OpenAIService)
OpenAIOption configures the OpenAI TTS service.
func WithOpenAIBaseURL ¶
func WithOpenAIBaseURL(url string) OpenAIOption
WithOpenAIBaseURL sets a custom base URL (for testing or proxies).
func WithOpenAIClient ¶
func WithOpenAIClient(client *http.Client) OpenAIOption
WithOpenAIClient sets a custom HTTP client.
func WithOpenAIModel ¶
func WithOpenAIModel(model string) OpenAIOption
WithOpenAIModel sets the TTS model to use.
type OpenAIService ¶
type OpenAIService struct {
// contains filtered or unexported fields
}
OpenAIService implements TTS using OpenAI's text-to-speech API.
func NewOpenAI ¶
func NewOpenAI(apiKey string, opts ...OpenAIOption) *OpenAIService
NewOpenAI creates an OpenAI TTS service.
func (*OpenAIService) Name ¶
func (s *OpenAIService) Name() string
Name returns the provider identifier.
func (*OpenAIService) SupportedFormats ¶
func (s *OpenAIService) SupportedFormats() []AudioFormat
SupportedFormats returns audio formats supported by OpenAI TTS.
func (*OpenAIService) SupportedVoices ¶
func (s *OpenAIService) SupportedVoices() []Voice
SupportedVoices returns available OpenAI voices.
func (*OpenAIService) Synthesize ¶
func (s *OpenAIService) Synthesize( ctx context.Context, text string, config SynthesisConfig, ) (io.ReadCloser, error)
Synthesize converts text to audio using OpenAI's TTS API.
type Service ¶
type Service interface {
// Name returns the provider identifier (for logging/debugging).
Name() string
// Synthesize converts text to audio.
// Returns a reader for streaming audio data.
// The caller is responsible for closing the reader.
Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)
// SupportedVoices returns available voices for this provider.
SupportedVoices() []Voice
// SupportedFormats returns supported audio output formats.
SupportedFormats() []AudioFormat
}
Service converts text to speech audio. This interface abstracts different TTS providers (OpenAI, ElevenLabs, etc.) enabling voice AI applications to use any provider interchangeably.
type StreamingService ¶
type StreamingService interface {
Service
// SynthesizeStream converts text to audio with streaming output.
// Returns a channel that receives audio chunks as they're generated.
// The channel is closed when synthesis completes or an error occurs.
SynthesizeStream(ctx context.Context, text string, config SynthesisConfig) (<-chan AudioChunk, error)
}
StreamingService extends Service with streaming synthesis capabilities. Streaming TTS provides lower latency by returning audio chunks as they're generated.
type SynthesisConfig ¶
type SynthesisConfig struct {
// Voice is the voice ID to use for synthesis.
// Available voices vary by provider - use SupportedVoices() to list options.
Voice string
// Format is the output audio format.
// Default is MP3 for most providers.
Format AudioFormat
// Speed is the speech rate multiplier (0.25-4.0, default 1.0).
// Not all providers support speed adjustment.
Speed float64
// Pitch adjusts the voice pitch (-20 to 20 semitones, default 0).
// Not all providers support pitch adjustment.
Pitch float64
// Language is the language code for synthesis (e.g., "en-US").
// Required for some providers, optional for others.
Language string
// Model is the TTS model to use (provider-specific).
// For OpenAI: "tts-1" (fast) or "tts-1-hd" (high quality).
Model string
}
SynthesisConfig configures text-to-speech synthesis.
func DefaultSynthesisConfig ¶
func DefaultSynthesisConfig() SynthesisConfig
DefaultSynthesisConfig returns sensible defaults for synthesis.
type SynthesisError ¶
type SynthesisError struct {
// Provider is the TTS provider that returned the error.
Provider string
// Code is the provider-specific error code.
Code string
// Message is the error message.
Message string
// Cause is the underlying error (if any).
Cause error
// Retryable indicates if the error is transient and retry may succeed.
Retryable bool
}
SynthesisError provides detailed error information from TTS providers.
func NewSynthesisError ¶
func NewSynthesisError(provider, code, message string, cause error, retryable bool) *SynthesisError
NewSynthesisError creates a new SynthesisError.
func (*SynthesisError) Error ¶
func (e *SynthesisError) Error() string
Error implements the error interface.
func (*SynthesisError) Unwrap ¶
func (e *SynthesisError) Unwrap() error
Unwrap returns the underlying error.
type Voice ¶
type Voice struct {
// ID is the provider-specific voice identifier.
ID string
// Name is a human-readable voice name.
Name string
// Language is the primary language code (e.g., "en", "es", "fr").
Language string
// Gender is the voice gender ("male", "female", "neutral").
Gender string
// Description provides additional voice characteristics.
Description string
// Preview is a URL to a voice sample (if available).
Preview string
}
Voice describes a TTS voice available from a provider.