Documentation
¶
Overview ¶
Package tts provides text-to-speech services. This file contains WebSocket streaming implementation for Cartesia TTS. It is excluded from coverage testing due to the difficulty of mocking WebSocket connections.
Package tts provides text-to-speech services for converting text responses to audio.
The package defines a common Service interface that abstracts TTS providers, enabling voice AI applications to convert text-only LLM responses to speech.
Architecture ¶
The package provides:
- Service interface for TTS providers
- SynthesisConfig for voice/format configuration
- Voice and AudioFormat types for provider capabilities
- Multiple provider implementations (OpenAI, ElevenLabs, etc.)
Usage ¶
Basic usage with OpenAI TTS:
service := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
reader, err := service.Synthesize(ctx, "Hello world", tts.SynthesisConfig{
Voice: "alloy",
Format: tts.FormatMP3,
})
if err != nil {
log.Fatal(err)
}
defer reader.Close()
// Stream audio to speaker or save to file
io.Copy(audioOutput, reader)
Streaming TTS ¶
For low-latency applications, use StreamingService:
streamer := tts.NewCartesia(os.Getenv("CARTESIA_API_KEY"))
chunks, err := streamer.SynthesizeStream(ctx, "Hello world", config)
for chunk := range chunks {
// Play audio chunk immediately
speaker.Write(chunk)
}
Available Providers ¶
The package includes implementations for:
- OpenAI TTS (tts-1, tts-1-hd models)
- ElevenLabs (high-quality voice cloning)
- Cartesia (ultra-low latency streaming)
- Google Cloud Text-to-Speech (multi-language)
Index ¶
- Constants
- Variables
- func APIKeyFromCredential(c credentials.Credential) string
- func RegisterFactory(providerType string, factory Factory)
- func ResolveCredential(ctx context.Context, providerType string, cfgDir string, ...) (credentials.Credential, error)
- func SynthesizeWithRetry(ctx context.Context, svc Service, text string, config SynthesisConfig, ...) (io.ReadCloser, error)
- type AudioChunk
- type AudioFormat
- type CartesiaOption
- type CartesiaService
- func (s *CartesiaService) Name() string
- func (s *CartesiaService) SupportedFormats() []AudioFormat
- func (s *CartesiaService) SupportedVoices() []Voice
- func (s *CartesiaService) Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)
- func (s *CartesiaService) SynthesizeStream(ctx context.Context, text string, config SynthesisConfig) (<-chan AudioChunk, error)
- type ElevenLabsOption
- type ElevenLabsService
- type Factory
- type OpenAIOption
- type OpenAIService
- type ProviderSpec
- type RetryConfig
- type Service
- type StreamingService
- type SynthesisConfig
- type SynthesisError
- type Voice
Constants ¶
const ( // ElevenLabsModelMultilingual is the multilingual v2 model. ElevenLabsModelMultilingual = "eleven_multilingual_v2" // ElevenLabsModelTurbo is the fast turbo v2.5 model. ElevenLabsModelTurbo = "eleven_turbo_v2_5" // ElevenLabsModelEnglish is the English monolingual v1 model. ElevenLabsModelEnglish = "eleven_monolingual_v1" // ElevenLabsModelMultilingualV1 is the older multilingual v1 model. ElevenLabsModelMultilingualV1 = "eleven_multilingual_v1" )
const ( // ModelTTS1 is the OpenAI TTS model optimized for speed. ModelTTS1 = "tts-1" // ModelTTS1HD is the OpenAI TTS model optimized for quality. ModelTTS1HD = "tts-1-hd" )
const ( VoiceAlloy = "alloy" // Neutral voice. VoiceEcho = "echo" // Male voice. VoiceFable = "fable" // British accent. VoiceOnyx = "onyx" // Deep male voice. VoiceNova = "nova" // Female voice. VoiceShimmer = "shimmer" // Soft female voice. )
OpenAI voices.
const (
// CartesiaModelSonic is the latest Sonic model for Cartesia TTS.
CartesiaModelSonic = "sonic-2024-10-01"
)
Variables ¶
var ( // ErrInvalidVoice is returned when the requested voice is not available. ErrInvalidVoice = errors.New("invalid or unsupported voice") // ErrInvalidFormat is returned when the requested format is not supported. ErrInvalidFormat = errors.New("invalid or unsupported audio format") // ErrEmptyText is returned when attempting to synthesize empty text. ErrEmptyText = errors.New("text cannot be empty") // ErrSynthesisFailed is returned when TTS synthesis fails. ErrSynthesisFailed = errors.New("speech synthesis failed") // ErrRateLimited is returned when API rate limits are exceeded. ErrRateLimited = errors.New("rate limit exceeded") // ErrQuotaExceeded is returned when account quota is exceeded. ErrQuotaExceeded = errors.New("quota exceeded") ErrServiceUnavailable = errors.New("TTS service unavailable") )
Common TTS errors.
var ( // FormatMP3 is MP3 format (most compatible). FormatMP3 = AudioFormat{ Name: "mp3", MIMEType: "audio/mpeg", SampleRate: sampleRateDefault, BitDepth: 0, Channels: 1, } // FormatOpus is Opus format (best for streaming). FormatOpus = AudioFormat{ Name: "opus", MIMEType: "audio/opus", SampleRate: sampleRateDefault, BitDepth: 0, Channels: 1, } // FormatAAC is AAC format. FormatAAC = AudioFormat{ Name: "aac", MIMEType: "audio/aac", SampleRate: sampleRateDefault, BitDepth: 0, Channels: 1, } // FormatFLAC is FLAC format (lossless). FormatFLAC = AudioFormat{ Name: "flac", MIMEType: "audio/flac", SampleRate: sampleRateDefault, BitDepth: bitDepthDefault, Channels: 1, } // FormatPCM16 is raw 16-bit PCM (for processing). FormatPCM16 = AudioFormat{ Name: "pcm", MIMEType: "audio/pcm", SampleRate: sampleRateDefault, BitDepth: bitDepthDefault, Channels: 1, } // FormatWAV is WAV format (PCM with header). FormatWAV = AudioFormat{ Name: "wav", MIMEType: "audio/wav", SampleRate: sampleRateDefault, BitDepth: bitDepthDefault, Channels: 1, } )
Common audio formats.
Functions ¶
func APIKeyFromCredential ¶ added in v1.4.5
func APIKeyFromCredential(c credentials.Credential) string
APIKeyFromCredential returns the raw API key from an APIKey credential, or "" for any other credential shape (or nil). TTS providers want the key string for their constructors.
func RegisterFactory ¶ added in v1.4.5
RegisterFactory registers a factory for the given provider type. Typically called from per-provider package init().
func ResolveCredential ¶ added in v1.4.5
func ResolveCredential(ctx context.Context, providerType string, cfgDir string, cred *credentials.CredentialConfig, ) (credentials.Credential, error)
ResolveCredential resolves a TTS provider's credential block into a concrete Credential, applying the same fallback chain as chat providers. Exposed as a helper for the SDK runtime-config layer.
func SynthesizeWithRetry ¶ added in v1.4.2
func SynthesizeWithRetry( ctx context.Context, svc Service, text string, config SynthesisConfig, retry RetryConfig, ) (io.ReadCloser, error)
SynthesizeWithRetry calls svc.Synthesize with bounded retry on transient errors. Only errors where SynthesisError.Retryable is true are retried; all others are returned immediately. Uses full jitter backoff to avoid synchronized retries across concurrent callers.
Types ¶
type AudioChunk ¶
type AudioChunk struct {
// Data is the raw audio bytes.
Data []byte
// Index is the chunk sequence number (0-indexed).
Index int
// Final indicates this is the last chunk.
Final bool
// Error is set if an error occurred during synthesis.
Error error
}
AudioChunk represents a chunk of synthesized audio data.
type AudioFormat ¶
type AudioFormat struct {
// Name is the format identifier ("mp3", "opus", "pcm", "aac", "flac").
Name string
// MIMEType is the content type (e.g., "audio/mpeg").
MIMEType string
// SampleRate is the audio sample rate in Hz.
SampleRate int
// BitDepth is the bits per sample (for PCM formats).
BitDepth int
// Channels is the number of audio channels (1=mono, 2=stereo).
Channels int
}
AudioFormat describes an audio output format.
type CartesiaOption ¶
type CartesiaOption func(*CartesiaService)
CartesiaOption configures the Cartesia TTS service.
func WithCartesiaBaseURL ¶
func WithCartesiaBaseURL(url string) CartesiaOption
WithCartesiaBaseURL sets a custom base URL.
func WithCartesiaClient ¶
func WithCartesiaClient(client *http.Client) CartesiaOption
WithCartesiaClient sets a custom HTTP client.
func WithCartesiaModel ¶
func WithCartesiaModel(model string) CartesiaOption
WithCartesiaModel sets the TTS model.
func WithCartesiaWSURL ¶
func WithCartesiaWSURL(url string) CartesiaOption
WithCartesiaWSURL sets a custom WebSocket URL.
type CartesiaService ¶
type CartesiaService struct {
// contains filtered or unexported fields
}
CartesiaService implements TTS using Cartesia's ultra-low latency API. Cartesia specializes in real-time streaming TTS with <100ms first-byte latency.
func NewCartesia ¶
func NewCartesia(apiKey string, opts ...CartesiaOption) *CartesiaService
NewCartesia creates a Cartesia TTS service.
func (*CartesiaService) Name ¶
func (s *CartesiaService) Name() string
Name returns the provider identifier.
func (*CartesiaService) SupportedFormats ¶
func (s *CartesiaService) SupportedFormats() []AudioFormat
SupportedFormats returns audio formats supported by Cartesia.
func (*CartesiaService) SupportedVoices ¶
func (s *CartesiaService) SupportedVoices() []Voice
SupportedVoices returns a sample of available Cartesia voices.
func (*CartesiaService) Synthesize ¶
func (s *CartesiaService) Synthesize( ctx context.Context, text string, config SynthesisConfig, ) (io.ReadCloser, error)
Synthesize converts text to audio using Cartesia's REST API. For streaming output, use SynthesizeStream instead.
func (*CartesiaService) SynthesizeStream ¶
func (s *CartesiaService) SynthesizeStream( ctx context.Context, text string, config SynthesisConfig, ) (<-chan AudioChunk, error)
SynthesizeStream converts text to audio with streaming output via WebSocket. This provides ultra-low latency (<100ms first-byte) for real-time applications.
type ElevenLabsOption ¶
type ElevenLabsOption func(*ElevenLabsService)
ElevenLabsOption configures the ElevenLabs TTS service.
func WithElevenLabsBaseURL ¶
func WithElevenLabsBaseURL(url string) ElevenLabsOption
WithElevenLabsBaseURL sets a custom base URL.
func WithElevenLabsClient ¶
func WithElevenLabsClient(client *http.Client) ElevenLabsOption
WithElevenLabsClient sets a custom HTTP client.
func WithElevenLabsModel ¶
func WithElevenLabsModel(model string) ElevenLabsOption
WithElevenLabsModel sets the TTS model.
type ElevenLabsService ¶
type ElevenLabsService struct {
// contains filtered or unexported fields
}
ElevenLabsService implements TTS using ElevenLabs' API. ElevenLabs specializes in high-quality voice cloning and natural-sounding speech.
func NewElevenLabs ¶
func NewElevenLabs(apiKey string, opts ...ElevenLabsOption) *ElevenLabsService
NewElevenLabs creates an ElevenLabs TTS service.
func (*ElevenLabsService) Name ¶
func (s *ElevenLabsService) Name() string
Name returns the provider identifier.
func (*ElevenLabsService) SupportedFormats ¶
func (s *ElevenLabsService) SupportedFormats() []AudioFormat
SupportedFormats returns audio formats supported by ElevenLabs.
func (*ElevenLabsService) SupportedVoices ¶
func (s *ElevenLabsService) SupportedVoices() []Voice
SupportedVoices returns a sample of available ElevenLabs voices. Note: ElevenLabs has many more voices including custom cloned voices. Use the ElevenLabs API to get a complete list of available voices.
func (*ElevenLabsService) Synthesize ¶
func (s *ElevenLabsService) Synthesize( ctx context.Context, text string, config SynthesisConfig, ) (io.ReadCloser, error)
Synthesize converts text to audio using ElevenLabs' TTS API.
type Factory ¶ added in v1.4.5
type Factory func(spec ProviderSpec) (Service, error)
Factory builds a Service from a spec. Per-provider packages register one of these via init() so this package never needs to import them (avoiding a cycle — implementations already import this package for the Service interface).
type OpenAIOption ¶
type OpenAIOption func(*OpenAIService)
OpenAIOption configures the OpenAI TTS service.
func WithOpenAIBaseURL ¶
func WithOpenAIBaseURL(url string) OpenAIOption
WithOpenAIBaseURL sets a custom base URL (for testing or proxies).
func WithOpenAIClient ¶
func WithOpenAIClient(client *http.Client) OpenAIOption
WithOpenAIClient sets a custom HTTP client.
func WithOpenAIModel ¶
func WithOpenAIModel(model string) OpenAIOption
WithOpenAIModel sets the TTS model to use.
type OpenAIService ¶
type OpenAIService struct {
// contains filtered or unexported fields
}
OpenAIService implements TTS using OpenAI's text-to-speech API.
func NewOpenAI ¶
func NewOpenAI(apiKey string, opts ...OpenAIOption) *OpenAIService
NewOpenAI creates an OpenAI TTS service.
func (*OpenAIService) Name ¶
func (s *OpenAIService) Name() string
Name returns the provider identifier.
func (*OpenAIService) SupportedFormats ¶
func (s *OpenAIService) SupportedFormats() []AudioFormat
SupportedFormats returns audio formats supported by OpenAI TTS.
func (*OpenAIService) SupportedVoices ¶
func (s *OpenAIService) SupportedVoices() []Voice
SupportedVoices returns available OpenAI voices.
func (*OpenAIService) Synthesize ¶
func (s *OpenAIService) Synthesize( ctx context.Context, text string, config SynthesisConfig, ) (io.ReadCloser, error)
Synthesize converts text to audio using OpenAI's TTS API.
type ProviderSpec ¶ added in v1.4.5
type ProviderSpec struct {
// ID is a stable identifier; informational only at this layer.
ID string
// Type selects the implementation: openai, elevenlabs, cartesia.
Type string
// Model overrides the provider's default voice/model. Empty uses
// the per-provider default.
Model string
// BaseURL overrides the provider's default API endpoint.
BaseURL string
// Credential carries the resolved API key.
Credential credentials.Credential
// AdditionalConfig carries provider-specific extras (cartesia
// websocket URL, etc.). Unknown keys are ignored.
AdditionalConfig map[string]any
}
ProviderSpec is the runtime form of a TTS-provider declaration, used by CreateFromSpec to construct a Service implementation. The SDK's runtime-config layer translates pkg/config.TTSProviderConfig into this struct after resolving credentials.
type RetryConfig ¶ added in v1.4.2
type RetryConfig struct {
// MaxAttempts is the total number of attempts including the initial
// call. 3 means "initial + up to 2 retries". Values < 1 are
// treated as 1 (no retry).
MaxAttempts int
// InitialDelay is the base backoff before the first retry.
InitialDelay time.Duration
// MaxDelay caps the per-attempt backoff.
MaxDelay time.Duration
}
RetryConfig configures bounded retry for TTS synthesis calls. Defaults are on (unlike streaming retry) because TTS calls are one-shot and idempotent — retry has no content-duplication risk, and the alternative is silence.
func DefaultRetryConfig ¶ added in v1.4.2
func DefaultRetryConfig() RetryConfig
DefaultRetryConfig returns sensible defaults for TTS retry.
type Service ¶
type Service interface {
// Name returns the provider identifier (for logging/debugging).
Name() string
// Synthesize converts text to audio.
// Returns a reader for streaming audio data.
// The caller is responsible for closing the reader.
Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)
// SupportedVoices returns available voices for this provider.
SupportedVoices() []Voice
// SupportedFormats returns supported audio output formats.
SupportedFormats() []AudioFormat
}
Service converts text to speech audio. This interface abstracts different TTS providers (OpenAI, ElevenLabs, etc.) enabling voice AI applications to use any provider interchangeably.
func CreateFromSpec ¶ added in v1.4.5
func CreateFromSpec(spec ProviderSpec) (Service, error)
CreateFromSpec returns a Service implementation for the given spec.
type StreamingService ¶
type StreamingService interface {
Service
// SynthesizeStream converts text to audio with streaming output.
// Returns a channel that receives audio chunks as they're generated.
// The channel is closed when synthesis completes or an error occurs.
SynthesizeStream(ctx context.Context, text string, config SynthesisConfig) (<-chan AudioChunk, error)
}
StreamingService extends Service with streaming synthesis capabilities. Streaming TTS provides lower latency by returning audio chunks as they're generated.
type SynthesisConfig ¶
type SynthesisConfig struct {
// Voice is the voice ID to use for synthesis.
// Available voices vary by provider - use SupportedVoices() to list options.
Voice string
// Format is the output audio format.
// Default is MP3 for most providers.
Format AudioFormat
// Speed is the speech rate multiplier (0.25-4.0, default 1.0).
// Not all providers support speed adjustment.
Speed float64
// Pitch adjusts the voice pitch (-20 to 20 semitones, default 0).
// Not all providers support pitch adjustment.
Pitch float64
// Language is the language code for synthesis (e.g., "en-US").
// Required for some providers, optional for others.
Language string
// Model is the TTS model to use (provider-specific).
// For OpenAI: "tts-1" (fast) or "tts-1-hd" (high quality).
Model string
}
SynthesisConfig configures text-to-speech synthesis.
func DefaultSynthesisConfig ¶
func DefaultSynthesisConfig() SynthesisConfig
DefaultSynthesisConfig returns sensible defaults for synthesis.
type SynthesisError ¶
type SynthesisError struct {
// Provider is the TTS provider that returned the error.
Provider string
// Code is the provider-specific error code.
Code string
// Message is the error message.
Message string
// Cause is the underlying error (if any).
Cause error
// Retryable indicates if the error is transient and retry may succeed.
Retryable bool
}
SynthesisError provides detailed error information from TTS providers.
func NewSynthesisError ¶
func NewSynthesisError(provider, code, message string, cause error, retryable bool) *SynthesisError
NewSynthesisError creates a new SynthesisError.
func (*SynthesisError) Error ¶
func (e *SynthesisError) Error() string
Error implements the error interface.
func (*SynthesisError) Unwrap ¶
func (e *SynthesisError) Unwrap() error
Unwrap returns the underlying error.
type Voice ¶
type Voice struct {
// ID is the provider-specific voice identifier.
ID string
// Name is a human-readable voice name.
Name string
// Language is the primary language code (e.g., "en", "es", "fr").
Language string
// Gender is the voice gender ("male", "female", "neutral").
Gender string
// Description provides additional voice characteristics.
Description string
// Preview is a URL to a voice sample (if available).
Preview string
}
Voice describes a TTS voice available from a provider.