tts

package
v1.1.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 23, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Overview

Package tts provides text-to-speech services. This file contains WebSocket streaming implementation for Cartesia TTS. It is excluded from coverage testing due to the difficulty of mocking WebSocket connections.

Package tts provides text-to-speech services for converting text responses to audio.

The package defines a common Service interface that abstracts TTS providers, enabling voice AI applications to convert text-only LLM responses to speech.

Architecture

The package provides:

  • Service interface for TTS providers
  • SynthesisConfig for voice/format configuration
  • Voice and AudioFormat types for provider capabilities
  • Multiple provider implementations (OpenAI, ElevenLabs, etc.)

Usage

Basic usage with OpenAI TTS:

service := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
reader, err := service.Synthesize(ctx, "Hello world", tts.SynthesisConfig{
    Voice:  "alloy",
    Format: tts.FormatMP3,
})
if err != nil {
    log.Fatal(err)
}
defer reader.Close()

// Stream audio to speaker or save to file
io.Copy(audioOutput, reader)

Streaming TTS

For low-latency applications, use StreamingService:

streamer := tts.NewCartesia(os.Getenv("CARTESIA_API_KEY"))
chunks, err := streamer.SynthesizeStream(ctx, "Hello world", config)
for chunk := range chunks {
    // Play audio chunk immediately
    speaker.Write(chunk)
}

Available Providers

The package includes implementations for:

  • OpenAI TTS (tts-1, tts-1-hd models)
  • ElevenLabs (high-quality voice cloning)
  • Cartesia (ultra-low latency streaming)
  • Google Cloud Text-to-Speech (multi-language)

Index

Constants

View Source
const (

	// ElevenLabsModelMultilingual is the multilingual v2 model.
	ElevenLabsModelMultilingual = "eleven_multilingual_v2"
	// ElevenLabsModelTurbo is the fast turbo v2.5 model.
	ElevenLabsModelTurbo = "eleven_turbo_v2_5"
	// ElevenLabsModelEnglish is the English monolingual v1 model.
	ElevenLabsModelEnglish = "eleven_monolingual_v1"
	// ElevenLabsModelMultilingualV1 is the older multilingual v1 model.
	ElevenLabsModelMultilingualV1 = "eleven_multilingual_v1"
)
View Source
const (

	// ModelTTS1 is the OpenAI TTS model optimized for speed.
	ModelTTS1 = "tts-1"
	// ModelTTS1HD is the OpenAI TTS model optimized for quality.
	ModelTTS1HD = "tts-1-hd"
)
View Source
const (
	VoiceAlloy   = "alloy"   // Neutral voice.
	VoiceEcho    = "echo"    // Male voice.
	VoiceFable   = "fable"   // British accent.
	VoiceOnyx    = "onyx"    // Deep male voice.
	VoiceNova    = "nova"    // Female voice.
	VoiceShimmer = "shimmer" // Soft female voice.
)

OpenAI voices.

View Source
const (

	// CartesiaModelSonic is the latest Sonic model for Cartesia TTS.
	CartesiaModelSonic = "sonic-2024-10-01"
)

Variables

View Source
var (
	// ErrInvalidVoice is returned when the requested voice is not available.
	ErrInvalidVoice = errors.New("invalid or unsupported voice")

	// ErrInvalidFormat is returned when the requested format is not supported.
	ErrInvalidFormat = errors.New("invalid or unsupported audio format")

	// ErrEmptyText is returned when attempting to synthesize empty text.
	ErrEmptyText = errors.New("text cannot be empty")

	// ErrSynthesisFailed is returned when TTS synthesis fails.
	ErrSynthesisFailed = errors.New("speech synthesis failed")

	// ErrRateLimited is returned when API rate limits are exceeded.
	ErrRateLimited = errors.New("rate limit exceeded")

	// ErrQuotaExceeded is returned when account quota is exceeded.
	ErrQuotaExceeded = errors.New("quota exceeded")

	// ErrServiceUnavailable is returned when the TTS service is unavailable.
	ErrServiceUnavailable = errors.New("TTS service unavailable")
)

Common TTS errors.

View Source
var (
	// FormatMP3 is MP3 format (most compatible).
	FormatMP3 = AudioFormat{
		Name:       "mp3",
		MIMEType:   "audio/mpeg",
		SampleRate: sampleRateDefault,
		BitDepth:   0,
		Channels:   1,
	}

	// FormatOpus is Opus format (best for streaming).
	FormatOpus = AudioFormat{
		Name:       "opus",
		MIMEType:   "audio/opus",
		SampleRate: sampleRateDefault,
		BitDepth:   0,
		Channels:   1,
	}

	// FormatAAC is AAC format.
	FormatAAC = AudioFormat{
		Name:       "aac",
		MIMEType:   "audio/aac",
		SampleRate: sampleRateDefault,
		BitDepth:   0,
		Channels:   1,
	}

	// FormatFLAC is FLAC format (lossless).
	FormatFLAC = AudioFormat{
		Name:       "flac",
		MIMEType:   "audio/flac",
		SampleRate: sampleRateDefault,
		BitDepth:   bitDepthDefault,
		Channels:   1,
	}

	// FormatPCM16 is raw 16-bit PCM (for processing).
	FormatPCM16 = AudioFormat{
		Name:       "pcm",
		MIMEType:   "audio/pcm",
		SampleRate: sampleRateDefault,
		BitDepth:   bitDepthDefault,
		Channels:   1,
	}

	// FormatWAV is WAV format (PCM with header).
	FormatWAV = AudioFormat{
		Name:       "wav",
		MIMEType:   "audio/wav",
		SampleRate: sampleRateDefault,
		BitDepth:   bitDepthDefault,
		Channels:   1,
	}
)

Common audio formats.

Functions

This section is empty.

Types

type AudioChunk

type AudioChunk struct {
	// Data is the raw audio bytes.
	Data []byte

	// Index is the chunk sequence number (0-indexed).
	Index int

	// Final indicates this is the last chunk.
	Final bool

	// Error is set if an error occurred during synthesis.
	Error error
}

AudioChunk represents a chunk of synthesized audio data.

type AudioFormat

type AudioFormat struct {
	// Name is the format identifier ("mp3", "opus", "pcm", "aac", "flac").
	Name string

	// MIMEType is the content type (e.g., "audio/mpeg").
	MIMEType string

	// SampleRate is the audio sample rate in Hz.
	SampleRate int

	// BitDepth is the bits per sample (for PCM formats).
	BitDepth int

	// Channels is the number of audio channels (1=mono, 2=stereo).
	Channels int
}

AudioFormat describes an audio output format.

func (AudioFormat) String

func (f AudioFormat) String() string

String returns the format name.

type CartesiaOption

type CartesiaOption func(*CartesiaService)

CartesiaOption configures the Cartesia TTS service.

func WithCartesiaBaseURL

func WithCartesiaBaseURL(url string) CartesiaOption

WithCartesiaBaseURL sets a custom base URL.

func WithCartesiaClient

func WithCartesiaClient(client *http.Client) CartesiaOption

WithCartesiaClient sets a custom HTTP client.

func WithCartesiaModel

func WithCartesiaModel(model string) CartesiaOption

WithCartesiaModel sets the TTS model.

func WithCartesiaWSURL

func WithCartesiaWSURL(url string) CartesiaOption

WithCartesiaWSURL sets a custom WebSocket URL.

type CartesiaService

type CartesiaService struct {
	// contains filtered or unexported fields
}

CartesiaService implements TTS using Cartesia's ultra-low latency API. Cartesia specializes in real-time streaming TTS with <100ms first-byte latency.

func NewCartesia

func NewCartesia(apiKey string, opts ...CartesiaOption) *CartesiaService

NewCartesia creates a Cartesia TTS service.

func (*CartesiaService) Name

func (s *CartesiaService) Name() string

Name returns the provider identifier.

func (*CartesiaService) SupportedFormats

func (s *CartesiaService) SupportedFormats() []AudioFormat

SupportedFormats returns audio formats supported by Cartesia.

func (*CartesiaService) SupportedVoices

func (s *CartesiaService) SupportedVoices() []Voice

SupportedVoices returns a sample of available Cartesia voices.

func (*CartesiaService) Synthesize

func (s *CartesiaService) Synthesize(
	ctx context.Context, text string, config SynthesisConfig,
) (io.ReadCloser, error)

Synthesize converts text to audio using Cartesia's REST API. For streaming output, use SynthesizeStream instead.

func (*CartesiaService) SynthesizeStream

func (s *CartesiaService) SynthesizeStream(
	ctx context.Context, text string, config SynthesisConfig,
) (<-chan AudioChunk, error)

SynthesizeStream converts text to audio with streaming output via WebSocket. This provides ultra-low latency (<100ms first-byte) for real-time applications.

type ElevenLabsOption

type ElevenLabsOption func(*ElevenLabsService)

ElevenLabsOption configures the ElevenLabs TTS service.

func WithElevenLabsBaseURL

func WithElevenLabsBaseURL(url string) ElevenLabsOption

WithElevenLabsBaseURL sets a custom base URL.

func WithElevenLabsClient

func WithElevenLabsClient(client *http.Client) ElevenLabsOption

WithElevenLabsClient sets a custom HTTP client.

func WithElevenLabsModel

func WithElevenLabsModel(model string) ElevenLabsOption

WithElevenLabsModel sets the TTS model.

type ElevenLabsService

type ElevenLabsService struct {
	// contains filtered or unexported fields
}

ElevenLabsService implements TTS using ElevenLabs' API. ElevenLabs specializes in high-quality voice cloning and natural-sounding speech.

func NewElevenLabs

func NewElevenLabs(apiKey string, opts ...ElevenLabsOption) *ElevenLabsService

NewElevenLabs creates an ElevenLabs TTS service.

func (*ElevenLabsService) Name

func (s *ElevenLabsService) Name() string

Name returns the provider identifier.

func (*ElevenLabsService) SupportedFormats

func (s *ElevenLabsService) SupportedFormats() []AudioFormat

SupportedFormats returns audio formats supported by ElevenLabs.

func (*ElevenLabsService) SupportedVoices

func (s *ElevenLabsService) SupportedVoices() []Voice

SupportedVoices returns a sample of available ElevenLabs voices. Note: ElevenLabs has many more voices including custom cloned voices. Use the ElevenLabs API to get a complete list of available voices.

func (*ElevenLabsService) Synthesize

func (s *ElevenLabsService) Synthesize(
	ctx context.Context, text string, config SynthesisConfig,
) (io.ReadCloser, error)

Synthesize converts text to audio using ElevenLabs' TTS API.

type OpenAIOption

type OpenAIOption func(*OpenAIService)

OpenAIOption configures the OpenAI TTS service.

func WithOpenAIBaseURL

func WithOpenAIBaseURL(url string) OpenAIOption

WithOpenAIBaseURL sets a custom base URL (for testing or proxies).

func WithOpenAIClient

func WithOpenAIClient(client *http.Client) OpenAIOption

WithOpenAIClient sets a custom HTTP client.

func WithOpenAIModel

func WithOpenAIModel(model string) OpenAIOption

WithOpenAIModel sets the TTS model to use.

type OpenAIService

type OpenAIService struct {
	// contains filtered or unexported fields
}

OpenAIService implements TTS using OpenAI's text-to-speech API.

func NewOpenAI

func NewOpenAI(apiKey string, opts ...OpenAIOption) *OpenAIService

NewOpenAI creates an OpenAI TTS service.

func (*OpenAIService) Name

func (s *OpenAIService) Name() string

Name returns the provider identifier.

func (*OpenAIService) SupportedFormats

func (s *OpenAIService) SupportedFormats() []AudioFormat

SupportedFormats returns audio formats supported by OpenAI TTS.

func (*OpenAIService) SupportedVoices

func (s *OpenAIService) SupportedVoices() []Voice

SupportedVoices returns available OpenAI voices.

func (*OpenAIService) Synthesize

func (s *OpenAIService) Synthesize(
	ctx context.Context, text string, config SynthesisConfig,
) (io.ReadCloser, error)

Synthesize converts text to audio using OpenAI's TTS API.

type Service

type Service interface {
	// Name returns the provider identifier (for logging/debugging).
	Name() string

	// Synthesize converts text to audio.
	// Returns a reader for streaming audio data.
	// The caller is responsible for closing the reader.
	Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)

	// SupportedVoices returns available voices for this provider.
	SupportedVoices() []Voice

	// SupportedFormats returns supported audio output formats.
	SupportedFormats() []AudioFormat
}

Service converts text to speech audio. This interface abstracts different TTS providers (OpenAI, ElevenLabs, etc.) enabling voice AI applications to use any provider interchangeably.

type StreamingService

type StreamingService interface {
	Service

	// SynthesizeStream converts text to audio with streaming output.
	// Returns a channel that receives audio chunks as they're generated.
	// The channel is closed when synthesis completes or an error occurs.
	SynthesizeStream(ctx context.Context, text string, config SynthesisConfig) (<-chan AudioChunk, error)
}

StreamingService extends Service with streaming synthesis capabilities. Streaming TTS provides lower latency by returning audio chunks as they're generated.

type SynthesisConfig

type SynthesisConfig struct {
	// Voice is the voice ID to use for synthesis.
	// Available voices vary by provider - use SupportedVoices() to list options.
	Voice string

	// Format is the output audio format.
	// Default is MP3 for most providers.
	Format AudioFormat

	// Speed is the speech rate multiplier (0.25-4.0, default 1.0).
	// Not all providers support speed adjustment.
	Speed float64

	// Pitch adjusts the voice pitch (-20 to 20 semitones, default 0).
	// Not all providers support pitch adjustment.
	Pitch float64

	// Language is the language code for synthesis (e.g., "en-US").
	// Required for some providers, optional for others.
	Language string

	// Model is the TTS model to use (provider-specific).
	// For OpenAI: "tts-1" (fast) or "tts-1-hd" (high quality).
	Model string
}

SynthesisConfig configures text-to-speech synthesis.

func DefaultSynthesisConfig

func DefaultSynthesisConfig() SynthesisConfig

DefaultSynthesisConfig returns sensible defaults for synthesis.

type SynthesisError

type SynthesisError struct {
	// Provider is the TTS provider that returned the error.
	Provider string

	// Code is the provider-specific error code.
	Code string

	// Message is the error message.
	Message string

	// Cause is the underlying error (if any).
	Cause error

	// Retryable indicates if the error is transient and retry may succeed.
	Retryable bool
}

SynthesisError provides detailed error information from TTS providers.

func NewSynthesisError

func NewSynthesisError(provider, code, message string, cause error, retryable bool) *SynthesisError

NewSynthesisError creates a new SynthesisError.

func (*SynthesisError) Error

func (e *SynthesisError) Error() string

Error implements the error interface.

func (*SynthesisError) Unwrap

func (e *SynthesisError) Unwrap() error

Unwrap returns the underlying error.

type Voice

type Voice struct {
	// ID is the provider-specific voice identifier.
	ID string

	// Name is a human-readable voice name.
	Name string

	// Language is the primary language code (e.g., "en", "es", "fr").
	Language string

	// Gender is the voice gender ("male", "female", "neutral").
	Gender string

	// Description provides additional voice characteristics.
	Description string

	// Preview is a URL to a voice sample (if available).
	Preview string
}

Voice describes a TTS voice available from a provider.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL