tts

package
v1.4.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 15, 2026 License: Apache-2.0 Imports: 13 Imported by: 2

Documentation

Overview

Package tts provides text-to-speech services. This file contains WebSocket streaming implementation for Cartesia TTS. It is excluded from coverage testing due to the difficulty of mocking WebSocket connections.

Package tts provides text-to-speech services for converting text responses to audio.

The package defines a common Service interface that abstracts TTS providers, enabling voice AI applications to convert text-only LLM responses to speech.

Architecture

The package provides:

  • Service interface for TTS providers
  • SynthesisConfig for voice/format configuration
  • Voice and AudioFormat types for provider capabilities
  • Multiple provider implementations (OpenAI, ElevenLabs, etc.)

Usage

Basic usage with OpenAI TTS:

service := tts.NewOpenAI(os.Getenv("OPENAI_API_KEY"))
reader, err := service.Synthesize(ctx, "Hello world", tts.SynthesisConfig{
    Voice:  "alloy",
    Format: tts.FormatMP3,
})
if err != nil {
    log.Fatal(err)
}
defer reader.Close()

// Stream audio to speaker or save to file
io.Copy(audioOutput, reader)

Streaming TTS

For low-latency applications, use StreamingService:

streamer := tts.NewCartesia(os.Getenv("CARTESIA_API_KEY"))
chunks, err := streamer.SynthesizeStream(ctx, "Hello world", config)
for chunk := range chunks {
    // Play audio chunk immediately
    speaker.Write(chunk)
}

Available Providers

The package includes implementations for:

  • OpenAI TTS (tts-1, tts-1-hd models)
  • ElevenLabs (high-quality voice cloning)
  • Cartesia (ultra-low latency streaming)
  • Google Cloud Text-to-Speech (multi-language)

Index

Constants

View Source
const (

	// ElevenLabsModelMultilingual is the multilingual v2 model.
	ElevenLabsModelMultilingual = "eleven_multilingual_v2"
	// ElevenLabsModelTurbo is the fast turbo v2.5 model.
	ElevenLabsModelTurbo = "eleven_turbo_v2_5"
	// ElevenLabsModelEnglish is the English monolingual v1 model.
	ElevenLabsModelEnglish = "eleven_monolingual_v1"
	// ElevenLabsModelMultilingualV1 is the older multilingual v1 model.
	ElevenLabsModelMultilingualV1 = "eleven_multilingual_v1"
)
View Source
const (

	// ModelTTS1 is the OpenAI TTS model optimized for speed.
	ModelTTS1 = "tts-1"
	// ModelTTS1HD is the OpenAI TTS model optimized for quality.
	ModelTTS1HD = "tts-1-hd"
)
View Source
const (
	VoiceAlloy   = "alloy"   // Neutral voice.
	VoiceEcho    = "echo"    // Male voice.
	VoiceFable   = "fable"   // British accent.
	VoiceOnyx    = "onyx"    // Deep male voice.
	VoiceNova    = "nova"    // Female voice.
	VoiceShimmer = "shimmer" // Soft female voice.
)

OpenAI voices.

View Source
const (

	// CartesiaModelSonic is the latest Sonic model for Cartesia TTS.
	CartesiaModelSonic = "sonic-2024-10-01"
)

Variables

View Source
var (
	// ErrInvalidVoice is returned when the requested voice is not available.
	ErrInvalidVoice = errors.New("invalid or unsupported voice")

	// ErrInvalidFormat is returned when the requested format is not supported.
	ErrInvalidFormat = errors.New("invalid or unsupported audio format")

	// ErrEmptyText is returned when attempting to synthesize empty text.
	ErrEmptyText = errors.New("text cannot be empty")

	// ErrSynthesisFailed is returned when TTS synthesis fails.
	ErrSynthesisFailed = errors.New("speech synthesis failed")

	// ErrRateLimited is returned when API rate limits are exceeded.
	ErrRateLimited = errors.New("rate limit exceeded")

	// ErrQuotaExceeded is returned when account quota is exceeded.
	ErrQuotaExceeded = errors.New("quota exceeded")

	// ErrServiceUnavailable is returned when the TTS service is unavailable.
	ErrServiceUnavailable = errors.New("TTS service unavailable")
)

Common TTS errors.

View Source
var (
	// FormatMP3 is MP3 format (most compatible).
	FormatMP3 = AudioFormat{
		Name:       "mp3",
		MIMEType:   "audio/mpeg",
		SampleRate: sampleRateDefault,
		BitDepth:   0,
		Channels:   1,
	}

	// FormatOpus is Opus format (best for streaming).
	FormatOpus = AudioFormat{
		Name:       "opus",
		MIMEType:   "audio/opus",
		SampleRate: sampleRateDefault,
		BitDepth:   0,
		Channels:   1,
	}

	// FormatAAC is AAC format.
	FormatAAC = AudioFormat{
		Name:       "aac",
		MIMEType:   "audio/aac",
		SampleRate: sampleRateDefault,
		BitDepth:   0,
		Channels:   1,
	}

	// FormatFLAC is FLAC format (lossless).
	FormatFLAC = AudioFormat{
		Name:       "flac",
		MIMEType:   "audio/flac",
		SampleRate: sampleRateDefault,
		BitDepth:   bitDepthDefault,
		Channels:   1,
	}

	// FormatPCM16 is raw 16-bit PCM (for processing).
	FormatPCM16 = AudioFormat{
		Name:       "pcm",
		MIMEType:   "audio/pcm",
		SampleRate: sampleRateDefault,
		BitDepth:   bitDepthDefault,
		Channels:   1,
	}

	// FormatWAV is WAV format (PCM with header).
	FormatWAV = AudioFormat{
		Name:       "wav",
		MIMEType:   "audio/wav",
		SampleRate: sampleRateDefault,
		BitDepth:   bitDepthDefault,
		Channels:   1,
	}
)

Common audio formats.

Functions

func APIKeyFromCredential added in v1.4.5

func APIKeyFromCredential(c credentials.Credential) string

APIKeyFromCredential returns the raw API key from an APIKey credential, or "" for any other credential shape (or nil). TTS providers want the key string for their constructors.

func RegisterFactory added in v1.4.5

func RegisterFactory(providerType string, factory Factory)

RegisterFactory registers a factory for the given provider type. Typically called from per-provider package init().

func ResolveCredential added in v1.4.5

func ResolveCredential(ctx context.Context, providerType string,
	cfgDir string, cred *credentials.CredentialConfig,
) (credentials.Credential, error)

ResolveCredential resolves a TTS provider's credential block into a concrete Credential, applying the same fallback chain as chat providers. Exposed as a helper for the SDK runtime-config layer.

func SynthesizeWithRetry added in v1.4.2

func SynthesizeWithRetry(
	ctx context.Context,
	svc Service,
	text string,
	config SynthesisConfig,
	retry RetryConfig,
) (io.ReadCloser, error)

SynthesizeWithRetry calls svc.Synthesize with bounded retry on transient errors. Only errors where SynthesisError.Retryable is true are retried; all others are returned immediately. Uses full jitter backoff to avoid synchronized retries across concurrent callers.

Types

type AudioChunk

type AudioChunk struct {
	// Data is the raw audio bytes.
	Data []byte

	// Index is the chunk sequence number (0-indexed).
	Index int

	// Final indicates this is the last chunk.
	Final bool

	// Error is set if an error occurred during synthesis.
	Error error
}

AudioChunk represents a chunk of synthesized audio data.

type AudioFormat

type AudioFormat struct {
	// Name is the format identifier ("mp3", "opus", "pcm", "aac", "flac").
	Name string

	// MIMEType is the content type (e.g., "audio/mpeg").
	MIMEType string

	// SampleRate is the audio sample rate in Hz.
	SampleRate int

	// BitDepth is the bits per sample (for PCM formats).
	BitDepth int

	// Channels is the number of audio channels (1=mono, 2=stereo).
	Channels int
}

AudioFormat describes an audio output format.

func (AudioFormat) String

func (f AudioFormat) String() string

String returns the format name.

type CartesiaOption

type CartesiaOption func(*CartesiaService)

CartesiaOption configures the Cartesia TTS service.

func WithCartesiaBaseURL

func WithCartesiaBaseURL(url string) CartesiaOption

WithCartesiaBaseURL sets a custom base URL.

func WithCartesiaClient

func WithCartesiaClient(client *http.Client) CartesiaOption

WithCartesiaClient sets a custom HTTP client.

func WithCartesiaModel

func WithCartesiaModel(model string) CartesiaOption

WithCartesiaModel sets the TTS model.

func WithCartesiaWSURL

func WithCartesiaWSURL(url string) CartesiaOption

WithCartesiaWSURL sets a custom WebSocket URL.

type CartesiaService

type CartesiaService struct {
	// contains filtered or unexported fields
}

CartesiaService implements TTS using Cartesia's ultra-low latency API. Cartesia specializes in real-time streaming TTS with <100ms first-byte latency.

func NewCartesia

func NewCartesia(apiKey string, opts ...CartesiaOption) *CartesiaService

NewCartesia creates a Cartesia TTS service.

func (*CartesiaService) Name

func (s *CartesiaService) Name() string

Name returns the provider identifier.

func (*CartesiaService) SupportedFormats

func (s *CartesiaService) SupportedFormats() []AudioFormat

SupportedFormats returns audio formats supported by Cartesia.

func (*CartesiaService) SupportedVoices

func (s *CartesiaService) SupportedVoices() []Voice

SupportedVoices returns a sample of available Cartesia voices.

func (*CartesiaService) Synthesize

func (s *CartesiaService) Synthesize(
	ctx context.Context, text string, config SynthesisConfig,
) (io.ReadCloser, error)

Synthesize converts text to audio using Cartesia's REST API. For streaming output, use SynthesizeStream instead.

func (*CartesiaService) SynthesizeStream

func (s *CartesiaService) SynthesizeStream(
	ctx context.Context, text string, config SynthesisConfig,
) (<-chan AudioChunk, error)

SynthesizeStream converts text to audio with streaming output via WebSocket. This provides ultra-low latency (<100ms first-byte) for real-time applications.

type ElevenLabsOption

type ElevenLabsOption func(*ElevenLabsService)

ElevenLabsOption configures the ElevenLabs TTS service.

func WithElevenLabsBaseURL

func WithElevenLabsBaseURL(url string) ElevenLabsOption

WithElevenLabsBaseURL sets a custom base URL.

func WithElevenLabsClient

func WithElevenLabsClient(client *http.Client) ElevenLabsOption

WithElevenLabsClient sets a custom HTTP client.

func WithElevenLabsModel

func WithElevenLabsModel(model string) ElevenLabsOption

WithElevenLabsModel sets the TTS model.

type ElevenLabsService

type ElevenLabsService struct {
	// contains filtered or unexported fields
}

ElevenLabsService implements TTS using ElevenLabs' API. ElevenLabs specializes in high-quality voice cloning and natural-sounding speech.

func NewElevenLabs

func NewElevenLabs(apiKey string, opts ...ElevenLabsOption) *ElevenLabsService

NewElevenLabs creates an ElevenLabs TTS service.

func (*ElevenLabsService) Name

func (s *ElevenLabsService) Name() string

Name returns the provider identifier.

func (*ElevenLabsService) SupportedFormats

func (s *ElevenLabsService) SupportedFormats() []AudioFormat

SupportedFormats returns audio formats supported by ElevenLabs.

func (*ElevenLabsService) SupportedVoices

func (s *ElevenLabsService) SupportedVoices() []Voice

SupportedVoices returns a sample of available ElevenLabs voices. Note: ElevenLabs has many more voices including custom cloned voices. Use the ElevenLabs API to get a complete list of available voices.

func (*ElevenLabsService) Synthesize

func (s *ElevenLabsService) Synthesize(
	ctx context.Context, text string, config SynthesisConfig,
) (io.ReadCloser, error)

Synthesize converts text to audio using ElevenLabs' TTS API.

type Factory added in v1.4.5

type Factory func(spec ProviderSpec) (Service, error)

Factory builds a Service from a spec. Per-provider packages register one of these via init() so this package never needs to import them (avoiding a cycle — implementations already import this package for the Service interface).

type OpenAIOption

type OpenAIOption func(*OpenAIService)

OpenAIOption configures the OpenAI TTS service.

func WithOpenAIBaseURL

func WithOpenAIBaseURL(url string) OpenAIOption

WithOpenAIBaseURL sets a custom base URL (for testing or proxies).

func WithOpenAIClient

func WithOpenAIClient(client *http.Client) OpenAIOption

WithOpenAIClient sets a custom HTTP client.

func WithOpenAIModel

func WithOpenAIModel(model string) OpenAIOption

WithOpenAIModel sets the TTS model to use.

type OpenAIService

type OpenAIService struct {
	// contains filtered or unexported fields
}

OpenAIService implements TTS using OpenAI's text-to-speech API.

func NewOpenAI

func NewOpenAI(apiKey string, opts ...OpenAIOption) *OpenAIService

NewOpenAI creates an OpenAI TTS service.

func (*OpenAIService) Name

func (s *OpenAIService) Name() string

Name returns the provider identifier.

func (*OpenAIService) SupportedFormats

func (s *OpenAIService) SupportedFormats() []AudioFormat

SupportedFormats returns audio formats supported by OpenAI TTS.

func (*OpenAIService) SupportedVoices

func (s *OpenAIService) SupportedVoices() []Voice

SupportedVoices returns available OpenAI voices.

func (*OpenAIService) Synthesize

func (s *OpenAIService) Synthesize(
	ctx context.Context, text string, config SynthesisConfig,
) (io.ReadCloser, error)

Synthesize converts text to audio using OpenAI's TTS API.

type ProviderSpec added in v1.4.5

type ProviderSpec struct {
	// ID is a stable identifier; informational only at this layer.
	ID string
	// Type selects the implementation: openai, elevenlabs, cartesia.
	Type string
	// Model overrides the provider's default voice/model. Empty uses
	// the per-provider default.
	Model string
	// BaseURL overrides the provider's default API endpoint.
	BaseURL string
	// Credential carries the resolved API key.
	Credential credentials.Credential
	// AdditionalConfig carries provider-specific extras (cartesia
	// websocket URL, etc.). Unknown keys are ignored.
	AdditionalConfig map[string]any
}

ProviderSpec is the runtime form of a TTS-provider declaration, used by CreateFromSpec to construct a Service implementation. The SDK's runtime-config layer translates pkg/config.TTSProviderConfig into this struct after resolving credentials.

type RetryConfig added in v1.4.2

type RetryConfig struct {
	// MaxAttempts is the total number of attempts including the initial
	// call. 3 means "initial + up to 2 retries". Values < 1 are
	// treated as 1 (no retry).
	MaxAttempts int
	// InitialDelay is the base backoff before the first retry.
	InitialDelay time.Duration
	// MaxDelay caps the per-attempt backoff.
	MaxDelay time.Duration
}

RetryConfig configures bounded retry for TTS synthesis calls. Defaults are on (unlike streaming retry) because TTS calls are one-shot and idempotent — retry has no content-duplication risk, and the alternative is silence.

func DefaultRetryConfig added in v1.4.2

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns sensible defaults for TTS retry.

type Service

type Service interface {
	// Name returns the provider identifier (for logging/debugging).
	Name() string

	// Synthesize converts text to audio.
	// Returns a reader for streaming audio data.
	// The caller is responsible for closing the reader.
	Synthesize(ctx context.Context, text string, config SynthesisConfig) (io.ReadCloser, error)

	// SupportedVoices returns available voices for this provider.
	SupportedVoices() []Voice

	// SupportedFormats returns supported audio output formats.
	SupportedFormats() []AudioFormat
}

Service converts text to speech audio. This interface abstracts different TTS providers (OpenAI, ElevenLabs, etc.) enabling voice AI applications to use any provider interchangeably.

func CreateFromSpec added in v1.4.5

func CreateFromSpec(spec ProviderSpec) (Service, error)

CreateFromSpec returns a Service implementation for the given spec.

type StreamingService

type StreamingService interface {
	Service

	// SynthesizeStream converts text to audio with streaming output.
	// Returns a channel that receives audio chunks as they're generated.
	// The channel is closed when synthesis completes or an error occurs.
	SynthesizeStream(ctx context.Context, text string, config SynthesisConfig) (<-chan AudioChunk, error)
}

StreamingService extends Service with streaming synthesis capabilities. Streaming TTS provides lower latency by returning audio chunks as they're generated.

type SynthesisConfig

type SynthesisConfig struct {
	// Voice is the voice ID to use for synthesis.
	// Available voices vary by provider - use SupportedVoices() to list options.
	Voice string

	// Format is the output audio format.
	// Default is MP3 for most providers.
	Format AudioFormat

	// Speed is the speech rate multiplier (0.25-4.0, default 1.0).
	// Not all providers support speed adjustment.
	Speed float64

	// Pitch adjusts the voice pitch (-20 to 20 semitones, default 0).
	// Not all providers support pitch adjustment.
	Pitch float64

	// Language is the language code for synthesis (e.g., "en-US").
	// Required for some providers, optional for others.
	Language string

	// Model is the TTS model to use (provider-specific).
	// For OpenAI: "tts-1" (fast) or "tts-1-hd" (high quality).
	Model string
}

SynthesisConfig configures text-to-speech synthesis.

func DefaultSynthesisConfig

func DefaultSynthesisConfig() SynthesisConfig

DefaultSynthesisConfig returns sensible defaults for synthesis.

type SynthesisError

type SynthesisError struct {
	// Provider is the TTS provider that returned the error.
	Provider string

	// Code is the provider-specific error code.
	Code string

	// Message is the error message.
	Message string

	// Cause is the underlying error (if any).
	Cause error

	// Retryable indicates if the error is transient and retry may succeed.
	Retryable bool
}

SynthesisError provides detailed error information from TTS providers.

func NewSynthesisError

func NewSynthesisError(provider, code, message string, cause error, retryable bool) *SynthesisError

NewSynthesisError creates a new SynthesisError.

func (*SynthesisError) Error

func (e *SynthesisError) Error() string

Error implements the error interface.

func (*SynthesisError) Unwrap

func (e *SynthesisError) Unwrap() error

Unwrap returns the underlying error.

type Voice

type Voice struct {
	// ID is the provider-specific voice identifier.
	ID string

	// Name is a human-readable voice name.
	Name string

	// Language is the primary language code (e.g., "en", "es", "fr").
	Language string

	// Gender is the voice gender ("male", "female", "neutral").
	Gender string

	// Description provides additional voice characteristics.
	Description string

	// Preview is a URL to a voice sample (if available).
	Preview string
}

Voice describes a TTS voice available from a provider.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL