voice

package

v0.15.0 Latest Latest Go to latest Published: Mar 11, 2026 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sevigo/goframe

Links

Open Source Insights

README ¶

Voice Generation Package

The voice package provides a modular Text-to-Speech (TTS) interface for generating audio from text. It supports both buffered synthesis and streaming modes, compatible with OpenAI's API and local OpenAI-compatible servers like Kokoro-FastAPI.

Features

Dual Mode Operation: Buffered synthesis (Synthesize) for short texts, streaming (Stream) for efficient processing of longer content
OpenAI Compatible: Works with OpenAI cloud API and local servers (Kokoro, etc.)
Dialogue Synthesis: Multi-speaker dialogue generation with natural conversation flow
Context-Aware Pacing: Intelligent pause calculation based on dialogue context (questions, exclamations, transitions)
Word-Level Timestamps: Captioned synthesis interface for precise synchronization and subtitle generation (where supported)
Voice Mixing: Combine multiple voices with weighted ratios for unique character voices
Audio Processing: Crossfading, volume normalization (LUFS-style), and zero-crossing optimization
Functional Options: Flexible configuration using the functional options pattern
Context Support: Proper cancellation and timeout handling
Multiple Voices: Support for various voice identifiers per provider

Installation

go get github.com/sevigo/goframe/voice
go get github.com/sevigo/goframe/voice/openai

Usage

Cloud OpenAI

package main

import (
    "context"
    "fmt"
    "log"
    "os"

    "github.com/sevigo/goframe/voice/openai"
)

func main() {
    synthesizer, err := openai.NewSynthesizer(
        openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
        openai.WithModel("tts-1"),
        openai.WithVoice("alloy"),
        openai.WithFormat("mp3"),
    )
    if err != nil {
        log.Fatal(err)
    }

    audio, err := synthesizer.Synthesize(context.Background(), "Hello, world!")
    if err != nil {
        log.Fatal(err)
    }

    err = os.WriteFile("output.mp3", audio.Data, 0600)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println("Audio saved to output.mp3")
}

Local Kokoro Container

package main

import (
    "context"
    "fmt"
    "io"
    "log"
    "os"

    "github.com/sevigo/goframe/voice/openai"
)

func main() {
    // Kokoro runs locally - no API key required
    synthesizer, err := openai.NewSynthesizer(
        openai.WithBaseURL("http://localhost:8880/v1"),
        openai.WithModel("kokoro"),
        openai.WithVoice("af_bella"),
        openai.WithFormat("wav"),
    )
    if err != nil {
        log.Fatal(err)
    }

    // Use streaming for efficient processing
    stream, err := synthesizer.Stream(context.Background(), "Hello from local TTS!")
    if err != nil {
        log.Fatal(err)
    }
    defer stream.Close()

    file, err := os.Create("output.wav")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    written, err := io.Copy(file, stream)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("Saved %d bytes to output.wav\n", written)
}

Per-Request Options

Override default settings for individual requests:

audio, err := synthesizer.Synthesize(ctx, text,
    voice.WithVoice("echo"),
    voice.WithModel("tts-1-hd"),
    voice.WithSpeed(1.2),
)

Configuration Options

Synthesizer Options

Option	Description	Default
`WithAPIKey(key)`	OpenAI/compatible API key	None
`WithBaseURL(url)`	API base URL	`https://api.openai.com/v1`
`WithModel(model)`	TTS model name	`tts-1`
`WithVoice(voice)`	Voice identifier	`alloy`
`WithFormat(format)`	Output format (mp3, wav, etc.)	`mp3`
`WithSpeed(speed)`	Speech speed multiplier (0.25-4.0)	1.0
`WithHTTPClient(client)`	Custom HTTP client	Default shared client
`WithLogger(logger)`	Custom structured logger	`slog.Default()`

DialogueSynthesizer Settings

type DialogueSynthesizer struct {
    Syn             Synthesizer          // Underlying TTS engine
    VoiceMap        map[string]string    // Speaker -> voice ID mapping
    SpeedMap        map[string]float64   // Speaker -> speed multiplier
    Format          string               // Output format ("wav" recommended)
    CrossfadeMs     int                  // Crossfade duration (default: 50)
    PauseMsMin      int                  // Minimum pause (default: 200)
    PauseMsMax      int                  // Maximum pause (default: 300)
    NormalizeVolume bool                 // Enable volume normalization (default: true)
}

SpeedMap Guidelines

Recommended speed multipliers for different character types:

Character Type	Speed	Example
Energetic host	1.05-1.10	News anchor, podcast host
Normal speaker	1.00	Most conversational voices
Thoughtful guest	0.90-0.95	Expert, professor
Slow narrator	0.85-0.90	Storyteller, documentary

Supported Voices

OpenAI

alloy, echo, fable, onyx, nova, shimmer

Kokoro (Kokoro-FastAPI)

American Female: af_bella, af_sarah, af_sky, af_heart, af_nicole
American Male: am_adam, am_michael
British Female: bf_emma, bf_isabella
British Male: bm_george, bm_lewis

Voice Combinations (Kokoro only)

Mix voices using + notation with optional weights:

// Single voice
voice: "af_bella"

// Equal mix (50%/50%)
voice: "af_bella+af_sky"

// Weighted mix (67%/33%)
voice: "af_bella(2)+af_heart(1)"

// Complex mix (40%/30%/30%)
voice: "af_sky(4)+af_nicole(3)+af_heart(3)"

Combined voices are automatically cached for future use.

Supported Formats

Recommended for Dialogue Synthesis

wav (recommended for dialogue) - Lossless, supports crossfading and normalization
mp3 - Not recommended for dialogue (compression artifacts at transitions)
opus - Good for streaming
flac - Lossless compression
pcm - Raw audio, useful for streaming

Why WAV for Dialogue?

WAV format preserves audio quality through multiple processing steps:

Crossfading between speakers
Volume normalization across different voices
Pause insertion with silence padding
Zero-crossing optimization

Other formats (MP3, Opus) apply lossy compression, which can introduce artifacts when processing the audio multiple times.

Examples

Multi-Speaker Dialogue with Natural Pacing

Generate dialogue with multiple speakers using different voices, with context-aware pause calculation for natural conversation flow:

import "github.com/sevigo/goframe/voice"

synthesizer, _ := openai.NewSynthesizer(
    openai.WithBaseURL("http://localhost:8880/v1"),
    openai.WithModel("kokoro"),
)

dialogueSyn := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})

// Customize speaker pacing
dialogueSyn.SpeedMap = map[string]float64{
    "Alice": 1.0,  // normal pace
    "Bob":   0.95, // slightly slower, thoughtful
}

dialogue := []voice.DialogueSegment{
    {Speaker: "Alice", Text: "What do you think about Tokyo?"},      // Longer pause (question)
    {Speaker: "Bob", Text: "I think it's amazing!"},                 // Medium pause (exclamation)
    {Speaker: "Alice", Text: "Yeah."},                                // Short pause (brief response)
    {Speaker: "Bob", Text: "And then we went to the station,"},       // Very short pause (continuing thought)
    {Speaker: "Alice", Text: "and the train was already there."},    // Flows naturally
}

stream, _ := dialogueSyn.StreamDialogue(ctx, dialogue)
defer stream.Close()
io.Copy(outputFile, stream)

Context-Aware Pause Calculation

The dialogue synthesizer automatically adjusts pauses based on conversational context:

Context	Pause Multiplier	Reason
Questions (`?`)	1.3x	Listener needs time to process
Exclamations (`!`)	1.2x	Emotional impact
Ellipsis (`...`)	1.4x	Thoughtful/pensive
Em dashes (`—`)	1.5x	Dramatic interruption
Commas (`,`)	0.7x	Continuing same thought
Short responses (≤3 words)	0.6x	Quick back-and-forth
Long sentences (>20 words)	1.2x	Complex idea, needs processing
Transition words ("so", "well")	1.1x	Slight pause for transition

Benefits:

✅ Podcast-quality natural flow
✅ No robotic fixed-pause timing
✅ Questions sound like questions
✅ Emotional responses have appropriate pauses
✅ Quick back-and-forth feels conversational

Voice Mixing for Unique Characters

Combine multiple voices with weighted ratios:

dialogueSyn := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
    "Narrator": "af_bella(3)+af_heart(1)",  // 75% bella, 25% heart
    "Host":     "af_sky(2)+af_nicole(1)",   // 67% sky, 33% nicole
    "Guest":    "am_adam",                  // pure voice
})

Parallel Dialogue Synthesis with Concurrency Limiting

For faster generation, synthesize all segments in parallel with controlled concurrency:

// Unlimited concurrency (may overwhelm API for large dialogues)
stream, err := dialogueSyn.StreamDialogueParallel(ctx, dialogue)

// Limited concurrency (recommended for large dialogues)
// Process 10 segments at a time to avoid overwhelming the API
stream, err := dialogueSyn.StreamDialogueParallelWithLimit(ctx, dialogue, 10)

Concurrency Recommendations:

Kokoro-FastAPI (local): 5-10 concurrent requests
OpenAI API: 3-5 concurrent requests (rate limits)
Large dialogues (>50 segments): Always use limiting
Small dialogues (<10 segments): Unlimited is fine

Audio Processing Features

The DialogueSynthesizer includes professional audio processing:

Crossfading (default: 50ms): Smooth transitions between speakers using equal-power curves
Volume Normalization: Peak normalization to 95% for consistent loudness across voices
Zero-Crossing Optimization: Minimizes clicks at splice points
Configurable Pauses: Set custom pause ranges:

dialogueSyn.PauseMsMin = 150  // minimum pause between segments
dialogueSyn.PauseMsMax = 350  // maximum pause (randomized for naturalness)
dialogueSyn.CrossfadeMs = 50  // crossfade duration

Streaming with Progress

See examples/kokoro-streaming/main.go for a streaming example with real-time progress:

go run ./examples/kokoro-streaming/main.go

Multiple Voices

See examples/kokoro-tts/main.go for generating audio with multiple voices:

go run ./examples/kokoro-tts/main.go

Dialogue Synthesis (Podcast-Quality)

See examples/kokoro-dialogue/main.go for multi-speaker dialogue generation with context-aware pacing:

# Start Kokoro container first
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# Run the example
go run ./examples/kokoro-dialogue/main.go

# Output: dialogue.wav with natural conversation flow
ffplay dialogue.wav

The dialogue example demonstrates:

3 different speakers with unique voice profiles
Context-aware pacing (questions, exclamations, transitions)
Voice mixing for character variety
Per-speaker speed customization
Volume normalization across speakers
Crossfading for smooth transitions

Testing Without API Credits

For local testing without spending API credits, use Kokoro-FastAPI:

# Start Kokoro container
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# Test with local server
synthesizer, _ := openai.NewSynthesizer(
    openai.WithBaseURL("http://localhost:8880/v1"),
)

Errors

Error	Description
`ErrAPIKeyRequired`	API key required when using default OpenAI endpoint
`openai: text cannot be empty`	Empty input text
`openai: request failed with status N`	HTTP error from API
`voice: no segments provided`	Empty dialogue segment list
`voice: no voice mapping for speaker X`	Speaker not found in VoiceMap
`voice: data too short for WAV header`	Invalid WAV audio data
`voice: unsupported WAV format N`	Non-PCM WAV format

Architecture

Package Structure

voice/
├── voice.go           # Core interfaces and types
├── dialogue.go        # Dialogue synthesis with context-aware pacing
├── dialogue_test.go   # Tests for pause calculation
└── openai/
    └── openai.go      # OpenAI-compatible TTS implementation

Key Types

voice.Synthesizer - Single-voice TTS interface

Synthesize(ctx, text, opts) - Buffer entire audio
Stream(ctx, text, opts) - Stream audio chunks

voice.CaptionedSynthesizer - TTS with word-level timestamps (where supported)

SynthesizeCaptioned(ctx, text, opts) - Audio + word timing information
StreamCaptioned(ctx, text, opts) - Stream audio + timing incrementally

Note: DialogueSynthesizerCaptioned.StreamDialogueCaptioned() is not yet implemented because timestamp-based pause calculation requires buffering all segments. For streaming dialogue synthesis, use DialogueSynthesizer.StreamDialogueParallel() instead.

voice.DialogueSynthesizer - Multi-speaker dialogue synthesis

StreamDialogue(ctx, segments) - Sequential synthesis
StreamDialogueParallel(ctx, segments) - Parallel synthesis (faster)
SynthesizeDialogue(ctx, segments) - Return individual segments

voice.DialogueSegment - Single speaker's line

type DialogueSegment struct {
    Speaker string
    Text    string
}

voice.CaptionedAudio - Audio with word timestamps

type CaptionedAudio struct {
    Data       []byte            // Raw audio bytes
    Format     string            // Audio format
    Timestamps []WordTimestamp   // Word-level timing
    DurationMs int               // Total duration
}

Advanced Features

Word-Level Timestamps (Kokoro-FastAPI)

For providers that support captioned synthesis (currently Kokoro-FastAPI), you can get precise word timing:

// Check if synthesizer supports captions
if cs, ok := synthesizer.(voice.CaptionedSynthesizer); ok {
    audio, err := cs.SynthesizeCaptioned(ctx, "Hello world", opts...)
    if err != nil {
        log.Fatal(err)
    }

    // Use timestamps for precise timing
    for _, ts := range audio.Timestamps {
        fmt.Printf("%d-%dms: %s\n", ts.StartMs, ts.EndMs, ts.Word)
    }
}

Benefits:

✅ Generate subtitles (SRT, VTT) automatically
✅ Calculate exact pauses between dialogue turns
✅ Analyze speech rate per speaker
✅ Create chapter markers for podcasts
✅ Perfect synchronization for background audio

Limitation: Captioned dialogue synthesis (DialogueSynthesizerCaptioned) does not support streaming. This is because calculating perfect pauses requires buffering all segments to detect built-in silence. For streaming dialogue, use heuristic-based DialogueSynthesizer instead.

Subtitle Generation

// Convert captions to SRT format
func generateSRT(segments []voice.CaptionedAudio) string {
    var srt strings.Builder
    index := 1
    timeOffset := 0

    for _, seg := range segments {
        for _, ts := range seg.Timestamps {
            start := formatSRTTime(timeOffset + ts.StartMs)
            end := formatSRTTime(timeOffset + ts.EndMs)
            srt.WriteString(fmt.Sprintf("%d\n%s --> %s\n%s\n\n",
                index, start, end, ts.Word))
            index++
        }
        timeOffset += seg.DurationMs
    }
    return srt.String()
}

Performance

Dialogue Synthesis Performance

Sequential: Processes segments one at a time (~300-600ms per segment)
Parallel: Synthesizes all segments concurrently, then assembles in order
- 3 segments: ~3x faster than sequential
- 10 segments: ~10x faster than sequential
- Overhead: ~50-100ms for assembly (crossfading, pauses)
Captioned Synthesis: Adds ~10-20ms overhead for timestamp generation

Memory Usage

WAV format: Requires buffering entire segment for crossfading
Non-WAV formats: Streams directly without buffering
Peak memory: ~2MB per minute of audio (WAV)
Captioned mode: Additional ~1KB per 100 words for timestamp storage

Optimization Tips

Use StreamDialogueParallel for dialogue with >3 segments
Use wav format for best quality with crossfading
Adjust CrossfadeMs (20-50ms is sufficient for most cases)
Set appropriate SpeedMap values to avoid re-synthesis
Use captioned synthesis for subtitle generation instead of separate processing

Documentation ¶

Overview ¶

Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.

Index ¶

Constants
func AnalyzeSpeechRate(segments []CaptionedSegment) float64
func AnalyzeWAVAudio(data []byte) (int, int, int, error)
func ComputeWAVDuration(data []byte, info wavInfo) int
func DetectLeadingSilence(data []byte, info wavInfo, threshold ...int16) int
func DetectTrailingSilence(data []byte, info wavInfo, threshold ...int16) int
func ValidateWAVConsistency(segments [][]byte, expectedInfo wavInfo) error
type Audio
type CaptionedAudio
type CaptionedChunk
type CaptionedDialogueResult
type CaptionedSegment
type CaptionedSynthesizer
type DialogueSegment
type DialogueSynthesizer
- func NewDialogueSynthesizer(syn Synthesizer, voiceMap map[string]string, format ...string) *DialogueSynthesizer
- func (ds *DialogueSynthesizer) StreamDialogue(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
- func (ds *DialogueSynthesizer) StreamDialogueParallel(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
- func (ds *DialogueSynthesizer) StreamDialogueParallelWithLimit(ctx context.Context, segments []DialogueSegment, concurrencyLimit int) (io.ReadCloser, error)
- func (ds *DialogueSynthesizer) SynthesizeDialogue(ctx context.Context, segments []DialogueSegment) ([]*Audio, error)
type DialogueSynthesizerCaptioned
- func NewDialogueSynthesizerCaptioned(syn CaptionedSynthesizer, voiceMap map[string]string, format ...string) (*DialogueSynthesizerCaptioned, error)
- func (ds *DialogueSynthesizerCaptioned) CalculatePerfectPause(prev, curr *CaptionedSegment) int
- func (ds *DialogueSynthesizerCaptioned) GenerateSRT(segments []CaptionedSegment) string
- func (ds *DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers(segments []CaptionedSegment) string
- func (ds *DialogueSynthesizerCaptioned) StreamDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
- func (ds *DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (*CaptionedDialogueResult, error)
type EstimatedCaptionedSynthesizer
- func NewEstimatedCaptionedSynthesizer(syn Synthesizer, format ...string) *EstimatedCaptionedSynthesizer
- func (e *EstimatedCaptionedSynthesizer) Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
- func (e *EstimatedCaptionedSynthesizer) StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
- func (e *EstimatedCaptionedSynthesizer) Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)
- func (e *EstimatedCaptionedSynthesizer) SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)
- func (e *EstimatedCaptionedSynthesizer) WithSilentThreshold(threshold int16) *EstimatedCaptionedSynthesizer
type FormatMismatchError
- func (e *FormatMismatchError) Error() string
type Option
- func WithFormat(format string) Option
- func WithModel(model string) Option
- func WithSpeed(speed float64) Option
- func WithVoice(voice string) Option
type SynthesizeOptions
type Synthesizer
type WordTimestamp
- func EstimateWordTimestamps(text string, totalDurationMs int, leadingSilenceMs int) []WordTimestamp

Constants ¶

View Source

const (
	SilenceThresholdDefault = 500
	SilenceWindowMs         = 10
)

View Source

const (
	PauseMultQuestion     = 1.3
	PauseMultExclamation  = 1.2
	PauseMultEllipsis     = 1.4
	PauseMultDash         = 1.5
	PauseMultComma        = 0.7
	PauseMultShortResp    = 0.6
	PauseMultLongSentence = 1.2
	PauseMultWait         = 1.3
	PauseMultContinuation = 0.8
	PauseMultTransition   = 1.1
	PauseMultEmotional    = 1.2
	PauseMultSameSpeaker  = 0.25
	PauseMultMin          = 0.5
	PauseMultMax          = 1.8
	PauseMultInterruption = -0.3

	RoomToneAmplitude    = 30
	RoomToneMaxPauseMs   = 3000
	MaxTrailingSilenceMs = 500
	MaxLeadingSilenceMs  = 300
	MinPauseMs           = 50
)

Pause multipliers for context-aware dialogue pacing. These values are based on natural speech patterns and provide natural-sounding conversation flow.

Variables ¶

This section is empty.

Functions ¶

func AnalyzeSpeechRate ¶

func AnalyzeSpeechRate(segments []CaptionedSegment) float64

AnalyzeSpeechRate calculates words per minute for a speaker. This enables automatic speed adjustment for consistent pacing.

func AnalyzeWAVAudio ¶

func AnalyzeWAVAudio(data []byte) (int, int, int, error)

func ComputeWAVDuration ¶

func ComputeWAVDuration(data []byte, info wavInfo) int

func DetectLeadingSilence ¶

func DetectLeadingSilence(data []byte, info wavInfo, threshold ...int16) int

func DetectTrailingSilence ¶

func DetectTrailingSilence(data []byte, info wavInfo, threshold ...int16) int

func ValidateWAVConsistency ¶

func ValidateWAVConsistency(segments [][]byte, expectedInfo wavInfo) error

Types ¶

type Audio ¶

type Audio struct {
	// Data contains the raw audio bytes.
	Data []byte
	// Format specifies the audio format (e.g., "mp3", "wav", "opus").
	Format string
}

Audio represents synthesized audio data with its format.

type CaptionedAudio ¶

type CaptionedAudio struct {
	// Data contains the raw audio bytes.
	Data []byte
	// Format specifies the audio format (e.g., "mp3", "wav", "opus").
	Format string
	// Timestamps contains word-level timing information.
	// Words are ordered chronologically as they appear in the audio.
	Timestamps []WordTimestamp
	// DurationMs is the total duration of the audio in milliseconds.
	// This is the end time of the last word plus any trailing silence.
	DurationMs int
}

CaptionedAudio represents synthesized audio with word-level timestamps. This provides precise timing information for each word, enabling advanced features like subtitle generation, speech analysis, and perfect synchronization.

type CaptionedChunk ¶

type CaptionedChunk struct {
	// Audio contains base64-encoded audio data for this chunk.
	Audio string `json:"audio"`
	// Timestamps contains word-level timing for words in this chunk.
	Timestamps []WordTimestamp `json:"timestamps"`
}

CaptionedChunk represents a single chunk from a captioned stream. It contains both audio data (base64 encoded) and word timestamps for incremental processing during streaming synthesis.

type CaptionedDialogueResult ¶

type CaptionedDialogueResult struct {
	// Audio is the complete dialogue audio.
	Audio []byte
	// Format is the audio format (e.g., "wav").
	Format string
	// Segments contains timing information for each segment.
	Segments []CaptionedSegment
	// TotalDurationMs is the total dialogue duration in milliseconds.
	TotalDurationMs int
	// Subtitles is the SRT-format subtitle string, if enabled.
	Subtitles string
}

CaptionedDialogueResult contains the synthesis output with timing information.

type CaptionedSegment ¶

type CaptionedSegment struct {
	// Speaker is the segment speaker.
	Speaker string
	// Text is the spoken text.
	Text string
	// Audio is the segment audio data.
	Audio []byte
	// Timestamps contains word-level timing.
	Timestamps []WordTimestamp
	// StartMs is when this segment starts in the full dialogue.
	StartMs int
	// EndMs is when this segment ends in the full dialogue.
	EndMs int
	// DurationMs is the total segment duration including trailing silence.
	DurationMs int
	// SpeechDurationMs is the actual speech duration without trailing silence.
	SpeechDurationMs int
	// TrailingSilenceMs is the silence at the end of the audio.
	TrailingSilenceMs int
	// LeadingSilenceMs is the silence at the start of the audio.
	LeadingSilenceMs int
}

CaptionedSegment represents one speaker's segment with timing details.

type CaptionedSynthesizer ¶

type CaptionedSynthesizer interface {
	Synthesizer

	// SynthesizeCaptioned generates audio from text with word-level timestamps.
	// This is similar to Synthesize but returns timing information for each word,
	// enabling precise synchronization and analysis.
	//
	// The returned CaptionedAudio contains both the audio data and timestamps.
	// Not all TTS providers support this feature.
	//
	// Example:
	//
	//	audio, err := synth.SynthesizeCaptioned(ctx, "Hello world", opts...)
	//	for _, ts := range audio.Timestamps {
	//	    fmt.Printf("%d-%dms: %s\n", ts.StartMs, ts.EndMs, ts.Word)
	//	}
	SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)

	// StreamCaptioned generates audio from text with timestamps streamed incrementally.
	// Each chunk contains a JSON object with "audio" (base64) and "timestamps" fields.
	// This is useful for long texts where you want to process audio and timing
	// information as it's generated, rather than waiting for complete synthesis.
	//
	// The returned ReadCloser streams JSON objects, one per chunk.
	//
	// Example:
	//
	//	stream, err := synth.StreamCaptioned(ctx, longText, opts...)
	//	defer stream.Close()
	//	decoder := json.NewDecoder(stream)
	//	for {
	//	    var chunk CaptionedChunk
	//	    if err := decoder.Decode(&chunk); err != nil {
	//	        if err == io.EOF { break }
	//	        return err
	//	    }
	//	    // Process chunk.Audio and chunk.Timestamps
	//	}
	StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
}

CaptionedSynthesizer extends the basic Synthesizer interface with timestamp-aware synthesis capabilities.

Implementations that support word-level timestamps (like Kokoro-FastAPI) should implement this interface in addition to Synthesizer. This enables more sophisticated audio processing like:

Exact pause calculation based on actual speech duration
Automatic subtitle generation (SRT, VTT)
Speech rate analysis and normalization
Perfect synchronization for background music/ambience
Quality control for podcast production

type DialogueSegment ¶

type DialogueSegment struct {
	// Speaker identifies who is speaking (e.g., "Alice", "Bob", "Narrator").
	// Must match a key in the DialogueSynthesizer.VoiceMap.
	Speaker string
	// Text is the content spoken by this speaker.
	// Punctuation and sentence structure affect pause timing between segments.
	Text string
}

DialogueSegment represents a single speaker's line in a multi-speaker dialogue. Each segment specifies who is speaking (Speaker) and what they say (Text), which allows the synthesizer to select the appropriate voice and apply context-aware pacing based on the conversation flow.

type DialogueSynthesizer ¶

type DialogueSynthesizer struct {
	// Syn is the underlying TTS engine used to generate audio for each segment.
	Syn Synthesizer
	// VoiceMap maps speaker names to voice identifiers.
	// For OpenAI: alloy, echo, fable, onyx, nova, shimmer.
	// For Kokoro: af_bella, af_sky, am_adam, etc.
	// Use "+" for voice mixing: "af_bella(3)+af_heart(1)" for 75%/25% mix.
	VoiceMap map[string]string
	// SpeedMap maps speaker names to speech speed multipliers.
	// Values typically range from 0.8 to 1.2, where 1.0 is normal speed.
	// Speakers not in the map default to 1.0.
	// Use higher values for energetic speakers, lower for thoughtful speakers.
	SpeedMap map[string]float64
	// Format specifies the output audio format (e.g., "wav", "mp3").
	// Default is "wav" for better concatenation support and crossfade quality.
	// Note: Compressed formats (mp3, opus) may introduce artifacts during processing.
	Format string
	// CrossfadeMs specifies crossfade duration in milliseconds (default: 50).
	// Higher values (80-100ms) create smoother transitions but may reduce clarity.
	// Lower values (20-40ms) are faster but may sound abrupt on speaker changes.
	// Set to 0 to disable crossfading (useful for compressed formats).
	// Note: Crossfading requires buffering segments in memory.
	CrossfadeMs int
	// PauseMsMin specifies minimum pause duration between segments in milliseconds (default: 200).
	// This is the base pause duration before context-aware adjustments.
	// Set both PauseMsMin and PauseMsMax to 0 to disable pauses between segments.
	// Recommended: 150-250ms for natural conversation, 300-500ms for dramatic effect.
	PauseMsMin int
	// PauseMsMax specifies maximum pause duration between segments in milliseconds (default: 300).
	// A random value between PauseMsMin and PauseMsMax provides natural variation.
	// Context-aware adjustments (questions, exclamations, etc.) can extend beyond this maximum
	// by up to 2x to accommodate natural speech patterns.
	PauseMsMax int
	// NormalizeVolume enables peak volume normalization per segment (default: true).
	// Ensures consistent loudness across different speakers/voices, which is critical
	// for dialogue where different native volumes could be jarring.
	// Normalization targets 95% of maximum amplitude to avoid clipping while maintaining
	// consistent perceived volume across all speakers.
	NormalizeVolume bool
}

DialogueSynthesizer generates audio for multi-speaker dialogues with natural conversation flow. It maps speakers to voice IDs and synthesizes each segment with the appropriate voice, then concatenates the results into a single audio stream with intelligent pacing.

The synthesizer automatically adjusts pause durations based on conversational context:

Questions and exclamations get longer pauses for processing time
Comma-terminated segments get shorter pauses as thoughts continue
Short responses get minimal pauses for quick back-and-forth
Transition words ("so", "well") get appropriate pauses

Audio is processed with crossfading between speakers and volume normalization to ensure consistent loudness across different voices.

Example:

syn, _ := openai.NewSynthesizer(openai.WithBaseURL("http://localhost:8880/v1"))
ds := voice.NewDialogueSynthesizer(syn, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})
ds.SpeedMap = map[string]float64{
    "Bob": 0.95, // Bob speaks slightly slower
}
stream, _ := ds.StreamDialogue(ctx, []voice.DialogueSegment{
    {Speaker: "Alice", Text: "What do you think?"},
    {Speaker: "Bob", Text: "I think it's great!"},
})

func NewDialogueSynthesizer ¶

func NewDialogueSynthesizer(syn Synthesizer, voiceMap map[string]string, format ...string) *DialogueSynthesizer

NewDialogueSynthesizer creates a new dialogue synthesizer for multi-speaker audio generation. The synthesizer applies context-aware pacing, crossfading, and volume normalization to create natural-sounding dialogues.

The format defaults to "wav" which supports reliable concatenation and processing. For dialogue synthesis, WAV is recommended over compressed formats like MP3 to avoid quality degradation through multiple processing steps.

Default settings:

CrossfadeMs: 50ms (smooth transitions between speakers)
PauseMsMin: 200ms (minimum pause between segments)
PauseMsMax: 300ms (maximum pause, randomized for naturalness)
NormalizeVolume: true (consistent loudness across voices)

Example:

ds := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})
ds.SpeedMap = map[string]float64{
    "Alice": 1.05, // slightly faster
    "Bob":   0.95, // slightly slower
}

func (*DialogueSynthesizer) StreamDialogue ¶

func (ds *DialogueSynthesizer) StreamDialogue(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)

StreamDialogue generates audio for all segments and streams them as a single concatenated audio stream. Segments are synthesized sequentially with context-aware pauses and crossfading between speakers.

The returned io.ReadCloser streams the complete dialogue audio. The caller must close the ReadCloser when done reading.

Example:

stream, err := ds.StreamDialogue(ctx, []voice.DialogueSegment{
    {Speaker: "Alice", Text: "Hello?"},
    {Speaker: "Bob", Text: "Hi there!"},
})
defer stream.Close()
io.Copy(outputFile, stream)

func (*DialogueSynthesizer) StreamDialogueParallel ¶

func (ds *DialogueSynthesizer) StreamDialogueParallel(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)

StreamDialogueParallel generates audio for segments in parallel and streams them in order. This is significantly faster than sequential synthesis for dialogues with many segments, as all segments are synthesized concurrently, then assembled in order with proper crossfading and pause timing.

The concurrencyLimit parameter controls how many segments are synthesized simultaneously. A value of 0 or negative means no limit (may overwhelm the API for large dialogues). Recommended: 5-10 for most APIs, adjust based on API rate limits. For Kokoro-FastAPI, 5-10 works well. For OpenAI, use lower values (3-5) due to rate limits.

Returns a ReadCloser that streams the concatenated audio with natural transitions.

func (*DialogueSynthesizer) StreamDialogueParallelWithLimit ¶

func (ds *DialogueSynthesizer) StreamDialogueParallelWithLimit(ctx context.Context, segments []DialogueSegment, concurrencyLimit int) (io.ReadCloser, error)

StreamDialogueParallelWithLimit generates audio with controlled concurrency. Use this for large dialogues or when dealing with rate-limited APIs. ConcurrencyLimit of 5-10 is recommended for most use cases.

func (*DialogueSynthesizer) SynthesizeDialogue ¶

func (ds *DialogueSynthesizer) SynthesizeDialogue(ctx context.Context, segments []DialogueSegment) ([]*Audio, error)

SynthesizeDialogue generates audio for all segments and returns individual audio files. This is useful when you want to process each speaker's audio separately, apply custom audio processing, or store segments individually.

Returns a slice of Audio objects, one per segment, in the same order as the input.

type DialogueSynthesizerCaptioned ¶

type DialogueSynthesizerCaptioned struct {
	// Syn is the captioned synthesizer used to generate audio with timestamps.
	Syn CaptionedSynthesizer
	// VoiceMap maps speaker names to voice identifiers.
	VoiceMap map[string]string
	// SpeedMap maps speaker names to speech speed multipliers.
	// Values typically range from 0.8 to 1.2, where 1.0 is normal speed.
	SpeedMap map[string]float64
	// Format specifies the output audio format (e.g., "wav", "mp3").
	// Default is "wav" for best quality with crossfading.
	Format string
	// CrossfadeMs specifies crossfade duration in milliseconds (default: 50).
	// Set to 0 to disable crossfading.
	CrossfadeMs int
	// TargetPauseMs is the target pause between segments (default: 250).
	// This is the desired gap between the END of one speech and START of the next.
	TargetPauseMs int
	// NormalizeVolume enables peak volume normalization per segment (default: true).
	NormalizeVolume bool
	// GenerateSubtitles enables automatic subtitle generation (default: true).
	// When enabled, returns both audio and SRT-format subtitles.
	GenerateSubtitles bool
}

DialogueSynthesizerCaptioned generates multi-speaker dialogue with perfect timing using word-level timestamps from_captioned synthesis.

This synthesizer provides superior dialogue quality compared to DialogueSynthesizer by using actual speech duration and timing information instead of heuristics. It eliminates problems like:

Double-pausing (built-in silence + added silence)
Cutting words during crossfade
Inconsistent speech rates between speakers
Manual subtitle timing

Requirements: The underlying synthesizer must implement CaptionedSynthesizer interface. Compatible providers: Kokoro-FastAPI (with /dev/captioned_speech endpoint).

func NewDialogueSynthesizerCaptioned ¶

func NewDialogueSynthesizerCaptioned(syn CaptionedSynthesizer, voiceMap map[string]string, format ...string) (*DialogueSynthesizerCaptioned, error)

NewDialogueSynthesizerCaptioned creates a new captioned dialogue synthesizer for multi-speaker audio generation. The synthesizer uses word-level timestamps for perfect pause calculation and optional subtitle generation.

Prerequisites: The synthesizer parameter must implement CaptionedSynthesizer. This is supported by Kokoro-FastAPI and similar providers with timestamp capabilities.

The format defaults to "wav" which preserves quality through multiple processing steps. For subtitle generation and timestamp analysis, WAV is strongly recommended.

Returns an error if the synthesizer is nil or voiceMap is empty.

Example:

syn, _ := openai.NewSynthesizer(openai.WithBaseURL("http://localhost:8880/v1"))
ds, err := voice.NewDialogueSynthesizerCaptioned(syn, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})

func (*DialogueSynthesizerCaptioned) CalculatePerfectPause ¶

func (ds *DialogueSynthesizerCaptioned) CalculatePerfectPause(prev, curr *CaptionedSegment) int

CalculatePerfectPause calculates the exact pause needed between two segments. It uses word-level timestamps to avoid double-pausing and applies context-aware adjustments based on dialogue content.

func (*DialogueSynthesizerCaptioned) GenerateSRT ¶

func (ds *DialogueSynthesizerCaptioned) GenerateSRT(segments []CaptionedSegment) string

GenerateSRT creates SRT-format subtitles from captioned segments. This automatically generates perfectly timed subtitles without manual adjustment. This is a convenience method that wraps the internal generateSRT function.

func (*DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers ¶

func (ds *DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers(segments []CaptionedSegment) string

GenerateSRTWithSpeakers creates SRT-format subtitles with speaker labels. Each subtitle line includes "[Speaker]: word" format, useful for multi-speaker content.

func (*DialogueSynthesizerCaptioned) StreamDialogueCaptioned ¶

func (ds *DialogueSynthesizerCaptioned) StreamDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)

StreamDialogueCaptioned streams dialogue with timestamps, calculating perfect pauses on-the-fly using speech duration information.

IMPLEMENTATION STATUS: This method is currently a stub and returns an error. Streaming captioned dialogue requires buffering segments anyway to calculate perfect pauses, so there's no significant benefit over SynthesizeDialogueCaptioned.

FUTURE WORK: If streaming is needed for very long dialogues, consider: 1. Using a heuristic pause calculation instead of perfect pause 2. Buffering N segments ahead for pause calculation while streaming 3. Using a separate goroutine for synthesis and another for assembly

For now, use SynthesizeDialogueCaptioned which provides the full feature set.

func (*DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned ¶

func (ds *DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (*CaptionedDialogueResult, error)

SynthesizeDialogueCaptioned generates dialogue with perfect timing using timestamps. This method provides superior audio quality by:

Calculating exact pauses from actual speech duration
Avoiding double-pausing (built-in silence + added silence)
Crossfading at word boundaries instead of random positions
Generating subtitles automatically (if enabled)

Returns complete dialogue audio and detailed timing information for each segment.

type EstimatedCaptionedSynthesizer ¶

type EstimatedCaptionedSynthesizer struct {
	Syn             Synthesizer
	Format          string
	SilentThreshold int16
}

func NewEstimatedCaptionedSynthesizer ¶

func NewEstimatedCaptionedSynthesizer(syn Synthesizer, format ...string) *EstimatedCaptionedSynthesizer

func (*EstimatedCaptionedSynthesizer) Stream ¶

func (e *EstimatedCaptionedSynthesizer) Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)

func (*EstimatedCaptionedSynthesizer) StreamCaptioned ¶

func (e *EstimatedCaptionedSynthesizer) StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)

func (*EstimatedCaptionedSynthesizer) Synthesize ¶

func (e *EstimatedCaptionedSynthesizer) Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)

func (*EstimatedCaptionedSynthesizer) SynthesizeCaptioned ¶

func (e *EstimatedCaptionedSynthesizer) SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)

func (*EstimatedCaptionedSynthesizer) WithSilentThreshold ¶

func (e *EstimatedCaptionedSynthesizer) WithSilentThreshold(threshold int16) *EstimatedCaptionedSynthesizer

type FormatMismatchError ¶

type FormatMismatchError struct {
	SegmentIndex int
	Property     string
	Expected     int
	Actual       int
}

func (*FormatMismatchError) Error ¶

func (e *FormatMismatchError) Error() string

type Option ¶

type Option func(*SynthesizeOptions)

Option is a functional option for configuring synthesis parameters.

func WithFormat ¶

func WithFormat(format string) Option

WithFormat sets the output audio format.

func WithModel ¶

func WithModel(model string) Option

WithModel sets the TTS model for synthesis.

func WithSpeed ¶

func WithSpeed(speed float64) Option

WithSpeed sets the speech speed multiplier. Valid range is 0.25 to 4.0, where 1.0 is normal speed.

func WithVoice ¶

func WithVoice(voice string) Option

WithVoice sets the voice identifier for synthesis.

type SynthesizeOptions ¶

type SynthesizeOptions struct {
	// Model specifies the TTS model to use (e.g., "tts-1", "kokoro").
	Model string
	// Voice specifies the voice identifier (e.g., "alloy", "af_bella").
	Voice string
	// Format specifies the output audio format (e.g., "mp3", "wav").
	Format string
	// Speed specifies the speech speed (0.25 to 4.0, where 1.0 is normal).
	Speed float64
}

SynthesizeOptions configures text-to-speech synthesis parameters.

type Synthesizer ¶

type Synthesizer interface {
	// Synthesize generates audio from text and returns the complete audio data.
	// Use this for shorter texts where buffering the entire response is acceptable.
	Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)

	// Stream generates audio from text and returns a stream for reading audio chunks.
	// Use this for longer texts or when you want to process audio as it arrives.
	// The caller is responsible for closing the returned ReadCloser.
	Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
}

Synthesizer is the interface for Text-to-Speech providers. Implementations convert text into audio data, supporting both buffered synthesis and streaming modes.

type WordTimestamp ¶

type WordTimestamp struct {
	// Word is the text content of this segment.
	Word string
	// StartMs is the start time in milliseconds from the beginning of the audio.
	StartMs int
	// EndMs is the end time in milliseconds from the beginning of the audio.
	EndMs int
}

WordTimestamp represents a single word with its timing information. This enables precise synchronization of audio with text, useful for generating subtitles, chapter markers, and analyzing speech patterns.

func EstimateWordTimestamps ¶

func EstimateWordTimestamps(text string, totalDurationMs int, leadingSilenceMs int) []WordTimestamp

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
elevenlabs
openai Package openai provides an OpenAI-compatible Text-to-Speech implementation.	Package openai provides an OpenAI-compatible Text-to-Speech implementation.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL