voice

package
v0.34.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 11, 2026 License: MIT Imports: 12 Imported by: 0

README

Voice Generation Package

The voice package provides a modular Text-to-Speech (TTS) interface for generating audio from text. It supports both buffered synthesis and streaming modes, compatible with OpenAI's API and local OpenAI-compatible servers like Kokoro-FastAPI.

Features

  • Dual Mode Operation: Buffered synthesis (Synthesize) for short texts, streaming (Stream) for efficient processing of longer content
  • OpenAI Compatible: Works with OpenAI cloud API and local servers (Kokoro, etc.)
  • Dialogue Synthesis: Multi-speaker dialogue generation with natural conversation flow
  • Context-Aware Pacing: Intelligent pause calculation based on dialogue context (questions, exclamations, transitions)
  • Word-Level Timestamps: Captioned synthesis interface for precise synchronization and subtitle generation (where supported)
  • Voice Mixing: Combine multiple voices with weighted ratios for unique character voices
  • Audio Processing: Crossfading, volume normalization (LUFS-style), and zero-crossing optimization
  • Functional Options: Flexible configuration using the functional options pattern
  • Context Support: Proper cancellation and timeout handling
  • Multiple Voices: Support for various voice identifiers per provider

Installation

go get github.com/sevigo/goframe/voice
go get github.com/sevigo/goframe/voice/openai

Usage

Cloud OpenAI
package main

import (
    "context"
    "fmt"
    "log"
    "os"

    "github.com/sevigo/goframe/voice/openai"
)

func main() {
    synthesizer, err := openai.NewSynthesizer(
        openai.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
        openai.WithModel("tts-1"),
        openai.WithVoice("alloy"),
        openai.WithFormat("mp3"),
    )
    if err != nil {
        log.Fatal(err)
    }

    audio, err := synthesizer.Synthesize(context.Background(), "Hello, world!")
    if err != nil {
        log.Fatal(err)
    }

    err = os.WriteFile("output.mp3", audio.Data, 0600)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println("Audio saved to output.mp3")
}
Local Kokoro Container
package main

import (
    "context"
    "fmt"
    "io"
    "log"
    "os"

    "github.com/sevigo/goframe/voice/openai"
)

func main() {
    // Kokoro runs locally - no API key required
    synthesizer, err := openai.NewSynthesizer(
        openai.WithBaseURL("http://localhost:8880/v1"),
        openai.WithModel("kokoro"),
        openai.WithVoice("af_bella"),
        openai.WithFormat("wav"),
    )
    if err != nil {
        log.Fatal(err)
    }

    // Use streaming for efficient processing
    stream, err := synthesizer.Stream(context.Background(), "Hello from local TTS!")
    if err != nil {
        log.Fatal(err)
    }
    defer stream.Close()

    file, err := os.Create("output.wav")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    written, err := io.Copy(file, stream)
    if err != nil {
        log.Fatal(err)
    }
    fmt.Printf("Saved %d bytes to output.wav\n", written)
}
Per-Request Options

Override default settings for individual requests:

audio, err := synthesizer.Synthesize(ctx, text,
    voice.WithVoice("echo"),
    voice.WithModel("tts-1-hd"),
    voice.WithSpeed(1.2),
)

Configuration Options

Synthesizer Options
Option Description Default
WithAPIKey(key) OpenAI/compatible API key None
WithBaseURL(url) API base URL https://api.openai.com/v1
WithModel(model) TTS model name tts-1
WithVoice(voice) Voice identifier alloy
WithFormat(format) Output format (mp3, wav, etc.) mp3
WithSpeed(speed) Speech speed multiplier (0.25-4.0) 1.0
WithHTTPClient(client) Custom HTTP client Default shared client
WithLogger(logger) Custom structured logger slog.Default()
DialogueSynthesizer Settings
type DialogueSynthesizer struct {
    Syn             Synthesizer          // Underlying TTS engine
    VoiceMap        map[string]string    // Speaker -> voice ID mapping
    SpeedMap        map[string]float64   // Speaker -> speed multiplier
    Format          string               // Output format ("wav" recommended)
    CrossfadeMs     int                  // Crossfade duration (default: 50)
    PauseMsMin      int                  // Minimum pause (default: 200)
    PauseMsMax      int                  // Maximum pause (default: 300)
    NormalizeVolume bool                 // Enable volume normalization (default: true)
}
SpeedMap Guidelines

Recommended speed multipliers for different character types:

Character Type Speed Example
Energetic host 1.05-1.10 News anchor, podcast host
Normal speaker 1.00 Most conversational voices
Thoughtful guest 0.90-0.95 Expert, professor
Slow narrator 0.85-0.90 Storyteller, documentary

Supported Voices

OpenAI
  • alloy, echo, fable, onyx, nova, shimmer
Kokoro (Kokoro-FastAPI)
  • American Female: af_bella, af_sarah, af_sky, af_heart, af_nicole
  • American Male: am_adam, am_michael
  • British Female: bf_emma, bf_isabella
  • British Male: bm_george, bm_lewis
Voice Combinations (Kokoro only)

Mix voices using + notation with optional weights:

// Single voice
voice: "af_bella"

// Equal mix (50%/50%)
voice: "af_bella+af_sky"

// Weighted mix (67%/33%)
voice: "af_bella(2)+af_heart(1)"

// Complex mix (40%/30%/30%)
voice: "af_sky(4)+af_nicole(3)+af_heart(3)"

Combined voices are automatically cached for future use.

Supported Formats

  • wav (recommended for dialogue) - Lossless, supports crossfading and normalization
  • mp3 - Not recommended for dialogue (compression artifacts at transitions)
  • opus - Good for streaming
  • flac - Lossless compression
  • pcm - Raw audio, useful for streaming
Why WAV for Dialogue?

WAV format preserves audio quality through multiple processing steps:

  1. Crossfading between speakers
  2. Volume normalization across different voices
  3. Pause insertion with silence padding
  4. Zero-crossing optimization

Other formats (MP3, Opus) apply lossy compression, which can introduce artifacts when processing the audio multiple times.

Examples

Multi-Speaker Dialogue with Natural Pacing

Generate dialogue with multiple speakers using different voices, with context-aware pause calculation for natural conversation flow:

import "github.com/sevigo/goframe/voice"

synthesizer, _ := openai.NewSynthesizer(
    openai.WithBaseURL("http://localhost:8880/v1"),
    openai.WithModel("kokoro"),
)

dialogueSyn := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})

// Customize speaker pacing
dialogueSyn.SpeedMap = map[string]float64{
    "Alice": 1.0,  // normal pace
    "Bob":   0.95, // slightly slower, thoughtful
}

dialogue := []voice.DialogueSegment{
    {Speaker: "Alice", Text: "What do you think about Tokyo?"},      // Longer pause (question)
    {Speaker: "Bob", Text: "I think it's amazing!"},                 // Medium pause (exclamation)
    {Speaker: "Alice", Text: "Yeah."},                                // Short pause (brief response)
    {Speaker: "Bob", Text: "And then we went to the station,"},       // Very short pause (continuing thought)
    {Speaker: "Alice", Text: "and the train was already there."},    // Flows naturally
}

stream, _ := dialogueSyn.StreamDialogue(ctx, dialogue)
defer stream.Close()
io.Copy(outputFile, stream)
Context-Aware Pause Calculation

The dialogue synthesizer automatically adjusts pauses based on conversational context:

Context Pause Multiplier Reason
Questions (?) 1.3x Listener needs time to process
Exclamations (!) 1.2x Emotional impact
Ellipsis (...) 1.4x Thoughtful/pensive
Em dashes () 1.5x Dramatic interruption
Commas (,) 0.7x Continuing same thought
Short responses (≤3 words) 0.6x Quick back-and-forth
Long sentences (>20 words) 1.2x Complex idea, needs processing
Transition words ("so", "well") 1.1x Slight pause for transition

Benefits:

  • ✅ Podcast-quality natural flow
  • ✅ No robotic fixed-pause timing
  • ✅ Questions sound like questions
  • ✅ Emotional responses have appropriate pauses
  • ✅ Quick back-and-forth feels conversational
Voice Mixing for Unique Characters

Combine multiple voices with weighted ratios:

dialogueSyn := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
    "Narrator": "af_bella(3)+af_heart(1)",  // 75% bella, 25% heart
    "Host":     "af_sky(2)+af_nicole(1)",   // 67% sky, 33% nicole
    "Guest":    "am_adam",                  // pure voice
})
Parallel Dialogue Synthesis with Concurrency Limiting

For faster generation, synthesize all segments in parallel with controlled concurrency:

// Unlimited concurrency (may overwhelm API for large dialogues)
stream, err := dialogueSyn.StreamDialogueParallel(ctx, dialogue)

// Limited concurrency (recommended for large dialogues)
// Process 10 segments at a time to avoid overwhelming the API
stream, err := dialogueSyn.StreamDialogueParallelWithLimit(ctx, dialogue, 10)

Concurrency Recommendations:

  • Kokoro-FastAPI (local): 5-10 concurrent requests
  • OpenAI API: 3-5 concurrent requests (rate limits)
  • Large dialogues (>50 segments): Always use limiting
  • Small dialogues (<10 segments): Unlimited is fine
Audio Processing Features

The DialogueSynthesizer includes professional audio processing:

  1. Crossfading (default: 50ms): Smooth transitions between speakers using equal-power curves
  2. Volume Normalization: Peak normalization to 95% for consistent loudness across voices
  3. Zero-Crossing Optimization: Minimizes clicks at splice points
  4. Configurable Pauses: Set custom pause ranges:
dialogueSyn.PauseMsMin = 150  // minimum pause between segments
dialogueSyn.PauseMsMax = 350  // maximum pause (randomized for naturalness)
dialogueSyn.CrossfadeMs = 50  // crossfade duration
Streaming with Progress

See examples/kokoro-streaming/main.go for a streaming example with real-time progress:

go run ./examples/kokoro-streaming/main.go
Multiple Voices

See examples/kokoro-tts/main.go for generating audio with multiple voices:

go run ./examples/kokoro-tts/main.go
Dialogue Synthesis (Podcast-Quality)

See examples/kokoro-dialogue/main.go for multi-speaker dialogue generation with context-aware pacing:

# Start Kokoro container first
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# Run the example
go run ./examples/kokoro-dialogue/main.go

# Output: dialogue.wav with natural conversation flow
ffplay dialogue.wav

The dialogue example demonstrates:

  • 3 different speakers with unique voice profiles
  • Context-aware pacing (questions, exclamations, transitions)
  • Voice mixing for character variety
  • Per-speaker speed customization
  • Volume normalization across speakers
  • Crossfading for smooth transitions

Testing Without API Credits

For local testing without spending API credits, use Kokoro-FastAPI:

# Start Kokoro container
docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest

# Test with local server
synthesizer, _ := openai.NewSynthesizer(
    openai.WithBaseURL("http://localhost:8880/v1"),
)

Errors

Error Description
ErrAPIKeyRequired API key required when using default OpenAI endpoint
openai: text cannot be empty Empty input text
openai: request failed with status N HTTP error from API
voice: no segments provided Empty dialogue segment list
voice: no voice mapping for speaker X Speaker not found in VoiceMap
voice: data too short for WAV header Invalid WAV audio data
voice: unsupported WAV format N Non-PCM WAV format

Architecture

Package Structure
voice/
├── voice.go           # Core interfaces and types
├── dialogue.go        # Dialogue synthesis with context-aware pacing
├── dialogue_test.go   # Tests for pause calculation
└── openai/
    └── openai.go      # OpenAI-compatible TTS implementation
Key Types

voice.Synthesizer - Single-voice TTS interface

  • Synthesize(ctx, text, opts) - Buffer entire audio
  • Stream(ctx, text, opts) - Stream audio chunks

voice.CaptionedSynthesizer - TTS with word-level timestamps (where supported)

  • SynthesizeCaptioned(ctx, text, opts) - Audio + word timing information
  • StreamCaptioned(ctx, text, opts) - Stream audio + timing incrementally

Note: DialogueSynthesizerCaptioned.StreamDialogueCaptioned() is not yet implemented because timestamp-based pause calculation requires buffering all segments. For streaming dialogue synthesis, use DialogueSynthesizer.StreamDialogueParallel() instead.

voice.DialogueSynthesizer - Multi-speaker dialogue synthesis

  • StreamDialogue(ctx, segments) - Sequential synthesis
  • StreamDialogueParallel(ctx, segments) - Parallel synthesis (faster)
  • SynthesizeDialogue(ctx, segments) - Return individual segments

voice.DialogueSegment - Single speaker's line

type DialogueSegment struct {
    Speaker string
    Text    string
}

voice.CaptionedAudio - Audio with word timestamps

type CaptionedAudio struct {
    Data       []byte            // Raw audio bytes
    Format     string            // Audio format
    Timestamps []WordTimestamp   // Word-level timing
    DurationMs int               // Total duration
}

Advanced Features

Word-Level Timestamps (Kokoro-FastAPI)

For providers that support captioned synthesis (currently Kokoro-FastAPI), you can get precise word timing:

// Check if synthesizer supports captions
if cs, ok := synthesizer.(voice.CaptionedSynthesizer); ok {
    audio, err := cs.SynthesizeCaptioned(ctx, "Hello world", opts...)
    if err != nil {
        log.Fatal(err)
    }

    // Use timestamps for precise timing
    for _, ts := range audio.Timestamps {
        fmt.Printf("%d-%dms: %s\n", ts.StartMs, ts.EndMs, ts.Word)
    }
}

Benefits:

  • ✅ Generate subtitles (SRT, VTT) automatically
  • ✅ Calculate exact pauses between dialogue turns
  • ✅ Analyze speech rate per speaker
  • ✅ Create chapter markers for podcasts
  • ✅ Perfect synchronization for background audio

Limitation: Captioned dialogue synthesis (DialogueSynthesizerCaptioned) does not support streaming. This is because calculating perfect pauses requires buffering all segments to detect built-in silence. For streaming dialogue, use heuristic-based DialogueSynthesizer instead.

Subtitle Generation
// Convert captions to SRT format
func generateSRT(segments []voice.CaptionedAudio) string {
    var srt strings.Builder
    index := 1
    timeOffset := 0

    for _, seg := range segments {
        for _, ts := range seg.Timestamps {
            start := formatSRTTime(timeOffset + ts.StartMs)
            end := formatSRTTime(timeOffset + ts.EndMs)
            srt.WriteString(fmt.Sprintf("%d\n%s --> %s\n%s\n\n",
                index, start, end, ts.Word))
            index++
        }
        timeOffset += seg.DurationMs
    }
    return srt.String()
}

Performance

Dialogue Synthesis Performance
  • Sequential: Processes segments one at a time (~300-600ms per segment)
  • Parallel: Synthesizes all segments concurrently, then assembles in order
    • 3 segments: ~3x faster than sequential
    • 10 segments: ~10x faster than sequential
    • Overhead: ~50-100ms for assembly (crossfading, pauses)
  • Captioned Synthesis: Adds ~10-20ms overhead for timestamp generation
Memory Usage
  • WAV format: Requires buffering entire segment for crossfading
  • Non-WAV formats: Streams directly without buffering
  • Peak memory: ~2MB per minute of audio (WAV)
  • Captioned mode: Additional ~1KB per 100 words for timestamp storage
Optimization Tips
  1. Use StreamDialogueParallel for dialogue with >3 segments
  2. Use wav format for best quality with crossfading
  3. Adjust CrossfadeMs (20-50ms is sufficient for most cases)
  4. Set appropriate SpeedMap values to avoid re-synthesis
  5. Use captioned synthesis for subtitle generation instead of separate processing

Documentation

Overview

Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.

Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.

Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.

Index

Constants

View Source
const (
	SilenceThresholdDefault = 500
	SilenceWindowMs         = 10
)
View Source
const (
	PauseMultQuestion     = 1.3
	PauseMultExclamation  = 1.2
	PauseMultEllipsis     = 1.4
	PauseMultDash         = 1.5
	PauseMultComma        = 0.7
	PauseMultShortResp    = 0.6
	PauseMultLongSentence = 1.2
	PauseMultWait         = 1.3
	PauseMultContinuation = 0.8
	PauseMultTransition   = 1.1
	PauseMultEmotional    = 1.2
	PauseMultSameSpeaker  = 0.25
	PauseMultMin          = 0.5
	PauseMultMax          = 1.8
	PauseMultInterruption = -0.3

	RoomToneAmplitude    = 30
	RoomToneMaxPauseMs   = 3000
	MaxTrailingSilenceMs = 500
	MaxLeadingSilenceMs  = 300
	MinPauseMs           = 50
)

Pause multipliers for context-aware dialogue pacing. These values are based on natural speech patterns and provide natural-sounding conversation flow.

Variables

This section is empty.

Functions

func AnalyzeSpeechRate

func AnalyzeSpeechRate(segments []CaptionedSegment) float64

AnalyzeSpeechRate calculates words per minute for a speaker. This enables automatic speed adjustment for consistent pacing.

func AnalyzeWAVAudio

func AnalyzeWAVAudio(data []byte) (int, int, int, error)

func ComputeWAVDuration

func ComputeWAVDuration(data []byte, info wavInfo) int

func DetectLeadingSilence

func DetectLeadingSilence(data []byte, info wavInfo, threshold ...int16) int

func DetectTrailingSilence

func DetectTrailingSilence(data []byte, info wavInfo, threshold ...int16) int

func ValidateWAVConsistency

func ValidateWAVConsistency(segments [][]byte, expectedInfo wavInfo) error

Types

type Audio

type Audio struct {
	// Data contains the raw audio bytes.
	Data []byte
	// Format specifies the audio format (e.g., "mp3", "wav", "opus").
	Format string
}

Audio represents synthesized audio data with its format.

type CaptionedAudio

type CaptionedAudio struct {
	// Data contains the raw audio bytes.
	Data []byte
	// Format specifies the audio format (e.g., "mp3", "wav", "opus").
	Format string
	// Timestamps contains word-level timing information.
	// Words are ordered chronologically as they appear in the audio.
	Timestamps []WordTimestamp
	// DurationMs is the total duration of the audio in milliseconds.
	// This is the end time of the last word plus any trailing silence.
	DurationMs int
}

CaptionedAudio represents synthesized audio with word-level timestamps. This provides precise timing information for each word, enabling advanced features like subtitle generation, speech analysis, and perfect synchronization.

type CaptionedChunk

type CaptionedChunk struct {
	// Audio contains base64-encoded audio data for this chunk.
	Audio string `json:"audio"`
	// Timestamps contains word-level timing for words in this chunk.
	Timestamps []WordTimestamp `json:"timestamps"`
}

CaptionedChunk represents a single chunk from a captioned stream. It contains both audio data (base64 encoded) and word timestamps for incremental processing during streaming synthesis.

type CaptionedDialogueResult

type CaptionedDialogueResult struct {
	// Audio is the complete dialogue audio.
	Audio []byte
	// Format is the audio format (e.g., "wav").
	Format string
	// Segments contains timing information for each segment.
	Segments []CaptionedSegment
	// TotalDurationMs is the total dialogue duration in milliseconds.
	TotalDurationMs int
	// Subtitles is the SRT-format subtitle string, if enabled.
	Subtitles string
}

CaptionedDialogueResult contains the synthesis output with timing information.

type CaptionedSegment

type CaptionedSegment struct {
	// Speaker is the segment speaker.
	Speaker string
	// Text is the spoken text.
	Text string
	// Audio is the segment audio data.
	Audio []byte
	// Timestamps contains word-level timing.
	Timestamps []WordTimestamp
	// StartMs is when this segment starts in the full dialogue.
	StartMs int
	// EndMs is when this segment ends in the full dialogue.
	EndMs int
	// DurationMs is the total segment duration including trailing silence.
	DurationMs int
	// SpeechDurationMs is the actual speech duration without trailing silence.
	SpeechDurationMs int
	// TrailingSilenceMs is the silence at the end of the audio.
	TrailingSilenceMs int
	// LeadingSilenceMs is the silence at the start of the audio.
	LeadingSilenceMs int
}

CaptionedSegment represents one speaker's segment with timing details.

type CaptionedSynthesizer

type CaptionedSynthesizer interface {
	Synthesizer

	// SynthesizeCaptioned generates audio from text with word-level timestamps.
	// This is similar to Synthesize but returns timing information for each word,
	// enabling precise synchronization and analysis.
	//
	// The returned CaptionedAudio contains both the audio data and timestamps.
	// Not all TTS providers support this feature.
	//
	// Example:
	//
	//	audio, err := synth.SynthesizeCaptioned(ctx, "Hello world", opts...)
	//	for _, ts := range audio.Timestamps {
	//	    fmt.Printf("%d-%dms: %s\n", ts.StartMs, ts.EndMs, ts.Word)
	//	}
	SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)

	// StreamCaptioned generates audio from text with timestamps streamed incrementally.
	// Each chunk contains a JSON object with "audio" (base64) and "timestamps" fields.
	// This is useful for long texts where you want to process audio and timing
	// information as it's generated, rather than waiting for complete synthesis.
	//
	// The returned ReadCloser streams JSON objects, one per chunk.
	//
	// Example:
	//
	//	stream, err := synth.StreamCaptioned(ctx, longText, opts...)
	//	defer stream.Close()
	//	decoder := json.NewDecoder(stream)
	//	for {
	//	    var chunk CaptionedChunk
	//	    if err := decoder.Decode(&chunk); err != nil {
	//	        if err == io.EOF { break }
	//	        return err
	//	    }
	//	    // Process chunk.Audio and chunk.Timestamps
	//	}
	StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
}

CaptionedSynthesizer extends the basic Synthesizer interface with timestamp-aware synthesis capabilities.

Implementations that support word-level timestamps (like Kokoro-FastAPI) should implement this interface in addition to Synthesizer. This enables more sophisticated audio processing like:

  • Exact pause calculation based on actual speech duration
  • Automatic subtitle generation (SRT, VTT)
  • Speech rate analysis and normalization
  • Perfect synchronization for background music/ambience
  • Quality control for podcast production

type DialogueSegment

type DialogueSegment struct {
	// Speaker identifies who is speaking (e.g., "Alice", "Bob", "Narrator").
	// Must match a key in the DialogueSynthesizer.VoiceMap.
	Speaker string
	// Text is the content spoken by this speaker.
	// Punctuation and sentence structure affect pause timing between segments.
	Text string
}

DialogueSegment represents a single speaker's line in a multi-speaker dialogue. Each segment specifies who is speaking (Speaker) and what they say (Text), which allows the synthesizer to select the appropriate voice and apply context-aware pacing based on the conversation flow.

type DialogueSynthesizer

type DialogueSynthesizer struct {
	// Syn is the underlying TTS engine used to generate audio for each segment.
	Syn Synthesizer
	// VoiceMap maps speaker names to voice identifiers.
	// For OpenAI: alloy, echo, fable, onyx, nova, shimmer.
	// For Kokoro: af_bella, af_sky, am_adam, etc.
	// Use "+" for voice mixing: "af_bella(3)+af_heart(1)" for 75%/25% mix.
	VoiceMap map[string]string
	// SpeedMap maps speaker names to speech speed multipliers.
	// Values typically range from 0.8 to 1.2, where 1.0 is normal speed.
	// Speakers not in the map default to 1.0.
	// Use higher values for energetic speakers, lower for thoughtful speakers.
	SpeedMap map[string]float64
	// Format specifies the output audio format (e.g., "wav", "mp3").
	// Default is "wav" for better concatenation support and crossfade quality.
	// Note: Compressed formats (mp3, opus) may introduce artifacts during processing.
	Format string
	// CrossfadeMs specifies crossfade duration in milliseconds (default: 50).
	// Higher values (80-100ms) create smoother transitions but may reduce clarity.
	// Lower values (20-40ms) are faster but may sound abrupt on speaker changes.
	// Set to 0 to disable crossfading (useful for compressed formats).
	// Note: Crossfading requires buffering segments in memory.
	CrossfadeMs int
	// PauseMsMin specifies minimum pause duration between segments in milliseconds (default: 200).
	// This is the base pause duration before context-aware adjustments.
	// Set both PauseMsMin and PauseMsMax to 0 to disable pauses between segments.
	// Recommended: 150-250ms for natural conversation, 300-500ms for dramatic effect.
	PauseMsMin int
	// PauseMsMax specifies maximum pause duration between segments in milliseconds (default: 300).
	// A random value between PauseMsMin and PauseMsMax provides natural variation.
	// Context-aware adjustments (questions, exclamations, etc.) can extend beyond this maximum
	// by up to 2x to accommodate natural speech patterns.
	PauseMsMax int
	// NormalizeVolume enables peak volume normalization per segment (default: true).
	// Ensures consistent loudness across different speakers/voices, which is critical
	// for dialogue where different native volumes could be jarring.
	// Normalization targets 95% of maximum amplitude to avoid clipping while maintaining
	// consistent perceived volume across all speakers.
	NormalizeVolume bool
}

DialogueSynthesizer generates audio for multi-speaker dialogues with natural conversation flow. It maps speakers to voice IDs and synthesizes each segment with the appropriate voice, then concatenates the results into a single audio stream with intelligent pacing.

The synthesizer automatically adjusts pause durations based on conversational context:

  • Questions and exclamations get longer pauses for processing time
  • Comma-terminated segments get shorter pauses as thoughts continue
  • Short responses get minimal pauses for quick back-and-forth
  • Transition words ("so", "well") get appropriate pauses

Audio is processed with crossfading between speakers and volume normalization to ensure consistent loudness across different voices.

Example:

syn, _ := openai.NewSynthesizer(openai.WithBaseURL("http://localhost:8880/v1"))
ds := voice.NewDialogueSynthesizer(syn, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})
ds.SpeedMap = map[string]float64{
    "Bob": 0.95, // Bob speaks slightly slower
}
stream, _ := ds.StreamDialogue(ctx, []voice.DialogueSegment{
    {Speaker: "Alice", Text: "What do you think?"},
    {Speaker: "Bob", Text: "I think it's great!"},
})

func NewDialogueSynthesizer

func NewDialogueSynthesizer(syn Synthesizer, voiceMap map[string]string, format ...string) *DialogueSynthesizer

NewDialogueSynthesizer creates a new dialogue synthesizer for multi-speaker audio generation. The synthesizer applies context-aware pacing, crossfading, and volume normalization to create natural-sounding dialogues.

The format defaults to "wav" which supports reliable concatenation and processing. For dialogue synthesis, WAV is recommended over compressed formats like MP3 to avoid quality degradation through multiple processing steps.

Default settings:

  • CrossfadeMs: 50ms (smooth transitions between speakers)
  • PauseMsMin: 200ms (minimum pause between segments)
  • PauseMsMax: 300ms (maximum pause, randomized for naturalness)
  • NormalizeVolume: true (consistent loudness across voices)

Example:

ds := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})
ds.SpeedMap = map[string]float64{
    "Alice": 1.05, // slightly faster
    "Bob":   0.95, // slightly slower
}

func (*DialogueSynthesizer) StreamDialogue

func (ds *DialogueSynthesizer) StreamDialogue(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)

StreamDialogue generates audio for all segments and streams them as a single concatenated audio stream. Segments are synthesized sequentially with context-aware pauses and crossfading between speakers.

The returned io.ReadCloser streams the complete dialogue audio. The caller must close the ReadCloser when done reading.

Example:

stream, err := ds.StreamDialogue(ctx, []voice.DialogueSegment{
    {Speaker: "Alice", Text: "Hello?"},
    {Speaker: "Bob", Text: "Hi there!"},
})
defer stream.Close()
io.Copy(outputFile, stream)

func (*DialogueSynthesizer) StreamDialogueParallel

func (ds *DialogueSynthesizer) StreamDialogueParallel(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)

StreamDialogueParallel generates audio for segments in parallel and streams them in order. This is significantly faster than sequential synthesis for dialogues with many segments, as all segments are synthesized concurrently, then assembled in order with proper crossfading and pause timing.

The concurrencyLimit parameter controls how many segments are synthesized simultaneously. A value of 0 or negative means no limit (may overwhelm the API for large dialogues). Recommended: 5-10 for most APIs, adjust based on API rate limits. For Kokoro-FastAPI, 5-10 works well. For OpenAI, use lower values (3-5) due to rate limits.

Returns a ReadCloser that streams the concatenated audio with natural transitions.

func (*DialogueSynthesizer) StreamDialogueParallelWithLimit

func (ds *DialogueSynthesizer) StreamDialogueParallelWithLimit(ctx context.Context, segments []DialogueSegment, concurrencyLimit int) (io.ReadCloser, error)

StreamDialogueParallelWithLimit generates audio with controlled concurrency. Use this for large dialogues or when dealing with rate-limited APIs. ConcurrencyLimit of 5-10 is recommended for most use cases.

func (*DialogueSynthesizer) SynthesizeDialogue

func (ds *DialogueSynthesizer) SynthesizeDialogue(ctx context.Context, segments []DialogueSegment) ([]*Audio, error)

SynthesizeDialogue generates audio for all segments and returns individual audio files. This is useful when you want to process each speaker's audio separately, apply custom audio processing, or store segments individually.

Returns a slice of Audio objects, one per segment, in the same order as the input.

type DialogueSynthesizerCaptioned

type DialogueSynthesizerCaptioned struct {
	// Syn is the captioned synthesizer used to generate audio with timestamps.
	Syn CaptionedSynthesizer
	// VoiceMap maps speaker names to voice identifiers.
	VoiceMap map[string]string
	// SpeedMap maps speaker names to speech speed multipliers.
	// Values typically range from 0.8 to 1.2, where 1.0 is normal speed.
	SpeedMap map[string]float64
	// Format specifies the output audio format (e.g., "wav", "mp3").
	// Default is "wav" for best quality with crossfading.
	Format string
	// CrossfadeMs specifies crossfade duration in milliseconds (default: 50).
	// Set to 0 to disable crossfading.
	CrossfadeMs int
	// TargetPauseMs is the target pause between segments (default: 250).
	// This is the desired gap between the END of one speech and START of the next.
	TargetPauseMs int
	// NormalizeVolume enables peak volume normalization per segment (default: true).
	NormalizeVolume bool
	// GenerateSubtitles enables automatic subtitle generation (default: true).
	// When enabled, returns both audio and SRT-format subtitles.
	GenerateSubtitles bool
}

DialogueSynthesizerCaptioned generates multi-speaker dialogue with perfect timing using word-level timestamps from_captioned synthesis.

This synthesizer provides superior dialogue quality compared to DialogueSynthesizer by using actual speech duration and timing information instead of heuristics. It eliminates problems like:

  • Double-pausing (built-in silence + added silence)
  • Cutting words during crossfade
  • Inconsistent speech rates between speakers
  • Manual subtitle timing

Requirements: The underlying synthesizer must implement CaptionedSynthesizer interface. Compatible providers: Kokoro-FastAPI (with /dev/captioned_speech endpoint).

func NewDialogueSynthesizerCaptioned

func NewDialogueSynthesizerCaptioned(syn CaptionedSynthesizer, voiceMap map[string]string, format ...string) (*DialogueSynthesizerCaptioned, error)

NewDialogueSynthesizerCaptioned creates a new captioned dialogue synthesizer for multi-speaker audio generation. The synthesizer uses word-level timestamps for perfect pause calculation and optional subtitle generation.

Prerequisites: The synthesizer parameter must implement CaptionedSynthesizer. This is supported by Kokoro-FastAPI and similar providers with timestamp capabilities.

The format defaults to "wav" which preserves quality through multiple processing steps. For subtitle generation and timestamp analysis, WAV is strongly recommended.

Returns an error if the synthesizer is nil or voiceMap is empty.

Example:

syn, _ := openai.NewSynthesizer(openai.WithBaseURL("http://localhost:8880/v1"))
ds, err := voice.NewDialogueSynthesizerCaptioned(syn, map[string]string{
    "Alice": "af_bella",
    "Bob":   "am_adam",
})

func (*DialogueSynthesizerCaptioned) CalculatePerfectPause

func (ds *DialogueSynthesizerCaptioned) CalculatePerfectPause(prev, curr *CaptionedSegment) int

CalculatePerfectPause calculates the exact pause needed between two segments. It uses word-level timestamps to avoid double-pausing and applies context-aware adjustments based on dialogue content.

func (*DialogueSynthesizerCaptioned) GenerateSRT

func (ds *DialogueSynthesizerCaptioned) GenerateSRT(segments []CaptionedSegment) string

GenerateSRT creates SRT-format subtitles from captioned segments. This automatically generates perfectly timed subtitles without manual adjustment. This is a convenience method that wraps the internal generateSRT function.

func (*DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers

func (ds *DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers(segments []CaptionedSegment) string

GenerateSRTWithSpeakers creates SRT-format subtitles with speaker labels. Each subtitle line includes "[Speaker]: word" format, useful for multi-speaker content.

func (*DialogueSynthesizerCaptioned) StreamDialogueCaptioned

func (ds *DialogueSynthesizerCaptioned) StreamDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)

StreamDialogueCaptioned streams dialogue with timestamps, calculating perfect pauses on-the-fly using speech duration information.

IMPLEMENTATION STATUS: This method is currently a stub and returns an error. Streaming captioned dialogue requires buffering segments anyway to calculate perfect pauses, so there's no significant benefit over SynthesizeDialogueCaptioned.

FUTURE WORK: If streaming is needed for very long dialogues, consider: 1. Using a heuristic pause calculation instead of perfect pause 2. Buffering N segments ahead for pause calculation while streaming 3. Using a separate goroutine for synthesis and another for assembly

For now, use SynthesizeDialogueCaptioned which provides the full feature set.

func (*DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned

func (ds *DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (*CaptionedDialogueResult, error)

SynthesizeDialogueCaptioned generates dialogue with perfect timing using timestamps. This method provides superior audio quality by:

  • Calculating exact pauses from actual speech duration
  • Avoiding double-pausing (built-in silence + added silence)
  • Crossfading at word boundaries instead of random positions
  • Generating subtitles automatically (if enabled)

Returns complete dialogue audio and detailed timing information for each segment.

type EstimatedCaptionedSynthesizer

type EstimatedCaptionedSynthesizer struct {
	Syn             Synthesizer
	Format          string
	SilentThreshold int16
}

func NewEstimatedCaptionedSynthesizer

func NewEstimatedCaptionedSynthesizer(syn Synthesizer, format ...string) *EstimatedCaptionedSynthesizer

func (*EstimatedCaptionedSynthesizer) Stream

func (e *EstimatedCaptionedSynthesizer) Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)

func (*EstimatedCaptionedSynthesizer) StreamCaptioned

func (e *EstimatedCaptionedSynthesizer) StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)

func (*EstimatedCaptionedSynthesizer) Synthesize

func (e *EstimatedCaptionedSynthesizer) Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)

func (*EstimatedCaptionedSynthesizer) SynthesizeCaptioned

func (e *EstimatedCaptionedSynthesizer) SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)

func (*EstimatedCaptionedSynthesizer) WithSilentThreshold

func (e *EstimatedCaptionedSynthesizer) WithSilentThreshold(threshold int16) *EstimatedCaptionedSynthesizer

type FormatMismatchError

type FormatMismatchError struct {
	SegmentIndex int
	Property     string
	Expected     int
	Actual       int
}

func (*FormatMismatchError) Error

func (e *FormatMismatchError) Error() string

type Option

type Option func(*SynthesizeOptions)

Option is a functional option for configuring synthesis parameters.

func WithFormat

func WithFormat(format string) Option

WithFormat sets the output audio format.

func WithModel

func WithModel(model string) Option

WithModel sets the TTS model for synthesis.

func WithSpeed

func WithSpeed(speed float64) Option

WithSpeed sets the speech speed multiplier. Valid range is 0.25 to 4.0, where 1.0 is normal speed.

func WithVoice

func WithVoice(voice string) Option

WithVoice sets the voice identifier for synthesis.

type SynthesizeOptions

type SynthesizeOptions struct {
	// Model specifies the TTS model to use (e.g., "tts-1", "kokoro").
	Model string
	// Voice specifies the voice identifier (e.g., "alloy", "af_bella").
	Voice string
	// Format specifies the output audio format (e.g., "mp3", "wav").
	Format string
	// Speed specifies the speech speed (0.25 to 4.0, where 1.0 is normal).
	Speed float64
}

SynthesizeOptions configures text-to-speech synthesis parameters.

type Synthesizer

type Synthesizer interface {
	// Synthesize generates audio from text and returns the complete audio data.
	// Use this for shorter texts where buffering the entire response is acceptable.
	Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)

	// Stream generates audio from text and returns a stream for reading audio chunks.
	// Use this for longer texts or when you want to process audio as it arrives.
	// The caller is responsible for closing the returned ReadCloser.
	Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
}

Synthesizer is the interface for Text-to-Speech providers. Implementations convert text into audio data, supporting both buffered synthesis and streaming modes.

type WordTimestamp

type WordTimestamp struct {
	// Word is the text content of this segment.
	Word string
	// StartMs is the start time in milliseconds from the beginning of the audio.
	StartMs int
	// EndMs is the end time in milliseconds from the beginning of the audio.
	EndMs int
}

WordTimestamp represents a single word with its timing information. This enables precise synchronization of audio with text, useful for generating subtitles, chapter markers, and analyzing speech patterns.

func EstimateWordTimestamps

func EstimateWordTimestamps(text string, totalDurationMs int, leadingSilenceMs int) []WordTimestamp

Directories

Path Synopsis
Package openai provides an OpenAI-compatible Text-to-Speech implementation.
Package openai provides an OpenAI-compatible Text-to-Speech implementation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL