Documentation
¶
Overview ¶
Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.
Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.
Package voice provides interfaces and types for Text-to-Speech synthesis. It defines a modular interface that supports multiple TTS backends.
Index ¶
- Constants
- func AnalyzeSpeechRate(segments []CaptionedSegment) float64
- func AnalyzeWAVAudio(data []byte) (int, int, int, error)
- func ComputeWAVDuration(data []byte, info wavInfo) int
- func DetectLeadingSilence(data []byte, info wavInfo, threshold ...int16) int
- func DetectTrailingSilence(data []byte, info wavInfo, threshold ...int16) int
- func ValidateWAVConsistency(segments [][]byte, expectedInfo wavInfo) error
- type Audio
- type CaptionedAudio
- type CaptionedChunk
- type CaptionedDialogueResult
- type CaptionedSegment
- type CaptionedSynthesizer
- type DialogueSegment
- type DialogueSynthesizer
- func (ds *DialogueSynthesizer) StreamDialogue(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
- func (ds *DialogueSynthesizer) StreamDialogueParallel(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
- func (ds *DialogueSynthesizer) StreamDialogueParallelWithLimit(ctx context.Context, segments []DialogueSegment, concurrencyLimit int) (io.ReadCloser, error)
- func (ds *DialogueSynthesizer) SynthesizeDialogue(ctx context.Context, segments []DialogueSegment) ([]*Audio, error)
- type DialogueSynthesizerCaptioned
- func (ds *DialogueSynthesizerCaptioned) CalculatePerfectPause(prev, curr *CaptionedSegment) int
- func (ds *DialogueSynthesizerCaptioned) GenerateSRT(segments []CaptionedSegment) string
- func (ds *DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers(segments []CaptionedSegment) string
- func (ds *DialogueSynthesizerCaptioned) StreamDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
- func (ds *DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (*CaptionedDialogueResult, error)
- type EstimatedCaptionedSynthesizer
- func (e *EstimatedCaptionedSynthesizer) Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
- func (e *EstimatedCaptionedSynthesizer) StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
- func (e *EstimatedCaptionedSynthesizer) Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)
- func (e *EstimatedCaptionedSynthesizer) SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)
- func (e *EstimatedCaptionedSynthesizer) WithSilentThreshold(threshold int16) *EstimatedCaptionedSynthesizer
- type FormatMismatchError
- type Option
- type SynthesizeOptions
- type Synthesizer
- type WordTimestamp
Constants ¶
const ( SilenceThresholdDefault = 500 SilenceWindowMs = 10 )
const ( PauseMultQuestion = 1.3 PauseMultExclamation = 1.2 PauseMultEllipsis = 1.4 PauseMultDash = 1.5 PauseMultComma = 0.7 PauseMultShortResp = 0.6 PauseMultLongSentence = 1.2 PauseMultWait = 1.3 PauseMultContinuation = 0.8 PauseMultTransition = 1.1 PauseMultEmotional = 1.2 PauseMultSameSpeaker = 0.25 PauseMultMin = 0.5 PauseMultMax = 1.8 PauseMultInterruption = -0.3 RoomToneAmplitude = 30 RoomToneMaxPauseMs = 3000 MaxTrailingSilenceMs = 500 MaxLeadingSilenceMs = 300 MinPauseMs = 50 )
Pause multipliers for context-aware dialogue pacing. These values are based on natural speech patterns and provide natural-sounding conversation flow.
Variables ¶
This section is empty.
Functions ¶
func AnalyzeSpeechRate ¶
func AnalyzeSpeechRate(segments []CaptionedSegment) float64
AnalyzeSpeechRate calculates words per minute for a speaker. This enables automatic speed adjustment for consistent pacing.
func ComputeWAVDuration ¶
func DetectLeadingSilence ¶
func DetectTrailingSilence ¶
func ValidateWAVConsistency ¶
Types ¶
type Audio ¶
type Audio struct {
// Data contains the raw audio bytes.
Data []byte
// Format specifies the audio format (e.g., "mp3", "wav", "opus").
Format string
}
Audio represents synthesized audio data with its format.
type CaptionedAudio ¶
type CaptionedAudio struct {
// Data contains the raw audio bytes.
Data []byte
// Format specifies the audio format (e.g., "mp3", "wav", "opus").
Format string
// Timestamps contains word-level timing information.
// Words are ordered chronologically as they appear in the audio.
Timestamps []WordTimestamp
// DurationMs is the total duration of the audio in milliseconds.
// This is the end time of the last word plus any trailing silence.
DurationMs int
}
CaptionedAudio represents synthesized audio with word-level timestamps. This provides precise timing information for each word, enabling advanced features like subtitle generation, speech analysis, and perfect synchronization.
type CaptionedChunk ¶
type CaptionedChunk struct {
// Audio contains base64-encoded audio data for this chunk.
Audio string `json:"audio"`
// Timestamps contains word-level timing for words in this chunk.
Timestamps []WordTimestamp `json:"timestamps"`
}
CaptionedChunk represents a single chunk from a captioned stream. It contains both audio data (base64 encoded) and word timestamps for incremental processing during streaming synthesis.
type CaptionedDialogueResult ¶
type CaptionedDialogueResult struct {
// Audio is the complete dialogue audio.
Audio []byte
// Format is the audio format (e.g., "wav").
Format string
// Segments contains timing information for each segment.
Segments []CaptionedSegment
// TotalDurationMs is the total dialogue duration in milliseconds.
TotalDurationMs int
// Subtitles is the SRT-format subtitle string, if enabled.
Subtitles string
}
CaptionedDialogueResult contains the synthesis output with timing information.
type CaptionedSegment ¶
type CaptionedSegment struct {
// Speaker is the segment speaker.
Speaker string
// Text is the spoken text.
Text string
// Audio is the segment audio data.
Audio []byte
// Timestamps contains word-level timing.
Timestamps []WordTimestamp
// StartMs is when this segment starts in the full dialogue.
StartMs int
// EndMs is when this segment ends in the full dialogue.
EndMs int
// DurationMs is the total segment duration including trailing silence.
DurationMs int
// SpeechDurationMs is the actual speech duration without trailing silence.
SpeechDurationMs int
// TrailingSilenceMs is the silence at the end of the audio.
TrailingSilenceMs int
// LeadingSilenceMs is the silence at the start of the audio.
LeadingSilenceMs int
}
CaptionedSegment represents one speaker's segment with timing details.
type CaptionedSynthesizer ¶
type CaptionedSynthesizer interface {
Synthesizer
// SynthesizeCaptioned generates audio from text with word-level timestamps.
// This is similar to Synthesize but returns timing information for each word,
// enabling precise synchronization and analysis.
//
// The returned CaptionedAudio contains both the audio data and timestamps.
// Not all TTS providers support this feature.
//
// Example:
//
// audio, err := synth.SynthesizeCaptioned(ctx, "Hello world", opts...)
// for _, ts := range audio.Timestamps {
// fmt.Printf("%d-%dms: %s\n", ts.StartMs, ts.EndMs, ts.Word)
// }
SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)
// StreamCaptioned generates audio from text with timestamps streamed incrementally.
// Each chunk contains a JSON object with "audio" (base64) and "timestamps" fields.
// This is useful for long texts where you want to process audio and timing
// information as it's generated, rather than waiting for complete synthesis.
//
// The returned ReadCloser streams JSON objects, one per chunk.
//
// Example:
//
// stream, err := synth.StreamCaptioned(ctx, longText, opts...)
// defer stream.Close()
// decoder := json.NewDecoder(stream)
// for {
// var chunk CaptionedChunk
// if err := decoder.Decode(&chunk); err != nil {
// if err == io.EOF { break }
// return err
// }
// // Process chunk.Audio and chunk.Timestamps
// }
StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
}
CaptionedSynthesizer extends the basic Synthesizer interface with timestamp-aware synthesis capabilities.
Implementations that support word-level timestamps (like Kokoro-FastAPI) should implement this interface in addition to Synthesizer. This enables more sophisticated audio processing like:
- Exact pause calculation based on actual speech duration
- Automatic subtitle generation (SRT, VTT)
- Speech rate analysis and normalization
- Perfect synchronization for background music/ambience
- Quality control for podcast production
type DialogueSegment ¶
type DialogueSegment struct {
// Speaker identifies who is speaking (e.g., "Alice", "Bob", "Narrator").
// Must match a key in the DialogueSynthesizer.VoiceMap.
Speaker string
// Text is the content spoken by this speaker.
// Punctuation and sentence structure affect pause timing between segments.
Text string
}
DialogueSegment represents a single speaker's line in a multi-speaker dialogue. Each segment specifies who is speaking (Speaker) and what they say (Text), which allows the synthesizer to select the appropriate voice and apply context-aware pacing based on the conversation flow.
type DialogueSynthesizer ¶
type DialogueSynthesizer struct {
// Syn is the underlying TTS engine used to generate audio for each segment.
Syn Synthesizer
// VoiceMap maps speaker names to voice identifiers.
// For OpenAI: alloy, echo, fable, onyx, nova, shimmer.
// For Kokoro: af_bella, af_sky, am_adam, etc.
// Use "+" for voice mixing: "af_bella(3)+af_heart(1)" for 75%/25% mix.
VoiceMap map[string]string
// SpeedMap maps speaker names to speech speed multipliers.
// Values typically range from 0.8 to 1.2, where 1.0 is normal speed.
// Speakers not in the map default to 1.0.
// Use higher values for energetic speakers, lower for thoughtful speakers.
SpeedMap map[string]float64
// Format specifies the output audio format (e.g., "wav", "mp3").
// Default is "wav" for better concatenation support and crossfade quality.
// Note: Compressed formats (mp3, opus) may introduce artifacts during processing.
Format string
// CrossfadeMs specifies crossfade duration in milliseconds (default: 50).
// Higher values (80-100ms) create smoother transitions but may reduce clarity.
// Lower values (20-40ms) are faster but may sound abrupt on speaker changes.
// Set to 0 to disable crossfading (useful for compressed formats).
// Note: Crossfading requires buffering segments in memory.
CrossfadeMs int
// PauseMsMin specifies minimum pause duration between segments in milliseconds (default: 200).
// This is the base pause duration before context-aware adjustments.
// Set both PauseMsMin and PauseMsMax to 0 to disable pauses between segments.
// Recommended: 150-250ms for natural conversation, 300-500ms for dramatic effect.
PauseMsMin int
// PauseMsMax specifies maximum pause duration between segments in milliseconds (default: 300).
// A random value between PauseMsMin and PauseMsMax provides natural variation.
// Context-aware adjustments (questions, exclamations, etc.) can extend beyond this maximum
// by up to 2x to accommodate natural speech patterns.
PauseMsMax int
// NormalizeVolume enables peak volume normalization per segment (default: true).
// Ensures consistent loudness across different speakers/voices, which is critical
// for dialogue where different native volumes could be jarring.
// Normalization targets 95% of maximum amplitude to avoid clipping while maintaining
// consistent perceived volume across all speakers.
NormalizeVolume bool
}
DialogueSynthesizer generates audio for multi-speaker dialogues with natural conversation flow. It maps speakers to voice IDs and synthesizes each segment with the appropriate voice, then concatenates the results into a single audio stream with intelligent pacing.
The synthesizer automatically adjusts pause durations based on conversational context:
- Questions and exclamations get longer pauses for processing time
- Comma-terminated segments get shorter pauses as thoughts continue
- Short responses get minimal pauses for quick back-and-forth
- Transition words ("so", "well") get appropriate pauses
Audio is processed with crossfading between speakers and volume normalization to ensure consistent loudness across different voices.
Example:
syn, _ := openai.NewSynthesizer(openai.WithBaseURL("http://localhost:8880/v1"))
ds := voice.NewDialogueSynthesizer(syn, map[string]string{
"Alice": "af_bella",
"Bob": "am_adam",
})
ds.SpeedMap = map[string]float64{
"Bob": 0.95, // Bob speaks slightly slower
}
stream, _ := ds.StreamDialogue(ctx, []voice.DialogueSegment{
{Speaker: "Alice", Text: "What do you think?"},
{Speaker: "Bob", Text: "I think it's great!"},
})
func NewDialogueSynthesizer ¶
func NewDialogueSynthesizer(syn Synthesizer, voiceMap map[string]string, format ...string) *DialogueSynthesizer
NewDialogueSynthesizer creates a new dialogue synthesizer for multi-speaker audio generation. The synthesizer applies context-aware pacing, crossfading, and volume normalization to create natural-sounding dialogues.
The format defaults to "wav" which supports reliable concatenation and processing. For dialogue synthesis, WAV is recommended over compressed formats like MP3 to avoid quality degradation through multiple processing steps.
Default settings:
- CrossfadeMs: 50ms (smooth transitions between speakers)
- PauseMsMin: 200ms (minimum pause between segments)
- PauseMsMax: 300ms (maximum pause, randomized for naturalness)
- NormalizeVolume: true (consistent loudness across voices)
Example:
ds := voice.NewDialogueSynthesizer(synthesizer, map[string]string{
"Alice": "af_bella",
"Bob": "am_adam",
})
ds.SpeedMap = map[string]float64{
"Alice": 1.05, // slightly faster
"Bob": 0.95, // slightly slower
}
func (*DialogueSynthesizer) StreamDialogue ¶
func (ds *DialogueSynthesizer) StreamDialogue(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
StreamDialogue generates audio for all segments and streams them as a single concatenated audio stream. Segments are synthesized sequentially with context-aware pauses and crossfading between speakers.
The returned io.ReadCloser streams the complete dialogue audio. The caller must close the ReadCloser when done reading.
Example:
stream, err := ds.StreamDialogue(ctx, []voice.DialogueSegment{
{Speaker: "Alice", Text: "Hello?"},
{Speaker: "Bob", Text: "Hi there!"},
})
defer stream.Close()
io.Copy(outputFile, stream)
func (*DialogueSynthesizer) StreamDialogueParallel ¶
func (ds *DialogueSynthesizer) StreamDialogueParallel(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
StreamDialogueParallel generates audio for segments in parallel and streams them in order. This is significantly faster than sequential synthesis for dialogues with many segments, as all segments are synthesized concurrently, then assembled in order with proper crossfading and pause timing.
The concurrencyLimit parameter controls how many segments are synthesized simultaneously. A value of 0 or negative means no limit (may overwhelm the API for large dialogues). Recommended: 5-10 for most APIs, adjust based on API rate limits. For Kokoro-FastAPI, 5-10 works well. For OpenAI, use lower values (3-5) due to rate limits.
Returns a ReadCloser that streams the concatenated audio with natural transitions.
func (*DialogueSynthesizer) StreamDialogueParallelWithLimit ¶
func (ds *DialogueSynthesizer) StreamDialogueParallelWithLimit(ctx context.Context, segments []DialogueSegment, concurrencyLimit int) (io.ReadCloser, error)
StreamDialogueParallelWithLimit generates audio with controlled concurrency. Use this for large dialogues or when dealing with rate-limited APIs. ConcurrencyLimit of 5-10 is recommended for most use cases.
func (*DialogueSynthesizer) SynthesizeDialogue ¶
func (ds *DialogueSynthesizer) SynthesizeDialogue(ctx context.Context, segments []DialogueSegment) ([]*Audio, error)
SynthesizeDialogue generates audio for all segments and returns individual audio files. This is useful when you want to process each speaker's audio separately, apply custom audio processing, or store segments individually.
Returns a slice of Audio objects, one per segment, in the same order as the input.
type DialogueSynthesizerCaptioned ¶
type DialogueSynthesizerCaptioned struct {
// Syn is the captioned synthesizer used to generate audio with timestamps.
Syn CaptionedSynthesizer
// VoiceMap maps speaker names to voice identifiers.
VoiceMap map[string]string
// SpeedMap maps speaker names to speech speed multipliers.
// Values typically range from 0.8 to 1.2, where 1.0 is normal speed.
SpeedMap map[string]float64
// Format specifies the output audio format (e.g., "wav", "mp3").
// Default is "wav" for best quality with crossfading.
Format string
// CrossfadeMs specifies crossfade duration in milliseconds (default: 50).
// Set to 0 to disable crossfading.
CrossfadeMs int
// TargetPauseMs is the target pause between segments (default: 250).
// This is the desired gap between the END of one speech and START of the next.
TargetPauseMs int
// NormalizeVolume enables peak volume normalization per segment (default: true).
NormalizeVolume bool
// GenerateSubtitles enables automatic subtitle generation (default: true).
// When enabled, returns both audio and SRT-format subtitles.
GenerateSubtitles bool
}
DialogueSynthesizerCaptioned generates multi-speaker dialogue with perfect timing using word-level timestamps from_captioned synthesis.
This synthesizer provides superior dialogue quality compared to DialogueSynthesizer by using actual speech duration and timing information instead of heuristics. It eliminates problems like:
- Double-pausing (built-in silence + added silence)
- Cutting words during crossfade
- Inconsistent speech rates between speakers
- Manual subtitle timing
Requirements: The underlying synthesizer must implement CaptionedSynthesizer interface. Compatible providers: Kokoro-FastAPI (with /dev/captioned_speech endpoint).
func NewDialogueSynthesizerCaptioned ¶
func NewDialogueSynthesizerCaptioned(syn CaptionedSynthesizer, voiceMap map[string]string, format ...string) (*DialogueSynthesizerCaptioned, error)
NewDialogueSynthesizerCaptioned creates a new captioned dialogue synthesizer for multi-speaker audio generation. The synthesizer uses word-level timestamps for perfect pause calculation and optional subtitle generation.
Prerequisites: The synthesizer parameter must implement CaptionedSynthesizer. This is supported by Kokoro-FastAPI and similar providers with timestamp capabilities.
The format defaults to "wav" which preserves quality through multiple processing steps. For subtitle generation and timestamp analysis, WAV is strongly recommended.
Returns an error if the synthesizer is nil or voiceMap is empty.
Example:
syn, _ := openai.NewSynthesizer(openai.WithBaseURL("http://localhost:8880/v1"))
ds, err := voice.NewDialogueSynthesizerCaptioned(syn, map[string]string{
"Alice": "af_bella",
"Bob": "am_adam",
})
func (*DialogueSynthesizerCaptioned) CalculatePerfectPause ¶
func (ds *DialogueSynthesizerCaptioned) CalculatePerfectPause(prev, curr *CaptionedSegment) int
CalculatePerfectPause calculates the exact pause needed between two segments. It uses word-level timestamps to avoid double-pausing and applies context-aware adjustments based on dialogue content.
func (*DialogueSynthesizerCaptioned) GenerateSRT ¶
func (ds *DialogueSynthesizerCaptioned) GenerateSRT(segments []CaptionedSegment) string
GenerateSRT creates SRT-format subtitles from captioned segments. This automatically generates perfectly timed subtitles without manual adjustment. This is a convenience method that wraps the internal generateSRT function.
func (*DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers ¶
func (ds *DialogueSynthesizerCaptioned) GenerateSRTWithSpeakers(segments []CaptionedSegment) string
GenerateSRTWithSpeakers creates SRT-format subtitles with speaker labels. Each subtitle line includes "[Speaker]: word" format, useful for multi-speaker content.
func (*DialogueSynthesizerCaptioned) StreamDialogueCaptioned ¶
func (ds *DialogueSynthesizerCaptioned) StreamDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (io.ReadCloser, error)
StreamDialogueCaptioned streams dialogue with timestamps, calculating perfect pauses on-the-fly using speech duration information.
IMPLEMENTATION STATUS: This method is currently a stub and returns an error. Streaming captioned dialogue requires buffering segments anyway to calculate perfect pauses, so there's no significant benefit over SynthesizeDialogueCaptioned.
FUTURE WORK: If streaming is needed for very long dialogues, consider: 1. Using a heuristic pause calculation instead of perfect pause 2. Buffering N segments ahead for pause calculation while streaming 3. Using a separate goroutine for synthesis and another for assembly
For now, use SynthesizeDialogueCaptioned which provides the full feature set.
func (*DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned ¶
func (ds *DialogueSynthesizerCaptioned) SynthesizeDialogueCaptioned(ctx context.Context, segments []DialogueSegment) (*CaptionedDialogueResult, error)
SynthesizeDialogueCaptioned generates dialogue with perfect timing using timestamps. This method provides superior audio quality by:
- Calculating exact pauses from actual speech duration
- Avoiding double-pausing (built-in silence + added silence)
- Crossfading at word boundaries instead of random positions
- Generating subtitles automatically (if enabled)
Returns complete dialogue audio and detailed timing information for each segment.
type EstimatedCaptionedSynthesizer ¶
type EstimatedCaptionedSynthesizer struct {
Syn Synthesizer
Format string
SilentThreshold int16
}
func NewEstimatedCaptionedSynthesizer ¶
func NewEstimatedCaptionedSynthesizer(syn Synthesizer, format ...string) *EstimatedCaptionedSynthesizer
func (*EstimatedCaptionedSynthesizer) Stream ¶
func (e *EstimatedCaptionedSynthesizer) Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
func (*EstimatedCaptionedSynthesizer) StreamCaptioned ¶
func (e *EstimatedCaptionedSynthesizer) StreamCaptioned(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
func (*EstimatedCaptionedSynthesizer) Synthesize ¶
func (*EstimatedCaptionedSynthesizer) SynthesizeCaptioned ¶
func (e *EstimatedCaptionedSynthesizer) SynthesizeCaptioned(ctx context.Context, text string, opts ...Option) (*CaptionedAudio, error)
func (*EstimatedCaptionedSynthesizer) WithSilentThreshold ¶
func (e *EstimatedCaptionedSynthesizer) WithSilentThreshold(threshold int16) *EstimatedCaptionedSynthesizer
type FormatMismatchError ¶
func (*FormatMismatchError) Error ¶
func (e *FormatMismatchError) Error() string
type Option ¶
type Option func(*SynthesizeOptions)
Option is a functional option for configuring synthesis parameters.
type SynthesizeOptions ¶
type SynthesizeOptions struct {
// Model specifies the TTS model to use (e.g., "tts-1", "kokoro").
Model string
// Voice specifies the voice identifier (e.g., "alloy", "af_bella").
Voice string
// Format specifies the output audio format (e.g., "mp3", "wav").
Format string
// Speed specifies the speech speed (0.25 to 4.0, where 1.0 is normal).
Speed float64
}
SynthesizeOptions configures text-to-speech synthesis parameters.
type Synthesizer ¶
type Synthesizer interface {
// Synthesize generates audio from text and returns the complete audio data.
// Use this for shorter texts where buffering the entire response is acceptable.
Synthesize(ctx context.Context, text string, opts ...Option) (*Audio, error)
// Stream generates audio from text and returns a stream for reading audio chunks.
// Use this for longer texts or when you want to process audio as it arrives.
// The caller is responsible for closing the returned ReadCloser.
Stream(ctx context.Context, text string, opts ...Option) (io.ReadCloser, error)
}
Synthesizer is the interface for Text-to-Speech providers. Implementations convert text into audio data, supporting both buffered synthesis and streaming modes.
type WordTimestamp ¶
type WordTimestamp struct {
// Word is the text content of this segment.
Word string
// StartMs is the start time in milliseconds from the beginning of the audio.
StartMs int
// EndMs is the end time in milliseconds from the beginning of the audio.
EndMs int
}
WordTimestamp represents a single word with its timing information. This enables precise synchronization of audio with text, useful for generating subtitles, chapter markers, and analyzing speech patterns.
func EstimateWordTimestamps ¶
func EstimateWordTimestamps(text string, totalDurationMs int, leadingSilenceMs int) []WordTimestamp
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package openai provides an OpenAI-compatible Text-to-Speech implementation.
|
Package openai provides an OpenAI-compatible Text-to-Speech implementation. |