Documentation
¶
Overview ¶
Package ttsscript provides a structured format for authoring multilingual TTS (Text-to-Speech) scripts that can be compiled to various output formats.
This package is engine-agnostic and can be used with any TTS provider including ElevenLabs, Google Cloud TTS, Amazon Polly, Azure TTS, and others.
Why Use ttsscript? ¶
Instead of storing raw SSML (which is engine-specific and hard to edit), store your scripts in a structured JSON format that:
- Supports multiple languages in a single file
- Handles pronunciations/acronyms separately from content
- Can be compiled to any TTS engine format
- Is easy to edit and version control
Basic Usage ¶
Create a script JSON file:
{
"title": "My Course",
"default_voices": {"en": "voice-id", "es": "voice-id-2"},
"pronunciations": {
"API": {"en": "A P I", "es": "A P I"},
"SDK": {"en": "S D K", "es": "S D K"}
},
"slides": [
{
"title": "Introduction",
"segments": [
{
"text": {"en": "Welcome to the API course", "es": "Bienvenidos al curso de API"},
"pause_after": "500ms"
}
]
}
]
}
Load and compile for ElevenLabs:
script, _ := ttsscript.LoadScript("script.json")
compiler := ttsscript.NewCompiler()
segments, _ := compiler.Compile(script, "en")
formatter := ttsscript.NewElevenLabsFormatter()
jobs := formatter.Format(segments)
for _, job := range jobs {
// Generate TTS for each segment
audio, _ := client.TextToSpeech().Simple(ctx, job.VoiceID, job.Text)
// Save with pause information for post-processing
}
Compile to SSML for Google TTS:
formatter := ttsscript.NewSSMLFormatter() ssml, _ := formatter.FormatScript(script, "en") // Use ssml with Google Cloud TTS API
Script Structure ¶
A Script contains:
- Metadata (title, description, default language)
- Default voices per language
- Global pronunciations
- Slides/sections containing segments
Each Segment contains:
- Text in multiple languages
- Voice overrides per language
- Pause before/after
- Prosody settings (rate, pitch, emphasis)
- Segment-specific pronunciations
Compilation Process ¶
1. Load the script from JSON 2. Create a Compiler and optionally add additional pronunciations 3. Compile for a specific language to get CompiledSegments 4. Format the segments for your target TTS engine
Formatters ¶
SSMLFormatter: Outputs W3C SSML compatible with Google, Amazon, Azure ElevenLabsFormatter: Outputs segments ready for ElevenLabs TTS API
Pronunciation Handling ¶
Pronunciations are applied at compile time with this priority: 1. Compiler-level (added via AddPronunciation) 2. Segment-level (in segment.pronunciations) 3. Script-level (in script.pronunciations)
This allows overrides at any level. Terms are matched case-insensitively with word boundaries.
Package ttsscript provides a structured format for authoring multilingual TTS scripts that can be compiled to various output formats (SSML, ElevenLabs, etc.).
This package is engine-agnostic and can be used with any TTS provider.
Index ¶
- func CombineText(segments []CompiledSegment) string
- func EscapeSSML(s string) string
- func FormatDuration(ms int) string
- func GroupBySlide(segments []CompiledSegment) map[int][]CompiledSegment
- func GroupByVoice(segments []CompiledSegment) map[string][]CompiledSegment
- func ParseDuration(s string) int
- func SSMLBreak(duration string) string
- func SSMLEmphasis(text, level string) string
- func SSMLPhoneme(text, alphabet, ph string) string
- func SSMLProsody(text, rate, pitch, volume string) string
- func SSMLSayAs(text, interpretAs, format string) string
- func SSMLSub(text, alias string) string
- type BatchConfig
- type CompiledSegment
- type Compiler
- type ElevenLabsFormatter
- func (f *ElevenLabsFormatter) CombineForSingleRequest(segments []ElevenLabsSegment) string
- func (f *ElevenLabsFormatter) Format(segments []CompiledSegment) []ElevenLabsSegment
- func (f *ElevenLabsFormatter) FormatScript(script *Script, language string) ([]ElevenLabsSegment, error)
- func (f *ElevenLabsFormatter) GroupByVoice(segments []ElevenLabsSegment) map[string][]ElevenLabsSegment
- type ElevenLabsSegment
- type ManifestEntry
- type SSMLFormatter
- type Script
- type Segment
- type Slide
- type TTSRequest
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CombineText ¶
func CombineText(segments []CompiledSegment) string
CombineText combines all segment texts into a single string with pause markers.
func FormatDuration ¶
FormatDuration formats milliseconds as a duration string.
func GroupBySlide ¶
func GroupBySlide(segments []CompiledSegment) map[int][]CompiledSegment
GroupBySlide groups compiled segments by slide index.
func GroupByVoice ¶
func GroupByVoice(segments []CompiledSegment) map[string][]CompiledSegment
GroupByVoice groups compiled segments by voice ID. Useful for batch processing with the same voice.
func ParseDuration ¶
ParseDuration parses a duration string like "500ms" or "1s" to milliseconds.
func SSMLEmphasis ¶
SSMLEmphasis wraps text in emphasis tags.
func SSMLPhoneme ¶
SSMLPhoneme wraps text with phonetic pronunciation.
func SSMLProsody ¶
SSMLProsody wraps text in prosody tags.
Types ¶
type BatchConfig ¶
type BatchConfig struct {
// OutputDir is the directory for output files.
OutputDir string
// FilePrefix is added before each filename.
FilePrefix string
// FileSuffix is added after each filename (before extension).
FileSuffix string
// IncludeLanguageInFilename adds language code to filename.
IncludeLanguageInFilename bool
}
BatchConfig contains configuration for batch TTS processing.
func NewBatchConfig ¶
func NewBatchConfig(outputDir string) *BatchConfig
NewBatchConfig creates a batch config with defaults.
func (*BatchConfig) GenerateFilename ¶
func (c *BatchConfig) GenerateFilename(seg ElevenLabsSegment, language string) string
GenerateFilename generates an output filename for a segment.
type CompiledSegment ¶
type CompiledSegment struct {
// SlideIndex is the 0-based slide index.
SlideIndex int
// SegmentIndex is the 0-based segment index within the slide.
// For title segments, this is -1.
SegmentIndex int
// SlideTitle is the slide title (if any).
SlideTitle string
// IsTitleSegment indicates this segment was generated from a slide title.
IsTitleSegment bool
// IsSectionHeader indicates this segment belongs to a section header slide.
IsSectionHeader bool
// Text is the processed text with pronunciations applied.
Text string
// OriginalText is the text before pronunciation substitutions.
OriginalText string
// VoiceID is the voice to use for this segment.
VoiceID string
// Language is the language code.
Language string
// PauseBeforeMs is the pause before in milliseconds.
PauseBeforeMs int
// PauseAfterMs is the pause after in milliseconds.
PauseAfterMs int
// Emphasis is the emphasis level.
Emphasis string
// Rate is the speaking rate.
Rate string
// Pitch is the pitch adjustment.
Pitch string
}
CompiledSegment represents a compiled segment ready for TTS.
type Compiler ¶
type Compiler struct {
// AdditionalPronunciations are extra pronunciations to apply.
AdditionalPronunciations map[string]map[string]string
// DefaultPauseAfterSlide is the pause after each slide if not specified.
DefaultPauseAfterSlide string
// DefaultPauseAfterSegment is the pause after each segment if not specified.
DefaultPauseAfterSegment string
}
Compiler compiles scripts to various output formats.
func NewCompiler ¶
func NewCompiler() *Compiler
NewCompiler creates a new script compiler with default settings.
func (*Compiler) AddPronunciation ¶
AddPronunciation adds a pronunciation rule.
func (*Compiler) AddPronunciations ¶
AddPronunciations adds multiple pronunciation rules for a language.
type ElevenLabsFormatter ¶
type ElevenLabsFormatter struct {
// UsePauseMarkers includes [pause:Xms] markers in text output.
// When false, pauses are tracked separately for post-processing.
UsePauseMarkers bool
// PauseMarkerFormat is the format for pause markers (default: "[pause:%s]").
PauseMarkerFormat string
}
ElevenLabsFormatter formats compiled segments for ElevenLabs TTS.
func NewElevenLabsFormatter ¶
func NewElevenLabsFormatter() *ElevenLabsFormatter
NewElevenLabsFormatter creates a new ElevenLabs formatter.
func (*ElevenLabsFormatter) CombineForSingleRequest ¶
func (f *ElevenLabsFormatter) CombineForSingleRequest(segments []ElevenLabsSegment) string
CombineForSingleRequest combines segments into a single text block. Useful when you want to generate all audio in one API call. Note: This loses per-segment voice control.
func (*ElevenLabsFormatter) Format ¶
func (f *ElevenLabsFormatter) Format(segments []CompiledSegment) []ElevenLabsSegment
Format formats compiled segments for ElevenLabs.
func (*ElevenLabsFormatter) FormatScript ¶
func (f *ElevenLabsFormatter) FormatScript(script *Script, language string) ([]ElevenLabsSegment, error)
FormatScript compiles and formats a script for ElevenLabs.
func (*ElevenLabsFormatter) GroupByVoice ¶
func (f *ElevenLabsFormatter) GroupByVoice(segments []ElevenLabsSegment) map[string][]ElevenLabsSegment
GroupByVoice groups segments by voice ID for batch processing.
type ElevenLabsSegment ¶
type ElevenLabsSegment struct {
// Text is the text to generate speech for.
Text string
// VoiceID is the ElevenLabs voice ID.
VoiceID string
// SlideIndex is the source slide index.
SlideIndex int
// SegmentIndex is the source segment index (-1 for title segments).
SegmentIndex int
// SlideTitle is the slide title for reference.
SlideTitle string
// IsTitleSegment indicates this segment was generated from a slide title.
IsTitleSegment bool
// IsSectionHeader indicates this segment belongs to a section header slide.
IsSectionHeader bool
// PauseBeforeMs is silence to add before this segment.
PauseBeforeMs int
// PauseAfterMs is silence to add after this segment.
PauseAfterMs int
// SuggestedFilename is a suggested output filename.
SuggestedFilename string
}
ElevenLabsSegment represents a segment ready for ElevenLabs TTS.
type ManifestEntry ¶
type ManifestEntry struct {
SlideIndex int `json:"slide_index"`
SegmentIndex int `json:"segment_index"`
SlideTitle string `json:"slide_title,omitempty"`
IsTitleSegment bool `json:"is_title_segment,omitempty"`
IsSectionHeader bool `json:"is_section_header,omitempty"`
Text string `json:"text"`
VoiceID string `json:"voice_id"`
Language string `json:"language"`
OutputFile string `json:"output_file"`
PauseBeforeMs int `json:"pause_before_ms,omitempty"`
PauseAfterMs int `json:"pause_after_ms,omitempty"`
}
ManifestEntry represents an entry in a generation manifest.
func GenerateManifest ¶
func GenerateManifest(segments []ElevenLabsSegment, config *BatchConfig, language string) []ManifestEntry
GenerateManifest creates a manifest of all segments for tracking.
type SSMLFormatter ¶
type SSMLFormatter struct {
// Version is the SSML version (default: "1.1").
Version string
// IncludeComments includes slide title comments in output.
IncludeComments bool
// IndentSpaces is the number of spaces for indentation.
IndentSpaces int
}
SSMLFormatter formats compiled segments as SSML. Compatible with Google Cloud TTS, Amazon Polly, Azure TTS, and others.
func NewSSMLFormatter ¶
func NewSSMLFormatter() *SSMLFormatter
NewSSMLFormatter creates a new SSML formatter with default settings.
func (*SSMLFormatter) Format ¶
func (f *SSMLFormatter) Format(segments []CompiledSegment, language string) string
Format formats compiled segments as SSML.
func (*SSMLFormatter) FormatScript ¶
func (f *SSMLFormatter) FormatScript(script *Script, language string) (string, error)
FormatScript compiles and formats a script as SSML.
type Script ¶
type Script struct {
// Title is the script title.
Title string `json:"title,omitempty"`
// Description is an optional description.
Description string `json:"description,omitempty"`
// DefaultLanguage is the primary language code (e.g., "en-US").
DefaultLanguage string `json:"default_language,omitempty"`
// DefaultVoices maps language codes to default voice IDs.
DefaultVoices map[string]string `json:"default_voices,omitempty"`
// Pronunciations maps terms to their pronunciation by language.
// Example: {"ADK": {"en": "A D K", "es": "A D K"}}
Pronunciations map[string]map[string]string `json:"pronunciations,omitempty"`
// Slides contains the ordered list of slides/sections.
Slides []Slide `json:"slides"`
}
Script represents a multilingual TTS script with slides/segments. This is the canonical format for authoring TTS content that can be compiled to SSML (Google TTS, Amazon Polly) or ElevenLabs-compatible text.
func LoadScript ¶
LoadScript loads a script from a JSON file.
func ParseScript ¶
ParseScript parses a script from JSON data.
func (*Script) SegmentCount ¶
SegmentCount returns the total number of segments across all slides.
func (*Script) SlideCount ¶
SlideCount returns the number of slides.
type Segment ¶
type Segment struct {
// Text contains the text content by language code.
// Example: {"en": "Hello world", "es": "Hola mundo"}
Text map[string]string `json:"text"`
// Voice overrides the default voice for this segment by language.
// Example: {"en": "voice-id-1", "es": "voice-id-2"}
Voice map[string]string `json:"voice,omitempty"`
// PauseBefore is the pause duration before this segment (e.g., "500ms", "1s").
PauseBefore string `json:"pause_before,omitempty"`
// PauseAfter is the pause duration after this segment (e.g., "500ms", "1s").
PauseAfter string `json:"pause_after,omitempty"`
// Emphasis indicates the emphasis level ("strong", "moderate", "reduced").
Emphasis string `json:"emphasis,omitempty"`
// Rate is the speaking rate ("slow", "medium", "fast", or percentage like "80%").
Rate string `json:"rate,omitempty"`
// Pitch adjusts the pitch ("low", "medium", "high", or percentage like "+10%").
Pitch string `json:"pitch,omitempty"`
// Pronunciations are segment-specific pronunciation overrides.
Pronunciations map[string]map[string]string `json:"pronunciations,omitempty"`
}
Segment represents a single audio segment within a slide.
type Slide ¶
type Slide struct {
// Title is the slide title (optional).
Title string `json:"title,omitempty"`
// Notes are speaker notes or comments (not rendered to audio).
Notes string `json:"notes,omitempty"`
// IsSectionHeader marks this slide as the start of a new section.
// Section headers can have their titles spoken and use longer transition pauses.
IsSectionHeader bool `json:"is_section_header,omitempty"`
// SpeakTitle causes the slide title to be spoken before the segments.
// If true, the title is converted to a segment. Defaults to true for section headers.
SpeakTitle *bool `json:"speak_title,omitempty"`
// TitleVoice overrides the voice used for speaking the title, by language.
// If not set, uses the segment voice or default voice.
TitleVoice map[string]string `json:"title_voice,omitempty"`
// TitlePauseAfter is the pause after the spoken title (e.g., "500ms").
// Defaults to "500ms" for section headers, "300ms" for regular slides.
TitlePauseAfter string `json:"title_pause_after,omitempty"`
// Segments are the audio segments for this slide.
Segments []Segment `json:"segments"`
}
Slide represents a slide or section of the script.
func (*Slide) ShouldSpeakTitle ¶
ShouldSpeakTitle returns true if the slide title should be spoken. Returns true if SpeakTitle is explicitly true, or if the slide is a section header and SpeakTitle is not explicitly false.
type TTSRequest ¶
type TTSRequest struct {
VoiceID string
Text string
ModelID string
Segment ElevenLabsSegment
Language string
}
TTSRequest represents a request to the ElevenLabs TTS API. This is a simplified version for use with ttsscript.
func GenerateTTSRequests ¶
func GenerateTTSRequests(segments []ElevenLabsSegment, modelID, language string) []TTSRequest
GenerateTTSRequests creates TTS requests from formatted segments.