ttsscript

package
v0.8.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 15, 2026 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package ttsscript provides a structured format for authoring multilingual TTS (Text-to-Speech) scripts that can be compiled to various output formats.

This package is engine-agnostic and can be used with any TTS provider including ElevenLabs, Google Cloud TTS, Amazon Polly, Azure TTS, and others.

Why Use ttsscript?

Instead of storing raw SSML (which is engine-specific and hard to edit), store your scripts in a structured JSON format that:

  • Supports multiple languages in a single file
  • Handles pronunciations/acronyms separately from content
  • Can be compiled to any TTS engine format
  • Is easy to edit and version control

Basic Usage

Create a script JSON file:

{
  "title": "My Course",
  "default_voices": {"en": "voice-id", "es": "voice-id-2"},
  "pronunciations": {
    "API": {"en": "A P I", "es": "A P I"},
    "SDK": {"en": "S D K", "es": "S D K"}
  },
  "slides": [
    {
      "title": "Introduction",
      "segments": [
        {
          "text": {"en": "Welcome to the API course", "es": "Bienvenidos al curso de API"},
          "pause_after": "500ms"
        }
      ]
    }
  ]
}

Load and compile for ElevenLabs:

script, _ := ttsscript.LoadScript("script.json")
compiler := ttsscript.NewCompiler()
segments, _ := compiler.Compile(script, "en")

formatter := ttsscript.NewElevenLabsFormatter()
jobs := formatter.Format(segments)

for _, job := range jobs {
    // Generate TTS for each segment
    audio, _ := client.TextToSpeech().Simple(ctx, job.VoiceID, job.Text)
    // Save with pause information for post-processing
}

Compile to SSML for Google TTS:

formatter := ttsscript.NewSSMLFormatter()
ssml, _ := formatter.FormatScript(script, "en")
// Use ssml with Google Cloud TTS API

Script Structure

A Script contains:

  • Metadata (title, description, default language)
  • Default voices per language
  • Global pronunciations
  • Slides/sections containing segments

Each Segment contains:

  • Text in multiple languages
  • Voice overrides per language
  • Pause before/after
  • Prosody settings (rate, pitch, emphasis)
  • Segment-specific pronunciations

Compilation Process

1. Load the script from JSON 2. Create a Compiler and optionally add additional pronunciations 3. Compile for a specific language to get CompiledSegments 4. Format the segments for your target TTS engine

Formatters

SSMLFormatter: Outputs W3C SSML compatible with Google, Amazon, Azure ElevenLabsFormatter: Outputs segments ready for ElevenLabs TTS API

Pronunciation Handling

Pronunciations are applied at compile time with this priority: 1. Compiler-level (added via AddPronunciation) 2. Segment-level (in segment.pronunciations) 3. Script-level (in script.pronunciations)

This allows overrides at any level. Terms are matched case-insensitively with word boundaries.

Package ttsscript provides a structured format for authoring multilingual TTS scripts that can be compiled to various output formats (SSML, ElevenLabs, etc.).

This package is engine-agnostic and can be used with any TTS provider.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CombineText

func CombineText(segments []CompiledSegment) string

CombineText combines all segment texts into a single string with pause markers.

func EscapeSSML

func EscapeSSML(s string) string

EscapeSSML escapes special characters for SSML.

func FormatDuration

func FormatDuration(ms int) string

FormatDuration formats milliseconds as a duration string.

func GroupBySlide

func GroupBySlide(segments []CompiledSegment) map[int][]CompiledSegment

GroupBySlide groups compiled segments by slide index.

func GroupByVoice

func GroupByVoice(segments []CompiledSegment) map[string][]CompiledSegment

GroupByVoice groups compiled segments by voice ID. Useful for batch processing with the same voice.

func ParseDuration

func ParseDuration(s string) int

ParseDuration parses a duration string like "500ms" or "1s" to milliseconds.

func SSMLBreak

func SSMLBreak(duration string) string

SSMLBreak generates an SSML break element.

func SSMLEmphasis

func SSMLEmphasis(text, level string) string

SSMLEmphasis wraps text in emphasis tags.

func SSMLPhoneme

func SSMLPhoneme(text, alphabet, ph string) string

SSMLPhoneme wraps text with phonetic pronunciation.

func SSMLProsody

func SSMLProsody(text, rate, pitch, volume string) string

SSMLProsody wraps text in prosody tags.

func SSMLSayAs

func SSMLSayAs(text, interpretAs, format string) string

SSMLSayAs wraps text in say-as tags for specific interpretation.

func SSMLSub

func SSMLSub(text, alias string) string

SSMLSub provides an alias for a word.

Types

type BatchConfig

type BatchConfig struct {
	// OutputDir is the directory for output files.
	OutputDir string

	// FilePrefix is added before each filename.
	FilePrefix string

	// FileSuffix is added after each filename (before extension).
	FileSuffix string

	// IncludeLanguageInFilename adds language code to filename.
	IncludeLanguageInFilename bool
}

BatchConfig contains configuration for batch TTS processing.

func NewBatchConfig

func NewBatchConfig(outputDir string) *BatchConfig

NewBatchConfig creates a batch config with defaults.

func (*BatchConfig) GenerateFilename

func (c *BatchConfig) GenerateFilename(seg ElevenLabsSegment, language string) string

GenerateFilename generates an output filename for a segment.

type CompiledSegment

type CompiledSegment struct {
	// SlideIndex is the 0-based slide index.
	SlideIndex int

	// SegmentIndex is the 0-based segment index within the slide.
	// For title segments, this is -1.
	SegmentIndex int

	// SlideTitle is the slide title (if any).
	SlideTitle string

	// IsTitleSegment indicates this segment was generated from a slide title.
	IsTitleSegment bool

	// IsSectionHeader indicates this segment belongs to a section header slide.
	IsSectionHeader bool

	// Text is the processed text with pronunciations applied.
	Text string

	// OriginalText is the text before pronunciation substitutions.
	OriginalText string

	// VoiceID is the voice to use for this segment.
	VoiceID string

	// Language is the language code.
	Language string

	// PauseBeforeMs is the pause before in milliseconds.
	PauseBeforeMs int

	// PauseAfterMs is the pause after in milliseconds.
	PauseAfterMs int

	// Emphasis is the emphasis level.
	Emphasis string

	// Rate is the speaking rate.
	Rate string

	// Pitch is the pitch adjustment.
	Pitch string
}

CompiledSegment represents a compiled segment ready for TTS.

type Compiler

type Compiler struct {
	// AdditionalPronunciations are extra pronunciations to apply.
	AdditionalPronunciations map[string]map[string]string

	// DefaultPauseAfterSlide is the pause after each slide if not specified.
	DefaultPauseAfterSlide string

	// DefaultPauseAfterSegment is the pause after each segment if not specified.
	DefaultPauseAfterSegment string
}

Compiler compiles scripts to various output formats.

func NewCompiler

func NewCompiler() *Compiler

NewCompiler creates a new script compiler with default settings.

func (*Compiler) AddPronunciation

func (c *Compiler) AddPronunciation(term, language, replacement string)

AddPronunciation adds a pronunciation rule.

func (*Compiler) AddPronunciations

func (c *Compiler) AddPronunciations(language string, rules map[string]string)

AddPronunciations adds multiple pronunciation rules for a language.

func (*Compiler) Compile

func (c *Compiler) Compile(script *Script, language string) ([]CompiledSegment, error)

Compile compiles the script for the specified language. Returns a slice of compiled segments ready for TTS processing.

type ElevenLabsFormatter

type ElevenLabsFormatter struct {
	// UsePauseMarkers includes [pause:Xms] markers in text output.
	// When false, pauses are tracked separately for post-processing.
	UsePauseMarkers bool

	// PauseMarkerFormat is the format for pause markers (default: "[pause:%s]").
	PauseMarkerFormat string
}

ElevenLabsFormatter formats compiled segments for ElevenLabs TTS.

func NewElevenLabsFormatter

func NewElevenLabsFormatter() *ElevenLabsFormatter

NewElevenLabsFormatter creates a new ElevenLabs formatter.

func (*ElevenLabsFormatter) CombineForSingleRequest

func (f *ElevenLabsFormatter) CombineForSingleRequest(segments []ElevenLabsSegment) string

CombineForSingleRequest combines segments into a single text block. Useful when you want to generate all audio in one API call. Note: This loses per-segment voice control.

func (*ElevenLabsFormatter) Format

func (f *ElevenLabsFormatter) Format(segments []CompiledSegment) []ElevenLabsSegment

Format formats compiled segments for ElevenLabs.

func (*ElevenLabsFormatter) FormatScript

func (f *ElevenLabsFormatter) FormatScript(script *Script, language string) ([]ElevenLabsSegment, error)

FormatScript compiles and formats a script for ElevenLabs.

func (*ElevenLabsFormatter) GroupByVoice

func (f *ElevenLabsFormatter) GroupByVoice(segments []ElevenLabsSegment) map[string][]ElevenLabsSegment

GroupByVoice groups segments by voice ID for batch processing.

type ElevenLabsSegment

type ElevenLabsSegment struct {
	// Text is the text to generate speech for.
	Text string

	// VoiceID is the ElevenLabs voice ID.
	VoiceID string

	// SlideIndex is the source slide index.
	SlideIndex int

	// SegmentIndex is the source segment index (-1 for title segments).
	SegmentIndex int

	// SlideTitle is the slide title for reference.
	SlideTitle string

	// IsTitleSegment indicates this segment was generated from a slide title.
	IsTitleSegment bool

	// IsSectionHeader indicates this segment belongs to a section header slide.
	IsSectionHeader bool

	// PauseBeforeMs is silence to add before this segment.
	PauseBeforeMs int

	// PauseAfterMs is silence to add after this segment.
	PauseAfterMs int

	// SuggestedFilename is a suggested output filename.
	SuggestedFilename string
}

ElevenLabsSegment represents a segment ready for ElevenLabs TTS.

type ManifestEntry

type ManifestEntry struct {
	SlideIndex      int    `json:"slide_index"`
	SegmentIndex    int    `json:"segment_index"`
	SlideTitle      string `json:"slide_title,omitempty"`
	IsTitleSegment  bool   `json:"is_title_segment,omitempty"`
	IsSectionHeader bool   `json:"is_section_header,omitempty"`
	Text            string `json:"text"`
	VoiceID         string `json:"voice_id"`
	Language        string `json:"language"`
	OutputFile      string `json:"output_file"`
	PauseBeforeMs   int    `json:"pause_before_ms,omitempty"`
	PauseAfterMs    int    `json:"pause_after_ms,omitempty"`
}

ManifestEntry represents an entry in a generation manifest.

func GenerateManifest

func GenerateManifest(segments []ElevenLabsSegment, config *BatchConfig, language string) []ManifestEntry

GenerateManifest creates a manifest of all segments for tracking.

type SSMLFormatter

type SSMLFormatter struct {
	// Version is the SSML version (default: "1.1").
	Version string

	// IncludeComments includes slide title comments in output.
	IncludeComments bool

	// IndentSpaces is the number of spaces for indentation.
	IndentSpaces int
}

SSMLFormatter formats compiled segments as SSML. Compatible with Google Cloud TTS, Amazon Polly, Azure TTS, and others.

func NewSSMLFormatter

func NewSSMLFormatter() *SSMLFormatter

NewSSMLFormatter creates a new SSML formatter with default settings.

func (*SSMLFormatter) Format

func (f *SSMLFormatter) Format(segments []CompiledSegment, language string) string

Format formats compiled segments as SSML.

func (*SSMLFormatter) FormatScript

func (f *SSMLFormatter) FormatScript(script *Script, language string) (string, error)

FormatScript compiles and formats a script as SSML.

type Script

type Script struct {
	// Title is the script title.
	Title string `json:"title,omitempty"`

	// Description is an optional description.
	Description string `json:"description,omitempty"`

	// DefaultLanguage is the primary language code (e.g., "en-US").
	DefaultLanguage string `json:"default_language,omitempty"`

	// DefaultVoices maps language codes to default voice IDs.
	DefaultVoices map[string]string `json:"default_voices,omitempty"`

	// Pronunciations maps terms to their pronunciation by language.
	// Example: {"ADK": {"en": "A D K", "es": "A D K"}}
	Pronunciations map[string]map[string]string `json:"pronunciations,omitempty"`

	// Slides contains the ordered list of slides/sections.
	Slides []Slide `json:"slides"`
}

Script represents a multilingual TTS script with slides/segments. This is the canonical format for authoring TTS content that can be compiled to SSML (Google TTS, Amazon Polly) or ElevenLabs-compatible text.

func LoadScript

func LoadScript(filePath string) (*Script, error)

LoadScript loads a script from a JSON file.

func ParseScript

func ParseScript(data []byte) (*Script, error)

ParseScript parses a script from JSON data.

func (*Script) Languages

func (s *Script) Languages() []string

Languages returns all language codes used in the script.

func (*Script) Save

func (s *Script) Save(filePath string) error

Save saves a script to a JSON file.

func (*Script) SegmentCount

func (s *Script) SegmentCount() int

SegmentCount returns the total number of segments across all slides.

func (*Script) SlideCount

func (s *Script) SlideCount() int

SlideCount returns the number of slides.

func (*Script) Validate

func (s *Script) Validate() []string

Validate checks the script for common issues.

type Segment

type Segment struct {
	// Text contains the text content by language code.
	// Example: {"en": "Hello world", "es": "Hola mundo"}
	Text map[string]string `json:"text"`

	// Voice overrides the default voice for this segment by language.
	// Example: {"en": "voice-id-1", "es": "voice-id-2"}
	Voice map[string]string `json:"voice,omitempty"`

	// PauseBefore is the pause duration before this segment (e.g., "500ms", "1s").
	PauseBefore string `json:"pause_before,omitempty"`

	// PauseAfter is the pause duration after this segment (e.g., "500ms", "1s").
	PauseAfter string `json:"pause_after,omitempty"`

	// Emphasis indicates the emphasis level ("strong", "moderate", "reduced").
	Emphasis string `json:"emphasis,omitempty"`

	// Rate is the speaking rate ("slow", "medium", "fast", or percentage like "80%").
	Rate string `json:"rate,omitempty"`

	// Pitch adjusts the pitch ("low", "medium", "high", or percentage like "+10%").
	Pitch string `json:"pitch,omitempty"`

	// Pronunciations are segment-specific pronunciation overrides.
	Pronunciations map[string]map[string]string `json:"pronunciations,omitempty"`
}

Segment represents a single audio segment within a slide.

type Slide

type Slide struct {
	// Title is the slide title (optional).
	Title string `json:"title,omitempty"`

	// Notes are speaker notes or comments (not rendered to audio).
	Notes string `json:"notes,omitempty"`

	// IsSectionHeader marks this slide as the start of a new section.
	// Section headers can have their titles spoken and use longer transition pauses.
	IsSectionHeader bool `json:"is_section_header,omitempty"`

	// SpeakTitle causes the slide title to be spoken before the segments.
	// If true, the title is converted to a segment. Defaults to true for section headers.
	SpeakTitle *bool `json:"speak_title,omitempty"`

	// TitleVoice overrides the voice used for speaking the title, by language.
	// If not set, uses the segment voice or default voice.
	TitleVoice map[string]string `json:"title_voice,omitempty"`

	// TitlePauseAfter is the pause after the spoken title (e.g., "500ms").
	// Defaults to "500ms" for section headers, "300ms" for regular slides.
	TitlePauseAfter string `json:"title_pause_after,omitempty"`

	// Segments are the audio segments for this slide.
	Segments []Segment `json:"segments"`
}

Slide represents a slide or section of the script.

func (*Slide) ShouldSpeakTitle

func (s *Slide) ShouldSpeakTitle() bool

ShouldSpeakTitle returns true if the slide title should be spoken. Returns true if SpeakTitle is explicitly true, or if the slide is a section header and SpeakTitle is not explicitly false.

type TTSRequest

type TTSRequest struct {
	VoiceID  string
	Text     string
	ModelID  string
	Segment  ElevenLabsSegment
	Language string
}

TTSRequest represents a request to the ElevenLabs TTS API. This is a simplified version for use with ttsscript.

func GenerateTTSRequests

func GenerateTTSRequests(segments []ElevenLabsSegment, modelID, language string) []TTSRequest

GenerateTTSRequests creates TTS requests from formatted segments.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL