md2audio

module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 24, 2025 License: MIT

README ΒΆ

md2audio

CI Code coverage go report card GitHub version go reference license Built with Devbox

Convert markdown H2 sections to individual audio files using multiple TTS (Text-to-Speech) providers including macOS say command and ElevenLabs API.

Features

  • Multiple TTS Providers: macOS say command and ElevenLabs API
  • Process files or directories recursively with structure mirroring
  • Target duration control: Adjust timing with annotations like (8s)
  • Multiple formats: AIFF, M4A, and MP3 output
  • Voice caching: Fast lookups with SQLite WAL mode
  • Developer-friendly: Debug mode, dry-run preview, progress indicators

Prerequisites

For macOS say Provider

  • macOS (uses built-in say command)
  • Go 1.25 or later (to build the tool)

For ElevenLabs Provider

  • Any OS (Windows, macOS, Linux)
  • Go 1.25 or later (to build the tool)
  • ElevenLabs API key (Get one here)
  • Set ELEVENLABS_API_KEY environment variable or create .env file

Installation

Using go install

go install github.com/indaco/md2audio/cmd/md2audio@latest

Building from source

git clone https://github.com/indaco/md2audio.git
cd md2audio
go build -o md2audio ./cmd/md2audio

The binary will be created in the current directory. You can move it to a location in your PATH:

sudo mv md2audio /usr/local/bin/

TTS Providers

md2audio supports multiple Text-to-Speech providers. Choose the one that best fits your needs:

macOS say (Default)

  • Platform: macOS only
  • Cost: Free (built-in)
  • Setup: No configuration needed
  • Quality: Good for local development and testing
  • Formats: AIFF, M4A
  • Voices: ~70 voices in various languages

ElevenLabs

  • Platform: Cross-platform (works on any OS)
  • Cost: Paid API (Pricing)
  • Setup: Requires API key
  • Quality: Premium, highly realistic voices
  • Formats: MP3
  • Voices: Multiple professional voices with emotional control
Setting up ElevenLabs
  1. Get your API key from ElevenLabs

  2. Set the environment variable:

    export ELEVENLABS_API_KEY='your-api-key'
    
  3. Or create a .env file in your project directory:

    # Copy the example file
    cp .env.example .env
    # Then edit .env and add your API key
    

    Or create it directly:

    echo 'ELEVENLABS_API_KEY=your-api-key' > .env
    
  4. (Optional) Configure voice settings in .env:

    # Voice quality settings (all optional, with sensible defaults)
    ELEVENLABS_STABILITY=0.5              # Voice consistency (0.0-1.0, default: 0.5)
    ELEVENLABS_SIMILARITY_BOOST=0.5       # Voice similarity (0.0-1.0, default: 0.5)
    ELEVENLABS_STYLE=0.0                  # Voice style/emotion (0.0-1.0, default: 0.0)
    ELEVENLABS_USE_SPEAKER_BOOST=true     # Boost similarity (true/false, default: true)
    ELEVENLABS_SPEED=1.0                  # Default speed for non-timed sections (0.7-1.2, default: 1.0)
    

    Note:

    • ELEVENLABS_SPEED only applies to sections WITHOUT timing annotations
    • Sections with (5s) timing will calculate speed automatically
    • Higher stability = more consistent but less expressive
    • Higher similarity_boost = closer to original voice characteristics
    • Style adds emotional range (0 = disabled, higher = more expressive)
  5. List available voices:

    ./md2audio -provider elevenlabs -list-voices
    

Usage

Basic Examples

Using macOS say Provider (Default)
# Check version
./md2audio -version

# List available voices for say provider
./md2audio -list-voices

# Process a single markdown file with voice preset
./md2audio -f script.md -p british-female

# Process entire directory recursively
./md2audio -d ./docs -p british-female

# Use specific voice with slower rate for clarity
./md2audio -f script.md -v Kate -r 170

# Generate M4A files instead of AIFF
./md2audio -d ./content -p british-female -format m4a

# Custom output directory and prefix
./md2audio -f script.md -o ./voiceovers -prefix demo

# Preview what would be generated (dry-run mode)
./md2audio -f script.md -p british-female -dry-run

# Enable debug logging to troubleshoot issues
./md2audio -f script.md -p british-female -debug

# Combine dry-run with debug for detailed preview
./md2audio -d ./docs -p british-female -dry-run -debug
Using ElevenLabs Provider
# List available ElevenLabs voices (cached for faster access)
./md2audio -provider elevenlabs -list-voices

# Refresh voice cache (when new voices are available)
./md2audio -provider elevenlabs -list-voices -refresh-cache

# Export voices to JSON for reference
./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json

# Process a single file with ElevenLabs
./md2audio -provider elevenlabs \
  -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \
  -f script.md

# Process entire directory with ElevenLabs
./md2audio -provider elevenlabs \
  -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \
  -d ./docs \
  -o ./audio_output

# Use specific ElevenLabs model
./md2audio -provider elevenlabs \
  -elevenlabs-voice-id YOUR_VOICE_ID \
  -elevenlabs-model eleven_multilingual_v2 \
  -f script.md

Debug Mode

Enable debug logging to troubleshoot issues or understand what's happening under the hood:

# Enable debug logging
./md2audio -f script.md -p british-female -debug

Debug mode shows:

  • Cache hits/misses for voice lookups
  • API request details (ElevenLabs)
  • File processing progress
  • Internal operation details

When to use debug mode:

  • Troubleshooting API issues with ElevenLabs
  • Understanding cache behavior
  • Investigating performance problems
  • Reporting bugs with detailed logs

Dry-Run Mode

Preview what would be generated without creating any audio files:

# Dry-run mode - shows what would be generated
./md2audio -f script.md -p british-female -dry-run

# Combine with debug for maximum visibility
./md2audio -d ./docs -provider elevenlabs -elevenlabs-voice-id YOUR_ID -dry-run -debug

Dry-run mode shows:

  • Which sections would be processed
  • Output file paths that would be created
  • Timing information for timed sections
  • Preview of text content

When to use dry-run mode:

  • Testing markdown format before generation
  • Verifying output paths and filenames
  • Checking section count and structure
  • Planning batch processing jobs

Example output:

πŸ’‘ DRY-RUN MODE: No files will be created

β„Ή Section 1/3:
  - title: Introduction
  πŸ’‘ Target duration: 8.0 seconds
  πŸ’‘ Text: Welcome to this demonstration...
  Would create: ./audio_sections/section_01_introduction.aiff

β„Ή Section 2/3:
  - title: Main Content
  πŸ’‘ Text: Here is the main content...
  Would create: ./audio_sections/section_02_main_content.aiff

βœ” Would generate 3 audio files

Voice Caching

To improve performance, md2audio caches voice lists from providers. This is especially useful for ElevenLabs to avoid repeated API calls:

# First call - fetches from API and caches (slower)
./md2audio -provider elevenlabs -list-voices

# Subsequent calls - uses cache (instant)
./md2audio -provider elevenlabs -list-voices

# Force refresh when new voices are available
./md2audio -provider elevenlabs -list-voices -refresh-cache

# Export cached voices to JSON file for reference
./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json
./md2audio -provider say -export-voices say_voices.json

Cache Details:

  • Location: ~/.md2audio/voice_cache.db (SQLite database)
  • Duration: 30 days (voices don't change frequently)
  • Benefits: Instant voice listing, reduced API calls, offline access to voice list
  • Refresh: Use -refresh-cache flag when you know new voices are available

Command Line Options

General Options
Flag Description Default
-f Input markdown file (use -f or -d) -
-d Input directory (recursive, use -f or -d) -
-o Output directory ./audio_sections
-format Output format aiff
-prefix Filename prefix section
-list-voices List all available voices (uses cache if available) -
-refresh-cache Force refresh of voice cache false
-export-voices Export cached voices to JSON file -
-provider TTS provider (say or elevenlabs) say
-version Print version and exit -
-debug Enable debug logging false
-dry-run Show what would be generated without creating files false
macOS say Provider Options
Flag Description Default
-p Voice preset (see Voice Presets below) Kate (if not set)
-v Specific voice name (overrides -p) -
-r Speaking rate (lower = slower) 180
ElevenLabs Provider Options
Flag Description Default
-elevenlabs-voice-id ElevenLabs voice ID (required) -
-elevenlabs-model ElevenLabs model ID eleven_multilingual_v2
-elevenlabs-api-key ElevenLabs API key (prefer env var) ELEVENLABS_API_KEY env

Voice Presets

  • british-female -> Kate
  • british-male -> Daniel
  • us-female -> Samantha
  • us-male -> Alex
  • australian-female -> Karen
  • indian-female -> Veena

ElevenLabs Voice Settings

ElevenLabs voice quality can be fine-tuned using environment variables. All settings are optional and have sensible defaults:

Setting Range Default Description
ELEVENLABS_STABILITY 0.0-1.0 0.5 Voice consistency. Higher = more consistent but less expressive
ELEVENLABS_SIMILARITY_BOOST 0.0-1.0 0.5 Voice similarity to original. Higher = closer to voice characteristics
ELEVENLABS_STYLE 0.0-1.0 0.0 Emotional range. 0 = disabled, higher = more expressive
ELEVENLABS_USE_SPEAKER_BOOST true/false true Boost similarity of synthesized speech
ELEVENLABS_SPEED 0.7-1.2 1.0 Default speaking speed (only for sections without timing annotations)

Speed Behavior:

  • Sections with timing annotations like ## Scene 1 (5s) β†’ Speed is calculated automatically to fit duration
  • Sections without timing annotations β†’ Uses ELEVENLABS_SPEED setting (default: 1.0)

Example .env configuration:

ELEVENLABS_API_KEY=your-api-key
ELEVENLABS_STABILITY=0.7           # More consistent voice
ELEVENLABS_SIMILARITY_BOOST=0.8    # Closer to original voice
ELEVENLABS_STYLE=0.3               # Slight emotional variation
ELEVENLABS_SPEED=1.1               # 10% faster for non-timed sections

Markdown Format

The script expects H2 headers (##) to denote sections. You can optionally specify target duration for each section:

## Scene 1: Introduction (8s)

This is the content for scene 1. It will be converted to audio that lasts exactly 8 seconds.

## Scene 2: Main Demo (12s)

This is the content for scene 2. The speaking rate will be automatically adjusted to fit 12 seconds.

## Scene 3: Conclusion

This section has no timing specified, so it will use the default speaking rate (-r flag).

Timing Formats Supported

  • (8s) - Target duration of 8 seconds
  • (10.5s) - Target duration of 10.5 seconds
  • (0-8s) - Range format, uses end time (8 seconds)
  • (15 seconds) - Also works with "seconds" spelled out

How it works (macOS say provider only):

  • The script counts the words in your text
  • Calculates the required words-per-minute (WPM) to fit the target duration
  • Automatically adjusts the speaking rate for that section
  • Shows you the actual duration vs target after generation

Important Notes:

  • Timing is supported with both providers, but with different accuracy:

    • macOS say provider: Uses -r (rate) parameter for speed control

      • Very wide range of speaking rates (90-360 wpm)
      • Actual duration may differ from target (typically within 1-3 seconds)
      • Applies 0.95 adjustment factor for better accuracy
    • ElevenLabs provider: Uses speed parameter (NEW!)

      • Limited range: 0.7x (slower) to 1.2x (faster) of natural pace
      • More accurate natural-sounding speech
      • If target duration requires speed outside this range, audio will be clamped
      • Example: 5s target β†’ 5.75s actual (within 15% for typical content)
  • Timing accuracy tip: Test with your content and adjust timing annotations as needed. For very tight timing requirements, consider the say provider's wider speed range.

Directory Processing

Process entire directory trees recursively with the -d flag:

./md2audio -d ./docs -p british-female -o ./audio_output

Input structure:

docs/
β”œβ”€β”€ intro.md
β”œβ”€β”€ chapter1/
β”‚   β”œβ”€β”€ part1.md
β”‚   └── part2.md
└── chapter2/
    └── overview.md

Output structure (mirrors input):

audio_output/
β”œβ”€β”€ intro/
β”‚   β”œβ”€β”€ section_01_welcome.aiff
β”‚   └── section_02_overview.aiff
β”œβ”€β”€ chapter1/
β”‚   β”œβ”€β”€ part1/
β”‚   β”‚   β”œβ”€β”€ section_01_title.aiff
β”‚   β”‚   └── section_02_title.aiff
β”‚   └── part2/
β”‚       └── section_01_title.aiff
└── chapter2/
    └── overview/
        └── section_01_title.aiff

Key features:

  • Processes all .md files recursively
  • Creates mirror directory structure
  • Each markdown file gets its own subdirectory
  • Preserves folder hierarchy from input
  • Continues processing even if individual files fail

Example with examples folder:

# Process the included examples
./md2audio -d ./examples -p british-female -format m4a

# Results in organized audio files matching the examples structure

Output

Files are named using the pattern:

{prefix}_{number}_{sanitized_title}.{format}

Example outputs:

  • section_01_scene_1_introduction.aiff
  • section_02_scene_2_main_demo.aiff

Tips for Video Editing

  1. Generate separate files per section (this is automatic)
  2. Add timing to your markdown headers to match your screen recording
  3. Import all audio files into your video editing software
  4. Place each audio clip on the timeline where needed
  5. The audio will match your specified durations automatically

Timing Tips

  • Be realistic: Very short durations with lots of text will sound rushed
  • Test first: Generate one section to verify the pacing feels natural
  • Adjust if needed: If timing is off, adjust the duration in your markdown and regenerate
  • Word count matters: ~2-3 words per second is natural speech
  • Override if needed: The -r flag still works for sections without timing

Troubleshooting

Voice not found:

  • Run ./md2audio -list-voices to see available voices
  • Use the exact voice name with -v flag

No sections found:

  • Ensure your markdown uses ## for headers (H2)
  • Check there's content after each header

Audio quality:

  • AIFF format is higher quality but larger
  • M4A format is compressed and smaller
  • Adjust rate with -r flag for clarity

Example Workflow

# 1. Check your markdown format
cat examples/demo_script.md

# 2. List available voices
./md2audio -list-voices

# 3. Generate audio files
./md2audio -f examples/demo_script.md -p british-female -r 175 -format m4a

# 4. Import the files from ./audio_sections into your video editor

Notes

  • The script automatically cleans markdown formatting (links, bold, italic)
  • Empty sections are skipped
  • Section titles are sanitized for safe filenames
  • Speaking rate default is 180 (macOS default is 200)

For Developers

Interested in contributing or understanding the codebase?

See the Contributing Guide for detailed information about:

  • Project architecture and package organization
  • Development tools and workflow
  • Code quality standards
  • Setting up your development environment

Contributing

Contributions are welcome!

See the Contributing Guide for setup instructions.

License

This project is licensed under the MIT License – see the LICENSE file for details.

Directories ΒΆ

Path Synopsis
cmd
md2audio command
internal
audio
Package audio provides audio generation orchestration.
Package audio provides audio generation orchestration.
cache
Package cache provides SQLite-based caching for TTS provider voice lists.
Package cache provides SQLite-based caching for TTS provider voice lists.
cli
Package cli provides command-line interface utilities.
Package cli provides command-line interface utilities.
config
Package config provides configuration management for the md2audio CLI.
Package config provides configuration management for the md2audio CLI.
env
Package env provides environment variable loading from .env files.
Package env provides environment variable loading from .env files.
logger
Package logger provides structured, color-coded console logging.
Package logger provides structured, color-coded console logging.
parser
Package parser provides markdown file parsing and section extraction functionality.
Package parser provides markdown file parsing and section extraction functionality.
processor
Package processor orchestrates file and directory processing.
Package processor orchestrates file and directory processing.
text
Package text provides text processing utilities for markdown content.
Package text provides text processing utilities for markdown content.
tts
Package tts provides the text-to-speech provider abstraction and implementations.
Package tts provides the text-to-speech provider abstraction and implementations.
utils
Package utils provides utility functions for audio processing and common operations.
Package utils provides utility functions for audio processing and common operations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL