md2audio
Convert markdown H2 sections to individual audio files using multiple TTS (Text-to-Speech) providers including macOS say command and ElevenLabs API.
Features
- Multiple TTS Providers: macOS
say command and ElevenLabs API
- Process files or directories recursively with structure mirroring
- Target duration control: Adjust timing with annotations like
(8s)
- Multiple formats: AIFF, M4A, and MP3 output
- Voice caching: Fast lookups with SQLite WAL mode
- Developer-friendly: Debug mode, dry-run preview, progress indicators
Prerequisites
For macOS say Provider
- macOS (uses built-in
say command)
- Go 1.25 or later (to build the tool)
For ElevenLabs Provider
- Any OS (Windows, macOS, Linux)
- Go 1.25 or later (to build the tool)
- ElevenLabs API key (Get one here)
- Set
ELEVENLABS_API_KEY environment variable or create .env file
Installation
Using go install
go install github.com/indaco/md2audio/cmd/md2audio@latest
Building from source
git clone https://github.com/indaco/md2audio.git
cd md2audio
go build -o md2audio ./cmd/md2audio
The binary will be created in the current directory. You can move it to a location in your PATH:
sudo mv md2audio /usr/local/bin/
TTS Providers
md2audio supports multiple Text-to-Speech providers. Choose the one that best fits your needs:
macOS say (Default)
- Platform: macOS only
- Cost: Free (built-in)
- Setup: No configuration needed
- Quality: Good for local development and testing
- Formats: AIFF, M4A
- Voices: ~70 voices in various languages
ElevenLabs
- Platform: Cross-platform (works on any OS)
- Cost: Paid API (Pricing)
- Setup: Requires API key
- Quality: Premium, highly realistic voices
- Formats: MP3
- Voices: Multiple professional voices with emotional control
Setting up ElevenLabs
-
Get your API key from ElevenLabs
-
Set the environment variable:
export ELEVENLABS_API_KEY='your-api-key'
-
Or create a .env file in your project directory:
# Copy the example file
cp .env.example .env
# Then edit .env and add your API key
Or create it directly:
echo 'ELEVENLABS_API_KEY=your-api-key' > .env
-
(Optional) Configure voice settings in .env:
# Voice quality settings (all optional, with sensible defaults)
ELEVENLABS_STABILITY=0.5 # Voice consistency (0.0-1.0, default: 0.5)
ELEVENLABS_SIMILARITY_BOOST=0.5 # Voice similarity (0.0-1.0, default: 0.5)
ELEVENLABS_STYLE=0.0 # Voice style/emotion (0.0-1.0, default: 0.0)
ELEVENLABS_USE_SPEAKER_BOOST=true # Boost similarity (true/false, default: true)
ELEVENLABS_SPEED=1.0 # Default speed for non-timed sections (0.7-1.2, default: 1.0)
Note:
ELEVENLABS_SPEED only applies to sections WITHOUT timing annotations
- Sections with
(5s) timing will calculate speed automatically
- Higher stability = more consistent but less expressive
- Higher similarity_boost = closer to original voice characteristics
- Style adds emotional range (0 = disabled, higher = more expressive)
-
List available voices:
./md2audio -provider elevenlabs -list-voices
Usage
Basic Examples
Using macOS say Provider (Default)
# Check version
./md2audio -version
# List available voices for say provider
./md2audio -list-voices
# Process a single markdown file with voice preset
./md2audio -f script.md -p british-female
# Process entire directory recursively
./md2audio -d ./docs -p british-female
# Use specific voice with slower rate for clarity
./md2audio -f script.md -v Kate -r 170
# Generate M4A files instead of AIFF
./md2audio -d ./content -p british-female -format m4a
# Custom output directory and prefix
./md2audio -f script.md -o ./voiceovers -prefix demo
# Preview what would be generated (dry-run mode)
./md2audio -f script.md -p british-female -dry-run
# Enable debug logging to troubleshoot issues
./md2audio -f script.md -p british-female -debug
# Combine dry-run with debug for detailed preview
./md2audio -d ./docs -p british-female -dry-run -debug
Using ElevenLabs Provider
# List available ElevenLabs voices (cached for faster access)
./md2audio -provider elevenlabs -list-voices
# Refresh voice cache (when new voices are available)
./md2audio -provider elevenlabs -list-voices -refresh-cache
# Export voices to JSON for reference
./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json
# Process a single file with ElevenLabs
./md2audio -provider elevenlabs \
-elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \
-f script.md
# Process entire directory with ElevenLabs
./md2audio -provider elevenlabs \
-elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \
-d ./docs \
-o ./audio_output
# Use specific ElevenLabs model
./md2audio -provider elevenlabs \
-elevenlabs-voice-id YOUR_VOICE_ID \
-elevenlabs-model eleven_multilingual_v2 \
-f script.md
Debug Mode
Enable debug logging to troubleshoot issues or understand what's happening under the hood:
# Enable debug logging
./md2audio -f script.md -p british-female -debug
Debug mode shows:
- Cache hits/misses for voice lookups
- API request details (ElevenLabs)
- File processing progress
- Internal operation details
When to use debug mode:
- Troubleshooting API issues with ElevenLabs
- Understanding cache behavior
- Investigating performance problems
- Reporting bugs with detailed logs
Dry-Run Mode
Preview what would be generated without creating any audio files:
# Dry-run mode - shows what would be generated
./md2audio -f script.md -p british-female -dry-run
# Combine with debug for maximum visibility
./md2audio -d ./docs -provider elevenlabs -elevenlabs-voice-id YOUR_ID -dry-run -debug
Dry-run mode shows:
- Which sections would be processed
- Output file paths that would be created
- Timing information for timed sections
- Preview of text content
When to use dry-run mode:
- Testing markdown format before generation
- Verifying output paths and filenames
- Checking section count and structure
- Planning batch processing jobs
Example output:
π‘ DRY-RUN MODE: No files will be created
βΉ Section 1/3:
- title: Introduction
π‘ Target duration: 8.0 seconds
π‘ Text: Welcome to this demonstration...
Would create: ./audio_sections/section_01_introduction.aiff
βΉ Section 2/3:
- title: Main Content
π‘ Text: Here is the main content...
Would create: ./audio_sections/section_02_main_content.aiff
β Would generate 3 audio files
Voice Caching
To improve performance, md2audio caches voice lists from providers. This is especially useful for ElevenLabs to avoid repeated API calls:
# First call - fetches from API and caches (slower)
./md2audio -provider elevenlabs -list-voices
# Subsequent calls - uses cache (instant)
./md2audio -provider elevenlabs -list-voices
# Force refresh when new voices are available
./md2audio -provider elevenlabs -list-voices -refresh-cache
# Export cached voices to JSON file for reference
./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json
./md2audio -provider say -export-voices say_voices.json
Cache Details:
- Location:
~/.md2audio/voice_cache.db (SQLite database)
- Duration: 30 days (voices don't change frequently)
- Benefits: Instant voice listing, reduced API calls, offline access to voice list
- Refresh: Use
-refresh-cache flag when you know new voices are available
Command Line Options
General Options
| Flag |
Description |
Default |
-f |
Input markdown file (use -f or -d) |
- |
-d |
Input directory (recursive, use -f or -d) |
- |
-o |
Output directory |
./audio_sections |
-format |
Output format |
aiff |
-prefix |
Filename prefix |
section |
-list-voices |
List all available voices (uses cache if available) |
- |
-refresh-cache |
Force refresh of voice cache |
false |
-export-voices |
Export cached voices to JSON file |
- |
-provider |
TTS provider (say or elevenlabs) |
say |
-version |
Print version and exit |
- |
-debug |
Enable debug logging |
false |
-dry-run |
Show what would be generated without creating files |
false |
macOS say Provider Options
| Flag |
Description |
Default |
-p |
Voice preset (see Voice Presets below) |
Kate (if not set) |
-v |
Specific voice name (overrides -p) |
- |
-r |
Speaking rate (lower = slower) |
180 |
ElevenLabs Provider Options
| Flag |
Description |
Default |
-elevenlabs-voice-id |
ElevenLabs voice ID (required) |
- |
-elevenlabs-model |
ElevenLabs model ID |
eleven_multilingual_v2 |
-elevenlabs-api-key |
ElevenLabs API key (prefer env var) |
ELEVENLABS_API_KEY env |
Voice Presets
british-female -> Kate
british-male -> Daniel
us-female -> Samantha
us-male -> Alex
australian-female -> Karen
indian-female -> Veena
ElevenLabs Voice Settings
ElevenLabs voice quality can be fine-tuned using environment variables. All settings are optional and have sensible defaults:
| Setting |
Range |
Default |
Description |
ELEVENLABS_STABILITY |
0.0-1.0 |
0.5 |
Voice consistency. Higher = more consistent but less expressive |
ELEVENLABS_SIMILARITY_BOOST |
0.0-1.0 |
0.5 |
Voice similarity to original. Higher = closer to voice characteristics |
ELEVENLABS_STYLE |
0.0-1.0 |
0.0 |
Emotional range. 0 = disabled, higher = more expressive |
ELEVENLABS_USE_SPEAKER_BOOST |
true/false |
true |
Boost similarity of synthesized speech |
ELEVENLABS_SPEED |
0.7-1.2 |
1.0 |
Default speaking speed (only for sections without timing annotations) |
Speed Behavior:
- Sections with timing annotations like
## Scene 1 (5s) β Speed is calculated automatically to fit duration
- Sections without timing annotations β Uses
ELEVENLABS_SPEED setting (default: 1.0)
Example .env configuration:
ELEVENLABS_API_KEY=your-api-key
ELEVENLABS_STABILITY=0.7 # More consistent voice
ELEVENLABS_SIMILARITY_BOOST=0.8 # Closer to original voice
ELEVENLABS_STYLE=0.3 # Slight emotional variation
ELEVENLABS_SPEED=1.1 # 10% faster for non-timed sections
The script expects H2 headers (##) to denote sections. You can optionally specify target duration for each section:
## Scene 1: Introduction (8s)
This is the content for scene 1. It will be converted to audio that lasts exactly 8 seconds.
## Scene 2: Main Demo (12s)
This is the content for scene 2. The speaking rate will be automatically adjusted to fit 12 seconds.
## Scene 3: Conclusion
This section has no timing specified, so it will use the default speaking rate (-r flag).
(8s) - Target duration of 8 seconds
(10.5s) - Target duration of 10.5 seconds
(0-8s) - Range format, uses end time (8 seconds)
(15 seconds) - Also works with "seconds" spelled out
How it works (macOS say provider only):
- The script counts the words in your text
- Calculates the required words-per-minute (WPM) to fit the target duration
- Automatically adjusts the speaking rate for that section
- Shows you the actual duration vs target after generation
Important Notes:
-
Timing is supported with both providers, but with different accuracy:
-
Timing accuracy tip: Test with your content and adjust timing annotations as needed. For very tight timing requirements, consider the say provider's wider speed range.
Directory Processing
Process entire directory trees recursively with the -d flag:
./md2audio -d ./docs -p british-female -o ./audio_output
Input structure:
docs/
βββ intro.md
βββ chapter1/
β βββ part1.md
β βββ part2.md
βββ chapter2/
βββ overview.md
Output structure (mirrors input):
audio_output/
βββ intro/
β βββ section_01_welcome.aiff
β βββ section_02_overview.aiff
βββ chapter1/
β βββ part1/
β β βββ section_01_title.aiff
β β βββ section_02_title.aiff
β βββ part2/
β βββ section_01_title.aiff
βββ chapter2/
βββ overview/
βββ section_01_title.aiff
Key features:
- Processes all
.md files recursively
- Creates mirror directory structure
- Each markdown file gets its own subdirectory
- Preserves folder hierarchy from input
- Continues processing even if individual files fail
Example with examples folder:
# Process the included examples
./md2audio -d ./examples -p british-female -format m4a
# Results in organized audio files matching the examples structure
Output
Files are named using the pattern:
{prefix}_{number}_{sanitized_title}.{format}
Example outputs:
section_01_scene_1_introduction.aiff
section_02_scene_2_main_demo.aiff
Tips for Video Editing
- Generate separate files per section (this is automatic)
- Add timing to your markdown headers to match your screen recording
- Import all audio files into your video editing software
- Place each audio clip on the timeline where needed
- The audio will match your specified durations automatically
Timing Tips
- Be realistic: Very short durations with lots of text will sound rushed
- Test first: Generate one section to verify the pacing feels natural
- Adjust if needed: If timing is off, adjust the duration in your markdown and regenerate
- Word count matters: ~2-3 words per second is natural speech
- Override if needed: The
-r flag still works for sections without timing
Troubleshooting
Voice not found:
- Run
./md2audio -list-voices to see available voices
- Use the exact voice name with
-v flag
No sections found:
- Ensure your markdown uses
## for headers (H2)
- Check there's content after each header
Audio quality:
- AIFF format is higher quality but larger
- M4A format is compressed and smaller
- Adjust rate with
-r flag for clarity
Example Workflow
# 1. Check your markdown format
cat examples/demo_script.md
# 2. List available voices
./md2audio -list-voices
# 3. Generate audio files
./md2audio -f examples/demo_script.md -p british-female -r 175 -format m4a
# 4. Import the files from ./audio_sections into your video editor
Notes
- The script automatically cleans markdown formatting (links, bold, italic)
- Empty sections are skipped
- Section titles are sanitized for safe filenames
- Speaking rate default is 180 (macOS default is 200)
For Developers
Interested in contributing or understanding the codebase?
See the Contributing Guide for detailed information about:
- Project architecture and package organization
- Development tools and workflow
- Code quality standards
- Setting up your development environment
Contributing
Contributions are welcome!
See the Contributing Guide for setup instructions.
License
This project is licensed under the MIT License β see the LICENSE file for details.