Documentation
¶
Index ¶
- Constants
- Variables
- func EncodeFFmpeg(ctx context.Context, format string, pcm []byte, sampleRate uint32) ([]byte, error)
- func Extension(format string) string
- func FFmpegAvailable() bool
- func IsValidModel(name string) bool
- func IsValidVoice(name string) bool
- func MimeType(format string) string
- func Sanitize(s string) string
- func StartCacheSweep()
- func SupportedFormats() []string
- type Service
- type SynthesizeRequest
- type SynthesizeResult
- type VoiceInfo
Constants ¶
const DefaultModel = "gemini-3.1-flash-tts-preview"
DefaultModel is the Gemini TTS model used by default.
const DefaultStylePrompt = "落ち着いた日本語で、淡々と短く読み上げて。"
DefaultStylePrompt is used when the agent has no custom style prompt.
const DefaultVoice = "Kore"
DefaultVoice is the default voice from the 30-voice Gemini TTS catalogue.
const MaxChars = 800
MaxChars caps input text length to bound cost and latency. Anything longer is truncated with a trailing ellipsis before being sent.
const SystemInstruction = `` /* 375-byte string literal not displayed */
SystemInstruction is prepended to every TTS request to keep the model in narrator mode and avoid refusal/rewriting of the input text.
Variables ¶
var Models = []string{
"gemini-3.1-flash-tts-preview",
"gemini-2.5-flash-preview-tts",
"gemini-2.5-pro-preview-tts",
}
Models is the set of TTS models we accept from clients.
var VoiceCatalog = []VoiceInfo{
{"Zephyr", "Bright", "F"},
{"Puck", "Upbeat", "M"},
{"Charon", "Informative", "M"},
{"Kore", "Firm", "F"},
{"Fenrir", "Excitable", "M"},
{"Leda", "Youthful", "F"},
{"Orus", "Firm", "M"},
{"Aoede", "Breezy", "F"},
{"Callirrhoe", "Easy-going", "F"},
{"Autonoe", "Bright", "F"},
{"Enceladus", "Breathy", "M"},
{"Iapetus", "Clear", "M"},
{"Umbriel", "Easy-going", "M"},
{"Algieba", "Smooth", "M"},
{"Despina", "Smooth", "F"},
{"Erinome", "Clear", "F"},
{"Algenib", "Gravelly", "M"},
{"Rasalgethi", "Informative", "M"},
{"Laomedeia", "Upbeat", "F"},
{"Achernar", "Soft", "F"},
{"Alnilam", "Firm", "M"},
{"Schedar", "Even", "M"},
{"Gacrux", "Mature", "F"},
{"Pulcherrima", "Forward", "F"},
{"Achird", "Friendly", "M"},
{"Zubenelgenubi", "Casual", "M"},
{"Vindemiatrix", "Gentle", "F"},
{"Sadachbia", "Lively", "M"},
{"Sadaltager", "Knowledgeable", "M"},
{"Sulafat", "Warm", "F"},
}
VoiceCatalog is the canonical, ordered list of voices supported by Gemini TTS as of the speech-generation docs in 2026-05.
var Voices = func() []string { out := make([]string, len(VoiceCatalog)) for i, v := range VoiceCatalog { out[i] = v.Name } return out }()
Voices is a flat list of voice ids preserved for backwards compatibility with the existing IsValidVoice/Service signatures.
Functions ¶
func EncodeFFmpeg ¶
func EncodeFFmpeg(ctx context.Context, format string, pcm []byte, sampleRate uint32) ([]byte, error)
EncodeFFmpeg pipes raw PCM (16-bit little-endian mono at sampleRate Hz) through ffmpeg and returns the encoded bytes in the requested container.
format must be "opus" (Ogg/Opus 24 kbps voip-tuned) or "mp3" (64 kbps libmp3lame). For "wav" use pcmToWAV directly — it does not require ffmpeg.
A 30 s hard timeout protects against a stuck subprocess; the goroutine piping stdin is bounded by ctx as well.
func FFmpegAvailable ¶
func FFmpegAvailable() bool
FFmpegAvailable returns whether an ffmpeg binary is on PATH. The lookup is performed once and cached for the process lifetime.
func IsValidModel ¶
IsValidModel reports whether the given model id is in the accepted list.
func IsValidVoice ¶
IsValidVoice reports whether the given name is in the canonical voice list. Empty string is rejected.
func Sanitize ¶
Sanitize prepares text for TTS by removing markdown noise that wastes audio token budget and degrades narration quality. The transformation is intentionally conservative — TTS narrators handle punctuation and natural language fine; we only strip what is genuinely unhelpful when spoken (long code blocks, URLs, inline backticks).
The result is also length-clipped to MaxChars so a runaway agent reply can't blow up the audio token bill.
func StartCacheSweep ¶
func StartCacheSweep()
func SupportedFormats ¶
func SupportedFormats() []string
SupportedFormats reports which output container formats kojo can emit for a given environment. WAV is always available; opus and mp3 require ffmpeg.
Types ¶
type Service ¶
type Service struct {
// contains filtered or unexported fields
}
Service performs Gemini TTS synthesis and caches results on disk.
The API key is fetched lazily via getAPIKey on every request, so a key rotation in the credential store takes effect on the next call without needing to rewire the service.
func NewService ¶
NewService constructs a Service. apiKeyFn must return the current Gemini Developer API key.
func (*Service) LookupCached ¶
LookupCached returns the bytes for a previously synthesized hash. It is used by the GET /audio endpoint to serve the file directly with a long browser cache. format must match the on-disk extension.
func (*Service) Synthesize ¶
func (s *Service) Synthesize(ctx context.Context, req SynthesizeRequest) (*SynthesizeResult, error)
Synthesize is the main entry point. The flow is:
- Sanitize and validate input.
- Hash the request and return the cached file if present.
- Call Gemini :generateContent with safetySettings=OFF.
- Decode the inline-data audio (raw 24 kHz LE16 PCM).
- Encode to the requested container (ffmpeg for opus/mp3, in-process WAV header for wav).
- Persist to cache and return.
type SynthesizeRequest ¶
type SynthesizeRequest struct {
Model string // "" = DefaultModel
Voice string // "" = DefaultVoice
StylePrompt string // "" = DefaultStylePrompt
Text string // raw, will be sanitized inside Synthesize
Format string // "opus" | "mp3" | "wav"
}
SynthesizeRequest is the input to Service.Synthesize. All fields are already sanitized at this layer — callers pass agent-derived configuration directly.
type SynthesizeResult ¶
SynthesizeResult is what Service.Synthesize returns. AudioBytes is the fully encoded payload ready to be served as MimeType(Format). Hash is the cache key (hex sha256) so handlers can build the audio URL.
type VoiceInfo ¶
type VoiceInfo struct {
Name string `json:"name"`
Trait string `json:"trait"`
Gender string `json:"gender,omitempty"` // "F" | "M" | ""
}
VoiceInfo pairs a voice id with the descriptive trait Google publishes in the Gemini TTS docs and the gender label Google publishes for the matching Cloud Text-to-Speech Chirp3-HD voice. Both are "official"; Gender is "F" / "M" / "" (unknown).
Trait source: https://ai.google.dev/gemini-api/docs/speech-generation Gender source: https://docs.cloud.google.com/text-to-speech/docs/list-voices-and-types
(the same voice names appear under Chirp3-HD with ssmlGender annotated)