tts

package
v0.19.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 7, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Index

Constants

View Source
const DefaultModel = "gemini-3.1-flash-tts-preview"

DefaultModel is the Gemini TTS model used by default.

View Source
const DefaultStylePrompt = "落ち着いた日本語で、淡々と短く読み上げて。"

DefaultStylePrompt is used when the agent has no custom style prompt.

View Source
const DefaultVoice = "Kore"

DefaultVoice is the default voice from the 30-voice Gemini TTS catalogue.

View Source
const MaxChars = 800

MaxChars caps input text length to bound cost and latency. Anything longer is truncated with a trailing ellipsis before being sent.

View Source
const SystemInstruction = `` /* 375-byte string literal not displayed */

SystemInstruction is prepended to every TTS request to keep the model in narrator mode and avoid refusal/rewriting of the input text.

Variables

View Source
var Models = []string{
	"gemini-3.1-flash-tts-preview",
	"gemini-2.5-flash-preview-tts",
	"gemini-2.5-pro-preview-tts",
}

Models is the set of TTS models we accept from clients.

View Source
var VoiceCatalog = []VoiceInfo{
	{"Zephyr", "Bright", "F"},
	{"Puck", "Upbeat", "M"},
	{"Charon", "Informative", "M"},
	{"Kore", "Firm", "F"},
	{"Fenrir", "Excitable", "M"},
	{"Leda", "Youthful", "F"},
	{"Orus", "Firm", "M"},
	{"Aoede", "Breezy", "F"},
	{"Callirrhoe", "Easy-going", "F"},
	{"Autonoe", "Bright", "F"},
	{"Enceladus", "Breathy", "M"},
	{"Iapetus", "Clear", "M"},
	{"Umbriel", "Easy-going", "M"},
	{"Algieba", "Smooth", "M"},
	{"Despina", "Smooth", "F"},
	{"Erinome", "Clear", "F"},
	{"Algenib", "Gravelly", "M"},
	{"Rasalgethi", "Informative", "M"},
	{"Laomedeia", "Upbeat", "F"},
	{"Achernar", "Soft", "F"},
	{"Alnilam", "Firm", "M"},
	{"Schedar", "Even", "M"},
	{"Gacrux", "Mature", "F"},
	{"Pulcherrima", "Forward", "F"},
	{"Achird", "Friendly", "M"},
	{"Zubenelgenubi", "Casual", "M"},
	{"Vindemiatrix", "Gentle", "F"},
	{"Sadachbia", "Lively", "M"},
	{"Sadaltager", "Knowledgeable", "M"},
	{"Sulafat", "Warm", "F"},
}

VoiceCatalog is the canonical, ordered list of voices supported by Gemini TTS as of the speech-generation docs in 2026-05.

View Source
var Voices = func() []string {
	out := make([]string, len(VoiceCatalog))
	for i, v := range VoiceCatalog {
		out[i] = v.Name
	}
	return out
}()

Voices is a flat list of voice ids preserved for backwards compatibility with the existing IsValidVoice/Service signatures.

Functions

func EncodeFFmpeg

func EncodeFFmpeg(ctx context.Context, format string, pcm []byte, sampleRate uint32) ([]byte, error)

EncodeFFmpeg pipes raw PCM (16-bit little-endian mono at sampleRate Hz) through ffmpeg and returns the encoded bytes in the requested container.

format must be "opus" (Ogg/Opus 24 kbps voip-tuned) or "mp3" (64 kbps libmp3lame). For "wav" use pcmToWAV directly — it does not require ffmpeg.

A 30 s hard timeout protects against a stuck subprocess; the goroutine piping stdin is bounded by ctx as well.

func Extension

func Extension(format string) string

Extension returns the file extension (no leading dot) for a format.

func FFmpegAvailable

func FFmpegAvailable() bool

FFmpegAvailable returns whether an ffmpeg binary is on PATH. The lookup is performed once and cached for the process lifetime.

func IsValidModel

func IsValidModel(name string) bool

IsValidModel reports whether the given model id is in the accepted list.

func IsValidVoice

func IsValidVoice(name string) bool

IsValidVoice reports whether the given name is in the canonical voice list. Empty string is rejected.

func MimeType

func MimeType(format string) string

MimeType returns the HTTP Content-Type for a supported format.

func Sanitize

func Sanitize(s string) string

Sanitize prepares text for TTS by removing markdown noise that wastes audio token budget and degrades narration quality. The transformation is intentionally conservative — TTS narrators handle punctuation and natural language fine; we only strip what is genuinely unhelpful when spoken (long code blocks, URLs, inline backticks).

The result is also length-clipped to MaxChars so a runaway agent reply can't blow up the audio token bill.

func StartCacheSweep

func StartCacheSweep()

func SupportedFormats

func SupportedFormats() []string

SupportedFormats reports which output container formats kojo can emit for a given environment. WAV is always available; opus and mp3 require ffmpeg.

Types

type Service

type Service struct {
	// contains filtered or unexported fields
}

Service performs Gemini TTS synthesis and caches results on disk.

The API key is fetched lazily via getAPIKey on every request, so a key rotation in the credential store takes effect on the next call without needing to rewire the service.

func NewService

func NewService(apiKeyFn func() (string, error)) *Service

NewService constructs a Service. apiKeyFn must return the current Gemini Developer API key.

func (*Service) LookupCached

func (s *Service) LookupCached(hash, format string) ([]byte, bool)

LookupCached returns the bytes for a previously synthesized hash. It is used by the GET /audio endpoint to serve the file directly with a long browser cache. format must match the on-disk extension.

func (*Service) Synthesize

func (s *Service) Synthesize(ctx context.Context, req SynthesizeRequest) (*SynthesizeResult, error)

Synthesize is the main entry point. The flow is:

  1. Sanitize and validate input.
  2. Hash the request and return the cached file if present.
  3. Call Gemini :generateContent with safetySettings=OFF.
  4. Decode the inline-data audio (raw 24 kHz LE16 PCM).
  5. Encode to the requested container (ffmpeg for opus/mp3, in-process WAV header for wav).
  6. Persist to cache and return.

type SynthesizeRequest

type SynthesizeRequest struct {
	Model       string // "" = DefaultModel
	Voice       string // "" = DefaultVoice
	StylePrompt string // "" = DefaultStylePrompt
	Text        string // raw, will be sanitized inside Synthesize
	Format      string // "opus" | "mp3" | "wav"
}

SynthesizeRequest is the input to Service.Synthesize. All fields are already sanitized at this layer — callers pass agent-derived configuration directly.

type SynthesizeResult

type SynthesizeResult struct {
	Hash       string
	Format     string
	AudioBytes []byte
	Cached     bool
}

SynthesizeResult is what Service.Synthesize returns. AudioBytes is the fully encoded payload ready to be served as MimeType(Format). Hash is the cache key (hex sha256) so handlers can build the audio URL.

type VoiceInfo

type VoiceInfo struct {
	Name   string `json:"name"`
	Trait  string `json:"trait"`
	Gender string `json:"gender,omitempty"` // "F" | "M" | ""
}

VoiceInfo pairs a voice id with the descriptive trait Google publishes in the Gemini TTS docs and the gender label Google publishes for the matching Cloud Text-to-Speech Chirp3-HD voice. Both are "official"; Gender is "F" / "M" / "" (unknown).

Trait source: https://ai.google.dev/gemini-api/docs/speech-generation Gender source: https://docs.cloud.google.com/text-to-speech/docs/list-voices-and-types

(the same voice names appear under Chirp3-HD with ssmlGender annotated)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL