tts

package

v0.19.0 Latest Latest Go to latest Published: May 7, 2026 License: MIT Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/loppo-llc/kojo

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func EncodeFFmpeg(ctx context.Context, format string, pcm []byte, sampleRate uint32) ([]byte, error)
func Extension(format string) string
func FFmpegAvailable() bool
func IsValidModel(name string) bool
func IsValidVoice(name string) bool
func MimeType(format string) string
func Sanitize(s string) string
func StartCacheSweep()
func SupportedFormats() []string
type Service
- func NewService(apiKeyFn func() (string, error)) *Service
- func (s *Service) LookupCached(hash, format string) ([]byte, bool)
- func (s *Service) Synthesize(ctx context.Context, req SynthesizeRequest) (*SynthesizeResult, error)
type SynthesizeRequest
type SynthesizeResult
type VoiceInfo

Constants ¶

View Source

const DefaultModel = "gemini-3.1-flash-tts-preview"

DefaultModel is the Gemini TTS model used by default.

View Source

const DefaultStylePrompt = "落ち着いた日本語で、淡々と短く読み上げて。"

DefaultStylePrompt is used when the agent has no custom style prompt.

View Source

const DefaultVoice = "Kore"

DefaultVoice is the default voice from the 30-voice Gemini TTS catalogue.

View Source

const MaxChars = 800

MaxChars caps input text length to bound cost and latency. Anything longer is truncated with a trailing ellipsis before being sent.

View Source

const SystemInstruction = `` /* 375-byte string literal not displayed */

SystemInstruction is prepended to every TTS request to keep the model in narrator mode and avoid refusal/rewriting of the input text.

Variables ¶

View Source

var Models = []string{
	"gemini-3.1-flash-tts-preview",
	"gemini-2.5-flash-preview-tts",
	"gemini-2.5-pro-preview-tts",
}

Models is the set of TTS models we accept from clients.

View Source

var VoiceCatalog = []VoiceInfo{
	{"Zephyr", "Bright", "F"},
	{"Puck", "Upbeat", "M"},
	{"Charon", "Informative", "M"},
	{"Kore", "Firm", "F"},
	{"Fenrir", "Excitable", "M"},
	{"Leda", "Youthful", "F"},
	{"Orus", "Firm", "M"},
	{"Aoede", "Breezy", "F"},
	{"Callirrhoe", "Easy-going", "F"},
	{"Autonoe", "Bright", "F"},
	{"Enceladus", "Breathy", "M"},
	{"Iapetus", "Clear", "M"},
	{"Umbriel", "Easy-going", "M"},
	{"Algieba", "Smooth", "M"},
	{"Despina", "Smooth", "F"},
	{"Erinome", "Clear", "F"},
	{"Algenib", "Gravelly", "M"},
	{"Rasalgethi", "Informative", "M"},
	{"Laomedeia", "Upbeat", "F"},
	{"Achernar", "Soft", "F"},
	{"Alnilam", "Firm", "M"},
	{"Schedar", "Even", "M"},
	{"Gacrux", "Mature", "F"},
	{"Pulcherrima", "Forward", "F"},
	{"Achird", "Friendly", "M"},
	{"Zubenelgenubi", "Casual", "M"},
	{"Vindemiatrix", "Gentle", "F"},
	{"Sadachbia", "Lively", "M"},
	{"Sadaltager", "Knowledgeable", "M"},
	{"Sulafat", "Warm", "F"},
}

VoiceCatalog is the canonical, ordered list of voices supported by Gemini TTS as of the speech-generation docs in 2026-05.

View Source

var Voices = func() []string {
	out := make([]string, len(VoiceCatalog))
	for i, v := range VoiceCatalog {
		out[i] = v.Name
	}
	return out
}()

Voices is a flat list of voice ids preserved for backwards compatibility with the existing IsValidVoice/Service signatures.

Functions ¶

func EncodeFFmpeg ¶

func EncodeFFmpeg(ctx context.Context, format string, pcm []byte, sampleRate uint32) ([]byte, error)

EncodeFFmpeg pipes raw PCM (16-bit little-endian mono at sampleRate Hz) through ffmpeg and returns the encoded bytes in the requested container.

format must be "opus" (Ogg/Opus 24 kbps voip-tuned) or "mp3" (64 kbps libmp3lame). For "wav" use pcmToWAV directly — it does not require ffmpeg.

A 30 s hard timeout protects against a stuck subprocess; the goroutine piping stdin is bounded by ctx as well.

func Extension ¶

func Extension(format string) string

Extension returns the file extension (no leading dot) for a format.

func FFmpegAvailable ¶

func FFmpegAvailable() bool

FFmpegAvailable returns whether an ffmpeg binary is on PATH. The lookup is performed once and cached for the process lifetime.

func IsValidModel ¶

func IsValidModel(name string) bool

IsValidModel reports whether the given model id is in the accepted list.

func IsValidVoice ¶

func IsValidVoice(name string) bool

IsValidVoice reports whether the given name is in the canonical voice list. Empty string is rejected.

func MimeType ¶

func MimeType(format string) string

MimeType returns the HTTP Content-Type for a supported format.

func Sanitize ¶

func Sanitize(s string) string

Sanitize prepares text for TTS by removing markdown noise that wastes audio token budget and degrades narration quality. The transformation is intentionally conservative — TTS narrators handle punctuation and natural language fine; we only strip what is genuinely unhelpful when spoken (long code blocks, URLs, inline backticks).

The result is also length-clipped to MaxChars so a runaway agent reply can't blow up the audio token bill.

func StartCacheSweep ¶

func StartCacheSweep()

func SupportedFormats ¶

func SupportedFormats() []string

SupportedFormats reports which output container formats kojo can emit for a given environment. WAV is always available; opus and mp3 require ffmpeg.

Types ¶

type Service ¶

type Service struct {
	// contains filtered or unexported fields
}

Service performs Gemini TTS synthesis and caches results on disk.

The API key is fetched lazily via getAPIKey on every request, so a key rotation in the credential store takes effect on the next call without needing to rewire the service.

func NewService ¶

func NewService(apiKeyFn func() (string, error)) *Service

NewService constructs a Service. apiKeyFn must return the current Gemini Developer API key.

func (*Service) LookupCached ¶

func (s *Service) LookupCached(hash, format string) ([]byte, bool)

LookupCached returns the bytes for a previously synthesized hash. It is used by the GET /audio endpoint to serve the file directly with a long browser cache. format must match the on-disk extension.

func (*Service) Synthesize ¶

func (s *Service) Synthesize(ctx context.Context, req SynthesizeRequest) (*SynthesizeResult, error)

Synthesize is the main entry point. The flow is:

Sanitize and validate input.
Hash the request and return the cached file if present.
Call Gemini :generateContent with safetySettings=OFF.
Decode the inline-data audio (raw 24 kHz LE16 PCM).
Encode to the requested container (ffmpeg for opus/mp3, in-process WAV header for wav).
Persist to cache and return.

type SynthesizeRequest ¶

type SynthesizeRequest struct {
	Model       string // "" = DefaultModel
	Voice       string // "" = DefaultVoice
	StylePrompt string // "" = DefaultStylePrompt
	Text        string // raw, will be sanitized inside Synthesize
	Format      string // "opus" | "mp3" | "wav"
}

SynthesizeRequest is the input to Service.Synthesize. All fields are already sanitized at this layer — callers pass agent-derived configuration directly.

type SynthesizeResult ¶

type SynthesizeResult struct {
	Hash       string
	Format     string
	AudioBytes []byte
	Cached     bool
}

SynthesizeResult is what Service.Synthesize returns. AudioBytes is the fully encoded payload ready to be served as MimeType(Format). Hash is the cache key (hex sha256) so handlers can build the audio URL.

type VoiceInfo ¶

type VoiceInfo struct {
	Name   string `json:"name"`
	Trait  string `json:"trait"`
	Gender string `json:"gender,omitempty"` // "F" | "M" | ""
}

VoiceInfo pairs a voice id with the descriptive trait Google publishes in the Gemini TTS docs and the gender label Google publishes for the matching Cloud Text-to-Speech Chirp3-HD voice. Both are "official"; Gender is "F" / "M" / "" (unknown).

Trait source: https://ai.google.dev/gemini-api/docs/speech-generation Gender source: https://docs.cloud.google.com/text-to-speech/docs/list-voices-and-types

(the same voice names appear under Chirp3-HD with ssmlGender annotated)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL