Documentation
¶
Overview ¶
Package engine defines the VoiceEngine interface and its supporting types.
A VoiceEngine is responsible for the core conversational loop of a single NPC: it receives an audio frame from the player, runs STT → LLM → TTS (or an equivalent end-to-end model), and returns a Response containing the NPC's reply text, a streaming audio channel, and any tool calls the model requested.
Context injection ([VoiceEngine.InjectContext]) lets the orchestrator push scene changes, identity updates, and recent utterances into a live session without tearing down and re-creating the engine — important for low-latency voice loops where re-initialisation costs are unacceptable.
Implementations are provided by provider-specific packages. The interface is intentionally narrow so that the orchestrator remains provider-agnostic.
This package lives under internal/ because it encapsulates application-private processing logic and is not intended to be imported by external code.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ContextUpdate ¶
type ContextUpdate struct {
// Identity is an updated NPC persona / system prompt fragment. If non-empty,
// the engine replaces or amends its current identity context.
Identity string
// Scene is an updated description of the current in-game scene sent as
// additional context to the LLM.
Scene string
// RecentUtterances are the latest transcript entries to append to the
// engine's conversation history before the next process call.
RecentUtterances []memory.TranscriptEntry
}
ContextUpdate carries a mid-session context refresh pushed via [VoiceEngine.InjectContext]. Fields are merged into the engine's running state; zero values are ignored.
type PromptContext ¶
type PromptContext struct {
// SystemPrompt is the full NPC persona / system instruction sent as the
// first message in the LLM conversation history.
SystemPrompt string
// HotContext is a short, dynamically generated string injected just before
// the player's utterance. Typical contents: current location, active quest,
// visible objects. Kept intentionally short to fit within latency budgets.
HotContext string
// PreFetchResults holds pre-fetched tool results that the orchestrator
// resolved speculatively before the player finished speaking. Passed as
// context so the LLM can reference them without issuing additional tool calls.
PreFetchResults []string
// Messages is the recent conversation history. The engine may truncate or
// summarise this list to stay within the model's context window.
Messages []llm.Message
// BudgetTier controls which tools are offered to the LLM based on latency
// constraints. See [mcp.BudgetTier] for tier definitions.
BudgetTier mcp.BudgetTier
}
PromptContext bundles everything the VoiceEngine needs to build the LLM prompt for a single [VoiceEngine.Process] call.
type Response ¶
type Response struct {
// Text is the NPC's reply in plain text (already cleaned of SSML / markup).
// Useful for logging, transcript recording, and subtitle display.
Text string
// Audio is a read-only channel that streams raw audio bytes (e.g., Opus
// packets or PCM chunks) as they are produced by the TTS stage. The channel
// is closed when synthesis completes or when a mid-stream error occurs.
// After the channel closes, call [Response.Err] to check whether synthesis
// completed cleanly. Callers must drain the channel even if they do not use
// the audio, to avoid blocking the engine's internal pipeline.
Audio <-chan []byte
// SampleRate is the sample rate in Hz of the PCM data on the Audio channel
// (e.g., 22050, 24000, 48000). Set by the engine based on its TTS provider
// configuration.
SampleRate int
// Channels is the number of audio channels (1 = mono, 2 = stereo).
Channels int
// ToolCalls lists any tool invocations the LLM requested during generation.
// The orchestrator is responsible for executing them and, if needed, feeding
// results back to the engine via a follow-up [VoiceEngine.Process] call.
ToolCalls []llm.ToolCall
// FinalText is closed when the definitive response text is available.
// Read FinalTextValue after <-FinalText returns.
//
// If playback completes naturally, FinalTextValue == Text.
// If interrupted (barge-in or mute), FinalTextValue is the truncated
// text with a "..." suffix reflecting only what was actually heard.
//
// May be nil for engines that do not support transcript truncation —
// callers must check before waiting.
FinalText chan struct{}
// FinalTextValue holds the definitive (possibly truncated) response text.
// Only valid after FinalText is closed.
FinalTextValue string
// NotifyDone is written to by the caller (agent) when the mixer
// finishes with the audio segment. interrupted=true means playback
// was cut short. The engine reads this to decide whether to truncate.
// May be nil if the engine does not support truncation.
NotifyDone chan bool
// contains filtered or unexported fields
}
Response is the result of a successful [VoiceEngine.Process] call.
func (*Response) Err ¶
Err returns the error that caused the Audio channel to close prematurely, or nil if the stream completed successfully. Callers should check Err after the Audio channel is closed.
func (*Response) SetStreamErr ¶
SetStreamErr records a mid-stream error. The engine goroutine should call this before closing the Audio channel so that callers can distinguish a clean completion from a failure.
type VoiceEngine ¶
type VoiceEngine interface {
// Process handles a complete voice interaction: it transcribes input (if the
// engine performs STT internally), generates a response with the LLM using
// prompt, synthesises speech, and returns a [Response]. The call blocks until
// at least the text response is available; audio may continue streaming after
// Process returns.
//
// An error is returned if any pipeline stage fails unrecoverably. Transient
// errors (e.g., a single dropped packet) are handled internally.
Process(ctx context.Context, input audio.AudioFrame, prompt PromptContext) (*Response, error)
// InjectContext pushes an out-of-band context update into the running session.
// The engine merges update into its state and applies it on the next call to
// [VoiceEngine.Process]. InjectContext is non-blocking and returns as soon as
// the update is queued.
InjectContext(ctx context.Context, update ContextUpdate) error
// SetTools replaces the full set of tools offered to the LLM. The new list
// takes effect on the next [VoiceEngine.Process] call. Pass a nil or empty
// slice to disable tool calling.
SetTools(tools []llm.ToolDefinition) error
// OnToolCall registers handler as the synchronous executor for LLM tool calls.
// When the LLM requests a tool during [VoiceEngine.Process], the engine calls
// handler(name, args) where args is a JSON-encoded argument string. handler must
// return a JSON-encoded result string, or a non-nil error if execution fails.
//
// Only one handler may be registered at a time; subsequent calls replace the
// previous registration. handler is called on the engine's internal goroutine
// and must not block for longer than the configured tool budget.
OnToolCall(handler func(name string, args string) (string, error))
// Transcripts returns a read-only channel on which the engine publishes
// [memory.TranscriptEntry] values — one for each final STT result and one
// for each NPC response. The channel is closed when the engine is closed.
Transcripts() <-chan memory.TranscriptEntry
// Close releases all resources held by the engine (connections, goroutines,
// TTS synthesis streams). It closes the [Transcripts] channel and is safe to
// call multiple times; subsequent calls return nil.
Close() error
}
VoiceEngine handles the complete speech-in / speech-out pipeline for one NPC.
A single VoiceEngine instance is owned by one NPC agent. Multiple agents must not share an engine; create one engine per NPC.
All methods that accept a context.Context respect cancellation. Cancelling a context passed to [VoiceEngine.Process] will abort the in-flight STT/LLM/TTS call and close the [Response.Audio] channel.
Implementations must be safe for concurrent use, though callers should avoid issuing concurrent [VoiceEngine.Process] calls for the same NPC unless the implementation explicitly documents support for that pattern.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package cascade implements an experimental dual-model sentence cascade engine.
|
Package cascade implements an experimental dual-model sentence cascade engine. |
|
Package mock provides an in-memory mock implementation of engine.VoiceEngine for use in unit tests.
|
Package mock provides an in-memory mock implementation of engine.VoiceEngine for use in unit tests. |
|
Package s2s provides a engine.VoiceEngine implementation that wraps an s2s.Provider, bridging the turn-based VoiceEngine.Process API with the streaming S2S session interface.
|
Package s2s provides a engine.VoiceEngine implementation that wraps an s2s.Provider, bridging the turn-based VoiceEngine.Process API with the streaming S2S session interface. |