Documentation
¶
Overview ¶
Package wakeword implements an always-on, on-device keyword spotter that any of the three SpeechKit modes (Dictation, Assist, Voice Agent) can opt into.
The package is platform-neutral kernel code. SpeechKit's framework architecture splits responsibilities across two layers:
Framework kernel (this package): keyword-spotting detector, audio buffering, debounce/cooldown logic, event dispatch contracts. Imported by every client that wants wake-word support.
Per-client adapters: own the OS-level audio source, hotkey mapping and status surface. The Windows Device-Target lives in cmd/speechkit/desktop_wakeword.go and is the reference adapter that ships with the SpeechKit demo. Future client targets (Android via gomobile, Local-Target Go CLI, web via WASM bindings, native iOS) plug their audio source and event sink into the same Pipeline.
Wake-word is never a server-side concern in SpeechKit. The Server-Target (internal/server) exposes Dictation/Assist/Voice Agent over HTTP/WS but has no wake-word surface — running an always-on mic in a server process makes no architectural sense and is intentionally out of scope.
Detection runs via the sherpa-onnx Zipformer KWS model (github.com/k2-fsa/sherpa-onnx-go). The model is Apache-2.0, ships as a single ~17 MB asset with encoder/decoder/joiner ONNX files plus a tokens + keywords file, and supports open-vocabulary keyword spotting — users declare keywords as text and tune a per-keyword threshold without retraining.
Index ¶
Constants ¶
const ( SampleRate = 16000 BytesPerSample = 2 Channels = 1 // FrameSamples is the unit the audio session delivers per // SetPCMHandler callback (80 ms at 16 kHz). FrameSamples = 1280 FrameBytes = FrameSamples * BytesPerSample FrameDur = 80 * time.Millisecond )
Audio contract for the wake-word pipeline. Matches internal/audio and internal/vad — 16 kHz mono S16 PCM. The detector accepts the same frame cadence the rest of the kernel produces, so a single audio session can fan out to VAD + wake-word without resampling.
const SourceWakeword = "wakeword"
SourceWakeword is the value the Dispatcher writes into HotkeyEvent.Source. Client adapters should compare against this constant rather than the literal string to stay decoupled from the dispatch wording.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type AutoEndConfig ¶ added in v0.35.8
type AutoEndConfig struct {
// SilenceCutoff: Dauer ohne User-Audio-Aktivitaet, nach der die Session
// automatisch endet. Zero disables silence-based auto-end.
SilenceCutoff time.Duration
// ExitPhrases: case-insensitive Substring-Matcher gegen User-Transkript-
// Snippets. Match feuert sofortiges Session-Ende. Empty slice disables
// exit-phrase auto-end.
//
// Funktioniert nur wenn der gewaehlte Voice-Agent-Provider User-Transkripte
// liefert (Gemini Live mit EnableInputAudioTranscription, OpenAI Realtime
// mit Input-Transcription, Local Cascaded immer). Ohne Transkripte bleibt
// SilenceCutoff allein die End-Garantie.
ExitPhrases []string
}
AutoEndConfig steuert wann eine wake-word-getriggerte Session automatisch endet. Typischerweise aus Client-Config (z.B. TOML [wakeword.auto_end]) gefuellt; DefaultAutoEndConfig liefert die Framework-Defaults die jeder Client-Target als Baseline nutzt.
SpeechKit's Voice-Agent ist by-design fuer mehrstuendige Dialoge ausgelegt — daher gibt es bewusst keinen Hard-Cap auf die Session-Dauer. Auto-End wird ueber Stille (SilenceCutoff) plus optionale Exit-Phrasen geregelt.
func DefaultAutoEndConfig ¶ added in v0.35.8
func DefaultAutoEndConfig() AutoEndConfig
DefaultAutoEndConfig liefert die Framework-Baseline:
- SilenceCutoff: 10s
- ExitPhrases: DE+EN-Closer (case-insensitive)
Client-Targets ueberschreiben einzelne Felder via Config oder ersetzen das ganze Default-Set, wenn ihre Locale anders aussieht.
type AutoEndPolicy ¶ added in v0.35.8
type AutoEndPolicy struct {
// contains filtered or unexported fields
}
AutoEndPolicy ist ein per-Session-Watcher. Provider-agnostisch — kennt keine Gemini/OpenAI/Pipeline-Details. Der Client-Adapter wired:
- Audio-Frames oder VAD-Onset -> NotifyActivity
- User-Transkript-Snippets -> NotifyTranscript
- EndSignal -> Session-Stop-Pfad
Lifecycle: NewAutoEndPolicy -> Start -> Notify*-Calls -> EndSignal feuert (genau einmal) -> Close (idempotent).
func NewAutoEndPolicy ¶ added in v0.35.8
func NewAutoEndPolicy(cfg AutoEndConfig, logger *slog.Logger) *AutoEndPolicy
NewAutoEndPolicy konstruiert eine Policy aus Config. Logger darf nil sein (faellt auf slog.Default zurueck). Ein Zero-Value-Config (SilenceCutoff=0 UND len(ExitPhrases)=0) wird durch DefaultAutoEndConfig ersetzt, damit Clients die das AutoEnd-Feature nicht aktiv konfigurieren trotzdem das Framework-Default-Verhalten bekommen.
func (*AutoEndPolicy) Close ¶ added in v0.35.8
func (p *AutoEndPolicy) Close()
Close stoppt den Silence-Timer und schliesst den endCh wenn noch nicht gefeuert. Idempotent — wiederholte Aufrufe sind no-ops. Notify*-Calls nach Close sind ebenfalls no-ops und panicken nicht.
func (*AutoEndPolicy) Config ¶ added in v0.35.8
func (p *AutoEndPolicy) Config() AutoEndConfig
Config gibt die zugrundeliegende Konfiguration zurueck (Kopie).
func (*AutoEndPolicy) EndSignal ¶ added in v0.35.8
func (p *AutoEndPolicy) EndSignal() <-chan EndReason
EndSignal liefert einen Channel der genau einmal feuert wenn ein Auto-End-Trigger matched. Nach dem Fire soll der Consumer Close aufrufen. Wenn Close aufgerufen wird ohne dass die Policy gefeuert hat, wird der Channel ohne Wert geschlossen — der Receiver bekommt zero-value + ok=false.
func (*AutoEndPolicy) NotifyActivity ¶ added in v0.35.8
func (p *AutoEndPolicy) NotifyActivity()
NotifyActivity setzt den Silence-Timer zurueck. Vom Client-Adapter bei User-Audio-Aktivitaet (PCM-Frame, VAD-Onset) aufgerufen. No-op nach Close oder vor Start.
func (*AutoEndPolicy) NotifyTranscript ¶ added in v0.35.8
func (p *AutoEndPolicy) NotifyTranscript(text string)
NotifyTranscript prueft text gegen die Exit-Phrase-Liste. Match (case-insensitive substring) feuert EndSignal mit EndReasonExitPhrase. Resetet auch den Silence-Timer (Transkript ist Aktivitaet).
func (*AutoEndPolicy) Start ¶ added in v0.35.8
func (p *AutoEndPolicy) Start()
Start armiert den Silence-Timer und markiert die Session als aktiv. Idempotent — wiederholte Aufrufe sind no-ops.
type Config ¶
type Config struct {
// Enabled gates the entire wake-word system. When false, no detector is
// loaded and no audio session is opened on its behalf. Default false —
// wake-word is an opt-in feature in line with the dictation industry
// norm (Wispr Flow, Superwhisper, VoiceInk all default to hotkey-only).
Enabled bool
// Phrase is the display label of the active wake phrase (e.g.
// "Hey Quby"). Surfaced in the tray and status feed. Detection is
// driven by the BPE-tokenised Keywords/KeywordsFile fields below, not
// by this label — multiple keywords can map to the same display label.
Phrase string
// ModelDir is the directory holding the sherpa-onnx KWS model assets
// (encoder/decoder/joiner ONNX + tokens.txt). The Windows reference
// adapter resolves a default of <exe_dir>/wakeword-kws when empty.
ModelDir string
// KeywordsFile is the path to a sherpa-onnx keywords.txt file
// (BPE-tokenised, one keyword per line with optional `:boost` and
// trailing `@display-name`). Takes precedence over inline Keywords
// when both are set. The Windows adapter ships a default file beside
// the KWS model.
KeywordsFile string
// Keywords is the inline alternative to KeywordsFile — one
// BPE-tokenised entry per slice element. Use this for programmatic
// callers (tests, ad-hoc clients) that do not want to materialise a
// file on disk.
Keywords []string
// DefaultMode is the runtime mode triggered when a keyword fires.
// One of "dictate" | "assist" | "voice_agent".
DefaultMode string
// Threshold is the minimum acoustic probability for sherpa-onnx to
// emit a keyword hit (0.0–1.0, lower = more sensitive). Each keyword
// can override it via the `:boost` suffix in KeywordsFile; this is the
// global default.
Threshold float32
// MinConsecutiveFrames is the number of decoded windows the same
// keyword must fire in before the pipeline emits a DetectionEvent.
// 1 = fire immediately, 2 = require two consecutive hits (default).
MinConsecutiveFrames int
// Cooldown is the minimum interval between two emitted DetectionEvents
// for the same keyword. Prevents the same utterance triggering twice
// and gives the downstream mode time to spin up before another fire.
Cooldown time.Duration
// AutoEnd steuert das automatische Beenden von Sessions die durch
// dieses Wake-Word getriggert wurden. Provider-agnostisch — gilt fuer
// Gemini Live, OpenAI Realtime und Local-Cascaded gleichermassen. Wenn
// der Zero-Value verwendet wird (SilenceCutoff=0 + len(ExitPhrases)=0),
// faellt die Pipeline auf DefaultAutoEndConfig zurueck. Siehe
// AutoEndPolicy + DefaultAutoEndConfig fuer die Framework-Baseline.
AutoEnd AutoEndConfig
}
Config controls the wake-word pipeline behaviour. Fields map 1:1 onto the wakeword block in the user TOML config; see internal/config/config.go for the on-disk schema.
The new sherpa-onnx-backed implementation does not need melspec/embedding/prediction triplets — those legacy fields are kept on the config surface (so older user configs do not error out) but ignored by this package. The relevant inputs are ModelDir + Keywords (or KeywordsFile).
type DetectionEvent ¶
type DetectionEvent struct {
// Phrase is the configured display name (e.g. "Hey Quby").
Phrase string
// Keyword is the raw keyword string sherpa-onnx returned (e.g.
// "HEY QUBY" or the @-suffix when the keywords file labelled it).
Keyword string
// Mode is the runtime mode the wake should trigger ("dictate" |
// "assist" | "voice_agent"), copied from Config.DefaultMode at the
// moment of detection so downstream consumers don't need to re-read
// config.
Mode string
// Probability is informational only — sherpa-onnx KWS already applies
// its threshold internally before reporting a keyword, so this is
// always 1.0 from the pipeline. Kept on the surface for API stability
// with the previous openWakeWord-based detector.
Probability float32
// At is the wall-clock time of the trigger.
At time.Time
}
DetectionEvent is emitted when a keyword fires.
type Detector ¶
type Detector struct {
// contains filtered or unexported fields
}
Detector owns the sherpa-onnx KeywordSpotter for the lifetime of the process. Construction loads the model graph; Close releases it. Callers do not call Detector directly — the Pipeline owns one Detector and the corresponding sherpa stream, and exposes FeedPCM / Close to the audio adapter.
func NewDetector ¶
func NewDetector(cfg DetectorConfig) (*Detector, error)
NewDetector loads the sherpa-onnx KWS model and prepares it for streaming inference. Returns an error if any required file is missing or the native library refuses to construct the spotter.
The caller is responsible for Close() — typically wired into the desktop cleanup stack so the native handles are released on shutdown.
type DetectorConfig ¶
type DetectorConfig struct {
Encoder string // encoder ONNX (zipformer KWS encoder-…onnx)
Decoder string // decoder ONNX
Joiner string // joiner ONNX
Tokens string // tokens.txt (BPE token table)
KeywordsFile string // BPE-tokenised keywords, sherpa-onnx format
Keywords []string // alternative to KeywordsFile (joined with \n)
// NumThreads bounds CPU parallelism. Zero defaults to 1 — the KWS model
// is intentionally small (~3 M parameters) and never benefits from more
// than a couple of threads on desktop hardware.
NumThreads int
// Threshold mirrors Config.Threshold; below 0 or above 1 falls back to
// the sherpa-onnx default (~0.25).
Threshold float32
}
DetectorConfig bundles the file-system inputs the sherpa-onnx KWS engine needs. The Windows reference adapter resolves these from Config.ModelDir against a layout produced by scripts/prepare-wakeword-model.ps1.
All paths must be absolute. Empty Encoder/Decoder/Joiner/Tokens fields fail with a clear error from NewDetector — they are not auto-resolved inside this package so adapter wiring stays explicit.
type Dispatcher ¶
type Dispatcher struct {
// contains filtered or unexported fields
}
Dispatcher converts a DetectionEvent into a synthetic KeyDown event on the configured DefaultMode binding. By default it emits *only* a press (no release), which the downstream mode controller interprets as a session-start request in either PTT or Toggle behaviour:
- Toggle: first press starts the session; the user toggles it off later via the actual hotkey or it auto-stops on silence.
- PTT: the press starts the session; auto-stop on silence (see desktopAudioLevelHandler.FastModeSilenceMs) closes it. Wake-trigger plus PTT works best with FastModeSilenceMs set, otherwise the user must press the actual hotkey to release.
SyntheticRelease can be enabled if a downstream consumer specifically requires a balanced press+release pair; the release fires after ReleaseAfter.
func NewDispatcher ¶
func NewDispatcher(sink HotkeySink, opts DispatcherOptions) *Dispatcher
NewDispatcher constructs a Dispatcher with the given sink and options.
func (*Dispatcher) Close ¶
func (d *Dispatcher) Close(ctx context.Context) error
Close blocks further events and waits for in-flight dispatch goroutines (or until ctx is cancelled).
func (*Dispatcher) Emit ¶
func (d *Dispatcher) Emit(ev DetectionEvent)
Emit implements Sink. Called by the wake pipeline when a phrase fires. Non-blocking: actual sink submission runs in a goroutine so a slow downstream channel cannot stall PCM ingestion.
type DispatcherOptions ¶
type DispatcherOptions struct {
// SyntheticRelease causes the dispatcher to emit a KeyUp event
// ReleaseAfter the KeyDown. Off by default — most desktop consumers
// want toggle/auto-stop semantics rather than a fast tap.
SyntheticRelease bool
// ReleaseAfter is the delay between synthesized KeyDown and KeyUp
// when SyntheticRelease is true. Defaults to 150ms.
ReleaseAfter time.Duration
// Logger is used for structured logs. Defaults to slog.Default().
Logger *slog.Logger
}
DispatcherOptions tweaks Dispatcher behaviour.
type EndReason ¶ added in v0.35.8
type EndReason string
EndReason ist das Ergebnis des AutoEnd-Watchers; vom Client-Adapter an den session.end-Audit-Event weitergereicht.
type HotkeyEvent ¶
type HotkeyEvent struct {
KeyDown bool // true = press, false = release
Binding string // mode name: "dictate" | "assist" | "voice_agent"
// Source identifies the origin of the synthesized event. The wake-word
// Dispatcher always sets this to "wakeword". Client adapters use this
// to distinguish wake-word-triggered activations from real hotkey
// presses — e.g. to attach an AutoEndPolicy to wake-word-origin
// sessions but leave hotkey-origin sessions on their existing
// hold-to-talk / toggle semantics.
Source string
}
HotkeyEvent is the minimal shape required to bridge a wake DetectionEvent into the existing mode-hotkey dispatch channel. It intentionally mirrors the Type+Binding pair used by internal/hotkey.Event without importing that package, so the kernel module stays platform-neutral.
type HotkeySink ¶
type HotkeySink interface {
Submit(HotkeyEvent)
}
HotkeySink consumes synthetic key events. The Device-Target supplies an adapter that pushes into the modeHotkeyManager events channel.
type HotkeySinkFunc ¶
type HotkeySinkFunc func(HotkeyEvent)
HotkeySinkFunc adapts a function to HotkeySink.
type PhraseCatalogEntry ¶
type PhraseCatalogEntry struct {
// ID is the stable identifier saved in [wakeword] phrase_id. Lowercase
// snake_case. Never change after publishing — config files in the wild
// reference it.
ID string
// DisplayName is the human-readable label rendered in the tray and
// settings UI. Free-form; can include parenthetical pronunciation
// hints (e.g. "Hey Quby (Cubi / Kubi)").
DisplayName string
// KeywordLabel is the @-suffix that must appear at the end of the
// matching line in keywords.txt. The sherpa-onnx KeywordSpotter reports
// hits via this label, which the pipeline forwards as DetectionEvent.Keyword.
// When empty, sherpa reports the raw BPE-decoded keyword.
KeywordLabel string
// Notes is a one-line tradeoff summary surfaced in the settings UI
// next to the entry. Plain text, no markup.
Notes string
}
PhraseCatalogEntry describes a curated wake-phrase shipped with SpeechKit. Under the sherpa-onnx KWS backend the actual detection contract is the keywords.txt file in the bundled KWS model directory (BPE-tokenised, one keyword per line with optional `:boost` and trailing `@display-label`). The catalog here exists only to:
- Give the settings UI a stable ID/label mapping that survives across config saves (the user picks "Hey Quby" from a dropdown, we persist "hey_quby" in the TOML; the adapter then resolves whatever the bundled keywords.txt contains for that phrase).
- Surface tradeoff notes the UI can render next to each entry.
Adding a new entry is non-breaking. Removing or renaming requires a config migration.
func DefaultCatalog ¶
func DefaultCatalog() []PhraseCatalogEntry
DefaultCatalog is SpeechKit's curated wake-phrase list. The corresponding BPE-tokenised keyword lines live in dist/windows/SpeechKit/wakeword-kws/keywords.txt; adding a new entry here requires adding the matching line in that file (and any per-platform bundle counterpart).
Selection criteria (consensus across Picovoice / openWakeWord / sherpa literature):
- 3-4 syllables (disambiguates without feeling sluggish)
- distinct consonant onsets (K, T, P, B help the encoder)
- vowel diversity
- rare in everyday conversation (low false-accept rate)
- pronounceable across DE+EN (SpeechKit's primary languages)
func LookupPhrase ¶
func LookupPhrase(id string) *PhraseCatalogEntry
LookupPhrase returns the catalog entry matching id, or nil if not found. The lookup is case-insensitive and trims surrounding whitespace so the stored config value tolerates manual edits.
type Pipeline ¶
type Pipeline struct {
// contains filtered or unexported fields
}
Pipeline streams PCM audio into the sherpa-onnx KeywordSpotter and emits DetectionEvents via Sink whenever the spotter reports a keyword hit, a configurable number of consecutive decodes in a row, outside the current cooldown window.
Pipeline is safe for concurrent calls to FeedPCM and Close. A single Pipeline backs a single audio source — the desktop adapter creates one per process.
func NewPipeline ¶
NewPipeline wires a Detector and Sink together with the resolved config. Invalid Config fields are coerced to defaults so the adapter does not need to mirror normalisation logic.
func (*Pipeline) Close ¶
Close releases the streaming handle. Subsequent FeedPCM calls error. The underlying Detector is NOT closed — Pipeline does not own its lifetime; the adapter that constructed both is responsible.
func (*Pipeline) Config ¶
Config returns a copy of the resolved pipeline config (with defaults applied). Useful for UI status displays.
func (*Pipeline) FeedPCM ¶
FeedPCM ingests raw S16 mono PCM at SampleRate. The pipeline normalises to float32 [-1, 1] in place, forwards to sherpa-onnx's streaming spotter, and drains every ready decode window. Returns the number of decode steps completed and the highest keyword score observed across them.
The signature matches the existing audio.Session.SetPCMHandler contract so adapters that were wired against the previous openWakeWord implementation continue to work without changes.
type Sink ¶
type Sink interface {
Emit(DetectionEvent)
}
Sink receives DetectionEvents from the pipeline. The desktop dispatcher implements this to convert events into synthetic hotkey events.
type SinkFunc ¶
type SinkFunc func(DetectionEvent)
SinkFunc adapts a plain function to the Sink interface.