wakeword

package
v0.35.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2026 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Overview

Package wakeword implements an always-on, on-device keyword spotter that any of the three SpeechKit modes (Dictation, Assist, Voice Agent) can opt into.

The package is platform-neutral kernel code. SpeechKit's framework architecture splits responsibilities across two layers:

  • Framework kernel (this package): keyword-spotting detector, audio buffering, debounce/cooldown logic, event dispatch contracts. Imported by every client that wants wake-word support.

  • Per-client adapters: own the OS-level audio source, hotkey mapping and status surface. The Windows Device-Target lives in cmd/speechkit/desktop_wakeword.go and is the reference adapter that ships with the SpeechKit demo. Future client targets (Android via gomobile, Local-Target Go CLI, web via WASM bindings, native iOS) plug their audio source and event sink into the same Pipeline.

Wake-word is never a server-side concern in SpeechKit. The Server-Target (internal/server) exposes Dictation/Assist/Voice Agent over HTTP/WS but has no wake-word surface — running an always-on mic in a server process makes no architectural sense and is intentionally out of scope.

Detection runs via the sherpa-onnx Zipformer KWS model (github.com/k2-fsa/sherpa-onnx-go). The model is Apache-2.0, ships as a single ~17 MB asset with encoder/decoder/joiner ONNX files plus a tokens + keywords file, and supports open-vocabulary keyword spotting — users declare keywords as text and tune a per-keyword threshold without retraining.

Index

Constants

View Source
const (
	SampleRate     = 16000
	BytesPerSample = 2
	Channels       = 1

	// FrameSamples is the unit the audio session delivers per
	// SetPCMHandler callback (80 ms at 16 kHz).
	FrameSamples = 1280
	FrameBytes   = FrameSamples * BytesPerSample
	FrameDur     = 80 * time.Millisecond
)

Audio contract for the wake-word pipeline. Matches internal/audio and internal/vad — 16 kHz mono S16 PCM. The detector accepts the same frame cadence the rest of the kernel produces, so a single audio session can fan out to VAD + wake-word without resampling.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// Enabled gates the entire wake-word system. When false, no detector is
	// loaded and no audio session is opened on its behalf. Default false —
	// wake-word is an opt-in feature in line with the dictation industry
	// norm (Wispr Flow, Superwhisper, VoiceInk all default to hotkey-only).
	Enabled bool

	// Phrase is the display label of the active wake phrase (e.g.
	// "Hey Quby"). Surfaced in the tray and status feed. Detection is
	// driven by the BPE-tokenised Keywords/KeywordsFile fields below, not
	// by this label — multiple keywords can map to the same display label.
	Phrase string

	// ModelDir is the directory holding the sherpa-onnx KWS model assets
	// (encoder/decoder/joiner ONNX + tokens.txt). The Windows reference
	// adapter resolves a default of <exe_dir>/wakeword-kws when empty.
	ModelDir string

	// KeywordsFile is the path to a sherpa-onnx keywords.txt file
	// (BPE-tokenised, one keyword per line with optional `:boost` and
	// trailing `@display-name`). Takes precedence over inline Keywords
	// when both are set. The Windows adapter ships a default file beside
	// the KWS model.
	KeywordsFile string

	// Keywords is the inline alternative to KeywordsFile — one
	// BPE-tokenised entry per slice element. Use this for programmatic
	// callers (tests, ad-hoc clients) that do not want to materialise a
	// file on disk.
	Keywords []string

	// DefaultMode is the runtime mode triggered when a keyword fires.
	// One of "dictate" | "assist" | "voice_agent".
	DefaultMode string

	// Threshold is the minimum acoustic probability for sherpa-onnx to
	// emit a keyword hit (0.0–1.0, lower = more sensitive). Each keyword
	// can override it via the `:boost` suffix in KeywordsFile; this is the
	// global default.
	Threshold float32

	// MinConsecutiveFrames is the number of decoded windows the same
	// keyword must fire in before the pipeline emits a DetectionEvent.
	// 1 = fire immediately, 2 = require two consecutive hits (default).
	MinConsecutiveFrames int

	// Cooldown is the minimum interval between two emitted DetectionEvents
	// for the same keyword. Prevents the same utterance triggering twice
	// and gives the downstream mode time to spin up before another fire.
	Cooldown time.Duration
}

Config controls the wake-word pipeline behaviour. Fields map 1:1 onto the wakeword block in the user TOML config; see internal/config/config.go for the on-disk schema.

The new sherpa-onnx-backed implementation does not need melspec/embedding/prediction triplets — those legacy fields are kept on the config surface (so older user configs do not error out) but ignored by this package. The relevant inputs are ModelDir + Keywords (or KeywordsFile).

type DetectionEvent

type DetectionEvent struct {
	// Phrase is the configured display name (e.g. "Hey Quby").
	Phrase string

	// Keyword is the raw keyword string sherpa-onnx returned (e.g.
	// "HEY QUBY" or the @-suffix when the keywords file labelled it).
	Keyword string

	// Mode is the runtime mode the wake should trigger ("dictate" |
	// "assist" | "voice_agent"), copied from Config.DefaultMode at the
	// moment of detection so downstream consumers don't need to re-read
	// config.
	Mode string

	// Probability is informational only — sherpa-onnx KWS already applies
	// its threshold internally before reporting a keyword, so this is
	// always 1.0 from the pipeline. Kept on the surface for API stability
	// with the previous openWakeWord-based detector.
	Probability float32

	// At is the wall-clock time of the trigger.
	At time.Time
}

DetectionEvent is emitted when a keyword fires.

type Detector

type Detector struct {
	// contains filtered or unexported fields
}

Detector owns the sherpa-onnx KeywordSpotter for the lifetime of the process. Construction loads the model graph; Close releases it. Callers do not call Detector directly — the Pipeline owns one Detector and the corresponding sherpa stream, and exposes FeedPCM / Close to the audio adapter.

func NewDetector

func NewDetector(cfg DetectorConfig) (*Detector, error)

NewDetector loads the sherpa-onnx KWS model and prepares it for streaming inference. Returns an error if any required file is missing or the native library refuses to construct the spotter.

The caller is responsible for Close() — typically wired into the desktop cleanup stack so the native handles are released on shutdown.

func (*Detector) Close

func (d *Detector) Close() error

Close releases the underlying KeywordSpotter. Safe to call multiple times — subsequent calls are no-ops.

type DetectorConfig

type DetectorConfig struct {
	Encoder      string   // encoder ONNX (zipformer KWS encoder-…onnx)
	Decoder      string   // decoder ONNX
	Joiner       string   // joiner ONNX
	Tokens       string   // tokens.txt (BPE token table)
	KeywordsFile string   // BPE-tokenised keywords, sherpa-onnx format
	Keywords     []string // alternative to KeywordsFile (joined with \n)

	// NumThreads bounds CPU parallelism. Zero defaults to 1 — the KWS model
	// is intentionally small (~3 M parameters) and never benefits from more
	// than a couple of threads on desktop hardware.
	NumThreads int

	// Threshold mirrors Config.Threshold; below 0 or above 1 falls back to
	// the sherpa-onnx default (~0.25).
	Threshold float32
}

DetectorConfig bundles the file-system inputs the sherpa-onnx KWS engine needs. The Windows reference adapter resolves these from Config.ModelDir against a layout produced by scripts/prepare-wakeword-model.ps1.

All paths must be absolute. Empty Encoder/Decoder/Joiner/Tokens fields fail with a clear error from NewDetector — they are not auto-resolved inside this package so adapter wiring stays explicit.

type Dispatcher

type Dispatcher struct {
	// contains filtered or unexported fields
}

Dispatcher converts a DetectionEvent into a synthetic KeyDown event on the configured DefaultMode binding. By default it emits *only* a press (no release), which the downstream mode controller interprets as a session-start request in either PTT or Toggle behaviour:

  • Toggle: first press starts the session; the user toggles it off later via the actual hotkey or it auto-stops on silence.
  • PTT: the press starts the session; auto-stop on silence (see desktopAudioLevelHandler.FastModeSilenceMs) closes it. Wake-trigger plus PTT works best with FastModeSilenceMs set, otherwise the user must press the actual hotkey to release.

SyntheticRelease can be enabled if a downstream consumer specifically requires a balanced press+release pair; the release fires after ReleaseAfter.

func NewDispatcher

func NewDispatcher(sink HotkeySink, opts DispatcherOptions) *Dispatcher

NewDispatcher constructs a Dispatcher with the given sink and options.

func (*Dispatcher) Close

func (d *Dispatcher) Close(ctx context.Context) error

Close blocks further events and waits for in-flight dispatch goroutines (or until ctx is cancelled).

func (*Dispatcher) Emit

func (d *Dispatcher) Emit(ev DetectionEvent)

Emit implements Sink. Called by the wake pipeline when a phrase fires. Non-blocking: actual sink submission runs in a goroutine so a slow downstream channel cannot stall PCM ingestion.

type DispatcherOptions

type DispatcherOptions struct {
	// SyntheticRelease causes the dispatcher to emit a KeyUp event
	// ReleaseAfter the KeyDown. Off by default — most desktop consumers
	// want toggle/auto-stop semantics rather than a fast tap.
	SyntheticRelease bool

	// ReleaseAfter is the delay between synthesized KeyDown and KeyUp
	// when SyntheticRelease is true. Defaults to 150ms.
	ReleaseAfter time.Duration

	// Logger is used for structured logs. Defaults to slog.Default().
	Logger *slog.Logger
}

DispatcherOptions tweaks Dispatcher behaviour.

type HotkeyEvent

type HotkeyEvent struct {
	KeyDown bool   // true = press, false = release
	Binding string // mode name: "dictate" | "assist" | "voice_agent"
}

HotkeyEvent is the minimal shape required to bridge a wake DetectionEvent into the existing mode-hotkey dispatch channel. It intentionally mirrors the Type+Binding pair used by internal/hotkey.Event without importing that package, so the kernel module stays platform-neutral.

type HotkeySink

type HotkeySink interface {
	Submit(HotkeyEvent)
}

HotkeySink consumes synthetic key events. The Device-Target supplies an adapter that pushes into the modeHotkeyManager events channel.

type HotkeySinkFunc

type HotkeySinkFunc func(HotkeyEvent)

HotkeySinkFunc adapts a function to HotkeySink.

func (HotkeySinkFunc) Submit

func (f HotkeySinkFunc) Submit(ev HotkeyEvent)

Submit calls f.

type PhraseCatalogEntry

type PhraseCatalogEntry struct {
	// ID is the stable identifier saved in [wakeword] phrase_id. Lowercase
	// snake_case. Never change after publishing — config files in the wild
	// reference it.
	ID string

	// DisplayName is the human-readable label rendered in the tray and
	// settings UI. Free-form; can include parenthetical pronunciation
	// hints (e.g. "Hey Quby (Cubi / Kubi)").
	DisplayName string

	// KeywordLabel is the @-suffix that must appear at the end of the
	// matching line in keywords.txt. The sherpa-onnx KeywordSpotter reports
	// hits via this label, which the pipeline forwards as DetectionEvent.Keyword.
	// When empty, sherpa reports the raw BPE-decoded keyword.
	KeywordLabel string

	// Notes is a one-line tradeoff summary surfaced in the settings UI
	// next to the entry. Plain text, no markup.
	Notes string
}

PhraseCatalogEntry describes a curated wake-phrase shipped with SpeechKit. Under the sherpa-onnx KWS backend the actual detection contract is the keywords.txt file in the bundled KWS model directory (BPE-tokenised, one keyword per line with optional `:boost` and trailing `@display-label`). The catalog here exists only to:

  • Give the settings UI a stable ID/label mapping that survives across config saves (the user picks "Hey Quby" from a dropdown, we persist "hey_quby" in the TOML; the adapter then resolves whatever the bundled keywords.txt contains for that phrase).
  • Surface tradeoff notes the UI can render next to each entry.

Adding a new entry is non-breaking. Removing or renaming requires a config migration.

func DefaultCatalog

func DefaultCatalog() []PhraseCatalogEntry

DefaultCatalog is SpeechKit's curated wake-phrase list. The corresponding BPE-tokenised keyword lines live in dist/windows/SpeechKit/wakeword-kws/keywords.txt; adding a new entry here requires adding the matching line in that file (and any per-platform bundle counterpart).

Selection criteria (consensus across Picovoice / openWakeWord / sherpa literature):

  • 3-4 syllables (disambiguates without feeling sluggish)
  • distinct consonant onsets (K, T, P, B help the encoder)
  • vowel diversity
  • rare in everyday conversation (low false-accept rate)
  • pronounceable across DE+EN (SpeechKit's primary languages)

func LookupPhrase

func LookupPhrase(id string) *PhraseCatalogEntry

LookupPhrase returns the catalog entry matching id, or nil if not found. The lookup is case-insensitive and trims surrounding whitespace so the stored config value tolerates manual edits.

type Pipeline

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline streams PCM audio into the sherpa-onnx KeywordSpotter and emits DetectionEvents via Sink whenever the spotter reports a keyword hit, a configurable number of consecutive decodes in a row, outside the current cooldown window.

Pipeline is safe for concurrent calls to FeedPCM and Close. A single Pipeline backs a single audio source — the desktop adapter creates one per process.

func NewPipeline

func NewPipeline(detector *Detector, sink Sink, cfg Config) (*Pipeline, error)

NewPipeline wires a Detector and Sink together with the resolved config. Invalid Config fields are coerced to defaults so the adapter does not need to mirror normalisation logic.

func (*Pipeline) Close

func (p *Pipeline) Close() error

Close releases the streaming handle. Subsequent FeedPCM calls error. The underlying Detector is NOT closed — Pipeline does not own its lifetime; the adapter that constructed both is responsible.

func (*Pipeline) Config

func (p *Pipeline) Config() Config

Config returns a copy of the resolved pipeline config (with defaults applied). Useful for UI status displays.

func (*Pipeline) FeedPCM

func (p *Pipeline) FeedPCM(pcm []byte) (decodes int, peakProb float32, err error)

FeedPCM ingests raw S16 mono PCM at SampleRate. The pipeline normalises to float32 [-1, 1] in place, forwards to sherpa-onnx's streaming spotter, and drains every ready decode window. Returns the number of decode steps completed and the highest keyword score observed across them.

The signature matches the existing audio.Session.SetPCMHandler contract so adapters that were wired against the previous openWakeWord implementation continue to work without changes.

func (*Pipeline) Reset

func (p *Pipeline) Reset()

Reset clears the rolling keyword-spotter state and the per-keyword debounce + cooldown maps. Adapters call this after a detection-triggered mode change to avoid the same utterance re-firing once the cooldown lapses.

type Sink

type Sink interface {
	Emit(DetectionEvent)
}

Sink receives DetectionEvents from the pipeline. The desktop dispatcher implements this to convert events into synthetic hotkey events.

type SinkFunc

type SinkFunc func(DetectionEvent)

SinkFunc adapts a plain function to the Sink interface.

func (SinkFunc) Emit

func (f SinkFunc) Emit(ev DetectionEvent)

Emit calls f.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL