wakeword

package

v0.40.2 Latest Latest Go to latest Published: May 26, 2026 License: Apache-2.0 Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kombifyio/SpeechKit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package wakeword implements an always-on, on-device keyword spotter that any of the three SpeechKit modes (Dictation, Assist, Voice Agent) can opt into.

The package is platform-neutral kernel code. SpeechKit's framework architecture splits responsibilities across two layers:

Framework kernel (this package): keyword-spotting detector, audio buffering, debounce/cooldown logic, event dispatch contracts. Imported by every client that wants wake-word support.
Per-client adapters: own the OS-level audio source, hotkey mapping and status surface. The Windows Device-Target lives in cmd/speechkit/desktop_wakeword.go and is the reference adapter that ships with the SpeechKit demo. Future client targets (Android via gomobile, Local-Target Go CLI, web via WASM bindings, native iOS) plug their audio source and event sink into the same Pipeline.

Wake-word is never a server-side concern in SpeechKit. The Server-Target (internal/server) exposes Dictation/Assist/Voice Agent over HTTP/WS but has no wake-word surface — running an always-on mic in a server process makes no architectural sense and is intentionally out of scope.

Detection runs via the sherpa-onnx Zipformer KWS model (github.com/k2-fsa/sherpa-onnx-go). The model is Apache-2.0, ships as a single ~17 MB asset with encoder/decoder/joiner ONNX files plus a tokens + keywords file, and supports open-vocabulary keyword spotting — users declare keywords as text and tune a per-keyword threshold without retraining.

Index ¶

Constants
type AutoEndConfig
- func DefaultAutoEndConfig() AutoEndConfig
type AutoEndPolicy
- func NewAutoEndPolicy(cfg AutoEndConfig, logger *slog.Logger) *AutoEndPolicy
- func (p *AutoEndPolicy) Close()
- func (p *AutoEndPolicy) Config() AutoEndConfig
- func (p *AutoEndPolicy) EndSignal() <-chan EndReason
- func (p *AutoEndPolicy) NotifyActivity()
- func (p *AutoEndPolicy) NotifyTranscript(text string)
- func (p *AutoEndPolicy) Start()
type Config
type DetectionEvent
type Detector
- func NewDetector(cfg DetectorConfig) (*Detector, error)
- func (d *Detector) Close() error
type DetectorConfig
type Dispatcher
- func NewDispatcher(sink HotkeySink, opts DispatcherOptions) *Dispatcher
- func (d *Dispatcher) Close(ctx context.Context) error
- func (d *Dispatcher) Emit(ev DetectionEvent)
type DispatcherOptions
type EndReason
type HotkeyEvent
type HotkeySink
type HotkeySinkFunc
- func (f HotkeySinkFunc) Submit(ev HotkeyEvent)
type PhraseCatalogEntry
- func DefaultCatalog() []PhraseCatalogEntry
- func LookupPhrase(id string) *PhraseCatalogEntry
type Pipeline
- func NewPipeline(detector *Detector, sink Sink, cfg Config) (*Pipeline, error)
- func (p *Pipeline) Close() error
- func (p *Pipeline) Config() Config
- func (p *Pipeline) FeedPCM(pcm []byte) (decodes int, peakProb float32, err error)
- func (p *Pipeline) Reset()
type Sink
type SinkFunc
- func (f SinkFunc) Emit(ev DetectionEvent)
type TrainingCapture
- func NewTrainingCapture(cfg TrainingCaptureConfig) (*TrainingCapture, error)
- func (c *TrainingCapture) Close() error
- func (c *TrainingCapture) Enabled() bool
- func (c *TrainingCapture) Ingest(pcm []byte)
- func (c *TrainingCapture) Trigger(ev DetectionEvent)
type TrainingCaptureConfig
type TrainingRecord
type TrainingRecordHandler
type TrainingUploader
- func NewTrainingUploader(cfg TrainingUploaderConfig) (*TrainingUploader, error)
- func (u *TrainingUploader) Close()
- func (u *TrainingUploader) Run(ctx context.Context) error
type TrainingUploaderConfig

Constants ¶

View Source

const (
	SampleRate     = 16000
	BytesPerSample = 2
	Channels       = 1

	// FrameSamples is the unit the audio session delivers per
	// SetPCMHandler callback (80 ms at 16 kHz).
	FrameSamples = 1280
	FrameBytes   = FrameSamples * BytesPerSample
	FrameDur     = 80 * time.Millisecond
)

Audio contract for the wake-word pipeline. Matches internal/audio and internal/vad — 16 kHz mono S16 PCM. The detector accepts the same frame cadence the rest of the kernel produces, so a single audio session can fan out to VAD + wake-word without resampling.

View Source

const SourceWakeword = "wakeword"

SourceWakeword is the value the Dispatcher writes into HotkeyEvent.Source. Client adapters should compare against this constant rather than the literal string to stay decoupled from the dispatch wording.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type AutoEndConfig ¶ added in v0.35.8

type AutoEndConfig struct {
	// SilenceCutoff: Dauer ohne User-Audio-Aktivitaet, nach der die Session
	// automatisch endet. Zero disables silence-based auto-end.
	SilenceCutoff time.Duration

	// ExitPhrases: case-insensitive Substring-Matcher gegen User-Transkript-
	// Snippets. Match feuert sofortiges Session-Ende. Empty slice disables
	// exit-phrase auto-end.
	//
	// Funktioniert nur wenn der gewaehlte Voice-Agent-Provider User-Transkripte
	// liefert (Gemini Live mit EnableInputAudioTranscription, OpenAI Realtime
	// mit Input-Transcription, Local Cascaded immer). Ohne Transkripte bleibt
	// SilenceCutoff allein die End-Garantie.
	ExitPhrases []string
}

AutoEndConfig steuert wann eine wake-word-getriggerte Session automatisch endet. Typischerweise aus Client-Config (z.B. TOML [wakeword.auto_end]) gefuellt; DefaultAutoEndConfig liefert die Framework-Defaults die jeder Client-Target als Baseline nutzt.

SpeechKit's Voice-Agent ist by-design fuer mehrstuendige Dialoge ausgelegt — daher gibt es bewusst keinen Hard-Cap auf die Session-Dauer. Auto-End wird ueber Stille (SilenceCutoff) plus optionale Exit-Phrasen geregelt.

func DefaultAutoEndConfig ¶ added in v0.35.8

func DefaultAutoEndConfig() AutoEndConfig

DefaultAutoEndConfig liefert die Framework-Baseline:

SilenceCutoff: 10s
ExitPhrases: DE+EN-Closer (case-insensitive)

Client-Targets ueberschreiben einzelne Felder via Config oder ersetzen das ganze Default-Set, wenn ihre Locale anders aussieht.

type AutoEndPolicy ¶ added in v0.35.8

type AutoEndPolicy struct {
	// contains filtered or unexported fields
}

AutoEndPolicy ist ein per-Session-Watcher. Provider-agnostisch — kennt keine Gemini/OpenAI/Pipeline-Details. Der Client-Adapter wired:

Audio-Frames oder VAD-Onset -> NotifyActivity
User-Transkript-Snippets -> NotifyTranscript
EndSignal -> Session-Stop-Pfad

Lifecycle: NewAutoEndPolicy -> Start -> Notify*-Calls -> EndSignal feuert (genau einmal) -> Close (idempotent).

func NewAutoEndPolicy ¶ added in v0.35.8

func NewAutoEndPolicy(cfg AutoEndConfig, logger *slog.Logger) *AutoEndPolicy

NewAutoEndPolicy konstruiert eine Policy aus Config. Logger darf nil sein (faellt auf slog.Default zurueck). Ein Zero-Value-Config (SilenceCutoff=0 UND len(ExitPhrases)=0) wird durch DefaultAutoEndConfig ersetzt, damit Clients die das AutoEnd-Feature nicht aktiv konfigurieren trotzdem das Framework-Default-Verhalten bekommen.

func (*AutoEndPolicy) Close ¶ added in v0.35.8

func (p *AutoEndPolicy) Close()

Close stoppt den Silence-Timer und schliesst den endCh wenn noch nicht gefeuert. Idempotent — wiederholte Aufrufe sind no-ops. Notify*-Calls nach Close sind ebenfalls no-ops und panicken nicht.

func (*AutoEndPolicy) Config ¶ added in v0.35.8

func (p *AutoEndPolicy) Config() AutoEndConfig

Config gibt die zugrundeliegende Konfiguration zurueck (Kopie).

func (*AutoEndPolicy) EndSignal ¶ added in v0.35.8

func (p *AutoEndPolicy) EndSignal() <-chan EndReason

EndSignal liefert einen Channel der genau einmal feuert wenn ein Auto-End-Trigger matched. Nach dem Fire soll der Consumer Close aufrufen. Wenn Close aufgerufen wird ohne dass die Policy gefeuert hat, wird der Channel ohne Wert geschlossen — der Receiver bekommt zero-value + ok=false.

func (*AutoEndPolicy) NotifyActivity ¶ added in v0.35.8

func (p *AutoEndPolicy) NotifyActivity()

NotifyActivity setzt den Silence-Timer zurueck. Vom Client-Adapter bei User-Audio-Aktivitaet (PCM-Frame, VAD-Onset) aufgerufen. No-op nach Close oder vor Start.

func (*AutoEndPolicy) NotifyTranscript ¶ added in v0.35.8

func (p *AutoEndPolicy) NotifyTranscript(text string)

NotifyTranscript prueft text gegen die Exit-Phrase-Liste. Match (case-insensitive substring) feuert EndSignal mit EndReasonExitPhrase. Resetet auch den Silence-Timer (Transkript ist Aktivitaet).

func (*AutoEndPolicy) Start ¶ added in v0.35.8

func (p *AutoEndPolicy) Start()

Start armiert den Silence-Timer und markiert die Session als aktiv. Idempotent — wiederholte Aufrufe sind no-ops.

type Config ¶

type Config struct {
	// Enabled gates the entire wake-word system. When false, no detector is
	// loaded and no audio session is opened on its behalf. Default false —
	// wake-word is an opt-in feature in line with the dictation industry
	// norm (Wispr Flow, Superwhisper, VoiceInk all default to hotkey-only).
	Enabled bool

	// Phrase is the display label of the active wake phrase (e.g.
	// "Hey Quby"). Surfaced in the tray and status feed. Detection is
	// driven by the BPE-tokenised Keywords/KeywordsFile fields below, not
	// by this label — multiple keywords can map to the same display label.
	Phrase string

	// ModelDir is the directory holding the sherpa-onnx KWS model assets
	// (encoder/decoder/joiner ONNX + tokens.txt). The Windows reference
	// adapter resolves a default of <exe_dir>/wakeword-kws when empty.
	ModelDir string

	// KeywordsFile is the path to a sherpa-onnx keywords.txt file
	// (BPE-tokenised, one keyword per line with optional `:boost` and
	// trailing `@display-name`). Takes precedence over inline Keywords
	// when both are set. The Windows adapter ships a default file beside
	// the KWS model.
	KeywordsFile string

	// Keywords is the inline alternative to KeywordsFile — one
	// BPE-tokenised entry per slice element. Use this for programmatic
	// callers (tests, ad-hoc clients) that do not want to materialise a
	// file on disk.
	Keywords []string

	// DefaultMode is the runtime mode triggered when a keyword fires.
	// One of "dictate" | "assist" | "voice_agent".
	DefaultMode string

	// Threshold is the minimum acoustic probability for sherpa-onnx to
	// emit a keyword hit (0.0–1.0, lower = more sensitive). Each keyword
	// can override it via the `:boost` suffix in KeywordsFile; this is the
	// global default.
	Threshold float32

	// MinConsecutiveFrames is the number of decoded windows the same
	// keyword must fire in before the pipeline emits a DetectionEvent.
	// 1 = fire immediately, 2 = require two consecutive hits (default).
	MinConsecutiveFrames int

	// Cooldown is the minimum interval between two emitted DetectionEvents
	// for the same keyword. Prevents the same utterance triggering twice
	// and gives the downstream mode time to spin up before another fire.
	Cooldown time.Duration

	// AutoEnd steuert das automatische Beenden von Sessions die durch
	// dieses Wake-Word getriggert wurden. Provider-agnostisch — gilt fuer
	// Gemini Live, OpenAI Realtime und Local-Cascaded gleichermassen. Wenn
	// der Zero-Value verwendet wird (SilenceCutoff=0 + len(ExitPhrases)=0),
	// faellt die Pipeline auf DefaultAutoEndConfig zurueck. Siehe
	// AutoEndPolicy + DefaultAutoEndConfig fuer die Framework-Baseline.
	AutoEnd AutoEndConfig
}

Config controls the wake-word pipeline behaviour. Fields map 1:1 onto the wakeword block in the user TOML config; see internal/config/config.go for the on-disk schema.

The new sherpa-onnx-backed implementation does not need melspec/embedding/prediction triplets — those legacy fields are kept on the config surface (so older user configs do not error out) but ignored by this package. The relevant inputs are ModelDir + Keywords (or KeywordsFile).

type DetectionEvent ¶

type DetectionEvent struct {
	// Phrase is the configured display name (e.g. "Hey Quby").
	Phrase string

	// Keyword is the raw keyword string sherpa-onnx returned (e.g.
	// "HEY QUBY" or the @-suffix when the keywords file labelled it).
	Keyword string

	// Mode is the runtime mode the wake should trigger ("dictate" |
	// "assist" | "voice_agent"), copied from Config.DefaultMode at the
	// moment of detection so downstream consumers don't need to re-read
	// config.
	Mode string

	// Probability is informational only — sherpa-onnx KWS already applies
	// its threshold internally before reporting a keyword, so this is
	// always 1.0 from the pipeline. Kept on the surface for API stability
	// with the previous openWakeWord-based detector.
	Probability float32

	// At is the wall-clock time of the trigger.
	At time.Time
}

DetectionEvent is emitted when a keyword fires.

type Detector ¶

type Detector struct {
	// contains filtered or unexported fields
}

Detector owns the sherpa-onnx KeywordSpotter for the lifetime of the process. Construction loads the model graph; Close releases it. Callers do not call Detector directly — the Pipeline owns one Detector and the corresponding sherpa stream, and exposes FeedPCM / Close to the audio adapter.

func NewDetector ¶

func NewDetector(cfg DetectorConfig) (*Detector, error)

NewDetector loads the sherpa-onnx KWS model and prepares it for streaming inference. Returns an error if any required file is missing or the native library refuses to construct the spotter.

The caller is responsible for Close() — typically wired into the desktop cleanup stack so the native handles are released on shutdown.

func (*Detector) Close ¶

func (d *Detector) Close() error

Close releases the underlying KeywordSpotter. Safe to call multiple times — subsequent calls are no-ops.

type DetectorConfig ¶

type DetectorConfig struct {
	Encoder      string   // encoder ONNX (zipformer KWS encoder-…onnx)
	Decoder      string   // decoder ONNX
	Joiner       string   // joiner ONNX
	Tokens       string   // tokens.txt (BPE token table)
	KeywordsFile string   // BPE-tokenised keywords, sherpa-onnx format
	Keywords     []string // alternative to KeywordsFile (joined with \n)

	// NumThreads bounds CPU parallelism. Zero defaults to 1 — the KWS model
	// is intentionally small (~3 M parameters) and never benefits from more
	// than a couple of threads on desktop hardware.
	NumThreads int

	// Threshold mirrors Config.Threshold; below 0 or above 1 falls back to
	// the sherpa-onnx default (~0.25).
	Threshold float32

	// Debug enables sherpa-onnx's verbose C++ logging (ModelConfig.Debug = 1).
	// Output goes to the C++ runtime's stderr — the sidecar's slog stderr pump
	// fans it into the host's log feed. Use only while tuning; the C++ side
	// is chatty.
	Debug bool
}

DetectorConfig bundles the file-system inputs the sherpa-onnx KWS engine needs. The Windows reference adapter resolves these from Config.ModelDir against a layout produced by scripts/prepare-wakeword-model.ps1.

All paths must be absolute. Empty Encoder/Decoder/Joiner/Tokens fields fail with a clear error from NewDetector — they are not auto-resolved inside this package so adapter wiring stays explicit.

type Dispatcher ¶

type Dispatcher struct {
	// contains filtered or unexported fields
}

Dispatcher converts a DetectionEvent into a synthetic KeyDown event on the configured DefaultMode binding. By default it emits *only* a press (no release), which the downstream mode controller interprets as a session-start request in either PTT or Toggle behaviour:

Toggle: first press starts the session; the user toggles it off later via the actual hotkey or it auto-stops on silence.
PTT: the press starts the session; auto-stop on silence (see desktopAudioLevelHandler.FastModeSilenceMs) closes it. Wake-trigger plus PTT works best with FastModeSilenceMs set, otherwise the user must press the actual hotkey to release.

SyntheticRelease can be enabled if a downstream consumer specifically requires a balanced press+release pair; the release fires after ReleaseAfter.

func NewDispatcher ¶

func NewDispatcher(sink HotkeySink, opts DispatcherOptions) *Dispatcher

NewDispatcher constructs a Dispatcher with the given sink and options.

func (*Dispatcher) Close ¶

func (d *Dispatcher) Close(ctx context.Context) error

Close blocks further events and waits for in-flight dispatch goroutines (or until ctx is cancelled).

func (*Dispatcher) Emit ¶

func (d *Dispatcher) Emit(ev DetectionEvent)

Emit implements Sink. Called by the wake pipeline when a phrase fires. Non-blocking: actual sink submission runs in a goroutine so a slow downstream channel cannot stall PCM ingestion.

type DispatcherOptions ¶

type DispatcherOptions struct {
	// SyntheticRelease causes the dispatcher to emit a KeyUp event
	// ReleaseAfter the KeyDown. Off by default — most desktop consumers
	// want toggle/auto-stop semantics rather than a fast tap.
	SyntheticRelease bool

	// ReleaseAfter is the delay between synthesized KeyDown and KeyUp
	// when SyntheticRelease is true. Defaults to 150ms.
	ReleaseAfter time.Duration

	// Logger is used for structured logs. Defaults to slog.Default().
	Logger *slog.Logger
}

DispatcherOptions tweaks Dispatcher behaviour.

type EndReason ¶ added in v0.35.8

type EndReason string

EndReason ist das Ergebnis des AutoEnd-Watchers; vom Client-Adapter an den session.end-Audit-Event weitergereicht.

const (
	EndReasonSilence    EndReason = "silence"
	EndReasonExitPhrase EndReason = "exit_phrase"
)

type HotkeyEvent ¶

type HotkeyEvent struct {
	KeyDown bool   // true = press, false = release
	Binding string // mode name: "dictate" | "assist" | "voice_agent"

	// Source identifies the origin of the synthesized event. The wake-word
	// Dispatcher always sets this to "wakeword". Client adapters use this
	// to distinguish wake-word-triggered activations from real hotkey
	// presses — e.g. to attach an AutoEndPolicy to wake-word-origin
	// sessions but leave hotkey-origin sessions on their existing
	// hold-to-talk / toggle semantics.
	Source string
}

HotkeyEvent is the minimal shape required to bridge a wake DetectionEvent into the existing mode-hotkey dispatch channel. It intentionally mirrors the Type+Binding pair used by internal/hotkey.Event without importing that package, so the kernel module stays platform-neutral.

type HotkeySink ¶

type HotkeySink interface {
	Submit(HotkeyEvent)
}

HotkeySink consumes synthetic key events. The Device-Target supplies an adapter that pushes into the modeHotkeyManager events channel.

type HotkeySinkFunc ¶

type HotkeySinkFunc func(HotkeyEvent)

HotkeySinkFunc adapts a function to HotkeySink.

func (HotkeySinkFunc) Submit ¶

func (f HotkeySinkFunc) Submit(ev HotkeyEvent)

Submit calls f.

type PhraseCatalogEntry ¶

type PhraseCatalogEntry struct {
	// ID is the stable identifier saved in [wakeword] phrase_id. Lowercase
	// snake_case. Never change after publishing — config files in the wild
	// reference it.
	ID string

	// DisplayName is the human-readable label rendered in the tray and
	// settings UI. Free-form; can include parenthetical pronunciation
	// hints (e.g. "Hey Quby (Cubi / Kubi)").
	DisplayName string

	// KeywordLabel is the @-suffix that must appear at the end of the
	// matching line in keywords.txt. The sherpa-onnx KeywordSpotter reports
	// hits via this label, which the pipeline forwards as DetectionEvent.Keyword.
	// When empty, sherpa reports the raw BPE-decoded keyword.
	KeywordLabel string

	// Notes is a one-line tradeoff summary surfaced in the settings UI
	// next to the entry. Plain text, no markup.
	Notes string
}

PhraseCatalogEntry describes a curated wake-phrase shipped with SpeechKit. Under the sherpa-onnx KWS backend the actual detection contract is the keywords.txt file in the bundled KWS model directory (BPE-tokenised, one keyword per line with optional `:boost` and trailing `@display-label`). The catalog here exists only to:

Give the settings UI a stable ID/label mapping that survives across config saves (the user picks "Hey Quby" from a dropdown, we persist "hey_quby" in the TOML; the adapter then resolves whatever the bundled keywords.txt contains for that phrase).
Surface tradeoff notes the UI can render next to each entry.

Adding a new entry is non-breaking. Removing or renaming requires a config migration.

func DefaultCatalog ¶

func DefaultCatalog() []PhraseCatalogEntry

DefaultCatalog is SpeechKit's curated wake-phrase list. The corresponding BPE-tokenised keyword lines live in dist/windows/SpeechKit/wakeword-kws/keywords.txt; adding a new entry here requires adding the matching line in that file (and any per-platform bundle counterpart).

Selection criteria (consensus across Picovoice / openWakeWord / sherpa literature):

3-4 syllables (disambiguates without feeling sluggish)
distinct consonant onsets (K, T, P, B help the encoder)
vowel diversity
rare in everyday conversation (low false-accept rate)
pronounceable across DE+EN (SpeechKit's primary languages)

func LookupPhrase ¶

func LookupPhrase(id string) *PhraseCatalogEntry

LookupPhrase returns the catalog entry matching id, or nil if not found. The lookup is case-insensitive and trims surrounding whitespace so the stored config value tolerates manual edits.

type Pipeline ¶

type Pipeline struct {
	// contains filtered or unexported fields
}

Pipeline streams PCM audio into the sherpa-onnx KeywordSpotter and emits DetectionEvents via Sink whenever the spotter reports a keyword hit, a configurable number of consecutive decodes in a row, outside the current cooldown window.

Pipeline is safe for concurrent calls to FeedPCM and Close. A single Pipeline backs a single audio source — the desktop adapter creates one per process.

func NewPipeline ¶

func NewPipeline(detector *Detector, sink Sink, cfg Config) (*Pipeline, error)

NewPipeline wires a Detector and Sink together with the resolved config. Invalid Config fields are coerced to defaults so the adapter does not need to mirror normalisation logic.

func (*Pipeline) Close ¶

func (p *Pipeline) Close() error

Close releases the streaming handle. Subsequent FeedPCM calls error. The underlying Detector is NOT closed — Pipeline does not own its lifetime; the adapter that constructed both is responsible.

func (*Pipeline) Config ¶

func (p *Pipeline) Config() Config

Config returns a copy of the resolved pipeline config (with defaults applied). Useful for UI status displays.

func (*Pipeline) FeedPCM ¶

func (p *Pipeline) FeedPCM(pcm []byte) (decodes int, peakProb float32, err error)

FeedPCM ingests raw S16 mono PCM at SampleRate. The pipeline normalises to float32 [-1, 1] in place, forwards to sherpa-onnx's streaming spotter, and drains every ready decode window. Returns the number of decode steps completed and the highest keyword score observed across them.

The signature matches the existing audio.Session.SetPCMHandler contract so adapters that were wired against the previous openWakeWord implementation continue to work without changes.

func (*Pipeline) Reset ¶

func (p *Pipeline) Reset()

Reset clears the rolling keyword-spotter state and the per-keyword debounce + cooldown maps. Adapters call this after a detection-triggered mode change to avoid the same utterance re-firing once the cooldown lapses.

type Sink ¶

type Sink interface {
	Emit(DetectionEvent)
}

Sink receives DetectionEvents from the pipeline. The desktop dispatcher implements this to convert events into synthetic hotkey events.

type SinkFunc ¶

type SinkFunc func(DetectionEvent)

SinkFunc adapts a plain function to the Sink interface.

func (SinkFunc) Emit ¶

func (f SinkFunc) Emit(ev DetectionEvent)

Emit calls f.

type TrainingCapture ¶ added in v0.37.8

type TrainingCapture struct {
	// contains filtered or unexported fields
}

TrainingCapture buffers the rolling PCM stream around wake-word detections and writes each detection's audio to disk as a WAV file plus a JSON sidecar describing the trigger. Foundation of v0.37.4's optional training-data pipeline — every boolean defaults to OFF so the structure only acts on host opt-in.

All audio is 16 kHz mono S16 PCM matching the rest of the wake-word kernel (SampleRate / Channels / BytesPerSample). The pre-roll and post-roll knobs are expressed in milliseconds and converted to samples at construction time.

Lifecycle (sidecar usage):

c, err := wakeword.NewTrainingCapture(wakeword.TrainingCaptureConfig{
    Enabled:    cfg.LocalCaptureEnabled,
    Dir:        cfg.LocalCaptureDir,
    PreRollMs:  cfg.PreRollMs,
    PostRollMs: cfg.PostRollMs,
    Backend:    "livekit_openwakeword",
    OnWrite:    func(rec TrainingRecord) { emit(...) },
})
defer c.Close()
// in PCM handler:
c.Ingest(pcm)
// on detection:
c.Trigger(detectionEvent)

Capture is goroutine-safe. Ingest and Trigger may be called concurrently though typical sidecars use one PCM thread for both.

func NewTrainingCapture ¶ added in v0.37.8

func NewTrainingCapture(cfg TrainingCaptureConfig) (*TrainingCapture, error)

NewTrainingCapture validates the config and returns a ready-to-use capture. When Enabled is false the returned capture is inert.

func (*TrainingCapture) Close ¶ added in v0.37.8

func (c *TrainingCapture) Close() error

Close flushes any still-pending captures with whatever audio they already have so training clips are not silently lost on shutdown. Close is idempotent.

func (*TrainingCapture) Enabled ¶ added in v0.37.8

func (c *TrainingCapture) Enabled() bool

Enabled reports whether the capture actively writes to disk.

func (*TrainingCapture) Ingest ¶ added in v0.37.8

func (c *TrainingCapture) Ingest(pcm []byte)

Ingest pushes a fresh PCM buffer through the ring + any pending post-roll collectors.

func (*TrainingCapture) Trigger ¶ added in v0.37.8

func (c *TrainingCapture) Trigger(ev DetectionEvent)

Trigger reacts to a wake-word DetectionEvent by snapshotting the current pre-roll ring and starting a post-roll collector. When post_roll is zero the capture flushes immediately.

type TrainingCaptureConfig ¶ added in v0.37.8

type TrainingCaptureConfig struct {
	// Enabled is the master switch. When false NewTrainingCapture
	// still returns a non-nil capture but Ingest/Trigger become
	// no-ops so the sidecar can wire the same call sites
	// regardless of user opt-in state.
	Enabled bool

	// Dir is the filesystem root where captures land. Must already
	// exist (the constructor does NOT create it). Each capture
	// writes two files: <prefix>.wav and <prefix>.json.
	Dir string

	// PreRollMs and PostRollMs determine how much audio gets
	// captured before and after a trigger. Internally converted to
	// samples at the canonical SampleRate.
	PreRollMs  int
	PostRollMs int

	// Backend identifies the detector backend that produced the
	// trigger ("livekit_openwakeword" / "sherpa_kws" /
	// "stt_phrase"). Echoed into the JSON sidecar so labelers
	// know which detector to test against.
	Backend string

	// OnWrite is an optional callback invoked once per completed
	// capture flush. Nil is allowed.
	OnWrite TrainingRecordHandler

	// TimeSource overrides time.Now for tests.
	TimeSource func() time.Time
}

TrainingCaptureConfig is the constructor input.

type TrainingRecord ¶ added in v0.37.8

type TrainingRecord struct {
	ID         string    `json:"id"`
	PhraseID   string    `json:"phrase_id"`
	Phrase     string    `json:"phrase"`
	Backend    string    `json:"backend"`
	Score      float32   `json:"score"`
	CapturedAt time.Time `json:"captured_at"`
	PreRollMs  int       `json:"pre_roll_ms"`
	PostRollMs int       `json:"post_roll_ms"`
	SampleRate int       `json:"sample_rate"`
	AudioPath  string    `json:"audio_path"`
	AudioBytes int       `json:"audio_bytes"`
	Label      string    `json:"label,omitempty"`
	Uploaded   bool      `json:"uploaded"`
}

TrainingRecord is the metadata describing one written capture. The JSON sidecar uses the same field names (lower_snake_case) so the Go struct is the canonical schema.

type TrainingRecordHandler ¶ added in v0.37.8

type TrainingRecordHandler func(TrainingRecord)

TrainingRecordHandler is the OnWrite callback shape.

type TrainingUploader ¶ added in v0.37.8

type TrainingUploader struct {
	// contains filtered or unexported fields
}

TrainingUploader scans a local capture directory for activation recordings produced by TrainingCapture and uploads them to a remote SpeechKit server's POST /v1/wakeword/activations endpoint.

Default behaviour:

Tick interval is configurable; 5 min is a reasonable default for production. Tests can drop to 50 ms.
Skips any record whose sidecar already has `"uploaded": true`.
When OnlyLabeled is true, also skips records whose `label` field is empty. This lets users gate "send to server" behind explicit labelling in the Wails Settings UI (v0.37.6).
After a successful 201 (or a 409 "already exists") the sidecar JSON is rewritten with `uploaded: true` and the WAV file stays on disk so the user can still review it locally.
On 503 (server says feature disabled) the uploader pauses one full backoff window before retrying, so device clients don't hammer a tenant who has the toggle off.
On 5xx / network errors the uploader just logs and waits for the next tick; nothing on disk is mutated so retry is automatic.

All work happens on a single goroutine. Run blocks until ctx is cancelled or Close() is called.

func NewTrainingUploader ¶ added in v0.37.8

func NewTrainingUploader(cfg TrainingUploaderConfig) (*TrainingUploader, error)

NewTrainingUploader validates the config and returns a ready uploader. Returns an error when required fields are missing.

func (*TrainingUploader) Close ¶ added in v0.37.8

func (u *TrainingUploader) Close()

Close marks the uploader as closed; the next tick will exit. Safe to call multiple times.

func (*TrainingUploader) Run ¶ added in v0.37.8

func (u *TrainingUploader) Run(ctx context.Context) error

Run scans on every tick until ctx is cancelled. Returns ctx.Err() when ctx fires; nil otherwise. Safe to call once; calling twice concurrently is not supported.

type TrainingUploaderConfig ¶ added in v0.37.8

type TrainingUploaderConfig struct {
	// Dir is the local capture directory that TrainingCapture writes
	// into (one .wav + .json pair per activation).
	Dir string

	// ServerURL is the base URL of the SpeechKit server, e.g.
	// "https://speechkit.example.com" or "http://localhost:8080".
	// The uploader appends "/v1/wakeword/activations" itself.
	ServerURL string

	// BearerToken authenticates the upload requests. Pulled from the
	// configured env var by the caller (we never read env vars
	// directly to keep secrets handling in one place).
	BearerToken string

	// Interval between scans. Zero is rejected by NewTrainingUploader;
	// production should pass at least 30 s.
	Interval time.Duration

	// OnlyLabeled — when true, only records whose sidecar JSON has a
	// non-empty `label` field are uploaded. Lets the user gate
	// sharing behind explicit review.
	OnlyLabeled bool

	// HTTPClient overrides the default *http.Client (tests use a
	// custom transport). When nil a 30 s-timeout client is used.
	HTTPClient *http.Client

	// Logger is optional; defaults to slog.Default.
	Logger *slog.Logger
}

TrainingUploaderConfig configures one uploader instance. All fields are required unless explicitly marked optional.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL