Documentation
¶
Overview ¶
Package vad is the voice-activity-detection layer. The Silero ONNX path (silero.go) is currently a no-op stub pending the sherpa-onnx VAD migration; the RMS-level fallback (level_vad.go) arms the dictation silence-cutoff watcher in the meantime. The upstream Silero implementation will return when the ONNX runtime ABI clean-up lands.
Audit 2026-05-24 maintainability sweep.
Index ¶
Constants ¶
const ( SampleRate = 16000 FrameSize = 512 BytesPerSample = 2 FrameBytes = FrameSize * BytesPerSample )
Variables ¶
ErrUnavailable is returned by NewSileroVAD until VAD is re-implemented on the unified sherpa-onnx runtime. See package-level note for rationale.
Functions ¶
This section is empty.
Types ¶
type LevelVAD ¶ added in v0.37.8
type LevelVAD struct {
// contains filtered or unexported fields
}
LevelVAD is a tiny RMS-based voice activity detector that needs no ONNX runtime. It exists as a fallback for the dictation pipeline while SileroVAD is disabled due to the onnxruntime/sherpa-onnx ABI conflict (see silero.go for the full story).
Speech detection is intentionally crude: compute the RMS of the frame, normalise to [0, 1] against the int16 full scale, and apply a piecewise mapping that produces 0 for noise-floor levels, 1 for clearly voiced audio, and a linear ramp in between. Consumers see the same float32 probability the Silero model would have returned, so they can keep using the existing threshold (DictationSegmenter uses 0.5).
Trade-offs:
- No frequency awareness — keyboard typing, fan noise, or paper rustle above the threshold counts as speech and resets the silence timer.
- No hangover smoothing — short consonant gaps register as silence at the frame level. The DictationSegmenter's pause threshold (default 700 ms) absorbs this in practice.
The numbers are tuned against the desktop logs (`Overlay audio: raw=0.006` for room silence, `raw=0.012-0.026` for normal voice). The wider window here errs toward "speech" so the silence-cutoff timer waits for genuine quiet rather than tripping on a breath.
func NewLevelVAD ¶ added in v0.37.8
func NewLevelVAD() *LevelVAD
NewLevelVAD constructs a level-based detector with the production defaults. Tests can override the thresholds by mutating the struct directly — there is no setter API yet because the only caller is the dictation bootstrap.
func (*LevelVAD) ProcessFrame ¶ added in v0.37.8
ProcessFrame implements Detector by returning a per-frame speech probability. The contract is the same as the Silero binding: 0 means silence, 1 means active speech, and the consumer compares against its own threshold.
type SileroVAD ¶
type SileroVAD struct {
// contains filtered or unexported fields
}
SileroVAD is the public type the package previously exported. Kept as an empty struct so callers compile; constructor returns ErrUnavailable.
func NewSileroVAD ¶
NewSileroVAD always returns ErrUnavailable in this build. Adapters (cmd/speechkit/dictation_session.go) detect the error, log a warning and degrade to manual-release hotkey behaviour.
func (*SileroVAD) ProcessFrame ¶
ProcessFrame is a no-op for the stub. The real implementation will return the speech probability from the sherpa-onnx VAD model.