SpeechKit

module
v0.35.15 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2026 License: Apache-2.0

README ΒΆ

SpeechKit

🚧 Beta. SpeechKit is in active beta. Public APIs, config keys, and defaults can still change between minor releases. Use it in production only with version pins. Pre-1.0 releases use the v0.MAJOR.MINOR scheme; breaking changes are called out in each release entry.

SpeechKit is a Windows-first voice framework for products that need dictation, voice commands, and realtime voice dialogue without coupling every use case to one desktop app or one hosted API.

The framework currently has three modules:

Module What it is Use it when
Local-first Go backend Embeddable Go runtime in pkg/speechkit with mode contracts, provider profiles, routing policy, readiness metadata, and reusable Dictation, Assist, and Voice Agent services. You want to integrate SpeechKit into your own Go product, internal tool, prototype, or automation host.
SpeechKit Server Linux server runtime in cmd/speechkit-server that wraps the same backend behind HTTP and WebSocket APIs. You need a durable server process for remote clients, teams, product backends, browsers, or centrally managed model/provider configuration.
Windows Client Wails desktop client in cmd/speechkit for local use, provider testing, and server-connected workflows. You want to use SpeechKit on a Windows machine, validate providers and models, or connect a workstation to a SpeechKit Server.

All three modules share the same three strict modes:

Mode Purpose Boundary
Dictation Turn speech into text. STT only. No LLM rewriting, no utilities, no codewords.
Assist Turn speech or text into one useful result. Codeword, utility, or LLM output with optional TTS and explicit UI surface metadata.
Voice Agent Run realtime audio-to-audio dialogue. Live conversation for brainstorming, support, and fast follow-ups.

Why SpeechKit

Local-first Go backend

Use the backend when you want voice features inside another application without adopting the Windows client. The public pkg/speechkit boundary gives host apps stable mode contracts, service interfaces, provider catalogs, and readiness data they can turn into their own setup UI.

Key advantages:

  • One framework kernel for Dictation, Assist, and Voice Agent instead of three unrelated voice pipelines.
  • Local-first provider support with room for managed local runtimes, user-managed local services, cloud providers, and direct vendor APIs.
  • Host policy controls for enabled modes, fixed profiles, fallbacks, and clean vs intelligence behavior.
  • Machine-readable readiness checks for credentials, local runtimes, model artifacts, and mode capability.

Start with Framework API or the examples in examples/.

SpeechKit Server

Use the server when SpeechKit should run as a long-lived Linux service. It adapts the same framework kernel to a containerized API surface so other clients can call Dictation, Assist, and Voice Agent without embedding Go code.

Key advantages:

  • One server image, one URL, and one deployment contract for all three modes.
  • HTTP endpoints for Dictation and Assist plus WebSocket sessions for realtime Voice Agent.
  • Built-in health/readiness routes, bearer or edge-auth modes, CORS/origin controls, and OpenAPI contracts.
  • Centralized provider, model, and secret configuration for teams or hosted deployments.

Start with docs/server/README.md and the server OpenAPI file at docs/server/openapi.v1.yaml.

Windows Client

Use the Windows client when you want a ready-to-run desktop experience or a reference host for testing providers, models, and server connections. The app can run local-first on the machine or delegate selected work to a SpeechKit Server.

Key advantages:

  • Global hotkeys for Dictation, Assist, and Voice Agent.
  • Local audio capture, VAD, overlays, settings, provider setup, and optional audio playback in one Wails app.
  • Provider/model test bench for local, cloud, and direct integrations.
  • Server connection support with configurable bearer-token environment variable, request timeout, and local fallback behavior.

Download public builds from GitHub Releases when available. For source builds:

powershell -ExecutionPolicy Bypass -File scripts/build.ps1 -SkipInstaller

Default hotkeys:

  • Dictation: Ctrl+Win
  • Assist: Win+Alt
  • Voice Agent: Ctrl+Shift

Quick Start

Embed the Go backend:

go get github.com/kombifyio/SpeechKit/pkg/speechkit

Build the Windows client locally:

powershell -ExecutionPolicy Bypass -File scripts/build.ps1 -SkipInstaller

Run the server image:

docker pull ghcr.io/kombifyio/speechkit-server:latest

Documentation

This README is the short orientation page. Use the detailed docs when you need contracts, deployment steps, or release rules:

Build

Canonical Windows app build:

powershell -ExecutionPolicy Bypass -File scripts/build.ps1 -SkipInstaller

For local verification before commit, PR, CI, or deploy work, use the repo-local mise contract documented in docs/LOCAL_TESTING.md. Package-manager and raw Go commands are implementation details behind those preflight gates.

mise run preflight:quick
mise run preflight:release
mise run preflight:deploy

Repository Layout

pkg/speechkit/          Local-first Go backend
cmd/speechkit-server/   SpeechKit Server entry point
cmd/speechkit/          Windows Client entry point
frontend/app/           Windows UI source
internal/               Product internals
docs/                   Detailed documentation
deploy/                 Docker and server config
installer/              Windows installer
scripts/                Build, release, export, and verification scripts

Trust

Public releases include checksums, an SBOM, and an unsigned Windows notice while the no-cost unsigned release path is active. Download only from the official kombifyio/SpeechKit releases.

License

Apache-2.0. See LICENSE.

Directories ΒΆ

Path Synopsis
cmd
sk-browser-smoke command
Command sk-browser-smoke drives the public smoke page (`/`) of a running speechkit-server instance from a real headless Chrome process and asserts that every mode tile reports OK.
Command sk-browser-smoke drives the public smoke page (`/`) of a running speechkit-server instance from a real headless Chrome process and asserts that every mode tile reports OK.
sk-e2e command
Command sk-e2e is a thin end-to-end smoke client for a running speechkit-server instance.
Command sk-e2e is a thin end-to-end smoke client for a running speechkit-server instance.
sk-localprobe command
Command sk-localprobe verifies that the SpeechKit kernel libraries produce working Dictation, Assist, and Voice Agent results against LOCAL models only (Whisper.cpp + Gemma via llama-server).
Command sk-localprobe verifies that the SpeechKit kernel libraries produce working Dictation, Assist, and Voice Agent results against LOCAL models only (Whisper.cpp + Gemma via llama-server).
speechkit command
speechkit-cli command
speechkit-mcp command
speechkit-openwakeword command
speechkit-openwakeword hosts the openWakeWord-compatible ONNX frontend in a sibling process.
speechkit-openwakeword hosts the openWakeWord-compatible ONNX frontend in a sibling process.
speechkit-server command
Package main is the canonical kombify SpeechKit Linux container server.
Package main is the canonical kombify SpeechKit Linux container server.
speechkit-wakeword command
speechkit-wakeword is the sidecar binary that hosts SpeechKit's on-device keyword spotter outside the main desktop process.
speechkit-wakeword is the sidecar binary that hosts SpeechKit's on-device keyword spotter outside the main desktop process.
speechkit/internal/profiles
Package profiles provides pure model-profile selection helpers β€” the part of the legacy cmd/speechkit model_selection_helpers.go that depends only on config and the model catalog (no *appState, no network, no Wails surface).
Package profiles provides pure model-profile selection helpers β€” the part of the legacy cmd/speechkit model_selection_helpers.go that depends only on config and the model catalog (no *appState, no network, no Wails surface).
speechkit/internal/transcription
Package transcription provides STT model-selection helpers and the vocabulary-dictionary primitives that the desktop adapters consume.
Package transcription provides STT model-selection helpers and the vocabulary-dictionary primitives that the desktop adapters consume.
examples
library command
Example: Using SpeechKit as a Go library for speech-to-text.
Example: Using SpeechKit as a Go library for speech-to-text.
provider-catalog command
Example: reading SpeechKit's public mode and provider catalog.
Example: reading SpeechKit's public mode and provider catalog.
voice-agent/game-instructor command
Example: 15-minute Voice-Agent game instructor.
Example: 15-minute Voice-Agent game instructor.
internal
ai
Package ai wires the Genkit runtime and the SpeechKit model catalog into a single LLM/embedding/reranker surface used by Assist and the Voice Agent pipeline-fallback path.
Package ai wires the Genkit runtime and the SpeechKit model catalog into a single LLM/embedding/reranker surface used by Assist and the Voice Agent pipeline-fallback path.
assist
Package assist implements the Assist Mode pipeline: STT transcript β†’ Codeword check β†’ LLM β†’ TTS β†’ Result with both text and audio.
Package assist implements the Assist Mode pipeline: STT transcript β†’ Codeword check β†’ LLM β†’ TTS β†’ Result with both text and audio.
audio
Audio playback via ebitengine/oto only requires cgo on Linux (ALSA/PulseAudio); the Windows and Darwin backends are pure-Go via purego.
Audio playback via ebitengine/oto only requires cgo on Linux (ALSA/PulseAudio); the Windows and Darwin backends are pure-Go via purego.
auditlog
Package auditlog provides the dedicated audit-event stream for SpeechKit.
Package auditlog provides the dedicated audit-event stream for SpeechKit.
auditlogtest
Package auditlogtest provides test-only helpers for resetting the audit log package state between test cases.
Package auditlogtest provides test-only helpers for resetting the audit log package state between test cases.
auth
Package auth provides the authentication abstraction for SpeechKit.
Package auth provides the authentication abstraction for SpeechKit.
dictation
Package dictation implements pause-based segmentation for Dictation Mode: it consumes VAD speech-probability frames and emits one transcription request per natural pause.
Package dictation implements pause-based segmentation for Dictation Mode: it consumes VAD speech-probability frames and emits one transcription request per natural pause.
downloads
Package downloads manages model downloads for SpeechKit β€” HTTP file downloads and Ollama model pulls with progress tracking.
Package downloads manages model downloads for SpeechKit β€” HTTP file downloads and Ollama model pulls with progress tracking.
features
Package features provides runtime feature detection for UI gating.
Package features provides runtime feature detection for UI gating.
kombify
Package kombify is the build-tag seam between OSS and kombify builds.
Package kombify is the build-tag seam between OSS and kombify builds.
models
Package models defines the SpeechKit model catalog: provider IDs, model identifiers, modality (STT, TTS, Realtime Voice, Assist, Utility, Embedding, Reranker), execution mode (local/cloud/direct), and the readiness metadata that setup UIs and the readiness endpoint consume.
Package models defines the SpeechKit model catalog: provider IDs, model identifiers, modality (STT, TTS, Realtime Voice, Assist, Utility, Embedding, Reranker), execution mode (local/cloud/direct), and the readiness metadata that setup UIs and the readiness endpoint consume.
netsec
Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
router
Package router implements the STT routing layer.
Package router implements the STT routing layer.
scaffold
Package scaffold renders embedded starter templates into a target directory so callers can bootstrap a SpeechKit integration without hand-copying boilerplate.
Package scaffold renders embedded starter templates into a target directory so callers can bootstrap a SpeechKit integration without hand-copying boilerplate.
server/assist
Package assist implements the POST /v1/assist/process handler.
Package assist implements the POST /v1/assist/process handler.
server/audio
Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
server/cli
Package cli holds the small amount of CLI-level glue for the Linux SpeechKit Server entry point.
Package cli holds the small amount of CLI-level glue for the Linux SpeechKit Server entry point.
server/core
Package core is the SpeechKit server bootstrap layer.
Package core is the SpeechKit server bootstrap layer.
server/dictation
Package dictation implements the POST /v1/dictation/transcribe handler.
Package dictation implements the POST /v1/dictation/transcribe handler.
server/httpx
Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
server/middleware
Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
server/persona
Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
server/voiceagent
Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
serverclient
Package serverclient is the client-side transport adapter that lets a device-target (cmd/speechkit) or a local-target binary delegate one or more modes (Dictation, Assist, Voice Agent) to a remote SpeechKit Server-Target instead of running the Framework kernel in-process.
Package serverclient is the client-side transport adapter that lets a device-target (cmd/speechkit) or a local-target binary delegate one or more modes (Dictation, Assist, Voice Agent) to a remote SpeechKit Server-Target instead of running the Framework kernel in-process.
shortcuts
Package shortcuts implements pattern-matched intent shortcuts used by Assist Mode.
Package shortcuts implements pattern-matched intent shortcuts used by Assist Mode.
stt
Package stt defines the SpeechKit speech-to-text provider interface and houses the concrete provider implementations: whisper.cpp (local built-in), HuggingFace, OpenAI, Groq, Google, an OpenAI-compatible adapter (covers Ollama and other compatible servers), and the self-hosted VPS adapter.
Package stt defines the SpeechKit speech-to-text provider interface and houses the concrete provider implementations: whisper.cpp (local built-in), HuggingFace, OpenAI, Groq, Google, an OpenAI-compatible adapter (covers Ollama and other compatible servers), and the self-hosted VPS adapter.
tts
Package tts implements the SpeechKit text-to-speech surface: a small provider interface plus concrete adapters for OpenAI, Google, and Hugging Face.
Package tts implements the SpeechKit text-to-speech surface: a small provider interface plus concrete adapters for OpenAI, Google, and Hugging Face.
vad
voiceagent/cascaded
Package cascaded implements a turn-based STT -> LLM -> TTS voice agent provider.
Package cascaded implements a turn-based STT -> LLM -> TTS voice agent provider.
voicebehavior
Package voicebehavior contains the shared Voice Agent behavior catalog used by both the local desktop runtime and the Linux server target.
Package voicebehavior contains the shared Voice Agent behavior catalog used by both the local desktop runtime and the Linux server target.
voiceeval
Package voiceeval contains deterministic dialogue checks for Voice Agent workflow tests.
Package voiceeval contains deterministic dialogue checks for Voice Agent workflow tests.
wakeword
Package wakeword implements an always-on, on-device keyword spotter that any of the three SpeechKit modes (Dictation, Assist, Voice Agent) can opt into.
Package wakeword implements an always-on, on-device keyword spotter that any of the three SpeechKit modes (Dictation, Assist, Voice Agent) can opt into.
winapi
Package winapi provides shared Windows DLL proc references used by multiple packages.
Package winapi provides shared Windows DLL proc references used by multiple packages.
pkg
speechkit
Package speechkit provides the public SDK for embedding SpeechKit voice capture, transcription, and assist/voice-agent pipelines into host applications.
Package speechkit provides the public SDK for embedding SpeechKit voice capture, transcription, and assist/voice-agent pipelines into host applications.
speechkit/agentkit
Package agentkit provides a small Go harness for building SpeechKit Voice Agent hosts.
Package agentkit provides a small Go harness for building SpeechKit Voice Agent hosts.
speechkit/assist
Package assist provides an embeddable Assist Mode service.
Package assist provides an embeddable Assist Mode service.
speechkit/client
Package client provides a typed HTTP client for talking to a remote SpeechKit Server (the `cmd/speechkit-server` Linux container or any compatible deployment).
Package client provides a typed HTTP client for talking to a remote SpeechKit Server (the `cmd/speechkit-server` Linux container or any compatible deployment).
speechkit/dictation
Package dictation provides an embeddable strict Dictation runtime.
Package dictation provides an embeddable strict Dictation runtime.
speechkit/voiceagent
Package voiceagent provides an embeddable Voice Agent service.
Package voiceagent provides an embeddable Voice Agent service.
speechkit/voiceagent/live
Package live exposes the low-level Voice Agent realtime-protocol types.
Package live exposes the low-level Voice Agent realtime-protocol types.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL