SpeechKit

module

v0.48.1 Latest Latest Go to latest Published: Jun 28, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kombifyio/SpeechKit

Links

Open Source Insights

README ¶

SpeechKit

🚧 Beta. SpeechKit is in active beta. Public APIs, config keys, and defaults can still change between minor releases. Use it in production only with version pins. Pre-1.0 releases use the v0.MAJOR.MINOR scheme; breaking changes are called out in each release entry.

SpeechKit is a Windows-first voice framework for products that need dictation, voice commands, and realtime voice dialogue without coupling every use case to one desktop app or one hosted API.

The framework currently has four modules:

Module	What it is	Use it when
Local-first Go backend	Embeddable Go runtime in `pkg/speechkit` with mode contracts, provider profiles, routing policy, readiness metadata, and reusable Dictation, Assist, and Voice Agent services.	You want to integrate SpeechKit into your own Go product, internal tool, prototype, or automation host.
Self-host Server	Linux server runtime in `cmd/speechkit-server` that wraps the same backend behind HTTP and WebSocket APIs.	You need a durable process for your own clients, teams, product backends, browsers, or centrally managed model/provider configuration.
Agent tools	`speechkit-mcp` and `speechkitctl` expose docs, validation, scaffolding, diagnostics, and authenticated server management to coding agents and operators.	You want an agent to inspect the framework, generate starters, validate payloads, or operate a self-hosted server.
Windows Client	Public installer and portable release assets for local use, provider testing, and server-connected workflows.	You want to use SpeechKit on a Windows machine, validate providers and models, or connect a workstation to a SpeechKit Server.

Desktop device support is Windows 10/11 x64 only today. The Windows Client uses WASAPI for local capture/playback. Linux is supported as a server runtime, not as a desktop capture client; macOS and Linux desktop packages are not currently supported.

The runtime modules share the same three strict modes:

Mode	Purpose	Boundary
Dictation	Turn speech into text.	STT only. No LLM rewriting, no utilities, no codewords.
Assist	Turn speech or text into one useful result.	Codeword, utility, or LLM output with optional TTS and explicit UI surface metadata.
Voice Agent	Run realtime audio-to-audio dialogue.	Live conversation for brainstorming, support, and fast follow-ups.

Hands-Free is not a fourth mode. It is an activation and voice-output layer for the three modes: wake activation, microphone capture, auto-end policy, and optional speaker output. Assist uses it for Siri/Alexa-style Voice Companion requests, Voice Agent uses it for continuous dialogue, and Dictation uses it only as UI-assisted activation with a visible text target or commit surface.

Speaker diarization, speaker identification, and speaker attribution are also add-on capabilities over the three modes. Provider support and auth status are tracked in the voice capability matrix.

Words and Replacements are the first-class customization axis over the same three modes. Words teach SpeechKit terms to recognize; Replacements define deterministic text, command, snippet, synonym, and template transformations. Native Templates are versioned curated packs of the same Words/Replacements data, for example the default punctuation template and the opt-in developer template. They replace the narrow dictionary concept without creating another mode. See the Words And Replacements standard.

Why SpeechKit

Local-first Go backend

Use the backend when you want voice features inside another application without adopting the Windows client. The public pkg/speechkit boundary gives host apps stable mode contracts, service interfaces, provider catalogs, and readiness data they can turn into their own setup UI.

Key advantages:

One framework kernel for Dictation, Assist, and Voice Agent instead of three unrelated voice pipelines.
Local-first provider support with room for managed local runtimes, user-managed local services, cloud providers, and direct vendor APIs.
Host policy controls for enabled modes, fixed profiles, fallbacks, and clean vs intelligence behavior.
Machine-readable readiness checks for credentials, local runtimes, model artifacts, and mode capability.

Start with Framework API, Voice Companion, or the examples in examples/.

Self-host Server

Use the server when SpeechKit should run as a long-lived Linux service you operate. It adapts the same framework kernel to a containerized API surface so other clients can call Dictation, Assist, and Voice Agent without embedding Go code.

Key advantages:

One server image, one URL, and one deployment contract for all three modes.
HTTP endpoints for Dictation and Assist plus WebSocket sessions for realtime Voice Agent.
Built-in health/readiness routes, bearer or edge-auth modes, CORS/origin controls, and OpenAPI contracts.
Centralized provider, model, and secret configuration for teams or hosted deployments.

Start with docs/server/README.md and the server OpenAPI file at docs/server/openapi.v1.yaml.

Agent-native integration

Use the MCP server and CLI when a coding agent should work with SpeechKit directly. Docs and test modes work without a running server. Management mode wraps a self-hosted SpeechKit Server and uses the same bearer or edge-auth rules as the HTTP API.

Start with docs/mcp/README.md, docs/agent/mcp/speechkit-mcp.md, and the agent entrypoint at docs/agent/llms.txt.

Windows Client

Use the Windows client when you want a ready-to-run desktop experience or a reference host for testing providers, models, and server connections. The app can run local-first on the machine or delegate selected work to a SpeechKit Server.

Key advantages:

Global hotkeys for Dictation, Assist, and Voice Agent.
Hands-Free settings for wake activation, target mode, auto-end behavior, and voice output.
Local audio capture, VAD, overlays, settings, provider setup, and optional audio playback in one Wails app.
Provider/model test bench for local, cloud, and direct integrations.
Server connection support with configurable bearer-token environment variable, request timeout, and local fallback behavior.

Download public builds from GitHub Releases. A fresh repository clone also carries the current installer metadata in release/latest/windows/, including canonical download URLs and SHA-256 hashes. The GitHub Release assets remain canonical.

Default hotkeys:

Dictation: Ctrl+Win
Assist: Win+Alt
Voice Agent: Ctrl+Shift

Quick Start

Embed the Go backend:

go get github.com/kombifyio/SpeechKit

Import only the components your host needs. A dictation-only app can use pkg/speechkit/dictation; an activation-only integration can use pkg/speechkit/wakeword; spoken output can use pkg/speechkit/tts; one-shot Voice Companion hosts use pkg/speechkit/companion plus Assist/TTS adapters; speaker-aware apps can use pkg/speechkit/speaker; server-connected apps use pkg/speechkit/client. You do not need to load the Windows client or the whole framework for a single component.

To drive the framework from a config.toml instead of building settings by hand, pkg/speechkit/hostconfig turns a config file into the public ModeSettings and a starting RuntimePolicy in one call:

settings, policy, err := hostconfig.Load("config.toml")

Real providers run in-process — no SpeechKit server required: realtime Voice Agents via pkg/speechkit/voiceagent/live (Gemini Live, OpenAI Realtime, Deepgram), speech-to-text via pkg/speechkit/stt (whisper.cpp, OpenAI, Groq, Google, Deepgram, AssemblyAI, Hugging Face, OpenRouter), text-to-speech via pkg/speechkit/tts (OpenAI, Google, Deepgram, Hugging Face, Piper), and a turn-based cascaded Voice Agent via pkg/speechkit/voiceagent/cascaded. Two runnable references:

# in-process Voice Agent (Gemini Live), no server:
GOOGLE_AI_API_KEY=... go run ./examples/voice-agent/in-process
# in-process Assist (host-owned LLM + optional public TTS), no server:
GOOGLE_AI_API_KEY=... go run ./examples/assist/in-process

For a single-prompt Go starter:

speechkit-cli init --template go-assist-voice-companion ./my-companion
speechkit-cli init --template go-voice-agent-companion ./my-agent
speechkit-cli init --template go-dictation-handsfree-ui ./my-dictation-ui

Run the self-host server image:

docker pull ghcr.io/kombifyio/speechkit-server:latest

Use the agent tools:

go run ./cmd/speechkit-mcp --mode=docs,test
go run ./cmd/speechkit-cli status --server http://localhost:8080 --token "$SPEECHKIT_SERVER_TOKEN"

Documentation

This README is the short orientation page. Use the detailed docs when you need contracts, deployment steps, or release rules:

Build

Public source verification:

go test ./pkg/... ./cmd/speechkit-cli/... ./cmd/speechkit-mcp/... ./examples/...
GOOS=linux CGO_ENABLED=0 go test ./cmd/speechkit-server/...
GOOS=linux CGO_ENABLED=0 go build ./cmd/speechkit-server ./cmd/speechkit-mcp ./cmd/speechkit-cli

Windows client builds are shipped as release assets from the maintained release pipeline. The installer is the recommended distribution format for end users. For clone-and-install testing on Windows, use release/latest/windows/INSTALLER-MANIFEST.json to resolve the current SpeechKit-Setup.exe download URL and verify it against SHA256SUMS.txt.

Repository Layout

pkg/speechkit/          Local-first Go backend
cmd/speechkit-server/   Self-host Server entry point
cmd/speechkit-mcp/      MCP server for agent docs, validation, and management
cmd/speechkit-cli/      CLI diagnostics, scaffolding, and quick actions
internal/               Implementation packages for the public binaries
docs/                   Detailed documentation
deploy/                 Docker and server config
scripts/                Public install and release-note helpers

Trust

Public releases include checksums and an unsigned Windows notice while the no-cost unsigned release path is active. Download only from the official kombifyio/SpeechKit releases.

License

Apache-2.0. See LICENSE.

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
speechkit-cli command
speechkit-mcp command
speechkit-mcp/internal/util
speechkit-server command Package main is the canonical kombify SpeechKit Linux container server.	Package main is the canonical kombify SpeechKit Linux container server.
docs
examples
assist/in-process command Example: fully in-process Assist — text (or speech) in, one useful result out, no SpeechKit server.	Example: fully in-process Assist — text (or speech) in, one useful result out, no SpeechKit server.
embed-companion command
embed-event-bus command
embed-tts command
library command Example: Using SpeechKit as a Go library for speech-to-text.	Example: Using SpeechKit as a Go library for speech-to-text.
provider-catalog command Example: reading SpeechKit's public mode and provider catalog.	Example: reading SpeechKit's public mode and provider catalog.
voice-agent/game-instructor command Example: 15-minute Voice-Agent game instructor.	Example: 15-minute Voice-Agent game instructor.
voice-agent/in-process command Example: fully in-process Voice Agent — no SpeechKit server in the path.	Example: fully in-process Voice Agent — no SpeechKit server in the path.
voice-agent/provider-switching command Package main demonstrates provider/profile/model selection without live credentials.	Package main demonstrates provider/profile/model selection without live credentials.
internal
ai Package ai wires the Genkit runtime and the SpeechKit model catalog into a single LLM/embedding/reranker surface used by Assist and the Voice Agent pipeline-fallback path.	Package ai wires the Genkit runtime and the SpeechKit model catalog into a single LLM/embedding/reranker surface used by Assist and the Voice Agent pipeline-fallback path.
ai/flows
assist Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.	Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.
assist/skills/voice_companion Package voice_companion provides ToolExecutor-compatible skill plugins for SpeechKit's Voice-Companion pattern.	Package voice_companion provides ToolExecutor-compatible skill plugins for SpeechKit's Voice-Companion pattern.
audio Package audio is the platform-neutral audio I/O kernel.	Package audio is the platform-neutral audio I/O kernel.
auditlog Package auditlog provides the dedicated audit-event stream for SpeechKit.	Package auditlog provides the dedicated audit-event stream for SpeechKit.
auditlogtest Package auditlogtest provides test-only helpers for resetting the audit log package state between test cases.	Package auditlogtest provides test-only helpers for resetting the audit log package state between test cases.
config Package config defines SpeechKit's TOML configuration schema and the load/merge/validate helpers around it.	Package config defines SpeechKit's TOML configuration schema and the load/merge/validate helpers around it.
customize
customize/templates
models Package models defines the SpeechKit model catalog: provider IDs, model identifiers, modality (STT, TTS, Realtime Voice, Assist, Utility, Embedding, Reranker), execution mode (local/cloud/direct), and the readiness metadata that setup UIs and the readiness endpoint consume.	Package models defines the SpeechKit model catalog: provider IDs, model identifiers, modality (STT, TTS, Realtime Voice, Assist, Utility, Embedding, Reranker), execution mode (local/cloud/direct), and the readiness metadata that setup UIs and the readiness endpoint consume.
router Package router implements the STT routing layer.	Package router implements the STT routing layer.
runtimepath Package runtimepath resolves exe-relative paths so SpeechKit's portable-mode bundle finds its bundled assets, models, and per-user data dirs without depending on the OS-level installer having registered a fixed location.	Package runtimepath resolves exe-relative paths so SpeechKit's portable-mode bundle finds its bundled assets, models, and per-user data dirs without depending on the OS-level installer having registered a fixed location.
scaffold Package scaffold renders embedded starter templates into a target directory so callers can bootstrap a SpeechKit integration without hand-copying boilerplate.	Package scaffold renders embedded starter templates into a target directory so callers can bootstrap a SpeechKit integration without hand-copying boilerplate.
sdkparity
secrets Package secrets is the cross-platform credential store with the canonical User > Install > Env > None resolution hierarchy.	Package secrets is the cross-platform credential store with the canonical User > Install > Env > None resolution hierarchy.
server Package server is the umbrella for the Linux Server-Target HTTP + WebSocket adapter.	Package server is the umbrella for the Linux Server-Target HTTP + WebSocket adapter.
server/assist Package assist implements the POST /v1/assist/process handler.	Package assist implements the POST /v1/assist/process handler.
server/audio Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.	Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
server/catalog
server/cli Package cli holds the small amount of CLI-level glue for the Linux SpeechKit Server entry point.	Package cli holds the small amount of CLI-level glue for the Linux SpeechKit Server entry point.
server/configapi
server/core Package core is the SpeechKit server bootstrap layer.	Package core is the SpeechKit server bootstrap layer.
server/customization
server/dictation Package dictation implements the POST /v1/dictation/transcribe handler.	Package dictation implements the POST /v1/dictation/transcribe handler.
server/httpx Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.	Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
server/middleware Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.	Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
server/onboarding
server/persona Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.	Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
server/storageauth
server/transcripts
server/ttsapi
server/vocabulary
server/voiceagent Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.	Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
server/wakewordtraining Package wakewordtraining mounts the v0.37.5 REST endpoints that accept wake-word activation training-data uploads from device clients.	Package wakewordtraining mounts the v0.37.5 REST endpoints that accept wake-word activation training-data uploads from device clients.
shortcuts Package shortcuts implements pattern-matched intent shortcuts used by Assist Mode.	Package shortcuts implements pattern-matched intent shortcuts used by Assist Mode.
store Package store is the durable backend for transcriptions, quick notes, voice-agent session summaries, persona catalog (M5b), and wake-word activation audio.	Package store is the durable backend for transcriptions, quick notes, voice-agent session summaries, persona catalog (M5b), and wake-word activation audio.
stt
telemetry Package telemetry installs the process-wide OpenTelemetry trace pipeline that the Framework kernel's spans (STT routing, TTS, Voice Agent, server lifecycle) feed into.	Package telemetry installs the process-wide OpenTelemetry trace pipeline that the Framework kernel's spans (STT routing, TTS, Voice Agent, server lifecycle) feed into.
tts Package tts re-exports the public pkg/speechkit/tts TTS surface so the existing kernel/adapter call sites (cmd/speechkit, internal/server, internal/assist, internal/voiceagent/cascaded, internal/ttswiring) keep compiling unchanged after the providers moved out to the public package.	Package tts re-exports the public pkg/speechkit/tts TTS surface so the existing kernel/adapter call sites (cmd/speechkit, internal/server, internal/assist, internal/voiceagent/cascaded, internal/ttswiring) keep compiling unchanged after the providers moved out to the public package.
ttswiring Package ttswiring resolves a config.Config into the neutral tts.EnabledProviders input consumed by tts.BuildRouter.	Package ttswiring resolves a config.Config into the neutral tts.EnabledProviders input consumed by tts.BuildRouter.
voiceagent Package voiceagent is the Voice Agent kernel — realtime audio-to-audio session manager backed by Gemini Live, with Persona/Role/Sequence resolution from internal/voicebehavior.	Package voiceagent is the Voice Agent kernel — realtime audio-to-audio session manager backed by Gemini Live, with Persona/Role/Sequence resolution from internal/voicebehavior.
voiceagent/cascaded Package cascaded re-exports the public pkg/speechkit/voiceagent/cascaded turn-based Voice Agent provider so existing kernel/adapter call sites (internal/server/voiceagent, internal/voiceagent, cmd/sk-localprobe, scripts) compile unchanged.	Package cascaded re-exports the public pkg/speechkit/voiceagent/cascaded turn-based Voice Agent provider so existing kernel/adapter call sites (internal/server/voiceagent, internal/voiceagent, cmd/sk-localprobe, scripts) compile unchanged.
voiceagentprofile Package voiceagentprofile re-exports the voicebehavior Profile DTO with JSON tags suitable for HTTP envelope serialisation.	Package voiceagentprofile re-exports the voicebehavior Profile DTO with JSON tags suitable for HTTP envelope serialisation.
voicebehavior Package voicebehavior contains the shared Voice Agent behavior catalog used by both the local desktop runtime and the Linux server target.	Package voicebehavior contains the shared Voice Agent behavior catalog used by both the local desktop runtime and the Linux server target.
pkg
speechkit Package speechkit provides the public SDK for embedding SpeechKit voice capture, transcription, and assist/voice-agent pipelines into host applications.	Package speechkit provides the public SDK for embedding SpeechKit voice capture, transcription, and assist/voice-agent pipelines into host applications.
speechkit/agentkit Package agentkit provides a small Go harness for building SpeechKit Voice Agent hosts.	Package agentkit provides a small Go harness for building SpeechKit Voice Agent hosts.
speechkit/assist Package assist provides an embeddable Assist Mode service.	Package assist provides an embeddable Assist Mode service.
speechkit/assist/genkitadapter Package genkitadapter keeps Genkit-specific Assist wiring out of the core public assist package.	Package genkitadapter keeps Genkit-specific Assist wiring out of the core public assist package.
speechkit/audio
speechkit/client Package client provides a typed HTTP client for talking to a remote SpeechKit Server (the `cmd/speechkit-server` Linux container or any compatible deployment).	Package client provides a typed HTTP client for talking to a remote SpeechKit Server (the `cmd/speechkit-server` Linux container or any compatible deployment).
speechkit/companion Package companion provides small composers for hands-free SpeechKit hosts.	Package companion provides small composers for hands-free SpeechKit hosts.
speechkit/customize Package customize defines SpeechKit's public Words/Replacements contract.	Package customize defines SpeechKit's public Words/Replacements contract.
speechkit/deviceagent Package deviceagent implements the LAN-side SpeechKit device agent contract.	Package deviceagent implements the LAN-side SpeechKit device agent contract.
speechkit/dictation Package dictation provides an embeddable strict Dictation runtime.	Package dictation provides an embeddable strict Dictation runtime.
speechkit/hostconfig Package hostconfig turns a SpeechKit TOML configuration file into the public SDK types an embedding host drives the framework with: a speechkit.ModeSettings (which modes are on, their hotkeys and selected provider profiles) and a permissive speechkit.RuntimePolicy (which modes the host exposes and whether fallbacks are allowed).	Package hostconfig turns a SpeechKit TOML configuration file into the public SDK types an embedding host drives the framework with: a speechkit.ModeSettings (which modes are on, their hotkeys and selected provider profiles) and a permissive speechkit.RuntimePolicy (which modes the host exposes and whether fallbacks are allowed).
speechkit/internal/speakercontract
speechkit/lifecycle Package lifecycle owns mode start/stop orchestration and refcounted shared dependencies for SpeechKit hosts.	Package lifecycle owns mode start/stop orchestration and refcounted shared dependencies for SpeechKit hosts.
speechkit/netsec Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).	Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
speechkit/provideropts Package provideropts defines SpeechKit's provider-neutral voice option vocabulary and the manifest/resolve types used by concrete provider adapters.	Package provideropts defines SpeechKit's provider-neutral voice option vocabulary and the manifest/resolve types used by concrete provider adapters.
speechkit/speaker Package speaker defines SpeechKit's public speaker diarization and attribution contracts.	Package speaker defines SpeechKit's public speaker diarization and attribution contracts.
speechkit/storage
speechkit/stt Package stt defines the SpeechKit speech-to-text provider interface and houses the concrete provider implementations: whisper.cpp (local built-in), HuggingFace, OpenAI, Groq, Google, an OpenAI-compatible adapter (covers Ollama and other compatible servers), and the self-hosted VPS adapter.	Package stt defines the SpeechKit speech-to-text provider interface and houses the concrete provider implementations: whisper.cpp (local built-in), HuggingFace, OpenAI, Groq, Google, an OpenAI-compatible adapter (covers Ollama and other compatible servers), and the self-hosted VPS adapter.
speechkit/stt/sttcontract Package sttcontract provides a reusable conformance suite that every stt.STTProvider implementation is expected to satisfy.	Package sttcontract provides a reusable conformance suite that every stt.STTProvider implementation is expected to satisfy.
speechkit/tts Package tts exposes the embeddable SpeechKit text-to-speech surface.	Package tts exposes the embeddable SpeechKit text-to-speech surface.
speechkit/tts/ttscontract Package ttscontract provides a reusable conformance suite that every tts.Provider implementation is expected to satisfy.	Package ttscontract provides a reusable conformance suite that every tts.Provider implementation is expected to satisfy.
speechkit/ttsroute Package ttsroute holds the single source of truth that maps a Voice-Output profile ID (e.g.	Package ttsroute holds the single source of truth that maps a Voice-Output profile ID (e.g.
speechkit/voiceagent Package voiceagent provides an embeddable Voice Agent service.	Package voiceagent provides an embeddable Voice Agent service.
speechkit/voiceagent/cascaded Package cascaded implements a turn-based STT -> LLM -> TTS voice agent provider.	Package cascaded implements a turn-based STT -> LLM -> TTS voice agent provider.
speechkit/voiceagent/live Package live exposes the low-level Voice Agent realtime-protocol types.	Package live exposes the low-level Voice Agent realtime-protocol types.
speechkit/voiceagent/live/livecontract Package livecontract provides reusable conformance checks for LiveProvider implementations.	Package livecontract provides reusable conformance checks for LiveProvider implementations.
speechkit/wakeword Package wakeword exposes embeddable SpeechKit wake-word contracts.	Package wakeword exposes embeddable SpeechKit wake-word contracts.
speechkit/wakeword/sherpa Package sherpa exposes the sherpa-onnx wake-word detector adapter.	Package sherpa exposes the sherpa-onnx wake-word detector adapter.