SpeechKit

module
v0.40.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 28, 2026 License: Apache-2.0

README ΒΆ

SpeechKit

🚧 Beta. SpeechKit is in active beta. Public APIs, config keys, and defaults can still change between minor releases. Use it in production only with version pins. Pre-1.0 releases use the v0.MAJOR.MINOR scheme; breaking changes are called out in each release entry.

SpeechKit is a Windows-first voice framework for products that need dictation, voice commands, and realtime voice dialogue without coupling every use case to one desktop app or one hosted API.

The framework currently has four modules:

Module What it is Use it when
Local-first Go backend Embeddable Go runtime in pkg/speechkit with mode contracts, provider profiles, routing policy, readiness metadata, and reusable Dictation, Assist, and Voice Agent services. You want to integrate SpeechKit into your own Go product, internal tool, prototype, or automation host.
Self-host Server Linux server runtime in cmd/speechkit-server that wraps the same backend behind HTTP and WebSocket APIs. You need a durable process for your own clients, teams, product backends, browsers, or centrally managed model/provider configuration.
Agent tools speechkit-mcp and speechkitctl expose docs, validation, scaffolding, diagnostics, and authenticated server management to coding agents and operators. You want an agent to inspect the framework, generate starters, validate payloads, or operate a self-hosted server.
Windows Client Public installer and portable release assets for local use, provider testing, and server-connected workflows. You want to use SpeechKit on a Windows machine, validate providers and models, or connect a workstation to a SpeechKit Server.

Desktop device support is Windows 10/11 x64 only today. The Windows Client uses WASAPI for local capture/playback. Linux is supported as a server runtime, not as a desktop capture client; macOS and Linux desktop packages are not currently supported.

The runtime modules share the same three strict modes:

Mode Purpose Boundary
Dictation Turn speech into text. STT only. No LLM rewriting, no utilities, no codewords.
Assist Turn speech or text into one useful result. Codeword, utility, or LLM output with optional TTS and explicit UI surface metadata.
Voice Agent Run realtime audio-to-audio dialogue. Live conversation for brainstorming, support, and fast follow-ups.

Hands-Free is not a fourth mode. It is an activation and voice-output layer for the three modes: wake activation, microphone capture, auto-end policy, and optional speaker output. Assist uses it for Siri/Alexa-style Voice Companion requests, Voice Agent uses it for continuous dialogue, and Dictation uses it only as UI-assisted activation with a visible text target or commit surface.

Why SpeechKit

Local-first Go backend

Use the backend when you want voice features inside another application without adopting the Windows client. The public pkg/speechkit boundary gives host apps stable mode contracts, service interfaces, provider catalogs, and readiness data they can turn into their own setup UI.

Key advantages:

  • One framework kernel for Dictation, Assist, and Voice Agent instead of three unrelated voice pipelines.
  • Local-first provider support with room for managed local runtimes, user-managed local services, cloud providers, and direct vendor APIs.
  • Host policy controls for enabled modes, fixed profiles, fallbacks, and clean vs intelligence behavior.
  • Machine-readable readiness checks for credentials, local runtimes, model artifacts, and mode capability.

Start with Framework API, Voice Companion, or the examples in examples/.

Self-host Server

Use the server when SpeechKit should run as a long-lived Linux service you operate. It adapts the same framework kernel to a containerized API surface so other clients can call Dictation, Assist, and Voice Agent without embedding Go code.

Key advantages:

  • One server image, one URL, and one deployment contract for all three modes.
  • HTTP endpoints for Dictation and Assist plus WebSocket sessions for realtime Voice Agent.
  • Built-in health/readiness routes, bearer or edge-auth modes, CORS/origin controls, and OpenAPI contracts.
  • Centralized provider, model, and secret configuration for teams or hosted deployments.

Start with docs/server/README.md and the server OpenAPI file at docs/server/openapi.v1.yaml.

Agent-native integration

Use the MCP server and CLI when a coding agent should work with SpeechKit directly. Docs and test modes work without a running server. Management mode wraps a self-hosted SpeechKit Server and uses the same bearer or edge-auth rules as the HTTP API.

Start with docs/mcp/README.md, docs/agent/mcp/speechkit-mcp.md, and the agent entrypoint at docs/agent/llms.txt.

Windows Client

Use the Windows client when you want a ready-to-run desktop experience or a reference host for testing providers, models, and server connections. The app can run local-first on the machine or delegate selected work to a SpeechKit Server.

Key advantages:

  • Global hotkeys for Dictation, Assist, and Voice Agent.
  • Hands-Free settings for wake activation, target mode, auto-end behavior, and voice output.
  • Local audio capture, VAD, overlays, settings, provider setup, and optional audio playback in one Wails app.
  • Provider/model test bench for local, cloud, and direct integrations.
  • Server connection support with configurable bearer-token environment variable, request timeout, and local fallback behavior.

Download public builds from GitHub Releases. A fresh repository clone also carries the current mirrored installer at release/latest/windows/SpeechKit-Setup.exe for direct Windows test installs. The GitHub Release asset remains canonical; the repo mirror is updated from it with hashes and source metadata in release/latest/windows/.

Default hotkeys:

  • Dictation: Ctrl+Win
  • Assist: Win+Alt
  • Voice Agent: Ctrl+Shift

Quick Start

Embed the Go backend:

go get github.com/kombifyio/SpeechKit

Import only the components your host needs. A dictation-only app can use pkg/speechkit/dictation; an activation-only integration can use pkg/speechkit/wakeword; spoken output can use pkg/speechkit/tts; one-shot Voice Companion hosts use pkg/speechkit/companion plus Assist/TTS adapters; server-connected apps use pkg/speechkit/client. You do not need to load the Windows client or the whole framework for a single component.

For a single-prompt Go starter:

speechkit-cli init --template go-assist-voice-companion ./my-companion
speechkit-cli init --template go-voice-agent-companion ./my-agent
speechkit-cli init --template go-dictation-handsfree-ui ./my-dictation-ui

Run the self-host server image:

docker pull ghcr.io/kombifyio/speechkit-server:latest

Use the agent tools:

go run ./cmd/speechkit-mcp --mode=docs,test
go run ./cmd/speechkit-cli status --server http://localhost:8080 --token "$SPEECHKIT_SERVER_TOKEN"

Documentation

This README is the short orientation page. Use the detailed docs when you need contracts, deployment steps, or release rules:

Build

Public source verification:

go test ./pkg/... ./cmd/speechkit-cli/... ./cmd/speechkit-mcp/... ./examples/...
GOOS=linux CGO_ENABLED=0 go test ./cmd/speechkit-server/...
GOOS=linux CGO_ENABLED=0 go build ./cmd/speechkit-server ./cmd/speechkit-mcp ./cmd/speechkit-cli

Windows client builds are shipped as release assets from the maintained release pipeline. The installer is the recommended distribution format for end users. For clone-and-install testing on Windows, use release/latest/windows/SpeechKit-Setup.exe; it is a repository mirror of the latest public release asset with hashes and source metadata next to it.

Repository Layout

pkg/speechkit/          Local-first Go backend
cmd/speechkit-server/   Self-host Server entry point
cmd/speechkit-mcp/      MCP server for agent docs, validation, and management
cmd/speechkit-cli/      CLI diagnostics, scaffolding, and quick actions
internal/               Implementation packages for the public binaries
docs/                   Detailed documentation
deploy/                 Docker and server config
scripts/                Public install and release-note helpers

Trust

Public releases include checksums, an SBOM, and an unsigned Windows notice while the no-cost unsigned release path is active. Download only from the official kombifyio/SpeechKit releases.

License

Apache-2.0. See LICENSE.

Directories ΒΆ

Path Synopsis
cmd
speechkit-cli command
speechkit-mcp command
speechkit-server command
Package main is the canonical kombify SpeechKit Linux container server.
Package main is the canonical kombify SpeechKit Linux container server.
examples
embed-companion command
embed-event-bus command
embed-tts command
library command
Example: Using SpeechKit as a Go library for speech-to-text.
Example: Using SpeechKit as a Go library for speech-to-text.
provider-catalog command
Example: reading SpeechKit's public mode and provider catalog.
Example: reading SpeechKit's public mode and provider catalog.
voice-agent/game-instructor command
Example: 15-minute Voice-Agent game instructor.
Example: 15-minute Voice-Agent game instructor.
internal
ai
Package ai wires the Genkit runtime and the SpeechKit model catalog into a single LLM/embedding/reranker surface used by Assist and the Voice Agent pipeline-fallback path.
Package ai wires the Genkit runtime and the SpeechKit model catalog into a single LLM/embedding/reranker surface used by Assist and the Voice Agent pipeline-fallback path.
assist
Package assist implements the Assist Mode pipeline: STT transcript β†’ Codeword check β†’ LLM β†’ TTS β†’ Result with both text and audio.
Package assist implements the Assist Mode pipeline: STT transcript β†’ Codeword check β†’ LLM β†’ TTS β†’ Result with both text and audio.
assist/skills/voice_companion
Package voice_companion provides ToolExecutor-compatible skill plugins for SpeechKit's Voice-Companion pattern.
Package voice_companion provides ToolExecutor-compatible skill plugins for SpeechKit's Voice-Companion pattern.
audio
Package audio is the platform-neutral audio I/O kernel.
Package audio is the platform-neutral audio I/O kernel.
auditlog
Package auditlog provides the dedicated audit-event stream for SpeechKit.
Package auditlog provides the dedicated audit-event stream for SpeechKit.
config
Package config defines SpeechKit's TOML configuration schema and the load/merge/validate helpers around it.
Package config defines SpeechKit's TOML configuration schema and the load/merge/validate helpers around it.
models
Package models defines the SpeechKit model catalog: provider IDs, model identifiers, modality (STT, TTS, Realtime Voice, Assist, Utility, Embedding, Reranker), execution mode (local/cloud/direct), and the readiness metadata that setup UIs and the readiness endpoint consume.
Package models defines the SpeechKit model catalog: provider IDs, model identifiers, modality (STT, TTS, Realtime Voice, Assist, Utility, Embedding, Reranker), execution mode (local/cloud/direct), and the readiness metadata that setup UIs and the readiness endpoint consume.
netsec
Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
router
Package router implements the STT routing layer.
Package router implements the STT routing layer.
runtimepath
Package runtimepath resolves exe-relative paths so SpeechKit's portable-mode bundle finds its bundled assets, models, and per-user data dirs without depending on the OS-level installer having registered a fixed location.
Package runtimepath resolves exe-relative paths so SpeechKit's portable-mode bundle finds its bundled assets, models, and per-user data dirs without depending on the OS-level installer having registered a fixed location.
scaffold
Package scaffold renders embedded starter templates into a target directory so callers can bootstrap a SpeechKit integration without hand-copying boilerplate.
Package scaffold renders embedded starter templates into a target directory so callers can bootstrap a SpeechKit integration without hand-copying boilerplate.
secrets
Package secrets is the cross-platform credential store with the canonical User > Install > Env > None resolution hierarchy.
Package secrets is the cross-platform credential store with the canonical User > Install > Env > None resolution hierarchy.
server
Package server is the umbrella for the Linux Server-Target HTTP + WebSocket adapter.
Package server is the umbrella for the Linux Server-Target HTTP + WebSocket adapter.
server/assist
Package assist implements the POST /v1/assist/process handler.
Package assist implements the POST /v1/assist/process handler.
server/audio
Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
server/cli
Package cli holds the small amount of CLI-level glue for the Linux SpeechKit Server entry point.
Package cli holds the small amount of CLI-level glue for the Linux SpeechKit Server entry point.
server/core
Package core is the SpeechKit server bootstrap layer.
Package core is the SpeechKit server bootstrap layer.
server/dictation
Package dictation implements the POST /v1/dictation/transcribe handler.
Package dictation implements the POST /v1/dictation/transcribe handler.
server/httpx
Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
server/middleware
Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
server/persona
Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
server/voiceagent
Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
server/wakewordtraining
Package wakewordtraining mounts the v0.37.5 REST endpoints that accept wake-word activation training-data uploads from device clients.
Package wakewordtraining mounts the v0.37.5 REST endpoints that accept wake-word activation training-data uploads from device clients.
shortcuts
Package shortcuts implements pattern-matched intent shortcuts used by Assist Mode.
Package shortcuts implements pattern-matched intent shortcuts used by Assist Mode.
store
Package store is the durable backend for transcriptions, quick notes, voice-agent session summaries, persona catalog (M5b), and wake-word activation audio.
Package store is the durable backend for transcriptions, quick notes, voice-agent session summaries, persona catalog (M5b), and wake-word activation audio.
stt
Package stt defines the SpeechKit speech-to-text provider interface and houses the concrete provider implementations: whisper.cpp (local built-in), HuggingFace, OpenAI, Groq, Google, an OpenAI-compatible adapter (covers Ollama and other compatible servers), and the self-hosted VPS adapter.
Package stt defines the SpeechKit speech-to-text provider interface and houses the concrete provider implementations: whisper.cpp (local built-in), HuggingFace, OpenAI, Groq, Google, an OpenAI-compatible adapter (covers Ollama and other compatible servers), and the self-hosted VPS adapter.
tts
Package tts implements the SpeechKit text-to-speech surface: a small provider interface plus concrete adapters for OpenAI, Google, and Hugging Face.
Package tts implements the SpeechKit text-to-speech surface: a small provider interface plus concrete adapters for OpenAI, Google, and Hugging Face.
voiceagent
Package voiceagent is the Voice Agent kernel β€” realtime audio-to-audio session manager backed by Gemini Live, with Persona/Role/Sequence resolution from internal/voicebehavior.
Package voiceagent is the Voice Agent kernel β€” realtime audio-to-audio session manager backed by Gemini Live, with Persona/Role/Sequence resolution from internal/voicebehavior.
voiceagent/cascaded
Package cascaded implements a turn-based STT -> LLM -> TTS voice agent provider.
Package cascaded implements a turn-based STT -> LLM -> TTS voice agent provider.
voiceagentprofile
Package voiceagentprofile re-exports the voicebehavior Profile DTO with JSON tags suitable for HTTP envelope serialisation.
Package voiceagentprofile re-exports the voicebehavior Profile DTO with JSON tags suitable for HTTP envelope serialisation.
voicebehavior
Package voicebehavior contains the shared Voice Agent behavior catalog used by both the local desktop runtime and the Linux server target.
Package voicebehavior contains the shared Voice Agent behavior catalog used by both the local desktop runtime and the Linux server target.
pkg
speechkit
Package speechkit provides the public SDK for embedding SpeechKit voice capture, transcription, and assist/voice-agent pipelines into host applications.
Package speechkit provides the public SDK for embedding SpeechKit voice capture, transcription, and assist/voice-agent pipelines into host applications.
speechkit/agentkit
Package agentkit provides a small Go harness for building SpeechKit Voice Agent hosts.
Package agentkit provides a small Go harness for building SpeechKit Voice Agent hosts.
speechkit/assist
Package assist provides an embeddable Assist Mode service.
Package assist provides an embeddable Assist Mode service.
speechkit/assist/genkitadapter
Package genkitadapter keeps Genkit-specific Assist wiring out of the core public assist package.
Package genkitadapter keeps Genkit-specific Assist wiring out of the core public assist package.
speechkit/client
Package client provides a typed HTTP client for talking to a remote SpeechKit Server (the `cmd/speechkit-server` Linux container or any compatible deployment).
Package client provides a typed HTTP client for talking to a remote SpeechKit Server (the `cmd/speechkit-server` Linux container or any compatible deployment).
speechkit/companion
Package companion provides small composers for hands-free SpeechKit hosts.
Package companion provides small composers for hands-free SpeechKit hosts.
speechkit/dictation
Package dictation provides an embeddable strict Dictation runtime.
Package dictation provides an embeddable strict Dictation runtime.
speechkit/lifecycle
Package lifecycle owns mode start/stop orchestration and refcounted shared dependencies for SpeechKit hosts.
Package lifecycle owns mode start/stop orchestration and refcounted shared dependencies for SpeechKit hosts.
speechkit/tts
Package tts exposes the embeddable SpeechKit text-to-speech surface.
Package tts exposes the embeddable SpeechKit text-to-speech surface.
speechkit/voiceagent
Package voiceagent provides an embeddable Voice Agent service.
Package voiceagent provides an embeddable Voice Agent service.
speechkit/voiceagent/live
Package live exposes the low-level Voice Agent realtime-protocol types.
Package live exposes the low-level Voice Agent realtime-protocol types.
speechkit/wakeword
Package wakeword exposes embeddable SpeechKit wake-word contracts.
Package wakeword exposes embeddable SpeechKit wake-word contracts.
speechkit/wakeword/sherpa
Package sherpa exposes the sherpa-onnx wake-word detector adapter.
Package sherpa exposes the sherpa-onnx wake-word detector adapter.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL