SpeechKit

module

v0.28.0 Latest Latest Go to latest Published: Apr 30, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kombifyio/SpeechKit

Links

Open Source Insights

README ¶

SpeechKit

SpeechKit is a Windows-first speech framework with a desktop reference app, an embeddable Go API, and a containerized Server-Target. It is built for products that need strict speech modes, local-first defaults, optional cloud providers, and host-managed credentials.

The repository treats frontend/app as first-class source. The embedded internal/frontendassets/dist output is generated from that source and should not be edited manually.

What You Get

Variant	Use it when	Ships
Device-Target	You want the Windows reference app	Wails desktop host, local overlay, global hotkeys, Settings UI
Local-Target	You embed SpeechKit into a Go app	`pkg/speechkit`, examples, mode contracts, provider catalog
Server-Target	You expose SpeechKit over HTTP/WebSocket	containerized server runtime, REST endpoints, realtime Voice Agent WebSocket

All variants share the same framework kernel. The Windows app is a reference client, not the source of truth for the framework contract.

Core Features

three strict product modes: Dictation, Assist, and Voice Agent
local-first Dictation with whisper.cpp support and optional cloud STT
six STT provider paths: whisper.cpp, Hugging Face, OpenAI, Groq, Google, and self-hosted VPS
Assist utilities for rewrites, summaries, answers, drafts, optional TTS, and visible result panels
Voice Agent realtime dialogue through Gemini Live or an explicit pipeline fallback
layered Voice Agent prompts: host/framework prompt plus optional personal refinement prompt
local SQLite state by default, with storage contracts prepared for server deployments
host-managed credentials; the framework core does not embed provider tokens
public control-plane and server OpenAPI contracts for integrations

Mode Boundaries

Mode	Intelligence	Contract
Dictation	User Intelligence	Audio in, text out. No LLM rewriting, no tools, no Assist routing.
Assist	Utility Intelligence	One-shot utility or LLM result with optional TTS and result surface metadata.
Voice Agent	Brainstorming Intelligence	Realtime spoken dialogue or explicit pipeline fallback with session summary support.

Default mode hotkeys in the Windows reference app are Win+Alt for Dictation, Ctrl+Win for Assist, and Ctrl+Shift for Voice Agent.

Start Here

Framework API - embeddable Go API, mode contracts, provider catalog, and local control API.
Server-Target guide - server runtime, mode endpoints, auth, and deployment profiles.
Server deploy guide - Docker Compose, Render, and generic OCI deployment notes.
Local OpenAPI - desktop control-plane contract.
Server OpenAPI - HTTP and WebSocket contract for the Server-Target.
Examples - library and provider-catalog examples.
Docs index - architecture, release, trust, and runbook links.

Quick Start

Windows App

Download the latest Windows artifacts from GitHub Releases:

SpeechKit-Setup.exe - installer
SpeechKit-Portable.zip - portable bundle

Public Windows releases include SHA256SUMS.txt, SpeechKit.sbom.json, and UNSIGNED-WINDOWS-RELEASE.txt when the no-cost unsigned release path is active.

For local development on Windows, start the local bundle from this repository:

powershell -ExecutionPolicy Bypass -File .\start-dev.ps1

If the bundle is missing, the launcher builds it first via the canonical Windows build script. Use npm run app:dev:detached when you want to start it without keeping the terminal attached.

Go Library

go get github.com/kombifyio/SpeechKit/pkg/speechkit

Use the framework backend in your own Go application by implementing the small host interfaces for audio recording, transcription, persistence, and output delivery. See examples/library/ for a minimal dictation pipeline and examples/provider-catalog/ for the three-mode provider contract.

Key public API entry points:

speechkit.DefaultModeContracts()
speechkit.DefaultProviderProfiles()
speechkit.ProfilesForMode(mode)
speechkit.ProviderKindsForMode(mode)
speechkit.ValidateProfileForMode(profile, mode)

Server-Target

docker pull ghcr.io/kombifyio/speechkit-server:latest

Use the Server-Target for Dictation REST, Assist REST, and realtime Voice Agent WebSocket from a containerized deployment. See docs/server/README.md.

Runtime Configuration

The staged Windows bundle includes config.toml next to SpeechKit.exe. For custom setups, start from config.example.toml.

[huggingface]
enabled = false
model = "openai/whisper-large-v3"
token_env = "HF_TOKEN"

[store]
backend = "sqlite"
save_audio = true
audio_retention_days = 7

[shortcuts.locale.de]
summarize = ["kurzfassung", "briefing"]
copy_last = ["kopier den letzten block"]

Public OSS users should rely on explicit configuration and environment variables. Internal development may use private secret managers, but public artifacts must never depend on private defaults.

Provider Credentials

SpeechKit's framework core is tokenless. Hosts decide how credentials are stored and injected.

The Windows reference host resolves Hugging Face credentials in this order:

user token stored from Settings
install token seeded by the installer and migrated on first start
environment variable fallback via token_env
internal development fallback only when explicitly configured

Server deployments read secret values only from environment variables whose names are configured in TOML.

Build And Verification

Prerequisites:

Go 1.26+
Node.js 22+
MinGW-w64 for CGo on Windows
NSIS for installer builds
optional: ONNX Runtime DLL for Silero VAD
optional: whisper.cpp server binary for local STT

Canonical Windows app build:

powershell -ExecutionPolicy Bypass -File scripts/build.ps1 -SkipInstaller

Common checks:

go test ./...
go vet ./...
npm --prefix frontend/app run test
npm --prefix frontend/app run build
npm --prefix Website run test
npm --prefix Website run build

Project Structure

pkg/speechkit/          Public framework orchestration API
cmd/speechkit/          Wails desktop host application
cmd/speechkit-server/   Linux Server-Target entry point
cmd/speechkit-voice/    Linux voice-only server entry point
frontend/app/           React/Vite Windows UI sources
Website/                Svelte/Vite public website
internal/audio/         WASAPI capture and playback
internal/stt/           STT provider implementations
internal/tts/           TTS provider implementations
internal/ai/            LLM integration
internal/assist/        Assist mode pipeline
internal/voiceagent/    Voice Agent runtime
internal/server/        Server-Target HTTP/WebSocket adapters
internal/serverclient/  Device-to-server transport adapters
internal/store/         SQLite/Postgres storage contracts
deploy/                 Docker, Render, and server config
docs/                   Architecture, release, server, and runbook docs
examples/               Library usage examples
installer/              NSIS Windows installer
scripts/                Build, release, export, and verification scripts

Release And Trust

SpeechKit is prepared in a private upstream and mirrored into kombifyio/SpeechKit through an allowlisted public export.

Start with:

Contributing

See:

License

Apache-2.0. See LICENSE.

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
assets
cmd
sk-e2e command Command sk-e2e is a thin end-to-end smoke client for a running speechkit-server instance.	Command sk-e2e is a thin end-to-end smoke client for a running speechkit-server instance.
speechkit command
speechkit-server command Package main is the canonical kombify SpeechKit Linux container server.	Package main is the canonical kombify SpeechKit Linux container server.
speechkit-voice command Package main is the kombify SpeechKit Voice Server — a focused Linux container that exposes ONLY the Voice Agent mode (real-time audio-to-audio dialogue over WebSocket).	Package main is the kombify SpeechKit Voice Server — a focused Linux container that exposes ONLY the Voice Agent mode (real-time audio-to-audio dialogue over WebSocket).
examples
library command Example: Using SpeechKit as a Go library for speech-to-text.	Example: Using SpeechKit as a Go library for speech-to-text.
provider-catalog command Example: reading SpeechKit's public mode and provider catalog.	Example: reading SpeechKit's public mode and provider catalog.
internal
ai
ai/flows
assist Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.	Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.
audio Audio playback via ebitengine/oto only requires cgo on Linux (ALSA/PulseAudio); the Windows and Darwin backends are pure-Go via purego.	Audio playback via ebitengine/oto only requires cgo on Linux (ALSA/PulseAudio); the Windows and Darwin backends are pure-Go via purego.
auth Package auth provides the authentication abstraction for SpeechKit.	Package auth provides the authentication abstraction for SpeechKit.
config
dictation
downloads Package downloads manages model downloads for SpeechKit — HTTP file downloads and Ollama model pulls with progress tracking.	Package downloads manages model downloads for SpeechKit — HTTP file downloads and Ollama model pulls with progress tracking.
features Package features provides runtime feature detection for UI gating.	Package features provides runtime feature detection for UI gating.
frontendassets
hotkey
kombify Package kombify is the build-tag seam between OSS and kombify builds.	Package kombify is the build-tag seam between OSS and kombify builds.
localllm
models
netsec Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).	Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
output
router
runtimepath
secrets
server/assist Package assist implements the POST /v1/assist/process handler.	Package assist implements the POST /v1/assist/process handler.
server/audio Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.	Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
server/cli Package cli holds the small amount of CLI-level glue the speechkit-server and speechkit-voice binaries share.	Package cli holds the small amount of CLI-level glue the speechkit-server and speechkit-voice binaries share.
server/core Package core is the SpeechKit server bootstrap layer.	Package core is the SpeechKit server bootstrap layer.
server/dictation Package dictation implements the POST /v1/dictation/transcribe handler.	Package dictation implements the POST /v1/dictation/transcribe handler.
server/httpx Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.	Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
server/middleware Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.	Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
server/persona Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.	Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
server/voiceagent Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.	Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
serverclient Package serverclient is the client-side transport adapter that lets a device-target (cmd/speechkit) or a local-target binary delegate one or more modes (Dictation, Assist, Voice Agent) to a remote SpeechKit Server-Target instead of running the Framework kernel in-process.	Package serverclient is the client-side transport adapter that lets a device-target (cmd/speechkit) or a local-target binary delegate one or more modes (Dictation, Assist, Voice Agent) to a remote SpeechKit Server-Target instead of running the Framework kernel in-process.
shortcuts
store
stt
textactions
tray
tts
vad
voiceagent Package voiceagent implements the Voice Agent Mode — a real-time, bidirectional voice conversation using native audio-to-audio models (Gemini Live API, OpenAI Realtime API) over WebSocket.	Package voiceagent implements the Voice Agent Mode — a real-time, bidirectional voice conversation using native audio-to-audio models (Gemini Live API, OpenAI Realtime API) over WebSocket.
winapi Package winapi provides shared Windows DLL proc references used by multiple packages.	Package winapi provides shared Windows DLL proc references used by multiple packages.
pkg
speechkit Package speechkit provides the public SDK for embedding SpeechKit voice capture and transcription into host applications.	Package speechkit provides the public SDK for embedding SpeechKit voice capture and transcription into host applications.
speechkit/assist Package assist provides an embeddable Assist service constructor.	Package assist provides an embeddable Assist service constructor.
speechkit/dictation Package dictation provides an embeddable strict Dictation runtime.	Package dictation provides an embeddable strict Dictation runtime.
speechkit/voiceagent Package voiceagent provides an embeddable Voice Agent service constructor.	Package voiceagent provides an embeddable Voice Agent service constructor.