SpeechKit

module

v0.22.1 Latest Latest Go to latest Published: Apr 20, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kombifyio/SpeechKit

Links

Open Source Insights

README ¶

SpeechKit

SpeechKit is a Windows-first speech-to-text framework with a desktop host application. It is designed to be embedded into tools that want local-first dictation, optional cloud providers, and a clean host-managed credential model.

The repository treats frontend/app as first-class source. The embedded internal/frontendassets/dist output is generated from that source and should not be edited manually.

What SpeechKit Is

a Go framework for speech capture, routing, transcription, and desktop integration
a Wails-based Windows desktop host that exercises the framework end to end
a local-first runtime with optional provider integrations such as Hugging Face and self-hosted VPS endpoints

Framework Principles

provider-agnostic core
tokenless framework layer
host-managed credentials and secret storage
local SQLite default for zero-config usage
Windows-first release quality for the first public version
three strict product modes: Dictation, Assist, and Voice Agent
Gemini Live as the standard Voice Agent runtime with a durable framework prompt, an optional personal refinement prompt, and session policy

Three Ways to Use SpeechKit

As a Go Library

Use the framework in your own Go application without any UI:

go get github.com/kombifyio/SpeechKit/pkg/speechkit

Implement a handful of interfaces (Transcriber, AudioRecorder, Persistence) and the framework handles recording lifecycle, job queuing, and transcription routing. See examples/library/ for a working example.

As a Windows Desktop App

Download the installer from the Releases page:

SpeechKit-Setup.exe — Windows installer
SpeechKit-Portable.zip — portable bundle (no install required)

As an Android App

The android/ directory contains a Kotlin-based Android implementation with a custom keyboard (HeliBoard integration) and voice assistant service. Android support is under active development.

Current Feature Set

push-to-talk Dictation with lightweight overlay feedback and no AI/tool routing
local runtime state and history via SQLite
six STT providers: local whisper.cpp, Hugging Face, OpenAI, Groq, Google, self-hosted VPS
Assist mode for one-shot utilities, rewrites, summaries, and answer panels with optional TTS
Voice Agent mode for realtime audio-to-audio dialogue (Gemini Live) with a dedicated live transcript surface and custom orb
layered Voice Agent setup: host supplies API key, framework prompt, optional personal refinement prompt, and Gemini session policy
settings UI for provider, overlay, hotkey, and storage preferences

Provider Credential Model

The framework core does not embed provider tokens.

For Hugging Face, the current host resolution order is:

user token stored from Settings
install token seeded by the installer and migrated on first start
environment variable fallback via token_env
explicit Doppler fallback for internal development only

That keeps the public framework neutral while allowing host apps to choose their own policy.

Prerequisites

Go 1.26+
Node.js 22+
MinGW-w64 for CGo on Windows
NSIS for the canonical Windows build that emits the installer
optional: ONNX Runtime DLL for Silero VAD
optional: whisper.cpp server binary for local STT
optional: Doppler CLI for internal development flows

Quick Start

git clone https://github.com/kombifyio/SpeechKit.git
cd SpeechKit
powershell -ExecutionPolicy Bypass -File scripts/build.ps1

The canonical Windows build produces:

dist/windows/SpeechKit/SpeechKit.exe
dist/windows/SpeechKit-Setup.exe

Runtime Configuration

The staged bundle includes config.toml next to SpeechKit.exe. For custom setups, start from config.example.toml.

[huggingface]
enabled = false
model = "openai/whisper-large-v3"
token_env = "HF_TOKEN"

[store]
backend = "sqlite"
save_audio = true
audio_retention_days = 7

[shortcuts.locale.de]
summarize = ["kurzfassung", "briefing"]
copy_last = ["kopier den letzten block"]

Public OSS users should rely on explicit configuration and environment variables. Internal development may additionally use Doppler, but public artifacts must never depend on private Doppler defaults.

Shortcut aliases are additive. SpeechKit keeps the built-in multilingual defaults and overlays any configured locale-specific aliases on top, so product teams can ship their own command words without changing Go code.

Default mode hotkeys are Win+Alt for Dictation, Ctrl+Shift+J for Assist, and Ctrl+Shift+K for Voice Agent.

Voice Agent Live Test

For the first end-to-end Voice Agent run, keep the setup minimal:

Set voice_agent_hotkey in config.toml and keep active_mode = "voice_agent" only if you want Voice Agent preselected on startup.
Provide a Gemini API key through the env var referenced by [providers.google].api_key_env (default: GOOGLE_AI_API_KEY).
Keep [voice_agent].framework_prompt = "" if you want the built-in default helper, or supply your own durable framework prompt.
Optionally add [voice_agent].refinement_prompt for personal preferences that should sharpen the framework prompt without replacing it.
Use model = "gemini-2.5-flash-native-audio-preview-12-2025" for the current recommended default Voice Agent runtime.
Launch SpeechKit.exe and press the configured voice_agent_hotkey to start and stop the live session.

Notes:

Native-audio Gemini Live sessions do not rely on speechConfig.languageCode; SpeechKit steers preferred language through the layered prompt assembly and locale-aware defaults.
enable_affective_dialog = true automatically switches the Gemini Live client to v1alpha and is intended for Gemini 2.5 native-audio sessions, not Gemini 3.1 Flash Live.
Non-blocking tool behavior is available in the Voice Agent framework contract, but Gemini 3.1 Flash Live only supports sequential tool execution.
Voice Agent is a realtime-dialog surface. If the live runtime is unavailable, SpeechKit now keeps the mode boundary explicit instead of silently dropping into the Assist capture pipeline.

The Voice Agent now combines two prompt layers on every session:

framework_prompt: the durable host/framework instruction that defines the product behavior and fixed flows
refinement_prompt: the user-level personalization layer that sharpens tone, brevity, naming, or other preferences without replacing the framework layer

Mode Boundaries

Dictation: speech-to-text only, no codeword or utility routing
Assist: one-shot utility mode that either inserts directly when safe or opens a reusable result panel
Voice Agent: realtime spoken dialogue for brainstorming and quick clarification, not a work-product or insertion surface

Build and Verification

powershell -ExecutionPolicy Bypass -File scripts/build.ps1

This is the canonical verification path. It runs:

frontend tests
frontend lint
frontend production build
go vet
go test ./...
bundle build
installer build

Project Structure

pkg/speechkit/          Framework-level orchestration (public API)
cmd/speechkit/          Wails desktop host application
frontend/app/           React/Vite UI sources
internal/audio/         Audio capture (WASAPI)
internal/stt/           STT provider implementations (6 providers)
internal/tts/           TTS provider implementations
internal/ai/            LLM integration via Genkit
internal/assist/        Assist mode pipeline (STT -> LLM -> TTS)
internal/voiceagent/    Voice agent (Gemini Live WebSocket)
internal/vad/           Voice activity detection (Silero ONNX)
internal/config/        Runtime config and secret resolution
internal/router/        Provider routing
internal/store/         Local storage (SQLite / PostgreSQL)
internal/secrets/       Host-side secret storage
internal/frontendassets/ Generated embedded frontend assets
android/                Android app and keyboard integration
examples/               Library usage examples
installer/              NSIS Windows installer
scripts/                Build and release scripts
docs/                   Architecture and contributor docs

OSS Release Hygiene

SpeechKit is prepared in a private upstream and mirrored into a separate release repository. Public publication is allowlist-based.

Start with:

Code Signing

Public Windows releases are expected to be built from kombifyio/SpeechKit, signed, and verified before publication.

See:

Contributing

See:

License

Apache-2.0. See LICENSE.

Directories ¶

Path	Synopsis
assets
cmd
speechkit command
examples
library command Example: Using SpeechKit as a Go library for speech-to-text.	Example: Using SpeechKit as a Go library for speech-to-text.
internal
ai
ai/flows
assist Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.	Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.
audio
auth Package auth provides the authentication abstraction for SpeechKit.	Package auth provides the authentication abstraction for SpeechKit.
config
dictation
downloads Package downloads manages model downloads for SpeechKit — HTTP file downloads (whisper models) and Ollama model pulls with progress tracking.	Package downloads manages model downloads for SpeechKit — HTTP file downloads (whisper models) and Ollama model pulls with progress tracking.
features Package features provides runtime feature detection for UI gating.	Package features provides runtime feature detection for UI gating.
frontendassets
hotkey
kombify Package kombify is the build-tag seam between OSS and kombify builds.	Package kombify is the build-tag seam between OSS and kombify builds.
models
netsec Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).	Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
output
router
runtimepath
secrets
shortcuts
store
stt
textactions
tray
tts
vad
voiceagent Package voiceagent implements the Voice Agent Mode — a real-time, bidirectional voice conversation using native audio-to-audio models (Gemini Live API, OpenAI Realtime API) over WebSocket.	Package voiceagent implements the Voice Agent Mode — a real-time, bidirectional voice conversation using native audio-to-audio models (Gemini Live API, OpenAI Realtime API) over WebSocket.
winapi Package winapi provides shared Windows DLL proc references used by multiple packages.	Package winapi provides shared Windows DLL proc references used by multiple packages.
pkg
speechkit Package speechkit provides the public SDK for embedding SpeechKit voice capture and transcription into host applications.	Package speechkit provides the public SDK for embedding SpeechKit voice capture and transcription into host applications.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL