SpeechKit

module
v0.28.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 30, 2026 License: Apache-2.0

README

SpeechKit

SpeechKit is a Windows-first speech framework with a desktop reference app, an embeddable Go API, and a containerized Server-Target. It is built for products that need strict speech modes, local-first defaults, optional cloud providers, and host-managed credentials.

The repository treats frontend/app as first-class source. The embedded internal/frontendassets/dist output is generated from that source and should not be edited manually.

What You Get

Variant Use it when Ships
Device-Target You want the Windows reference app Wails desktop host, local overlay, global hotkeys, Settings UI
Local-Target You embed SpeechKit into a Go app pkg/speechkit, examples, mode contracts, provider catalog
Server-Target You expose SpeechKit over HTTP/WebSocket containerized server runtime, REST endpoints, realtime Voice Agent WebSocket

All variants share the same framework kernel. The Windows app is a reference client, not the source of truth for the framework contract.

Core Features

  • three strict product modes: Dictation, Assist, and Voice Agent
  • local-first Dictation with whisper.cpp support and optional cloud STT
  • six STT provider paths: whisper.cpp, Hugging Face, OpenAI, Groq, Google, and self-hosted VPS
  • Assist utilities for rewrites, summaries, answers, drafts, optional TTS, and visible result panels
  • Voice Agent realtime dialogue through Gemini Live or an explicit pipeline fallback
  • layered Voice Agent prompts: host/framework prompt plus optional personal refinement prompt
  • local SQLite state by default, with storage contracts prepared for server deployments
  • host-managed credentials; the framework core does not embed provider tokens
  • public control-plane and server OpenAPI contracts for integrations

Mode Boundaries

Mode Intelligence Contract
Dictation User Intelligence Audio in, text out. No LLM rewriting, no tools, no Assist routing.
Assist Utility Intelligence One-shot utility or LLM result with optional TTS and result surface metadata.
Voice Agent Brainstorming Intelligence Realtime spoken dialogue or explicit pipeline fallback with session summary support.

Default mode hotkeys in the Windows reference app are Win+Alt for Dictation, Ctrl+Win for Assist, and Ctrl+Shift for Voice Agent.

Start Here

  • Framework API - embeddable Go API, mode contracts, provider catalog, and local control API.
  • Server-Target guide - server runtime, mode endpoints, auth, and deployment profiles.
  • Server deploy guide - Docker Compose, Render, and generic OCI deployment notes.
  • Local OpenAPI - desktop control-plane contract.
  • Server OpenAPI - HTTP and WebSocket contract for the Server-Target.
  • Examples - library and provider-catalog examples.
  • Docs index - architecture, release, trust, and runbook links.

Quick Start

Windows App

Download the latest Windows artifacts from GitHub Releases:

  • SpeechKit-Setup.exe - installer
  • SpeechKit-Portable.zip - portable bundle

Public Windows releases include SHA256SUMS.txt, SpeechKit.sbom.json, and UNSIGNED-WINDOWS-RELEASE.txt when the no-cost unsigned release path is active.

For local development on Windows, start the local bundle from this repository:

powershell -ExecutionPolicy Bypass -File .\start-dev.ps1

If the bundle is missing, the launcher builds it first via the canonical Windows build script. Use npm run app:dev:detached when you want to start it without keeping the terminal attached.

Go Library
go get github.com/kombifyio/SpeechKit/pkg/speechkit

Use the framework backend in your own Go application by implementing the small host interfaces for audio recording, transcription, persistence, and output delivery. See examples/library/ for a minimal dictation pipeline and examples/provider-catalog/ for the three-mode provider contract.

Key public API entry points:

  • speechkit.DefaultModeContracts()
  • speechkit.DefaultProviderProfiles()
  • speechkit.ProfilesForMode(mode)
  • speechkit.ProviderKindsForMode(mode)
  • speechkit.ValidateProfileForMode(profile, mode)
Server-Target
docker pull ghcr.io/kombifyio/speechkit-server:latest

Use the Server-Target for Dictation REST, Assist REST, and realtime Voice Agent WebSocket from a containerized deployment. See docs/server/README.md.

Runtime Configuration

The staged Windows bundle includes config.toml next to SpeechKit.exe. For custom setups, start from config.example.toml.

[huggingface]
enabled = false
model = "openai/whisper-large-v3"
token_env = "HF_TOKEN"

[store]
backend = "sqlite"
save_audio = true
audio_retention_days = 7

[shortcuts.locale.de]
summarize = ["kurzfassung", "briefing"]
copy_last = ["kopier den letzten block"]

Public OSS users should rely on explicit configuration and environment variables. Internal development may use private secret managers, but public artifacts must never depend on private defaults.

Provider Credentials

SpeechKit's framework core is tokenless. Hosts decide how credentials are stored and injected.

The Windows reference host resolves Hugging Face credentials in this order:

  1. user token stored from Settings
  2. install token seeded by the installer and migrated on first start
  3. environment variable fallback via token_env
  4. internal development fallback only when explicitly configured

Server deployments read secret values only from environment variables whose names are configured in TOML.

Build And Verification

Prerequisites:

  • Go 1.26+
  • Node.js 22+
  • MinGW-w64 for CGo on Windows
  • NSIS for installer builds
  • optional: ONNX Runtime DLL for Silero VAD
  • optional: whisper.cpp server binary for local STT

Canonical Windows app build:

powershell -ExecutionPolicy Bypass -File scripts/build.ps1 -SkipInstaller

Common checks:

go test ./...
go vet ./...
npm --prefix frontend/app run test
npm --prefix frontend/app run build
npm --prefix Website run test
npm --prefix Website run build

Project Structure

pkg/speechkit/          Public framework orchestration API
cmd/speechkit/          Wails desktop host application
cmd/speechkit-server/   Linux Server-Target entry point
cmd/speechkit-voice/    Linux voice-only server entry point
frontend/app/           React/Vite Windows UI sources
Website/                Svelte/Vite public website
internal/audio/         WASAPI capture and playback
internal/stt/           STT provider implementations
internal/tts/           TTS provider implementations
internal/ai/            LLM integration
internal/assist/        Assist mode pipeline
internal/voiceagent/    Voice Agent runtime
internal/server/        Server-Target HTTP/WebSocket adapters
internal/serverclient/  Device-to-server transport adapters
internal/store/         SQLite/Postgres storage contracts
deploy/                 Docker, Render, and server config
docs/                   Architecture, release, server, and runbook docs
examples/               Library usage examples
installer/              NSIS Windows installer
scripts/                Build, release, export, and verification scripts

Release And Trust

SpeechKit is prepared in a private upstream and mirrored into kombifyio/SpeechKit through an allowlisted public export.

Start with:

Contributing

See:

License

Apache-2.0. See LICENSE.

Directories

Path Synopsis
cmd
sk-e2e command
Command sk-e2e is a thin end-to-end smoke client for a running speechkit-server instance.
Command sk-e2e is a thin end-to-end smoke client for a running speechkit-server instance.
speechkit command
speechkit-server command
Package main is the canonical kombify SpeechKit Linux container server.
Package main is the canonical kombify SpeechKit Linux container server.
speechkit-voice command
Package main is the kombify SpeechKit Voice Server — a focused Linux container that exposes ONLY the Voice Agent mode (real-time audio-to-audio dialogue over WebSocket).
Package main is the kombify SpeechKit Voice Server — a focused Linux container that exposes ONLY the Voice Agent mode (real-time audio-to-audio dialogue over WebSocket).
examples
library command
Example: Using SpeechKit as a Go library for speech-to-text.
Example: Using SpeechKit as a Go library for speech-to-text.
provider-catalog command
Example: reading SpeechKit's public mode and provider catalog.
Example: reading SpeechKit's public mode and provider catalog.
internal
ai
assist
Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.
Package assist implements the Assist Mode pipeline: STT transcript → Codeword check → LLM → TTS → Result with both text and audio.
audio
Audio playback via ebitengine/oto only requires cgo on Linux (ALSA/PulseAudio); the Windows and Darwin backends are pure-Go via purego.
Audio playback via ebitengine/oto only requires cgo on Linux (ALSA/PulseAudio); the Windows and Darwin backends are pure-Go via purego.
auth
Package auth provides the authentication abstraction for SpeechKit.
Package auth provides the authentication abstraction for SpeechKit.
downloads
Package downloads manages model downloads for SpeechKit — HTTP file downloads and Ollama model pulls with progress tracking.
Package downloads manages model downloads for SpeechKit — HTTP file downloads and Ollama model pulls with progress tracking.
features
Package features provides runtime feature detection for UI gating.
Package features provides runtime feature detection for UI gating.
kombify
Package kombify is the build-tag seam between OSS and kombify builds.
Package kombify is the build-tag seam between OSS and kombify builds.
netsec
Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
Package netsec provides centralized network security primitives used by every HTTP-based provider in SpeechKit (STT, TTS, LLM, downloads).
server/assist
Package assist implements the POST /v1/assist/process handler.
Package assist implements the POST /v1/assist/process handler.
server/audio
Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
Package audio normalizes inbound audio payloads to the Framework kernel's canonical PCM format (16 kHz, signed 16-bit little-endian, mono) before they enter the STT router.
server/cli
Package cli holds the small amount of CLI-level glue the speechkit-server and speechkit-voice binaries share.
Package cli holds the small amount of CLI-level glue the speechkit-server and speechkit-voice binaries share.
server/core
Package core is the SpeechKit server bootstrap layer.
Package core is the SpeechKit server bootstrap layer.
server/dictation
Package dictation implements the POST /v1/dictation/transcribe handler.
Package dictation implements the POST /v1/dictation/transcribe handler.
server/httpx
Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
Package httpx contains tiny cross-handler helpers for JSON error envelopes and status mapping.
server/middleware
Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
Package middleware provides HTTP middleware primitives for the SpeechKit server adapter.
server/persona
Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
Package persona provides the Voice Agent persona / role / sequence catalog for the Server-Target.
server/voiceagent
Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
Package voiceagent implements the Voice Agent WebSocket surface on the Server-Target.
serverclient
Package serverclient is the client-side transport adapter that lets a device-target (cmd/speechkit) or a local-target binary delegate one or more modes (Dictation, Assist, Voice Agent) to a remote SpeechKit Server-Target instead of running the Framework kernel in-process.
Package serverclient is the client-side transport adapter that lets a device-target (cmd/speechkit) or a local-target binary delegate one or more modes (Dictation, Assist, Voice Agent) to a remote SpeechKit Server-Target instead of running the Framework kernel in-process.
stt
tts
vad
voiceagent
Package voiceagent implements the Voice Agent Mode — a real-time, bidirectional voice conversation using native audio-to-audio models (Gemini Live API, OpenAI Realtime API) over WebSocket.
Package voiceagent implements the Voice Agent Mode — a real-time, bidirectional voice conversation using native audio-to-audio models (Gemini Live API, OpenAI Realtime API) over WebSocket.
winapi
Package winapi provides shared Windows DLL proc references used by multiple packages.
Package winapi provides shared Windows DLL proc references used by multiple packages.
pkg
speechkit
Package speechkit provides the public SDK for embedding SpeechKit voice capture and transcription into host applications.
Package speechkit provides the public SDK for embedding SpeechKit voice capture and transcription into host applications.
speechkit/assist
Package assist provides an embeddable Assist service constructor.
Package assist provides an embeddable Assist service constructor.
speechkit/dictation
Package dictation provides an embeddable strict Dictation runtime.
Package dictation provides an embeddable strict Dictation runtime.
speechkit/voiceagent
Package voiceagent provides an embeddable Voice Agent service constructor.
Package voiceagent provides an embeddable Voice Agent service constructor.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL