talker

command module

v0.9.0 Latest Latest Go to latest Published: Jun 13, 2026 License: Apache-2.0 Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/siherrmann/talker

Links

Open Source Insights

README ¶

talker

A fast, OpenAI-compatible Chat Completion API wrapping local LLM inference using hugot.

💡 Goal of this project

talker provides a lightweight, entirely local backend that mimics the OpenAI Chat Completion API (POST /v1/chat/completions) and Embeddings API (POST /v1/embeddings). It enables you to point your existing OpenAI-compatible AI applications directly to a local, privacy-preserving server running ONNX-based language models without needing complex Python setups.

🛠️ Installation

To run talker, clone the repository and run it via Go:

git clone https://github.com/siherrmann/talker.git
cd talker
go mod tidy

The server requires:

Go 1.25+
ONNX-formatted language or embedding models (which can be downloaded automatically!)

🚀 Getting Started

Basic Usage

The simplest way to start the API for testing is by using the built-in mock engine. If no model parameters are specified, the server will default to the mock engine, allowing you to test endpoints immediately.

go run main.go

To run with real models and have them download automatically if they are missing:

MODEL_FOLDER=./models CHAT_MODEL=HuggingFaceTB/SmolLM-135M-Instruct EMBEDDING_MODEL=BAAI/bge-small-en-v1.5 PORT=8080 go run main.go

Environment Variables

The API behavior can be configured via environment variables:

MODEL_FOLDER=./models                        # Required for auto-download: The base directory to store models.
CHAT_MODEL=HuggingFaceTB/SmolLM-135M-Instruct # Optional: The Hugging Face repo name for the text generation model.
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5        # Optional: The Hugging Face repo name for the embeddings model.
PORT=8080                                    # Optional: Sets the port for the Echo server (default is 8080).
METRICS_PORT=9090                            # Optional: Sets the port for the Prometheus metrics server. If not set, the metrics server is disabled.

If neither CHAT_MODEL nor EMBEDDING_MODEL is provided, the mock engine is used.

⭐ Features

Local LLM Inference

hugot Integration: Native Go inference using the high-performance hugot library (which wraps ONNX runtime).
Automatic Downloading: Automatically downloads the requested models from Hugging Face directly into your MODEL_FOLDER on startup.

OpenAI Compatibility

Standard Endpoints: Strict implementation of both POST /v1/chat/completions and POST /v1/embeddings.
Request/Response Models: Fully conforms to the standard OpenAI request and response schemas.
SSE Streaming: Fully supports Server-Sent Events for real-time streaming when stream: true is passed.
Strict JSON Enforcement: Supports response_format: {"type": "json_object"} with automatic struct validation via github.com/siherrmann/validator. If the LLM generates invalid JSON, the engine automatically retries up to 3 times, passing the validation errors back to the model as a prompt.

Prometheus Telemetry

Native Prometheus Integration: Exposes standard Prometheus metrics on a dedicated internal port (e.g., :9090/metrics) when METRICS_PORT is configured.
Tracked Metrics: Includes standard counters and histograms for request duration (talker_request_duration_seconds), total requests (talker_requests_total), and exact tokens consumed (talker_tokens_consumed_total).
Billing Attributes: Automatically extracts billing and tenant labels (project_id, org_id, user_id) from HTTP headers (X-Project-Id, X-Org-Id, X-User-Id) for detailed usage tracking per tenant.
Exact Token Counting: Token consumption is accurately measured directly from the native hugot internal tokenization pipelines instead of relying on rough character-based heuristics.

Robust Architecture

Echo v5 Framework: Built on top of Echo for rapid and robust HTTP routing.
Test-Driven: Designed with a highly mockable architecture.

🖥️ API Interface

API Endpoints

POST /v1/chat/completions - Generates chat completions.
POST /v1/embeddings - Generates vector embeddings for a given input.

Example request (Non-streaming Chat):

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Example request (Embeddings):

curl -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-embedding-model",
    "input": ["First sentence", "Second sentence"]
  }'

🏗️ Architecture

talker is built with:

Echo v5 - Fast HTTP framework for Go
hugot - Golang wrapper around ONNX Runtime for local inference pipelines

The application follows a clean architecture with:

Handlers (handler/): Contains ChatHandler and EmbeddingsHandler for the HTTP lifecycle.
Core Engine (core/): Abstracts underlying hugot pipeline calls (HugotEngine). It seamlessly supports TextGenerationPipeline and FeatureExtractionPipeline concurrently.
Models (model/): Native Go structs matching the exact schema required by client libraries expecting an OpenAI backend. Includes custom unmarshaling logic for robust handling of dynamic OpenAI fields (e.g., embeddings input as string vs array).

🔧 Development

Prerequisites

Go 1.25+

Development Commands

Run the test suite to verify handlers and data parsing logic:

# Run all tests
go test ./...

# Run server with Mock Engine
go run main.go

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
core
handler
metrics
model

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL