llm

package
v0.8.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 29, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Overview

Package llm provides the LLM client abstraction for synapses-intelligence. All LLM calls use structured JSON output to ensure deterministic, fast parsing.

Index

Constants

View Source
const HFBaseURL = "https://huggingface.co"

HFBaseURL is the HuggingFace resolve endpoint for file downloads.

Variables

This section is empty.

Functions

func DownloadGGUF

func DownloadGGUF(ctx context.Context, cfg DownloadConfig) (string, error)

DownloadGGUF downloads a GGUF model file from HuggingFace if it doesn't already exist locally. Returns the local path on success.

Progress messages are written to cfg.Progress (if non-nil) in the format:

Downloading sil-coder-Q5_K_M.gguf from huggingface.co/divish/sil-coder
 500 MB / 6.5 GB (7%)
 ...
Download complete: /Users/you/.synapses/models/sil-coder-Q5_K_M.gguf

func ExtractJSON

func ExtractJSON(s string) string

ExtractJSON strips markdown code fences and extracts the JSON object from raw LLM output. Many small models wrap JSON responses in ```json ... ``` blocks despite instructions. This function handles that gracefully so callers always get raw JSON to unmarshal.

func GGUFExists

func GGUFExists(path string) bool

GGUFExists returns true if the GGUF file already exists on disk.

func ListInstalledModels

func ListInstalledModels(ctx context.Context, baseURL string) ([]string, error)

ListInstalledModels returns all model names present in Ollama's local library.

func ParseSILResponse

func ParseSILResponse(raw string) (rootSummary, insight string, concerns []string)

ParseSILResponse parses the labeled output format produced by the fine-tuned SIL model.

Expected output (after optional <think>...</think> block):

ROOT_SUMMARY: One sentence about the root node.
INSIGHT: One sentence about architectural role.
CONCERNS: concern1, concern2, concern3

Falls back to raw text as insight for backward compatibility with standard Ollama models that emit plain text or JSON.

Returns empty strings/nil slice for any field not found in the response.

func RepairJSON

func RepairJSON(s string) string

RepairJSON attempts to fix common JSON bracket mismatches produced by Qwen3.5 models. The most frequent issue is nested arrays-of-objects where the model writes "]]" instead of "]}]" (dropping the closing "}" of the inner object before the outer array bracket).

Only modifies the input when it fails json.Unmarshal AND the fix produces valid JSON — never corrupts already-valid output.

func Truncate

func Truncate(s string, n int) string

Truncate shortens s to at most n runes for use in error messages. Appends "..." when truncation occurs. Uses rune-aware slicing to avoid cutting multi-byte UTF-8 characters mid-sequence.

Types

type DownloadConfig

type DownloadConfig struct {
	// Repo is the HuggingFace repo, e.g. "divish/sil-coder"
	Repo string
	// Filename is the GGUF file name within the repo, e.g. "sil-coder-Q5_K_M.gguf"
	Filename string
	// DestDir is the local directory to save to. Created if it doesn't exist.
	DestDir string
	// Progress is an optional writer for progress messages. May be nil.
	Progress io.Writer
	// SHA256 is the expected SHA-256 hex digest. Required: download fails if empty.
	SHA256 string
}

DownloadConfig holds parameters for a GGUF download.

func (DownloadConfig) DestPath

func (d DownloadConfig) DestPath() string

DestPath returns the full local path where the GGUF will be saved.

func (DownloadConfig) URL

func (d DownloadConfig) URL() string

URL returns the HuggingFace download URL for this file.

type HardwareConfig

type HardwareConfig struct {
	// HasMetal is true on Apple Silicon (M1/M2/M3/M4) Macs.
	// llama.cpp uses the Metal framework for GPU acceleration on these devices.
	HasMetal bool

	// HasCUDA is true when an NVIDIA GPU with CUDA support is detected.
	HasCUDA bool

	// GPULayers is the number of transformer layers to offload to the GPU.
	// 0 = CPU-only. Auto-tuned based on detected VRAM.
	GPULayers int

	// AvailableRAMGB is the approximate amount of free system RAM in GB.
	// Used as an anti-OOM guard: if too low the local backend is skipped.
	AvailableRAMGB float64
}

HardwareConfig describes the host machine's LLM-relevant capabilities.

func DetectHardware

func DetectHardware() HardwareConfig

DetectHardware probes the current machine and returns a HardwareConfig. It is safe to call multiple times; results are not cached (cheap probes).

type LLMClient

type LLMClient interface {
	// Generate sends a prompt to the LLM and returns the raw response text.
	// The caller is responsible for parsing the JSON response.
	Generate(ctx context.Context, prompt string) (string, error)

	// Available returns true if the backend is reachable and the model is loaded.
	Available(ctx context.Context) bool

	// ModelName returns the configured model identifier.
	ModelName() string

	// ModelPulled returns true if the model is already present locally
	// (no download needed).
	ModelPulled(ctx context.Context) bool

	// PullModel downloads the model, streaming progress to w.
	PullModel(ctx context.Context, w io.Writer) error
}

LLMClient is the interface for all LLM backends. Implementations: OllamaClient (production), MockClient (tests).

type LocalClient

type LocalClient struct {
	// contains filtered or unexported fields
}

LocalClient runs a fine-tuned GGUF model embedded in-process via godeps/gollama. Zero network calls — everything happens in RAM.

gollama is not goroutine-safe per context instance, so all Generate calls are serialised through mu. For high-throughput workloads consider a pool of LocalClient instances, one per goroutine.

func NewLocalClient

func NewLocalClient(ggufPath string, hw HardwareConfig) (*LocalClient, error)

NewLocalClient loads a GGUF model file and returns a ready LocalClient. Returns an error if the model cannot be loaded or available RAM is too low.

Usage:

cli, err := llm.NewLocalClient("/path/to/sil-9b-gguf/model.gguf", llm.DetectHardware())

func (*LocalClient) Available

func (c *LocalClient) Available(_ context.Context) bool

Available returns true if the model is loaded and RAM is sufficient.

func (*LocalClient) Close

func (c *LocalClient) Close()

Close releases GPU/CPU memory held by the llama.cpp model and context. Safe to call multiple times; second and subsequent calls are no-ops.

func (*LocalClient) Generate

func (c *LocalClient) Generate(ctx context.Context, prompt string) (string, error)

Generate runs inference on prompt and returns the decoded response text. Thread-safe: mu is held only for the guard reads; inferSem serialises the actual CGo call outside the lock so context cancellation at the semaphore does not strand the next caller.

func (*LocalClient) ModelName

func (c *LocalClient) ModelName() string

ModelName returns the GGUF file name without path, used for logging.

func (*LocalClient) ModelPulled

func (c *LocalClient) ModelPulled(_ context.Context) bool

ModelPulled always returns true — local GGUF files are already on disk.

func (*LocalClient) PullModel

func (c *LocalClient) PullModel(_ context.Context, _ io.Writer) error

PullModel is a no-op for local files (nothing to download).

func (*LocalClient) WithThinking

func (c *LocalClient) WithThinking(enabled bool) *LocalClient

WithThinking enables or disables extended reasoning mode (Qwen3 <think> blocks).

type MockClient

type MockClient struct {
	Response string
	Err      error
	// contains filtered or unexported fields
}

MockClient is a deterministic LLM client for tests. It returns a fixed response for every Generate call.

func NewMockClient

func NewMockClient(response string) *MockClient

NewMockClient creates a MockClient that always returns the given response.

func NewUnavailableMockClient

func NewUnavailableMockClient() *MockClient

NewUnavailableMockClient creates a MockClient that reports itself unavailable.

func (*MockClient) Available

func (m *MockClient) Available(_ context.Context) bool

Available reports whether the mock client is configured as available.

func (*MockClient) Generate

func (m *MockClient) Generate(_ context.Context, _ string) (string, error)

Generate returns the configured mock response.

func (*MockClient) ModelName

func (m *MockClient) ModelName() string

ModelName returns the mock model name.

func (*MockClient) ModelPulled

func (m *MockClient) ModelPulled(_ context.Context) bool

ModelPulled reports whether the mock model is available.

func (*MockClient) PullModel

func (m *MockClient) PullModel(_ context.Context, _ io.Writer) error

PullModel is a no-op for the mock client.

type ModelWarmer

type ModelWarmer interface {
	WarmUp(ctx context.Context) error
}

ModelWarmer can pre-load a model into memory before the first real request. OllamaClient implements this by sending an empty prompt with keep_alive=-1, which forces Ollama to load the model weights without generating any output.

type OllamaClient

type OllamaClient struct {
	// contains filtered or unexported fields
}

OllamaClient calls the Ollama REST API at POST /api/generate. It keeps a reusable http.Client for connection pooling.

func NewOllamaClient

func NewOllamaClient(baseURL, model string, timeoutMS int) *OllamaClient

NewOllamaClient creates a client targeting the given Ollama base URL and model. timeoutMS is the per-request timeout in milliseconds (applied at HTTP client level — does not cancel the Ollama server-side inference, only the wait).

func (*OllamaClient) Available

func (c *OllamaClient) Available(ctx context.Context) bool

Available checks if Ollama is reachable by calling GET /api/tags. Returns true only if the HTTP call succeeds with a 200 status.

func (*OllamaClient) Generate

func (c *OllamaClient) Generate(ctx context.Context, prompt string) (string, error)

Generate sends a prompt and returns the response text. Uses stream=false for simplicity and lowest latency on small outputs. For Qwen3.x models, sets the Ollama API think: bool field (≥0.6) to control chain-of-thought. Non-Qwen3 models receive no think field — they ignore it. When useChat=true, dispatches to /api/chat instead of /api/generate — required for fine-tuned Qwen3.5 models that need chat-template formatting.

func (*OllamaClient) ModelName

func (c *OllamaClient) ModelName() string

ModelName returns the configured model tag.

func (*OllamaClient) ModelPulled

func (c *OllamaClient) ModelPulled(ctx context.Context) bool

ModelPulled returns true if the configured model is already present in Ollama's local model library (i.e. no pull is needed). Uses a short 3s deadline so startup is not blocked for 30s if Ollama is slow.

func (*OllamaClient) PullModel

func (c *OllamaClient) PullModel(ctx context.Context, w io.Writer) error

PullModel pulls the configured model from the Ollama registry, streaming progress lines to w. Pass os.Stderr for terminal feedback. Blocks until the pull completes or ctx is cancelled.

func (*OllamaClient) WarmUp

func (c *OllamaClient) WarmUp(ctx context.Context) error

WarmUp pre-loads the model into Ollama's memory by sending an empty prompt. Uses the client's configured keepAlive so that the warm model respects the same RAM residency policy as live requests. Pinned tiers (keepAlive=-1) stay loaded; JIT tiers (keepAlive=0) get pre-loaded but are evicted on first real request — this avoids warmup overriding the intended Optimal-mode RAM budget. Implements ModelWarmer. Called in background goroutines at brain startup.

func (*OllamaClient) WithChatMode

func (c *OllamaClient) WithChatMode(enabled bool) *OllamaClient

WithChatMode switches the client from /api/generate to /api/chat. Required for fine-tuned Qwen3.5 models: they need the chat-template message structure to follow instructions correctly. Raw /api/generate prompts cause these models to echo training examples instead of responding. Returns the client to allow chaining.

func (*OllamaClient) WithJSONFormat

func (c *OllamaClient) WithJSONFormat(enabled bool) *OllamaClient

WithJSONFormat enables Ollama's structured JSON output mode by setting "format":"json" in the request body. When enabled, the model is constrained to emit only valid JSON — it will not produce prose, markdown fences, or partial output. Use for tiers that parse structured responses (Orchestrator, Archivist) where base models might otherwise produce free-text. Returns the client to allow chaining.

func (*OllamaClient) WithKeepAlive

func (c *OllamaClient) WithKeepAlive(secs int) *OllamaClient

WithKeepAlive sets how long Ollama keeps the model loaded after a request. Pass -1 to pin the model in RAM indefinitely (hot-tier models called frequently). Pass 0 to evict immediately after each request (one-shot cold tasks). Pass positive seconds for a custom TTL. nil (default) uses Ollama's 5-minute default. Returns the client to allow chaining.

func (*OllamaClient) WithNumPredict

func (c *OllamaClient) WithNumPredict(n int) *OllamaClient

WithNumPredict sets the maximum output tokens per request. Default is 400 (sufficient for insight/coordination JSON). Increase for tiers with longer structured outputs, e.g. Archivist (1024). Returns the client to allow chaining.

func (*OllamaClient) WithThinking

func (c *OllamaClient) WithThinking(enabled bool) *OllamaClient

WithThinking configures extended thinking mode for Qwen3.5 models. Call on construction: llm.NewOllamaClient(...).WithThinking(true) Returns the client to allow chaining.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL