Documentation
¶
Overview ¶
Package llm provides the LLM client abstraction for synapses-intelligence. All LLM calls use structured JSON output to ensure deterministic, fast parsing.
Index ¶
- Constants
- func DownloadGGUF(ctx context.Context, cfg DownloadConfig) (string, error)
- func ExtractJSON(s string) string
- func GGUFExists(path string) bool
- func ListInstalledModels(ctx context.Context, baseURL string) ([]string, error)
- func ParseSILResponse(raw string) (rootSummary, insight string, concerns []string)
- func RepairJSON(s string) string
- func Truncate(s string, n int) string
- type DownloadConfig
- type HardwareConfig
- type LLMClient
- type LocalClient
- func (c *LocalClient) Available(_ context.Context) bool
- func (c *LocalClient) Close()
- func (c *LocalClient) Generate(ctx context.Context, prompt string) (string, error)
- func (c *LocalClient) ModelName() string
- func (c *LocalClient) ModelPulled(_ context.Context) bool
- func (c *LocalClient) PullModel(_ context.Context, _ io.Writer) error
- func (c *LocalClient) WithThinking(enabled bool) *LocalClient
- type MockClient
- type ModelWarmer
- type OllamaClient
- func (c *OllamaClient) Available(ctx context.Context) bool
- func (c *OllamaClient) Generate(ctx context.Context, prompt string) (string, error)
- func (c *OllamaClient) ModelName() string
- func (c *OllamaClient) ModelPulled(ctx context.Context) bool
- func (c *OllamaClient) PullModel(ctx context.Context, w io.Writer) error
- func (c *OllamaClient) WarmUp(ctx context.Context) error
- func (c *OllamaClient) WithChatMode(enabled bool) *OllamaClient
- func (c *OllamaClient) WithJSONFormat(enabled bool) *OllamaClient
- func (c *OllamaClient) WithKeepAlive(secs int) *OllamaClient
- func (c *OllamaClient) WithNumPredict(n int) *OllamaClient
- func (c *OllamaClient) WithThinking(enabled bool) *OllamaClient
Constants ¶
const HFBaseURL = "https://huggingface.co"
HFBaseURL is the HuggingFace resolve endpoint for file downloads.
Variables ¶
This section is empty.
Functions ¶
func DownloadGGUF ¶
func DownloadGGUF(ctx context.Context, cfg DownloadConfig) (string, error)
DownloadGGUF downloads a GGUF model file from HuggingFace if it doesn't already exist locally. Returns the local path on success.
Progress messages are written to cfg.Progress (if non-nil) in the format:
Downloading sil-coder-Q5_K_M.gguf from huggingface.co/divish/sil-coder 500 MB / 6.5 GB (7%) ... Download complete: /Users/you/.synapses/models/sil-coder-Q5_K_M.gguf
func ExtractJSON ¶
ExtractJSON strips markdown code fences and extracts the JSON object from raw LLM output. Many small models wrap JSON responses in ```json ... ``` blocks despite instructions. This function handles that gracefully so callers always get raw JSON to unmarshal.
func GGUFExists ¶
GGUFExists returns true if the GGUF file already exists on disk.
func ListInstalledModels ¶
ListInstalledModels returns all model names present in Ollama's local library.
func ParseSILResponse ¶
ParseSILResponse parses the labeled output format produced by the fine-tuned SIL model.
Expected output (after optional <think>...</think> block):
ROOT_SUMMARY: One sentence about the root node. INSIGHT: One sentence about architectural role. CONCERNS: concern1, concern2, concern3
Falls back to raw text as insight for backward compatibility with standard Ollama models that emit plain text or JSON.
Returns empty strings/nil slice for any field not found in the response.
func RepairJSON ¶
RepairJSON attempts to fix common JSON bracket mismatches produced by Qwen3.5 models. The most frequent issue is nested arrays-of-objects where the model writes "]]" instead of "]}]" (dropping the closing "}" of the inner object before the outer array bracket).
Only modifies the input when it fails json.Unmarshal AND the fix produces valid JSON — never corrupts already-valid output.
Types ¶
type DownloadConfig ¶
type DownloadConfig struct {
// Repo is the HuggingFace repo, e.g. "divish/sil-coder"
Repo string
// Filename is the GGUF file name within the repo, e.g. "sil-coder-Q5_K_M.gguf"
Filename string
// DestDir is the local directory to save to. Created if it doesn't exist.
DestDir string
// Progress is an optional writer for progress messages. May be nil.
Progress io.Writer
// SHA256 is the expected SHA-256 hex digest. Required: download fails if empty.
SHA256 string
}
DownloadConfig holds parameters for a GGUF download.
func (DownloadConfig) DestPath ¶
func (d DownloadConfig) DestPath() string
DestPath returns the full local path where the GGUF will be saved.
func (DownloadConfig) URL ¶
func (d DownloadConfig) URL() string
URL returns the HuggingFace download URL for this file.
type HardwareConfig ¶
type HardwareConfig struct {
// HasMetal is true on Apple Silicon (M1/M2/M3/M4) Macs.
// llama.cpp uses the Metal framework for GPU acceleration on these devices.
HasMetal bool
// HasCUDA is true when an NVIDIA GPU with CUDA support is detected.
HasCUDA bool
// GPULayers is the number of transformer layers to offload to the GPU.
// 0 = CPU-only. Auto-tuned based on detected VRAM.
GPULayers int
// AvailableRAMGB is the approximate amount of free system RAM in GB.
// Used as an anti-OOM guard: if too low the local backend is skipped.
AvailableRAMGB float64
}
HardwareConfig describes the host machine's LLM-relevant capabilities.
func DetectHardware ¶
func DetectHardware() HardwareConfig
DetectHardware probes the current machine and returns a HardwareConfig. It is safe to call multiple times; results are not cached (cheap probes).
type LLMClient ¶
type LLMClient interface {
// Generate sends a prompt to the LLM and returns the raw response text.
// The caller is responsible for parsing the JSON response.
Generate(ctx context.Context, prompt string) (string, error)
// Available returns true if the backend is reachable and the model is loaded.
Available(ctx context.Context) bool
// ModelName returns the configured model identifier.
ModelName() string
// ModelPulled returns true if the model is already present locally
// (no download needed).
ModelPulled(ctx context.Context) bool
// PullModel downloads the model, streaming progress to w.
PullModel(ctx context.Context, w io.Writer) error
}
LLMClient is the interface for all LLM backends. Implementations: OllamaClient (production), MockClient (tests).
type LocalClient ¶
type LocalClient struct {
// contains filtered or unexported fields
}
LocalClient runs a fine-tuned GGUF model embedded in-process via godeps/gollama. Zero network calls — everything happens in RAM.
gollama is not goroutine-safe per context instance, so all Generate calls are serialised through mu. For high-throughput workloads consider a pool of LocalClient instances, one per goroutine.
func NewLocalClient ¶
func NewLocalClient(ggufPath string, hw HardwareConfig) (*LocalClient, error)
NewLocalClient loads a GGUF model file and returns a ready LocalClient. Returns an error if the model cannot be loaded or available RAM is too low.
Usage:
cli, err := llm.NewLocalClient("/path/to/sil-9b-gguf/model.gguf", llm.DetectHardware())
func (*LocalClient) Available ¶
func (c *LocalClient) Available(_ context.Context) bool
Available returns true if the model is loaded and RAM is sufficient.
func (*LocalClient) Close ¶
func (c *LocalClient) Close()
Close releases GPU/CPU memory held by the llama.cpp model and context. Safe to call multiple times; second and subsequent calls are no-ops.
func (*LocalClient) Generate ¶
Generate runs inference on prompt and returns the decoded response text. Thread-safe: mu is held only for the guard reads; inferSem serialises the actual CGo call outside the lock so context cancellation at the semaphore does not strand the next caller.
func (*LocalClient) ModelName ¶
func (c *LocalClient) ModelName() string
ModelName returns the GGUF file name without path, used for logging.
func (*LocalClient) ModelPulled ¶
func (c *LocalClient) ModelPulled(_ context.Context) bool
ModelPulled always returns true — local GGUF files are already on disk.
func (*LocalClient) WithThinking ¶
func (c *LocalClient) WithThinking(enabled bool) *LocalClient
WithThinking enables or disables extended reasoning mode (Qwen3 <think> blocks).
type MockClient ¶
MockClient is a deterministic LLM client for tests. It returns a fixed response for every Generate call.
func NewMockClient ¶
func NewMockClient(response string) *MockClient
NewMockClient creates a MockClient that always returns the given response.
func NewUnavailableMockClient ¶
func NewUnavailableMockClient() *MockClient
NewUnavailableMockClient creates a MockClient that reports itself unavailable.
func (*MockClient) Available ¶
func (m *MockClient) Available(_ context.Context) bool
Available reports whether the mock client is configured as available.
func (*MockClient) ModelName ¶
func (m *MockClient) ModelName() string
ModelName returns the mock model name.
func (*MockClient) ModelPulled ¶
func (m *MockClient) ModelPulled(_ context.Context) bool
ModelPulled reports whether the mock model is available.
type ModelWarmer ¶
ModelWarmer can pre-load a model into memory before the first real request. OllamaClient implements this by sending an empty prompt with keep_alive=-1, which forces Ollama to load the model weights without generating any output.
type OllamaClient ¶
type OllamaClient struct {
// contains filtered or unexported fields
}
OllamaClient calls the Ollama REST API at POST /api/generate. It keeps a reusable http.Client for connection pooling.
func NewOllamaClient ¶
func NewOllamaClient(baseURL, model string, timeoutMS int) *OllamaClient
NewOllamaClient creates a client targeting the given Ollama base URL and model. timeoutMS is the per-request timeout in milliseconds (applied at HTTP client level — does not cancel the Ollama server-side inference, only the wait).
func (*OllamaClient) Available ¶
func (c *OllamaClient) Available(ctx context.Context) bool
Available checks if Ollama is reachable by calling GET /api/tags. Returns true only if the HTTP call succeeds with a 200 status.
func (*OllamaClient) Generate ¶
Generate sends a prompt and returns the response text. Uses stream=false for simplicity and lowest latency on small outputs. For Qwen3.x models, sets the Ollama API think: bool field (≥0.6) to control chain-of-thought. Non-Qwen3 models receive no think field — they ignore it. When useChat=true, dispatches to /api/chat instead of /api/generate — required for fine-tuned Qwen3.5 models that need chat-template formatting.
func (*OllamaClient) ModelName ¶
func (c *OllamaClient) ModelName() string
ModelName returns the configured model tag.
func (*OllamaClient) ModelPulled ¶
func (c *OllamaClient) ModelPulled(ctx context.Context) bool
ModelPulled returns true if the configured model is already present in Ollama's local model library (i.e. no pull is needed). Uses a short 3s deadline so startup is not blocked for 30s if Ollama is slow.
func (*OllamaClient) PullModel ¶
PullModel pulls the configured model from the Ollama registry, streaming progress lines to w. Pass os.Stderr for terminal feedback. Blocks until the pull completes or ctx is cancelled.
func (*OllamaClient) WarmUp ¶
func (c *OllamaClient) WarmUp(ctx context.Context) error
WarmUp pre-loads the model into Ollama's memory by sending an empty prompt. Uses the client's configured keepAlive so that the warm model respects the same RAM residency policy as live requests. Pinned tiers (keepAlive=-1) stay loaded; JIT tiers (keepAlive=0) get pre-loaded but are evicted on first real request — this avoids warmup overriding the intended Optimal-mode RAM budget. Implements ModelWarmer. Called in background goroutines at brain startup.
func (*OllamaClient) WithChatMode ¶
func (c *OllamaClient) WithChatMode(enabled bool) *OllamaClient
WithChatMode switches the client from /api/generate to /api/chat. Required for fine-tuned Qwen3.5 models: they need the chat-template message structure to follow instructions correctly. Raw /api/generate prompts cause these models to echo training examples instead of responding. Returns the client to allow chaining.
func (*OllamaClient) WithJSONFormat ¶
func (c *OllamaClient) WithJSONFormat(enabled bool) *OllamaClient
WithJSONFormat enables Ollama's structured JSON output mode by setting "format":"json" in the request body. When enabled, the model is constrained to emit only valid JSON — it will not produce prose, markdown fences, or partial output. Use for tiers that parse structured responses (Orchestrator, Archivist) where base models might otherwise produce free-text. Returns the client to allow chaining.
func (*OllamaClient) WithKeepAlive ¶
func (c *OllamaClient) WithKeepAlive(secs int) *OllamaClient
WithKeepAlive sets how long Ollama keeps the model loaded after a request. Pass -1 to pin the model in RAM indefinitely (hot-tier models called frequently). Pass 0 to evict immediately after each request (one-shot cold tasks). Pass positive seconds for a custom TTL. nil (default) uses Ollama's 5-minute default. Returns the client to allow chaining.
func (*OllamaClient) WithNumPredict ¶
func (c *OllamaClient) WithNumPredict(n int) *OllamaClient
WithNumPredict sets the maximum output tokens per request. Default is 400 (sufficient for insight/coordination JSON). Increase for tiers with longer structured outputs, e.g. Archivist (1024). Returns the client to allow chaining.
func (*OllamaClient) WithThinking ¶
func (c *OllamaClient) WithThinking(enabled bool) *OllamaClient
WithThinking configures extended thinking mode for Qwen3.5 models. Call on construction: llm.NewOllamaClient(...).WithThinking(true) Returns the client to allow chaining.