llm

package

v0.8.0 Latest Latest Go to latest Published: Mar 29, 2026 License: MIT Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/SynapsesOS/synapses

Links

Open Source Insights

Documentation ¶

Overview ¶

Package llm provides the LLM client abstraction for synapses-intelligence. All LLM calls use structured JSON output to ensure deterministic, fast parsing.

Index ¶

Constants
func DownloadGGUF(ctx context.Context, cfg DownloadConfig) (string, error)
func ExtractJSON(s string) string
func GGUFExists(path string) bool
func ListInstalledModels(ctx context.Context, baseURL string) ([]string, error)
func ParseSILResponse(raw string) (rootSummary, insight string, concerns []string)
func RepairJSON(s string) string
func Truncate(s string, n int) string
type DownloadConfig
- func (d DownloadConfig) DestPath() string
- func (d DownloadConfig) URL() string
type HardwareConfig
- func DetectHardware() HardwareConfig
type LLMClient
type LocalClient
- func NewLocalClient(ggufPath string, hw HardwareConfig) (*LocalClient, error)
- func (c *LocalClient) Available(_ context.Context) bool
- func (c *LocalClient) Close()
- func (c *LocalClient) Generate(ctx context.Context, prompt string) (string, error)
- func (c *LocalClient) ModelName() string
- func (c *LocalClient) ModelPulled(_ context.Context) bool
- func (c *LocalClient) PullModel(_ context.Context, _ io.Writer) error
- func (c *LocalClient) WithThinking(enabled bool) *LocalClient
type MockClient
- func NewMockClient(response string) *MockClient
- func NewUnavailableMockClient() *MockClient
- func (m *MockClient) Available(_ context.Context) bool
- func (m *MockClient) Generate(_ context.Context, _ string) (string, error)
- func (m *MockClient) ModelName() string
- func (m *MockClient) ModelPulled(_ context.Context) bool
- func (m *MockClient) PullModel(_ context.Context, _ io.Writer) error
type ModelWarmer
type OllamaClient
- func NewOllamaClient(baseURL, model string, timeoutMS int) *OllamaClient
- func (c *OllamaClient) Available(ctx context.Context) bool
- func (c *OllamaClient) Generate(ctx context.Context, prompt string) (string, error)
- func (c *OllamaClient) ModelName() string
- func (c *OllamaClient) ModelPulled(ctx context.Context) bool
- func (c *OllamaClient) PullModel(ctx context.Context, w io.Writer) error
- func (c *OllamaClient) WarmUp(ctx context.Context) error
- func (c *OllamaClient) WithChatMode(enabled bool) *OllamaClient
- func (c *OllamaClient) WithJSONFormat(enabled bool) *OllamaClient
- func (c *OllamaClient) WithKeepAlive(secs int) *OllamaClient
- func (c *OllamaClient) WithNumPredict(n int) *OllamaClient
- func (c *OllamaClient) WithThinking(enabled bool) *OllamaClient

Constants ¶

View Source

const HFBaseURL = "https://huggingface.co"

HFBaseURL is the HuggingFace resolve endpoint for file downloads.

Variables ¶

This section is empty.

Functions ¶

func DownloadGGUF ¶

func DownloadGGUF(ctx context.Context, cfg DownloadConfig) (string, error)

DownloadGGUF downloads a GGUF model file from HuggingFace if it doesn't already exist locally. Returns the local path on success.

Progress messages are written to cfg.Progress (if non-nil) in the format:

Downloading sil-coder-Q5_K_M.gguf from huggingface.co/divish/sil-coder
 500 MB / 6.5 GB (7%)
 ...
Download complete: /Users/you/.synapses/models/sil-coder-Q5_K_M.gguf

func ExtractJSON ¶

func ExtractJSON(s string) string

ExtractJSON strips markdown code fences and extracts the JSON object from raw LLM output. Many small models wrap JSON responses in ```json ... ``` blocks despite instructions. This function handles that gracefully so callers always get raw JSON to unmarshal.

func GGUFExists ¶

func GGUFExists(path string) bool

GGUFExists returns true if the GGUF file already exists on disk.

func ListInstalledModels ¶

func ListInstalledModels(ctx context.Context, baseURL string) ([]string, error)

ListInstalledModels returns all model names present in Ollama's local library.

func ParseSILResponse ¶

func ParseSILResponse(raw string) (rootSummary, insight string, concerns []string)

ParseSILResponse parses the labeled output format produced by the fine-tuned SIL model.

Expected output (after optional <think>...</think> block):

ROOT_SUMMARY: One sentence about the root node.
INSIGHT: One sentence about architectural role.
CONCERNS: concern1, concern2, concern3

Falls back to raw text as insight for backward compatibility with standard Ollama models that emit plain text or JSON.

Returns empty strings/nil slice for any field not found in the response.

func RepairJSON ¶

func RepairJSON(s string) string

RepairJSON attempts to fix common JSON bracket mismatches produced by Qwen3.5 models. The most frequent issue is nested arrays-of-objects where the model writes "]]" instead of "]}]" (dropping the closing "}" of the inner object before the outer array bracket).

Only modifies the input when it fails json.Unmarshal AND the fix produces valid JSON — never corrupts already-valid output.

func Truncate ¶

func Truncate(s string, n int) string

Truncate shortens s to at most n runes for use in error messages. Appends "..." when truncation occurs. Uses rune-aware slicing to avoid cutting multi-byte UTF-8 characters mid-sequence.

Types ¶

type DownloadConfig ¶

type DownloadConfig struct {
	// Repo is the HuggingFace repo, e.g. "divish/sil-coder"
	Repo string
	// Filename is the GGUF file name within the repo, e.g. "sil-coder-Q5_K_M.gguf"
	Filename string
	// DestDir is the local directory to save to. Created if it doesn't exist.
	DestDir string
	// Progress is an optional writer for progress messages. May be nil.
	Progress io.Writer
	// SHA256 is the expected SHA-256 hex digest. Required: download fails if empty.
	SHA256 string
}

DownloadConfig holds parameters for a GGUF download.

func (DownloadConfig) DestPath ¶

func (d DownloadConfig) DestPath() string

DestPath returns the full local path where the GGUF will be saved.

func (DownloadConfig) URL ¶

func (d DownloadConfig) URL() string

URL returns the HuggingFace download URL for this file.

type HardwareConfig ¶

type HardwareConfig struct {
	// HasMetal is true on Apple Silicon (M1/M2/M3/M4) Macs.
	// llama.cpp uses the Metal framework for GPU acceleration on these devices.
	HasMetal bool

	// HasCUDA is true when an NVIDIA GPU with CUDA support is detected.
	HasCUDA bool

	// GPULayers is the number of transformer layers to offload to the GPU.
	// 0 = CPU-only. Auto-tuned based on detected VRAM.
	GPULayers int

	// AvailableRAMGB is the approximate amount of free system RAM in GB.
	// Used as an anti-OOM guard: if too low the local backend is skipped.
	AvailableRAMGB float64
}

HardwareConfig describes the host machine's LLM-relevant capabilities.

func DetectHardware ¶

func DetectHardware() HardwareConfig

DetectHardware probes the current machine and returns a HardwareConfig. It is safe to call multiple times; results are not cached (cheap probes).

type LLMClient ¶

type LLMClient interface {
	// Generate sends a prompt to the LLM and returns the raw response text.
	// The caller is responsible for parsing the JSON response.
	Generate(ctx context.Context, prompt string) (string, error)

	// Available returns true if the backend is reachable and the model is loaded.
	Available(ctx context.Context) bool

	// ModelName returns the configured model identifier.
	ModelName() string

	// ModelPulled returns true if the model is already present locally
	// (no download needed).
	ModelPulled(ctx context.Context) bool

	// PullModel downloads the model, streaming progress to w.
	PullModel(ctx context.Context, w io.Writer) error
}

LLMClient is the interface for all LLM backends. Implementations: OllamaClient (production), MockClient (tests).

type LocalClient ¶

type LocalClient struct {
	// contains filtered or unexported fields
}

LocalClient runs a fine-tuned GGUF model embedded in-process via godeps/gollama. Zero network calls — everything happens in RAM.

gollama is not goroutine-safe per context instance, so all Generate calls are serialised through mu. For high-throughput workloads consider a pool of LocalClient instances, one per goroutine.

func NewLocalClient ¶

func NewLocalClient(ggufPath string, hw HardwareConfig) (*LocalClient, error)

NewLocalClient loads a GGUF model file and returns a ready LocalClient. Returns an error if the model cannot be loaded or available RAM is too low.

Usage:

cli, err := llm.NewLocalClient("/path/to/sil-9b-gguf/model.gguf", llm.DetectHardware())

func (*LocalClient) Available ¶

func (c *LocalClient) Available(_ context.Context) bool

Available returns true if the model is loaded and RAM is sufficient.

func (*LocalClient) Close ¶

func (c *LocalClient) Close()

Close releases GPU/CPU memory held by the llama.cpp model and context. Safe to call multiple times; second and subsequent calls are no-ops.

func (*LocalClient) Generate ¶

func (c *LocalClient) Generate(ctx context.Context, prompt string) (string, error)

Generate runs inference on prompt and returns the decoded response text. Thread-safe: mu is held only for the guard reads; inferSem serialises the actual CGo call outside the lock so context cancellation at the semaphore does not strand the next caller.

func (*LocalClient) ModelName ¶

func (c *LocalClient) ModelName() string

ModelName returns the GGUF file name without path, used for logging.

func (*LocalClient) ModelPulled ¶

func (c *LocalClient) ModelPulled(_ context.Context) bool

ModelPulled always returns true — local GGUF files are already on disk.

func (*LocalClient) PullModel ¶

func (c *LocalClient) PullModel(_ context.Context, _ io.Writer) error

PullModel is a no-op for local files (nothing to download).

func (*LocalClient) WithThinking ¶

func (c *LocalClient) WithThinking(enabled bool) *LocalClient

WithThinking enables or disables extended reasoning mode (Qwen3 <think> blocks).

type MockClient ¶

type MockClient struct {
	Response string
	Err      error
	// contains filtered or unexported fields
}

MockClient is a deterministic LLM client for tests. It returns a fixed response for every Generate call.

func NewMockClient ¶

func NewMockClient(response string) *MockClient

NewMockClient creates a MockClient that always returns the given response.

func NewUnavailableMockClient ¶

func NewUnavailableMockClient() *MockClient

NewUnavailableMockClient creates a MockClient that reports itself unavailable.

func (*MockClient) Available ¶

func (m *MockClient) Available(_ context.Context) bool

Available reports whether the mock client is configured as available.

func (*MockClient) Generate ¶

func (m *MockClient) Generate(_ context.Context, _ string) (string, error)

Generate returns the configured mock response.

func (*MockClient) ModelName ¶

func (m *MockClient) ModelName() string

ModelName returns the mock model name.

func (*MockClient) ModelPulled ¶

func (m *MockClient) ModelPulled(_ context.Context) bool

ModelPulled reports whether the mock model is available.

func (*MockClient) PullModel ¶

func (m *MockClient) PullModel(_ context.Context, _ io.Writer) error

PullModel is a no-op for the mock client.

type ModelWarmer ¶

type ModelWarmer interface {
	WarmUp(ctx context.Context) error
}

ModelWarmer can pre-load a model into memory before the first real request. OllamaClient implements this by sending an empty prompt with keep_alive=-1, which forces Ollama to load the model weights without generating any output.

type OllamaClient ¶

type OllamaClient struct {
	// contains filtered or unexported fields
}

OllamaClient calls the Ollama REST API at POST /api/generate. It keeps a reusable http.Client for connection pooling.

func NewOllamaClient ¶

func NewOllamaClient(baseURL, model string, timeoutMS int) *OllamaClient

NewOllamaClient creates a client targeting the given Ollama base URL and model. timeoutMS is the per-request timeout in milliseconds (applied at HTTP client level — does not cancel the Ollama server-side inference, only the wait).

func (*OllamaClient) Available ¶

func (c *OllamaClient) Available(ctx context.Context) bool

Available checks if Ollama is reachable by calling GET /api/tags. Returns true only if the HTTP call succeeds with a 200 status.

func (*OllamaClient) Generate ¶

func (c *OllamaClient) Generate(ctx context.Context, prompt string) (string, error)

Generate sends a prompt and returns the response text. Uses stream=false for simplicity and lowest latency on small outputs. For Qwen3.x models, sets the Ollama API think: bool field (≥0.6) to control chain-of-thought. Non-Qwen3 models receive no think field — they ignore it. When useChat=true, dispatches to /api/chat instead of /api/generate — required for fine-tuned Qwen3.5 models that need chat-template formatting.

func (*OllamaClient) ModelName ¶

func (c *OllamaClient) ModelName() string

ModelName returns the configured model tag.

func (*OllamaClient) ModelPulled ¶

func (c *OllamaClient) ModelPulled(ctx context.Context) bool

ModelPulled returns true if the configured model is already present in Ollama's local model library (i.e. no pull is needed). Uses a short 3s deadline so startup is not blocked for 30s if Ollama is slow.

func (*OllamaClient) PullModel ¶

func (c *OllamaClient) PullModel(ctx context.Context, w io.Writer) error

PullModel pulls the configured model from the Ollama registry, streaming progress lines to w. Pass os.Stderr for terminal feedback. Blocks until the pull completes or ctx is cancelled.

func (*OllamaClient) WarmUp ¶

func (c *OllamaClient) WarmUp(ctx context.Context) error

WarmUp pre-loads the model into Ollama's memory by sending an empty prompt. Uses the client's configured keepAlive so that the warm model respects the same RAM residency policy as live requests. Pinned tiers (keepAlive=-1) stay loaded; JIT tiers (keepAlive=0) get pre-loaded but are evicted on first real request — this avoids warmup overriding the intended Optimal-mode RAM budget. Implements ModelWarmer. Called in background goroutines at brain startup.

func (*OllamaClient) WithChatMode ¶

func (c *OllamaClient) WithChatMode(enabled bool) *OllamaClient

WithChatMode switches the client from /api/generate to /api/chat. Required for fine-tuned Qwen3.5 models: they need the chat-template message structure to follow instructions correctly. Raw /api/generate prompts cause these models to echo training examples instead of responding. Returns the client to allow chaining.

func (*OllamaClient) WithJSONFormat ¶

func (c *OllamaClient) WithJSONFormat(enabled bool) *OllamaClient

WithJSONFormat enables Ollama's structured JSON output mode by setting "format":"json" in the request body. When enabled, the model is constrained to emit only valid JSON — it will not produce prose, markdown fences, or partial output. Use for tiers that parse structured responses (Orchestrator, Archivist) where base models might otherwise produce free-text. Returns the client to allow chaining.

func (*OllamaClient) WithKeepAlive ¶

func (c *OllamaClient) WithKeepAlive(secs int) *OllamaClient

WithKeepAlive sets how long Ollama keeps the model loaded after a request. Pass -1 to pin the model in RAM indefinitely (hot-tier models called frequently). Pass 0 to evict immediately after each request (one-shot cold tasks). Pass positive seconds for a custom TTL. nil (default) uses Ollama's 5-minute default. Returns the client to allow chaining.

func (*OllamaClient) WithNumPredict ¶

func (c *OllamaClient) WithNumPredict(n int) *OllamaClient

WithNumPredict sets the maximum output tokens per request. Default is 400 (sufficient for insight/coordination JSON). Increase for tiers with longer structured outputs, e.g. Archivist (1024). Returns the client to allow chaining.

func (*OllamaClient) WithThinking ¶

func (c *OllamaClient) WithThinking(enabled bool) *OllamaClient

WithThinking configures extended thinking mode for Qwen3.5 models. Call on construction: llm.NewOllamaClient(...).WithThinking(true) Returns the client to allow chaining.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL