llama

package module

v0.0.11 Latest Latest Go to latest Published: Feb 1, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mutablelogic/go-llama

Links

Open Source Insights

README ¶

go-llama

Go bindings and a unified server/CLI for llama.cpp.

Run a local LLM server with a REST API, manage GGUF models, and use the go-llama CLI for chat, completion, embeddings, and tokenization.

Features

Command Line Interface: Interactive chat and completion tooling
HTTP API Server: REST endpoints for chat, completion, embeddings, and model management
Model Management: Pull, cache, load, unload, and delete GGUF models
Streaming: Incremental token streaming for chat and completion
GPU Support: CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
Docker Support: Pre-built images for CPU, CUDA, and Vulkan targets

Some work still to do on the chat endpoint. The following are not yet included, but will eventually be supported:

Multi-modal support (images, audio, PDF's, etc)
Reasoning/Thinking support
OpenAI or Anthropic compatible API
Tool calling
Grammar (JSON format output)
Text-to-Speech (Audio output)

Quick Start

Start the server with Docker:

docker volume create go-llama
docker run -d --name go-llama \
  -v go-llama:/data -p 8083:8083 \
  ghcr.io/mutablelogic/go-llama run

Then use the CLI to interact with the server:

export GOLLAMA_ADDR="localhost:8083"

# Pull a model (Hugging Face URL or hf:// scheme)
go-llama pull hf://bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf

# List models
go-llama models

# Load a model into memory
go-llama load Llama-3.2-1B-Instruct-Q4_K_M.gguf

# Chat (interactive)
go-llama chat Llama-3.2-1B-Instruct-Q4_K_M.gguf "You are a helpful assistant"
# Completion
go-llama complete Llama-3.2-1B-Instruct-Q4_K_M.gguf "Explain KV cache in two sentences"

Model Support

go-llama works with GGUF models supported by llama.cpp. Models can be pulled from Hugging Face using:

https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf
hf://<org>/<repo>/<file>.gguf

The default model cache directory is ${XDG_CACHE_HOME}/go-llama (or system temp) and can be overridden with GOLLAMA_DIR.

Docker Deployment

Docker containers are published for Linux AMD64 and ARM64. Variants include:

CPU and Vulkan: ghcr.io/mutablelogic/go-llama
CUDA: ghcr.io/mutablelogic/go-llama-cuda

Use the run command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.

CLI Usage Examples

Client-only commands:

Command	Description	Example
`models`	List available models	`go-llama models`
`model`	Get model details	`go-llama model phi-4-q4_k_m.gguf`
`pull`	Download a model	`go-llama pull hf://org/repo/model.gguf`
`load`	Load a model into memory	`go-llama load phi-4-q4_k_m.gguf`
`unload`	Unload a model from memory	`go-llama unload phi-4-q4_k_m.gguf`
`delete`	Delete a model	`go-llama delete phi-4-q4_k_m.gguf`
`chat`	Interactive chat	`go-llama chat phi-4-q4_k_m.gguf "system"`
`complete`	Text completion	`go-llama complete phi-4-q4_k_m.gguf "prompt"`
`embed`	Generate embeddings	`go-llama embed phi-4-q4_k_m.gguf "text"`
`tokenize`	Convert text to tokens	`go-llama tokenize phi-4-q4_k_m.gguf "text"`
`detokenize`	Convert tokens to text	`go-llama detokenize phi-4-q4_k_m.gguf 1 2 3`

Use go-llama --help or go-llama <command> --help for full options. Server commands:

Command	Description	Example
`gpuinfo`	Show GPU information	`go-llama gpuinfo`
`run`	Run the HTTP server	`go-llama run --http.addr localhost:8083`

Development

Project Structure

cmd contains the CLI and server entrypoint
pkg/llamacpp contains the high-level service and HTTP handlers
- httpclient/ - client for the server API
- httphandler/ - HTTP handlers and routing
- schema/ - API types
sys/llamacpp contains native bindings to llama.cpp
sys/gguf contains GGUF parsing helpers
third_party/llama.cpp is the upstream llama.cpp submodule
etc/ contains Dockerfiles

Building

# Build server binary
make go-llama

# Build client-only binary
make go-llama-client

# Build Docker images
make docker

Use GGML_CUDA=1 or GGML_VULKAN=1 to build GPU variants.

Contributing & License

Please file issues and feature requests in GitHub issues. Licensed under Apache 2.0.

Documentation ¶

Index ¶

type Error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Error ¶

type Error int

Error represents an error code

const (
	ErrSuccess Error = iota
	ErrInvalidContext
	ErrInvalidModel
	ErrInvalidArgument
	ErrIndexOutOfRange
	ErrKeyNotFound
	ErrTypeMismatch
	ErrInvalidToken
	ErrInvalidBatch
	ErrBatchFull
	ErrNoKVSlot
	ErrOpenFailed
	ErrNotFound
	ErrModelNotLoaded
	ErrNotEmbeddingModel
)

func (Error) Error ¶

func (e Error) Error() string

func (Error) String ¶

func (e Error) String() string

func (Error) With ¶

func (e Error) With(msg string) error

With returns a new error wrapping this error with additional context

func (Error) Withf ¶

func (e Error) Withf(format string, args ...any) error

Withf returns a new error wrapping this error with formatted context

Source Files ¶

View all Source files

error.go

Directories ¶

Path	Synopsis
cmd
go-llama command
pkg
llamacpp
llamacpp/httpclient Package httpclient provides a typed Go client for consuming the go-llama REST API.	Package httpclient provides a typed Go client for consuming the go-llama REST API.
llamacpp/httphandler
llamacpp/schema
llamacpp/store
version
sys
gguf
llamacpp
pkg-config command

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL