llama

package module
v0.0.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 1, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

README

go-llama

Go Reference License

Go bindings and a unified server/CLI for llama.cpp.

Run a local LLM server with a REST API, manage GGUF models, and use the go-llama CLI for chat, completion, embeddings, and tokenization.

Features

  • Command Line Interface: Interactive chat and completion tooling
  • HTTP API Server: REST endpoints for chat, completion, embeddings, and model management
  • Model Management: Pull, cache, load, unload, and delete GGUF models
  • Streaming: Incremental token streaming for chat and completion
  • GPU Support: CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
  • Docker Support: Pre-built images for CPU, CUDA, and Vulkan targets

Some work still to do on the chat endpoint. The following are not yet included, but will eventually be supported:

  • Multi-modal support (images, audio, PDF's, etc)
  • Reasoning/Thinking support
  • OpenAI or Anthropic compatible API
  • Tool calling
  • Grammar (JSON format output)
  • Text-to-Speech (Audio output)

Quick Start

Start the server with Docker:

docker volume create go-llama
docker run -d --name go-llama \
  -v go-llama:/data -p 8083:8083 \
  ghcr.io/mutablelogic/go-llama run

Then use the CLI to interact with the server:

export GOLLAMA_ADDR="localhost:8083"

# Pull a model (Hugging Face URL or hf:// scheme)
go-llama pull hf://bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf

# List models
go-llama models

# Load a model into memory
go-llama load Llama-3.2-1B-Instruct-Q4_K_M.gguf

# Chat (interactive)
go-llama chat Llama-3.2-1B-Instruct-Q4_K_M.gguf "You are a helpful assistant"
# Completion
go-llama complete Llama-3.2-1B-Instruct-Q4_K_M.gguf "Explain KV cache in two sentences"

Model Support

go-llama works with GGUF models supported by llama.cpp. Models can be pulled from Hugging Face using:

  • https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf
  • hf://<org>/<repo>/<file>.gguf

The default model cache directory is ${XDG_CACHE_HOME}/go-llama (or system temp) and can be overridden with GOLLAMA_DIR.

Docker Deployment

Docker containers are published for Linux AMD64 and ARM64. Variants include:

  • CPU and Vulkan: ghcr.io/mutablelogic/go-llama
  • CUDA: ghcr.io/mutablelogic/go-llama-cuda

Use the run command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.

CLI Usage Examples

Client-only commands:

Command Description Example
models List available models go-llama models
model Get model details go-llama model phi-4-q4_k_m.gguf
pull Download a model go-llama pull hf://org/repo/model.gguf
load Load a model into memory go-llama load phi-4-q4_k_m.gguf
unload Unload a model from memory go-llama unload phi-4-q4_k_m.gguf
delete Delete a model go-llama delete phi-4-q4_k_m.gguf
chat Interactive chat go-llama chat phi-4-q4_k_m.gguf "system"
complete Text completion go-llama complete phi-4-q4_k_m.gguf "prompt"
embed Generate embeddings go-llama embed phi-4-q4_k_m.gguf "text"
tokenize Convert text to tokens go-llama tokenize phi-4-q4_k_m.gguf "text"
detokenize Convert tokens to text go-llama detokenize phi-4-q4_k_m.gguf 1 2 3

Use go-llama --help or go-llama <command> --help for full options. Server commands:

Command Description Example
gpuinfo Show GPU information go-llama gpuinfo
run Run the HTTP server go-llama run --http.addr localhost:8083

Development

Project Structure
  • cmd contains the CLI and server entrypoint
  • pkg/llamacpp contains the high-level service and HTTP handlers
    • httpclient/ - client for the server API
    • httphandler/ - HTTP handlers and routing
    • schema/ - API types
  • sys/llamacpp contains native bindings to llama.cpp
  • sys/gguf contains GGUF parsing helpers
  • third_party/llama.cpp is the upstream llama.cpp submodule
  • etc/ contains Dockerfiles
Building
# Build server binary
make go-llama

# Build client-only binary
make go-llama-client

# Build Docker images
make docker

Use GGML_CUDA=1 or GGML_VULKAN=1 to build GPU variants.

Contributing & License

Please file issues and feature requests in GitHub issues. Licensed under Apache 2.0.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Error

type Error int

Error represents an error code

const (
	ErrSuccess Error = iota
	ErrInvalidContext
	ErrInvalidModel
	ErrInvalidArgument
	ErrIndexOutOfRange
	ErrKeyNotFound
	ErrTypeMismatch
	ErrInvalidToken
	ErrInvalidBatch
	ErrBatchFull
	ErrNoKVSlot
	ErrOpenFailed
	ErrNotFound
	ErrModelNotLoaded
	ErrNotEmbeddingModel
)

func (Error) Error

func (e Error) Error() string

func (Error) String

func (e Error) String() string

func (Error) With

func (e Error) With(msg string) error

With returns a new error wrapping this error with additional context

func (Error) Withf

func (e Error) Withf(format string, args ...any) error

Withf returns a new error wrapping this error with formatted context

Directories

Path Synopsis
cmd
go-llama command
pkg
llamacpp/httpclient
Package httpclient provides a typed Go client for consuming the go-llama REST API.
Package httpclient provides a typed Go client for consuming the go-llama REST API.
sys
pkg-config command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL