llama

package module
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 31, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

README

go-llama

Go Reference License

Go bindings and a unified server/CLI for llama.cpp.

Run a local LLM server with a REST API, manage GGUF models, and use the gollama CLI for chat, completion, embeddings, and tokenization.

Features

  • Command Line Interface: Interactive chat and completion tooling
  • HTTP API Server: REST endpoints for chat, completion, embeddings, and model management
  • Model Management: Pull, cache, load, unload, and delete GGUF models
  • Streaming: Incremental token streaming for chat and completion
  • GPU Support: CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
  • Docker Support: Pre-built images for CPU, CUDA, and Vulkan targets (WIP)

Quick Start

Start the server with Docker:

docker volume create gollama
docker run -d --name gollama \
  -v gollama:/data -p 8083:8083 \
  ghcr.io/mutablelogic/go-llama run

Then use the CLI to interact with the server:

export GOLLAMA_ADDR="localhost:8083"

# Pull a model (Hugging Face URL or hf:// scheme)
gollama pull https://huggingface.co/unsloth/phi-4-GGUF/blob/main/phi-4-q4_k_m.gguf

# List models
gollama models

# Load a model into memory
gollama load phi-4-q4_k_m.gguf

# Chat (interactive)
gollama chat phi-4-q4_k_m.gguf "You are a helpful assistant"

# Completion
gollama complete phi-4-q4_k_m.gguf "Explain KV cache in two sentences"

Model Support

gollama works with GGUF models supported by llama.cpp. Models can be pulled from Hugging Face using:

  • https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf
  • hf://<org>/<repo>/<file>.gguf

The default model cache directory is ${XDG_CACHE_HOME}/gollama (or system temp) and can be overridden with GOLLAMA_DIR.

Docker Deployment

Docker containers are published for Linux AMD64 and ARM64. Variants include:

  • CPU: ghcr.io/mutablelogic/go-llama
  • CUDA: ghcr.io/mutablelogic/go-llama-cuda
  • Vulkan: ghcr.io/mutablelogic/go-llama-vulkan

Use the run command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.

CLI Usage Examples

Command Description Example
models List available models gollama models
model Get model details gollama model phi-4-q4_k_m.gguf
pull Download a model gollama pull hf://org/repo/model.gguf
load Load a model into memory gollama load phi-4-q4_k_m.gguf
unload Unload a model from memory gollama unload phi-4-q4_k_m.gguf
delete Delete a model gollama delete phi-4-q4_k_m.gguf
chat Interactive chat gollama chat phi-4-q4_k_m.gguf "system"
complete Text completion gollama complete phi-4-q4_k_m.gguf "prompt"
embed Generate embeddings gollama embed phi-4-q4_k_m.gguf "text"
tokenize Convert text to tokens gollama tokenize phi-4-q4_k_m.gguf "text"
detokenize Convert tokens to text gollama detokenize phi-4-q4_k_m.gguf 1 2 3
run Run the HTTP server gollama run --http.addr localhost:8083

Use gollama --help or gollama <command> --help for full options.

Development

Project Structure
  • cmd contains the CLI and server entrypoint
  • pkg/llamacpp contains the high-level service and HTTP handlers
    • httpclient/ - client for the server API
    • httphandler/ - HTTP handlers and routing
    • schema/ - API types
  • sys/llamacpp contains native bindings to llama.cpp
  • sys/gguf contains GGUF parsing helpers
  • third_party/llama.cpp is the upstream llama.cpp submodule
  • etc/ contains Dockerfiles
Building
# Build server binary
make gollama

# Build client-only binary
make gollama-client

# Build Docker images
make docker

Use GGML_CUDA=1 or GGML_VULKAN=1 to build GPU variants.

Contributing & License

Please file issues and feature requests in GitHub issues. Licensed under Apache 2.0.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Error

type Error int

Error represents an error code

const (
	ErrSuccess Error = iota
	ErrInvalidContext
	ErrInvalidModel
	ErrInvalidArgument
	ErrIndexOutOfRange
	ErrKeyNotFound
	ErrTypeMismatch
	ErrInvalidToken
	ErrInvalidBatch
	ErrBatchFull
	ErrNoKVSlot
	ErrOpenFailed
	ErrNotFound
	ErrModelNotLoaded
	ErrNotEmbeddingModel
)

func (Error) Error

func (e Error) Error() string

func (Error) String

func (e Error) String() string

func (Error) With

func (e Error) With(msg string) error

With returns a new error wrapping this error with additional context

func (Error) Withf

func (e Error) Withf(format string, args ...any) error

Withf returns a new error wrapping this error with formatted context

Directories

Path Synopsis
cmd
gollama command
pkg
llamacpp/httpclient
Package httpclient provides a typed Go client for consuming the go-llama REST API.
Package httpclient provides a typed Go client for consuming the go-llama REST API.
sys
pkg-config command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL