llamacpp

package

v0.0.11 Latest Latest Go to latest Published: Feb 1, 2026 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mutablelogic/go-llama

Links

Open Source Insights

Documentation ¶

Index ¶

func ExtractThinkingBlocks(text string) []string
func HasThinkingTags(text string) bool
func IsThinkingModelName(modelInfo string) bool
func IsThinkingTemplate(template string) bool
func StripThinking(text string) string
type Llama
- func New(path string, opts ...Opt) (*Llama, error)
type Opt
- func WithTracer(tracer trace.Tracer) Opt
type PullCallback
type ReasoningResult
- func ParseReasoning(text string) ReasoningResult
- func ParseReasoningWithTag(text, tag string) ReasoningResult
type Task
type TaskFunc

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ExtractThinkingBlocks ¶

func ExtractThinkingBlocks(text string) []string

ExtractThinkingBlocks returns all thinking blocks found in the text without modifying the original text

func HasThinkingTags ¶

func HasThinkingTags(text string) bool

HasThinkingTags checks if the text contains any thinking/reasoning tags

func IsThinkingModelName ¶

func IsThinkingModelName(modelInfo string) bool

IsThinkingModelName checks if a model name/architecture suggests it's a reasoning model

func IsThinkingTemplate ¶

func IsThinkingTemplate(template string) bool

IsThinkingTemplate checks if a chat template string suggests it's a reasoning template by looking for common thinking-related patterns

func StripThinking ¶

func StripThinking(text string) string

StripThinking removes all thinking/reasoning tags from text leaving only the final response content

Types ¶

type Llama ¶

type Llama struct {
	sync.RWMutex

	*store.Store
	// contains filtered or unexported fields
}

Llama is a singleton that manages the llama.cpp runtime and models

func New ¶

func New(path string, opts ...Opt) (*Llama, error)

New creates or returns the singleton Llama instance

func (*Llama) Chat ¶ added in v0.0.5

func (l *Llama) Chat(ctx context.Context, req schema.ChatRequest, onChunk func(schema.ChatChunk) error) (result *schema.ChatResponse, err error)

Chat generates a response for the given chat messages. If onChunk is provided, it will be called for each generated token. The callback can stop generation early by returning an error.

func (*Llama) Close ¶

func (l *Llama) Close() error

Close releases all resources and cleans up the llama.cpp runtime

func (*Llama) Complete ¶

func (l *Llama) Complete(ctx context.Context, req schema.CompletionRequest, onChunk func(schema.CompletionChunk) error) (result *schema.CompletionResponse, err error)

Complete generates a completion for the given prompt. If onChunk is provided, it will be called for each generated token. The callback can stop generation early by returning an error.

func (*Llama) DeleteModel ¶

func (l *Llama) DeleteModel(ctx context.Context, name string) (err error)

DeleteModel deletes a model from disk and removes it from the cache if loaded.

func (*Llama) Detokenize ¶

func (l *Llama) Detokenize(ctx context.Context, req schema.DetokenizeRequest) (result *schema.DetokenizeResponse, err error)

Detokenize converts tokens back to text using the specified model. Loads the model if not already cached.

func (*Llama) Embed ¶

func (l *Llama) Embed(ctx context.Context, req schema.EmbedRequest) (result *schema.EmbedResponse, err error)

Embed generates embeddings for one or more texts. Loads the model if not already cached, creates a context with embeddings enabled, and computes embeddings for all input texts in a single batch.

func (*Llama) GPUInfo ¶

func (l *Llama) GPUInfo(ctx context.Context) *schema.GPUInfo

GPUInfo returns information about available GPU devices and the backend.

func (*Llama) GetModel ¶

func (l *Llama) GetModel(ctx context.Context, name string) (result *schema.CachedModel, err error)

GetModel returns a model by name as a CachedModel. If the model is loaded, returns the cached version with LoadedAt and Handle. If not loaded, returns a CachedModel with zero timestamp and nil Handle.

func (*Llama) ListModels ¶

func (l *Llama) ListModels(ctx context.Context) (result []*schema.CachedModel, err error)

ListModels returns all models in the store as CachedModel structures. Models that are loaded have their LoadedAt timestamp and Handle set. Models that are not loaded have zero timestamp and nil Handle.

func (*Llama) LoadModel ¶

func (l *Llama) LoadModel(ctx context.Context, req schema.LoadModelRequest) (result *schema.CachedModel, err error)

LoadModel loads a model into memory with the given parameters. Returns a CachedModel with the model handle and load timestamp. If the model is already cached, returns the existing cached model.

func (*Llama) PullModel ¶

func (l *Llama) PullModel(ctx context.Context, req schema.PullModelRequest, fn PullCallback) (result *schema.CachedModel, err error)

PullModel downloads a model from the given URL and returns the cached model.

func (*Llama) Tokenize ¶

func (l *Llama) Tokenize(ctx context.Context, req schema.TokenizeRequest) (result *schema.TokenizeResponse, err error)

Tokenize converts text to tokens using the specified model. Loads the model if not already cached.

func (*Llama) UnloadModel ¶

func (l *Llama) UnloadModel(ctx context.Context, name string) (result *schema.CachedModel, err error)

UnloadModel unloads a model from memory and removes it from the cache. Returns the model (now uncached with zero timestamp) and any error.

func (*Llama) WithContext ¶

func (l *Llama) WithContext(ctx context.Context, req schema.ContextRequest, fn TaskFunc) (err error)

WithContext loads a model (if not already cached), creates an inference context, and calls the function with a Task containing both. The context is freed after the callback returns, but the model remains loaded. Use this for operations that need a context (e.g., completion, embeddings).

Thread-safety: The callback is responsible for acquiring the model's mutex if needed. Use task.CachedModel().Lock()/Unlock() for operations that are not thread-safe (most llama.cpp operations).

func (*Llama) WithModel ¶

func (l *Llama) WithModel(ctx context.Context, req schema.LoadModelRequest, fn TaskFunc) (err error)

WithModel loads a model (if not already cached) and calls the function with a Task containing the model. The model remains loaded after the callback returns. Use this for operations that only need the model (e.g., tokenization, metadata access).

Thread-safety: The callback is responsible for acquiring the model's mutex if needed. Use task.CachedModel().Lock()/Unlock() for operations that are not thread-safe (most llama.cpp operations).

type Opt ¶

type Opt func(*opt) error

func WithTracer ¶

func WithTracer(tracer trace.Tracer) Opt

WithTracer sets the OpenTelemetry tracer for distributed tracing of model loading, unloading, and other operations.

type PullCallback ¶

type PullCallback func(filename string, bytes_received uint64, total_bytes uint64) error

PullCallback defines the callback function signature for progress updates during model downloads

type ReasoningResult ¶

type ReasoningResult struct {
	// Thinking contains the model's reasoning/thinking process
	// This is typically hidden from end users
	Thinking string

	// Content contains the final response after reasoning
	// This is what should be shown to users
	Content string

	// HasThinking indicates whether thinking tags were found
	HasThinking bool
}

ReasoningResult contains the parsed output from a reasoning model

func ParseReasoning ¶

func ParseReasoning(text string) ReasoningResult

ParseReasoning extracts thinking/reasoning blocks from model output It supports multiple common formats:

<think>...</think> (DeepSeek R1)
<reasoning>...</reasoning>
<scratchpad>...</scratchpad>
<thought>...</thought>
<internal>...</internal>

Returns a ReasoningResult with separated thinking and content

func ParseReasoningWithTag ¶

func ParseReasoningWithTag(text, tag string) ReasoningResult

ParseReasoningWithTag extracts content from a specific tag pattern tag should be the tag name without brackets (e.g., "think" not "<think>")

type Task ¶

type Task struct {
	// contains filtered or unexported fields
}

Task provides access to a loaded model and optionally a context for inference operations. Tasks are created via WithModel or WithContext and are valid only within the callback function.

func (*Task) CachedModel ¶

func (t *Task) CachedModel() *schema.CachedModel

CachedModel returns the full CachedModel metadata.

func (*Task) Context ¶

func (t *Task) Context() *llamacpp.Context

Context returns the underlying llamacpp Context, or nil if this task was created with WithModel (no context).

func (*Task) Model ¶

func (t *Task) Model() *llamacpp.Model

Model returns the underlying llamacpp Model handle.

type TaskFunc ¶

type TaskFunc func(context.Context, *Task) error

TaskFunc is a callback function that receives a context and Task.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
httpclient Package httpclient provides a typed Go client for consuming the go-llama REST API.	Package httpclient provides a typed Go client for consuming the go-llama REST API.
httphandler
schema
store

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL