README
¶
Ollama Benchmark Tool
A Go-based command-line tool for benchmarking Ollama models with configurable parameters, warmup phases, TTFT tracking, VRAM monitoring, and benchstat/CSV output.
Features
- Benchmark multiple models in a single run
- Support for both text and image prompts
- Configurable generation parameters (temperature, max tokens, seed, etc.)
- Warmup phase before timed epochs to stabilize measurements
- Time-to-first-token (TTFT) tracking per epoch
- Model metadata display (parameter size, quantization level, family)
- VRAM and CPU memory usage tracking via running process info
- Controlled prompt token length for reproducible benchmarks
- Benchstat and CSV output formats
Building from Source
go build -o ollama-bench ./cmd/bench
./ollama-bench -model gemma3 -epochs 6 -format csv
Using Go Run (without building)
go run ./cmd/bench -model gemma3 -epochs 3
Usage
Basic Example
./ollama-bench -model gemma3 -epochs 6
Benchmark Multiple Models
./ollama-bench -model gemma3,gemma3n -epochs 6 -max-tokens 100 -p "Write me a short story" | tee gemma.bench
benchstat -col /name gemma.bench
With Image Prompt
./ollama-bench -model qwen3-vl -image photo.jpg -epochs 6 -max-tokens 100 -p "Describe this image"
Controlled Prompt Length
./ollama-bench -model gemma3 -epochs 6 -prompt-tokens 512
Advanced Example
./ollama-bench -model llama3 -epochs 10 -temperature 0.7 -max-tokens 500 -seed 42 -warmup 2 -format csv -output results.csv
Command Line Options
| Option | Description | Default |
|---|---|---|
| -model | Comma-separated list of models to benchmark | (required) |
| -epochs | Number of iterations per model | 6 |
| -max-tokens | Maximum tokens for model response | 200 |
| -temperature | Temperature parameter | 0.0 |
| -seed | Random seed | 0 (random) |
| -timeout | Timeout in seconds | 300 |
| -p | Prompt text | (default story prompt) |
| -image | Image file to include in prompt | |
| -k | Keep-alive duration in seconds | 0 |
| -format | Output format (benchstat, csv) | benchstat |
| -output | Output file for results | "" (stdout) |
| -warmup | Number of warmup requests before timing | 1 |
| -prompt-tokens | Generate prompt targeting ~N tokens (0 = use -p) | 0 |
| -v | Verbose mode | false |
| -debug | Show debug information | false |
Output Formats
Benchstat Format (default)
Compatible with Go's benchstat tool for statistical analysis. Uses one value/unit pair per line, standard ns/op for timing metrics, and ns/token for throughput. Each epoch produces one set of lines -- benchstat aggregates across repeated runs to compute statistics.
# Model: gemma3 | Params: 4.3B | Quant: Q4_K_M | Family: gemma3 | Size: 4080218931 | VRAM: 4080218931
BenchmarkModel/name=gemma3/step=prefill 1 78125.00 ns/token 12800.00 token/sec
BenchmarkModel/name=gemma3/step=generate 1 19531.25 ns/token 51200.00 token/sec
BenchmarkModel/name=gemma3/step=ttft 1 45123000 ns/op
BenchmarkModel/name=gemma3/step=load 1 1500000000 ns/op
BenchmarkModel/name=gemma3/step=total 1 2861047625 ns/op
Use with benchstat:
./ollama-bench -model gemma3 -epochs 6 > gemma3.bench
benchstat -col /step gemma3.bench
Compare two runs:
./ollama-bench -model gemma3 -epochs 6 > before.bench
# ... make changes ...
./ollama-bench -model gemma3 -epochs 6 > after.bench
benchstat before.bench after.bench
CSV Format
Machine-readable comma-separated values:
NAME,STEP,COUNT,NS_PER_COUNT,TOKEN_PER_SEC
# Model: gemma3 | Params: 4.3B | Quant: Q4_K_M | Family: gemma3 | Size: 4080218931 | VRAM: 4080218931
gemma3,prefill,128,78125.00,12800.00
gemma3,generate,512,19531.25,51200.00
gemma3,ttft,1,45123000,0
gemma3,load,1,1500000000,0
gemma3,total,1,2861047625,0
Metrics Explained
The tool reports the following metrics for each epoch:
- prefill: Time spent processing the prompt (ns/token)
- generate: Time spent generating the response (ns/token)
- ttft: Time to first token -- latency from request start to first response content
- load: Model loading time (one-time cost)
- total: Total request duration
Additionally, the model info comment line (displayed once per model before epochs) includes:
- Params: Model parameter count (e.g., 4.3B)
- Quant: Quantization level (e.g., Q4_K_M)
- Family: Model family (e.g., gemma3)
- Size: Total model memory in bytes
- VRAM: GPU memory used by the loaded model (when Size > VRAM, the difference is CPU spill)
Documentation
¶
There is no documentation for this package.
Click to show internal directories.
Click to hide internal directories.