Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory
Published on February 12, 2026 • 22 min read
Quick Start: Get a Real Tokens/sec Number in 90 Seconds
Run this on any Ollama box and you have a defensible baseline:
ollama run llama3.1:8b --verbose "Write 200 words on caching."- Read the
eval rateline — that is your generation tokens/sec. - Read
prompt eval rate— that is your prompt-processing speed.
That is the floor. The rest of this guide is how to turn that single observation into a reproducible benchmark you can defend in a procurement meeting or post on r/LocalLLaMA without getting torn apart.
What you will measure:
- Generation throughput (tokens/sec) under steady-state load
- Time-to-first-token (TTFT) — the latency users actually feel
- Prompt processing speed (prefill tokens/sec) for long-context workloads
- VRAM and unified-memory headroom under real prompts
- Concurrency curves — at what request rate does throughput collapse
Most "Ollama is slow" or "my GPU is faster than yours" arguments online are unfalsifiable because nobody publishes the prompt, the seed, the context length, the quantization, or the warmup state. We fix that here. If you also want to know which models are even worth benchmarking on your hardware, start with the best Ollama models guide and our hardware requirements overview before running anything heavy.
Table of Contents
- Why Most Local AI Benchmarks Are Wrong
- The Five Numbers That Actually Matter
- Setting Up a Clean Test Environment
- Benchmarking Ollama
- Benchmarking llama.cpp Directly
- Benchmarking vLLM for Concurrency
- Cross-Stack Results from Our Lab
- Common Pitfalls That Tank Numbers
- Frequently Asked Questions
Why Most Local AI Benchmarks Are Wrong {#why-wrong}
Open any benchmark thread and you will find the same four mistakes:
- Cold-start measurements. First run loads weights from disk and recompiles kernels. That number is meaningless.
- Mixed quantizations. Comparing Q4_K_M against Q8_0 is comparing different models, not different machines.
- Different prompts. A 4-token prompt and a 4,000-token prompt produce wildly different prefill rates.
- No concurrency control. Single-stream tokens/sec is not the same as multi-user throughput, and most home rigs are tested at concurrency 1.
A defensible benchmark documents the model, the quantization, the context window, the prompt, the temperature, the seed (where supported), the runner version, the hardware, and whether the run was warmed up. We will hit every one of those.
For a deeper philosophical dig into evaluation methodology, llama.cpp's llama-bench tool is the reference implementation most other benchmarks copy from.
The Five Numbers That Actually Matter {#five-numbers}
Forget MMLU. We are measuring the rig, not the model. The five numbers that matter for a local deployment:
1. Generation tokens/sec (eval rate)
Steady-state output speed once the model is generating. This is the number you put on a slide.
2. Time-to-first-token (TTFT)
Wall clock from request to first emitted token. Dominates perceived latency for short prompts and chat UX.
3. Prompt processing tokens/sec (prefill rate)
How fast the model can ingest the prompt. Critical for RAG, long-context coding agents, and document summarization.
4. Effective concurrent throughput
Tokens/sec across N concurrent streams. Single-user vs ten-user numbers can differ 4x in either direction.
5. Peak resident memory
VRAM (GPU) or RSS (CPU/Apple Silicon) at full context. Drives model selection and headroom planning.
| Metric | Symbol | What it tells you |
|---|---|---|
| Generation rate | tok/s | Throughput once running |
| TTFT | ms | Latency users feel |
| Prefill rate | tok/s | Long-context viability |
| Concurrent throughput | tok/s @ N | Capacity for multi-user |
| Peak memory | GB | Largest model you can run |
Setting Up a Clean Test Environment {#clean-env}
Before any number is trustworthy:
# 1. Pin the model and quantization explicitly
ollama pull llama3.1:8b-instruct-q4_K_M
# 2. Record the runner version
ollama --version
# llama.cpp:
./llama-cli --version
# vLLM:
python -c "import vllm; print(vllm.__version__)"
# 3. Lock GPU clocks (NVIDIA) so thermals do not skew results
sudo nvidia-smi -pm 1
sudo nvidia-smi --lock-gpu-clocks=1410,1410 # adjust per card
sudo nvidia-smi --lock-memory-clocks=10501,10501
# 4. Drop filesystem cache between runs (Linux)
sync && sudo sysctl -w vm.drop_caches=3
# 5. Disable background indexers (macOS example)
sudo mdutil -a -i off
Standard benchmark prompts
We use three fixed prompts so you can compare directly:
SHORT (about 32 tokens):
"Explain in two sentences why local LLMs reduce egress costs versus hosted APIs."
MEDIUM (about 512 tokens):
[paste a Wikipedia paragraph, then ask] "Summarize the above in five bullets."
LONG (about 4,096 tokens):
[paste a long technical doc] "Extract every numeric claim with its source sentence."
Lock temperature, top-p, seed:
export OLLAMA_KEEP_ALIVE=30m # keep model resident between runs
ollama run llama3.1:8b-instruct-q4_K_M \
--verbose \
--temperature 0 \
--top-p 1 \
--seed 42
Benchmarking Ollama {#ollama}
Step 1: Warm up
# Burn one run to warm caches and JIT
ollama run llama3.1:8b-instruct-q4_K_M --verbose "warmup" > /dev/null
Step 2: Capture verbose stats
ollama run llama3.1:8b-instruct-q4_K_M --verbose \
"Explain in two sentences why local LLMs reduce egress costs versus hosted APIs."
You should see, at the bottom:
total duration: 4.812s
load duration: 38.1ms
prompt eval count: 32 token(s)
prompt eval duration: 410ms
prompt eval rate: 78.05 tokens/s
eval count: 186 token(s)
eval duration: 4.36s
eval rate: 42.66 tokens/s
The two numbers to record are prompt eval rate (prefill) and eval rate (generation).
Step 3: Automate with the API and measure TTFT
Ollama emits SSE chunks. The first non-empty chunk is your TTFT.
# bench-ollama.sh
#!/usr/bin/env bash
set -euo pipefail
MODEL="${1:-llama3.1:8b-instruct-q4_K_M}"
PROMPT="${2:-Write a 200 word essay on caching strategies.}"
URL="http://127.0.0.1:11434/api/generate"
start=$(date +%s%N)
first_token_ns=""
total_tokens=0
curl -sN -X POST "$URL" \
-H 'Content-Type: application/json' \
-d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":true,\"options\":{\"temperature\":0,\"seed\":42}}" \
| while IFS= read -r line; do
if [[ -z "$first_token_ns" ]] && echo "$line" | grep -q '"response":"[^"]'; then
first_token_ns=$(date +%s%N)
ttft_ms=$(( (first_token_ns - start) / 1000000 ))
echo "TTFT: $ttft_ms ms"
fi
total_tokens=$((total_tokens + 1))
done
end=$(date +%s%N)
elapsed_s=$(echo "scale=3; ($end - $start) / 1000000000" | bc)
echo "Wall: $elapsed_s s, chunks: $total_tokens"
Run five iterations, drop the first, average the rest.
Step 4: Concurrency
# Hit Ollama with N parallel streams
seq 1 8 | xargs -P 8 -I {} ./bench-ollama.sh llama3.1:8b-instruct-q4_K_M
Compare aggregate tokens/sec against single-stream. On most consumer cards, throughput peaks somewhere between 4 and 16 concurrent streams, then collapses.
Benchmarking llama.cpp Directly {#llama-cpp}
llama.cpp ships a purpose-built benchmark that is more rigorous than wrapping the CLI:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j
# Standard sweep: prompt processing 512 / generation 128
./build/bin/llama-bench \
-m /models/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf \
-p 512 -n 128 -t 8 -ngl 99 -r 5
Output:
| model | size | params | backend | ngl | test | t/s |
| Meta-Llama-3.1-8B | 4.58 GiB | 8.03B | CUDA | 99 | pp 512 | 1842.31 ± 12.40 |
| Meta-Llama-3.1-8B | 4.58 GiB | 8.03B | CUDA | 99 | tg 128 | 78.04 ± 0.18 |
pp is prefill. tg is generation. The ± is one standard deviation across -r 5 runs — publish that, not a single number.
Useful sweeps
# Sweep context lengths to find where prefill collapses
./build/bin/llama-bench -m model.gguf -p 128,512,2048,8192 -n 64
# Compare quantizations on the same hardware
./build/bin/llama-bench \
-m llama-3.1-8b.Q4_K_M.gguf \
-m llama-3.1-8b.Q5_K_M.gguf \
-m llama-3.1-8b.Q8_0.gguf \
-p 512 -n 128
If you are choosing between quants, our GGUF, AWQ and GPTQ comparison walks through the quality vs throughput tradeoff in detail.
Benchmarking vLLM for Concurrency {#vllm}
Ollama and llama.cpp are excellent single-tenant runners. If you are serving multiple users, vLLM's continuous batching changes the shape of the curve completely.
pip install "vllm>=0.6.4"
# Serve
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
vLLM ships its own benchmark harness:
git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks
# ShareGPT-style realistic load
python benchmark_serving.py \
--backend openai-chat \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 500 \
--request-rate 8
The output is what you actually want for capacity planning:
Successful requests: 500
Benchmark duration (s): 62.41
Total input tokens: 141,205
Total generated tokens: 72,418
Request throughput (req/s): 8.01
Output token throughput (tok/s): 1160.46
Mean TTFT (ms): 187
P99 TTFT (ms): 412
Mean TPOT (ms): 22.4
TPOT (time per output token) is vLLM's preferred per-token latency metric — useful when comparing against batched APIs.
Cross-Stack Results from Our Lab {#results}
Same model (Llama 3.1 8B Instruct, Q4_K_M / AWQ-equivalent), same prompts, three runners. Pin one row to your wall.
Single stream, RTX 4090 (24GB), Ryzen 9 7950X, 64GB DDR5
| Runner | Prefill (tok/s) | Generation (tok/s) | TTFT (ms) | Peak VRAM (GB) |
|---|---|---|---|---|
| Ollama 0.3.x | 1,624 | 71.8 | 92 | 5.9 |
| llama.cpp (CUDA) | 1,842 | 78.0 | 78 | 5.6 |
| vLLM (AWQ) | 4,310 | 84.2 | 64 | 18.4 |
Eight concurrent streams, same hardware
| Runner | Aggregate gen tok/s | P50 TTFT (ms) | P99 TTFT (ms) |
|---|---|---|---|
| Ollama 0.3.x | 142 | 240 | 1,840 |
| llama.cpp server | 318 | 168 | 920 |
| vLLM | 1,160 | 187 | 412 |
MacBook Pro M3 Max (64GB), Llama 3.1 8B Q4_K_M
| Runner | Prefill (tok/s) | Generation (tok/s) | TTFT (ms) | Peak unified mem (GB) |
|---|---|---|---|---|
| Ollama (Metal) | 612 | 38.4 | 142 | 5.7 |
| llama.cpp (Metal) | 681 | 41.2 | 128 | 5.5 |
If you are weighing Apple Silicon against discrete GPUs, our Mac Studio vs PC build comparison uses the exact same prompts so the numbers are comparable.
Common Pitfalls That Tank Numbers {#pitfalls}
1. Forgetting OLLAMA_KEEP_ALIVE
Default is 5 minutes. If your benchmark loop sleeps longer than that, the model unloads and the next "run" pays the load tax. Set OLLAMA_KEEP_ALIVE=30m or longer.
2. Background processes on the GPU
A single Chrome tab decoding video can shave 10-15% off your tokens/sec. Run nvidia-smi and confirm the only process on the card is your runner.
3. Power and thermal throttling
Laptops and SFF builds hit thermal walls fast. Pin clocks, watch nvidia-smi -l 1, and reject any run where GPU temperature crosses 83°C. On Apple Silicon use sudo powermetrics --samplers thermal -i1000 and reject runs that show Throttle: yes.
4. Mismatched context windows
If you build the GGUF with -c 2048 and llama.cpp with -c 8192, you are benchmarking different memory profiles. Pin --ctx-size explicitly.
5. Ignoring P99 latency
Mean TTFT looks great until one user gets a 3-second wait. Always publish P50 and P99.
6. Single-run reporting
Run at least 5 iterations, drop the first (warmup), report mean and standard deviation. A single number is not a benchmark; it is an anecdote.
7. Not pinning the runner version
llama.cpp ships breaking performance changes monthly. Pin the commit:
git -C llama.cpp rev-parse --short HEAD
Include that hash in your results table.
Putting It All Together: A Reproducible Benchmark Report
Every benchmark you publish should answer these questions on one page:
Hardware: RTX 4090 24GB / Ryzen 9 7950X / 64GB DDR5-6000
OS: Ubuntu 22.04, kernel 6.5.0, NVIDIA driver 550.90.07
Runner: llama.cpp commit a1b2c3d (CUDA, cuBLAS)
Model: Meta-Llama-3.1-8B-Instruct
Quantization: Q4_K_M (4.58 GiB)
Context: --ctx-size 4096
Prompts: 3 fixed prompts, see Appendix
Sampling: temperature 0, top-p 1, seed 42
Iterations: 5 per condition, first dropped
Concurrency: 1, 4, 8, 16
Power state: GPU clocks locked 1410/10501 MHz, ambient 22°C
That is the format auditors, hiring managers, and procurement teams take seriously. It is also the format that survives r/LocalLLaMA scrutiny.
Frequently Asked Questions {#faq}
Q: Which single number should I report if I only have time for one?
A: Generation tokens/sec at concurrency 1 with a fixed 512-token prompt and 128-token output, averaged over 5 runs after a warmup. It is not the whole story, but it is the least lying number you can report in one figure.
Q: How many runs is enough?
A: Five iterations, dropping the first, is the floor for stable means. For latency P99, you need at least 100 requests because P99 by definition needs ~100 samples to even exist.
Q: Why does my Ollama benchmark show different numbers than llama-bench on the same model?
A: Ollama applies a default system prompt, default sampler settings, and may chunk differently. Match settings explicitly: same context size, same temperature, same prompt, same quantization file.
Q: Should I benchmark with batch size > 1?
A: Only if your real workload uses batched requests. For chat UIs, single stream and concurrency curves are what matter. For offline pipelines, increase batch until VRAM saturates.
Q: How do I measure VRAM properly?
A: nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the run, take the peak. On Apple Silicon, sudo powermetrics --samplers gpu_power plus memory_pressure gives you the equivalent.
Q: My CPU-only run is faster than my GPU run for tiny models. Why?
A: Sub-3B models often fit in CPU cache, and PCIe transfer overhead can dominate GPU runtime. This is real. For models under ~3B parameters, benchmark both backends and pick the winner.
Q: Can I trust hosted leaderboards for my hardware decision?
A: Use them as a sanity check, never as the deciding number. Your prompt distribution, context length, and concurrency profile dictate which runner wins on your machine.
Q: How often should I re-benchmark?
A: Every minor llama.cpp / vLLM / Ollama bump, every driver update, and after every model swap. Pin the version in your report.
Conclusion
A benchmark is only useful if someone else can reproduce it. Pin the model, pin the quantization, pin the runner version, warm up, run five times, publish mean and standard deviation, include P50 and P99 latency, and disclose your hardware down to the GPU clock lock. Do that once and you will never have to argue about local AI performance with a stranger again — you will just send them your table.
If you are about to make a hardware purchase based on these numbers, pair this guide with our hardware requirements walkthrough and the budget local AI machine build before you click order.
Want printable benchmark templates and a YAML harness that runs the full Ollama / llama.cpp / vLLM matrix nightly? Subscribe to the LocalAimaster newsletter — we ship the harness to readers first.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!