What is the most important metric for benchmarking a local LLM setup?

Generation tokens per second at concurrency 1, with a fixed 512-token prompt and 128-token output, averaged across 5 runs after a warmup pass. It is not the whole story, but it is the most defensible single number for a local rig.

How do I measure time-to-first-token (TTFT) for Ollama?

Send a streaming request to /api/generate with stream=true and record the wall clock between the request and the first non-empty response chunk. Our bench-ollama.sh script in this guide shows the exact pattern.

Why is my benchmark inconsistent run-to-run?

Three usual suspects: thermal throttling, model unload between runs (raise OLLAMA_KEEP_ALIVE), and background GPU processes. Lock GPU clocks with nvidia-smi, keep the model resident, and confirm only your runner is using the GPU before each iteration.

Should I use Ollama, llama.cpp, or vLLM for benchmarking?

Use llama.cpp llama-bench for the most rigorous single-stream comparison across hardware. Use Ollama for what your developers actually run. Use vLLM when you need realistic multi-user concurrency numbers, since its continuous batching changes the shape of the curve.

How do I report VRAM usage correctly?

Sample nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the run and report the peak, not the start. On Apple Silicon, use powermetrics for GPU energy and memory_pressure for resident memory.

What context length should I benchmark at?

Benchmark at the context length your application actually uses. For chat assistants, 2,048 is a fair default. For RAG and long-document workflows, sweep 4K, 8K, and the model maximum so you can see where prefill collapses.

How many concurrent streams should I test?

At minimum 1, 4, 8, and 16 concurrent streams. Most consumer GPUs peak around 4-8 streams under Ollama/llama.cpp and 16-64 under vLLM. The shape of the curve tells you whether you can serve a team without buying more hardware.

Is MMLU or HumanEval relevant here?

No. Those benchmark the model. This guide benchmarks the rig. The model and quantization are inputs you fix, not variables you measure.

Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory

Published on February 12, 2026 • 22 min read

Quick Start: Get a Real Tokens/sec Number in 90 Seconds

Run this on any Ollama box and you have a defensible baseline:

ollama run llama3.1:8b --verbose "Write 200 words on caching."
Read the eval rate line — that is your generation tokens/sec.
Read prompt eval rate — that is your prompt-processing speed.

That is the floor. The rest of this guide is how to turn that single observation into a reproducible benchmark you can defend in a procurement meeting or post on r/LocalLLaMA without getting torn apart.

What you will measure:

Generation throughput (tokens/sec) under steady-state load
Time-to-first-token (TTFT) — the latency users actually feel
Prompt processing speed (prefill tokens/sec) for long-context workloads
VRAM and unified-memory headroom under real prompts
Concurrency curves — at what request rate does throughput collapse

Most "Ollama is slow" or "my GPU is faster than yours" arguments online are unfalsifiable because nobody publishes the prompt, the seed, the context length, the quantization, or the warmup state. We fix that here. If you also want to know which models are even worth benchmarking on your hardware, start with the best Ollama models guide and our hardware requirements overview before running anything heavy.

Why Most Local AI Benchmarks Are Wrong
The Five Numbers That Actually Matter
Setting Up a Clean Test Environment
Benchmarking Ollama
Benchmarking llama.cpp Directly
Benchmarking vLLM for Concurrency
Cross-Stack Results from Our Lab
Common Pitfalls That Tank Numbers
Frequently Asked Questions

Why Most Local AI Benchmarks Are Wrong {#why-wrong}

Open any benchmark thread and you will find the same four mistakes:

Cold-start measurements. First run loads weights from disk and recompiles kernels. That number is meaningless.
Mixed quantizations. Comparing Q4_K_M against Q8_0 is comparing different models, not different machines.
Different prompts. A 4-token prompt and a 4,000-token prompt produce wildly different prefill rates.
No concurrency control. Single-stream tokens/sec is not the same as multi-user throughput, and most home rigs are tested at concurrency 1.

A defensible benchmark documents the model, the quantization, the context window, the prompt, the temperature, the seed (where supported), the runner version, the hardware, and whether the run was warmed up. We will hit every one of those.

For a deeper philosophical dig into evaluation methodology, llama.cpp's llama-bench tool is the reference implementation most other benchmarks copy from.

The Five Numbers That Actually Matter {#five-numbers}

Forget MMLU. We are measuring the rig, not the model. The five numbers that matter for a local deployment:

1. Generation tokens/sec (eval rate)

Steady-state output speed once the model is generating. This is the number you put on a slide.

2. Time-to-first-token (TTFT)

Wall clock from request to first emitted token. Dominates perceived latency for short prompts and chat UX.

3. Prompt processing tokens/sec (prefill rate)

How fast the model can ingest the prompt. Critical for RAG, long-context coding agents, and document summarization.

4. Effective concurrent throughput

Tokens/sec across N concurrent streams. Single-user vs ten-user numbers can differ 4x in either direction.

5. Peak resident memory

VRAM (GPU) or RSS (CPU/Apple Silicon) at full context. Drives model selection and headroom planning.

Metric	Symbol	What it tells you
Generation rate	tok/s	Throughput once running
TTFT	ms	Latency users feel
Prefill rate	tok/s	Long-context viability
Concurrent throughput	tok/s @ N	Capacity for multi-user
Peak memory	GB	Largest model you can run

Setting Up a Clean Test Environment {#clean-env}

Before any number is trustworthy:

# 1. Pin the model and quantization explicitly
ollama pull llama3.1:8b-instruct-q4_K_M

# 2. Record the runner version
ollama --version
# llama.cpp:
./llama-cli --version
# vLLM:
python -c "import vllm; print(vllm.__version__)"

# 3. Lock GPU clocks (NVIDIA) so thermals do not skew results
sudo nvidia-smi -pm 1
sudo nvidia-smi --lock-gpu-clocks=1410,1410   # adjust per card
sudo nvidia-smi --lock-memory-clocks=10501,10501

# 4. Drop filesystem cache between runs (Linux)
sync && sudo sysctl -w vm.drop_caches=3

# 5. Disable background indexers (macOS example)
sudo mdutil -a -i off

Standard benchmark prompts

We use three fixed prompts so you can compare directly:

SHORT (about 32 tokens):
"Explain in two sentences why local LLMs reduce egress costs versus hosted APIs."

MEDIUM (about 512 tokens):
[paste a Wikipedia paragraph, then ask] "Summarize the above in five bullets."

LONG (about 4,096 tokens):
[paste a long technical doc] "Extract every numeric claim with its source sentence."

Lock temperature, top-p, seed:

export OLLAMA_KEEP_ALIVE=30m   # keep model resident between runs
ollama run llama3.1:8b-instruct-q4_K_M \
  --verbose \
  --temperature 0 \
  --top-p 1 \
  --seed 42

Benchmarking Ollama {#ollama}

Step 1: Warm up

# Burn one run to warm caches and JIT
ollama run llama3.1:8b-instruct-q4_K_M --verbose "warmup" > /dev/null

Step 2: Capture verbose stats

ollama run llama3.1:8b-instruct-q4_K_M --verbose \
  "Explain in two sentences why local LLMs reduce egress costs versus hosted APIs."

You should see, at the bottom:

total duration:       4.812s
load duration:        38.1ms
prompt eval count:    32 token(s)
prompt eval duration: 410ms
prompt eval rate:     78.05 tokens/s
eval count:           186 token(s)
eval duration:        4.36s
eval rate:            42.66 tokens/s

The two numbers to record are prompt eval rate (prefill) and eval rate (generation).

Step 3: Automate with the API and measure TTFT

Ollama emits SSE chunks. The first non-empty chunk is your TTFT.

# bench-ollama.sh
#!/usr/bin/env bash
set -euo pipefail

MODEL="${1:-llama3.1:8b-instruct-q4_K_M}"
PROMPT="${2:-Write a 200 word essay on caching strategies.}"
URL="http://127.0.0.1:11434/api/generate"

start=$(date +%s%N)
first_token_ns=""
total_tokens=0

curl -sN -X POST "$URL" \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":true,\"options\":{\"temperature\":0,\"seed\":42}}" \
| while IFS= read -r line; do
    if [[ -z "$first_token_ns" ]] && echo "$line" | grep -q '"response":"[^"]'; then
      first_token_ns=$(date +%s%N)
      ttft_ms=$(( (first_token_ns - start) / 1000000 ))
      echo "TTFT: $ttft_ms ms"
    fi
    total_tokens=$((total_tokens + 1))
done

end=$(date +%s%N)
elapsed_s=$(echo "scale=3; ($end - $start) / 1000000000" | bc)
echo "Wall: $elapsed_s s, chunks: $total_tokens"

Run five iterations, drop the first, average the rest.

Step 4: Concurrency

# Hit Ollama with N parallel streams
seq 1 8 | xargs -P 8 -I {} ./bench-ollama.sh llama3.1:8b-instruct-q4_K_M

Compare aggregate tokens/sec against single-stream. On most consumer cards, throughput peaks somewhere between 4 and 16 concurrent streams, then collapses.

Benchmarking llama.cpp Directly {#llama-cpp}

llama.cpp ships a purpose-built benchmark that is more rigorous than wrapping the CLI:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

# Standard sweep: prompt processing 512 / generation 128
./build/bin/llama-bench \
  -m /models/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf \
  -p 512 -n 128 -t 8 -ngl 99 -r 5

Output:

| model            |       size | params | backend | ngl |     test |              t/s |
| Meta-Llama-3.1-8B | 4.58 GiB   | 8.03B  | CUDA    |  99 |   pp 512 | 1842.31 ± 12.40  |
| Meta-Llama-3.1-8B | 4.58 GiB   | 8.03B  | CUDA    |  99 |   tg 128 |   78.04 ±  0.18  |

pp is prefill. tg is generation. The ± is one standard deviation across -r 5 runs — publish that, not a single number.

Useful sweeps

# Sweep context lengths to find where prefill collapses
./build/bin/llama-bench -m model.gguf -p 128,512,2048,8192 -n 64

# Compare quantizations on the same hardware
./build/bin/llama-bench \
  -m llama-3.1-8b.Q4_K_M.gguf \
  -m llama-3.1-8b.Q5_K_M.gguf \
  -m llama-3.1-8b.Q8_0.gguf \
  -p 512 -n 128

If you are choosing between quants, our GGUF, AWQ and GPTQ comparison walks through the quality vs throughput tradeoff in detail.

Benchmarking vLLM for Concurrency {#vllm}

Ollama and llama.cpp are excellent single-tenant runners. If you are serving multiple users, vLLM's continuous batching changes the shape of the curve completely.

pip install "vllm>=0.6.4"

# Serve
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

vLLM ships its own benchmark harness:

git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks

# ShareGPT-style realistic load
python benchmark_serving.py \
  --backend openai-chat \
  --base-url http://127.0.0.1:8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 500 \
  --request-rate 8

The output is what you actually want for capacity planning:

Successful requests:                     500
Benchmark duration (s):                  62.41
Total input tokens:                      141,205
Total generated tokens:                  72,418
Request throughput (req/s):              8.01
Output token throughput (tok/s):         1160.46
Mean TTFT (ms):                          187
P99 TTFT (ms):                           412
Mean TPOT (ms):                          22.4

TPOT (time per output token) is vLLM's preferred per-token latency metric — useful when comparing against batched APIs.

Cross-Stack Results from Our Lab {#results}

Same model (Llama 3.1 8B Instruct, Q4_K_M / AWQ-equivalent), same prompts, three runners. Pin one row to your wall.

Single stream, RTX 4090 (24GB), Ryzen 9 7950X, 64GB DDR5

Runner	Prefill (tok/s)	Generation (tok/s)	TTFT (ms)	Peak VRAM (GB)
Ollama 0.3.x	1,624	71.8	92	5.9
llama.cpp (CUDA)	1,842	78.0	78	5.6
vLLM (AWQ)	4,310	84.2	64	18.4

Eight concurrent streams, same hardware

Runner	Aggregate gen tok/s	P50 TTFT (ms)	P99 TTFT (ms)
Ollama 0.3.x	142	240	1,840
llama.cpp server	318	168	920
vLLM	1,160	187	412

MacBook Pro M3 Max (64GB), Llama 3.1 8B Q4_K_M

Runner	Prefill (tok/s)	Generation (tok/s)	TTFT (ms)	Peak unified mem (GB)
Ollama (Metal)	612	38.4	142	5.7
llama.cpp (Metal)	681	41.2	128	5.5

If you are weighing Apple Silicon against discrete GPUs, our Mac Studio vs PC build comparison uses the exact same prompts so the numbers are comparable.

Common Pitfalls That Tank Numbers {#pitfalls}

1. Forgetting OLLAMA_KEEP_ALIVE

Default is 5 minutes. If your benchmark loop sleeps longer than that, the model unloads and the next "run" pays the load tax. Set OLLAMA_KEEP_ALIVE=30m or longer.

2. Background processes on the GPU

A single Chrome tab decoding video can shave 10-15% off your tokens/sec. Run nvidia-smi and confirm the only process on the card is your runner.

3. Power and thermal throttling

Laptops and SFF builds hit thermal walls fast. Pin clocks, watch nvidia-smi -l 1, and reject any run where GPU temperature crosses 83°C. On Apple Silicon use sudo powermetrics --samplers thermal -i1000 and reject runs that show Throttle: yes.

4. Mismatched context windows

If you build the GGUF with -c 2048 and llama.cpp with -c 8192, you are benchmarking different memory profiles. Pin --ctx-size explicitly.

5. Ignoring P99 latency

Mean TTFT looks great until one user gets a 3-second wait. Always publish P50 and P99.

6. Single-run reporting

Run at least 5 iterations, drop the first (warmup), report mean and standard deviation. A single number is not a benchmark; it is an anecdote.

7. Not pinning the runner version

llama.cpp ships breaking performance changes monthly. Pin the commit:

git -C llama.cpp rev-parse --short HEAD

Include that hash in your results table.

Putting It All Together: A Reproducible Benchmark Report

Every benchmark you publish should answer these questions on one page:

Hardware:        RTX 4090 24GB / Ryzen 9 7950X / 64GB DDR5-6000
OS:              Ubuntu 22.04, kernel 6.5.0, NVIDIA driver 550.90.07
Runner:          llama.cpp commit a1b2c3d (CUDA, cuBLAS)
Model:           Meta-Llama-3.1-8B-Instruct
Quantization:    Q4_K_M (4.58 GiB)
Context:         --ctx-size 4096
Prompts:         3 fixed prompts, see Appendix
Sampling:        temperature 0, top-p 1, seed 42
Iterations:      5 per condition, first dropped
Concurrency:     1, 4, 8, 16
Power state:     GPU clocks locked 1410/10501 MHz, ambient 22°C

That is the format auditors, hiring managers, and procurement teams take seriously. It is also the format that survives r/LocalLLaMA scrutiny.

Frequently Asked Questions {#faq}

Q: Which single number should I report if I only have time for one?

A: Generation tokens/sec at concurrency 1 with a fixed 512-token prompt and 128-token output, averaged over 5 runs after a warmup. It is not the whole story, but it is the least lying number you can report in one figure.

Q: How many runs is enough?

A: Five iterations, dropping the first, is the floor for stable means. For latency P99, you need at least 100 requests because P99 by definition needs ~100 samples to even exist.

Q: Why does my Ollama benchmark show different numbers than llama-bench on the same model?

A: Ollama applies a default system prompt, default sampler settings, and may chunk differently. Match settings explicitly: same context size, same temperature, same prompt, same quantization file.

Q: Should I benchmark with batch size > 1?

A: Only if your real workload uses batched requests. For chat UIs, single stream and concurrency curves are what matter. For offline pipelines, increase batch until VRAM saturates.

Q: How do I measure VRAM properly?

A: nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the run, take the peak. On Apple Silicon, sudo powermetrics --samplers gpu_power plus memory_pressure gives you the equivalent.

Q: My CPU-only run is faster than my GPU run for tiny models. Why?

A: Sub-3B models often fit in CPU cache, and PCIe transfer overhead can dominate GPU runtime. This is real. For models under ~3B parameters, benchmark both backends and pick the winner.

Q: Can I trust hosted leaderboards for my hardware decision?

A: Use them as a sanity check, never as the deciding number. Your prompt distribution, context length, and concurrency profile dictate which runner wins on your machine.

Q: How often should I re-benchmark?

A: Every minor llama.cpp / vLLM / Ollama bump, every driver update, and after every model swap. Pin the version in your report.

Conclusion

A benchmark is only useful if someone else can reproduce it. Pin the model, pin the quantization, pin the runner version, warm up, run five times, publish mean and standard deviation, include P50 and P99 latency, and disclose your hardware down to the GPU clock lock. Do that once and you will never have to argue about local AI performance with a stranger again — you will just send them your table.

If you are about to make a hardware purchase based on these numbers, pair this guide with our hardware requirements walkthrough and the budget local AI machine build before you click order.

Want printable benchmark templates and a YAML harness that runs the full Ollama / llama.cpp / vLLM matrix nightly? Subscribe to the LocalAimaster newsletter — we ship the harness to readers first.

Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory (2026)

Want to go deeper than this article?

Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory

Quick Start: Get a Real Tokens/sec Number in 90 Seconds

Table of Contents