Free course — 2 free chapters of every course. No credit card.Start learning free
Performance

Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory (2026)

February 12, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Benchmark Your Local AI Setup: Tokens/sec, TTFT & Memory

Published on February 12, 2026 • 22 min read

Quick Start: Get a Real Tokens/sec Number in 90 Seconds

Run this on any Ollama box and you have a defensible baseline:

  1. ollama run llama3.1:8b --verbose "Write 200 words on caching."
  2. Read the eval rate line — that is your generation tokens/sec.
  3. Read prompt eval rate — that is your prompt-processing speed.

That is the floor. The rest of this guide is how to turn that single observation into a reproducible benchmark you can defend in a procurement meeting or post on r/LocalLLaMA without getting torn apart.


What you will measure:

  • Generation throughput (tokens/sec) under steady-state load
  • Time-to-first-token (TTFT) — the latency users actually feel
  • Prompt processing speed (prefill tokens/sec) for long-context workloads
  • VRAM and unified-memory headroom under real prompts
  • Concurrency curves — at what request rate does throughput collapse

Most "Ollama is slow" or "my GPU is faster than yours" arguments online are unfalsifiable because nobody publishes the prompt, the seed, the context length, the quantization, or the warmup state. We fix that here. If you also want to know which models are even worth benchmarking on your hardware, start with the best Ollama models guide and our hardware requirements overview before running anything heavy.

Table of Contents

  1. Why Most Local AI Benchmarks Are Wrong
  2. The Five Numbers That Actually Matter
  3. Setting Up a Clean Test Environment
  4. Benchmarking Ollama
  5. Benchmarking llama.cpp Directly
  6. Benchmarking vLLM for Concurrency
  7. Cross-Stack Results from Our Lab
  8. Common Pitfalls That Tank Numbers
  9. Frequently Asked Questions

Why Most Local AI Benchmarks Are Wrong {#why-wrong}

Open any benchmark thread and you will find the same four mistakes:

  1. Cold-start measurements. First run loads weights from disk and recompiles kernels. That number is meaningless.
  2. Mixed quantizations. Comparing Q4_K_M against Q8_0 is comparing different models, not different machines.
  3. Different prompts. A 4-token prompt and a 4,000-token prompt produce wildly different prefill rates.
  4. No concurrency control. Single-stream tokens/sec is not the same as multi-user throughput, and most home rigs are tested at concurrency 1.

A defensible benchmark documents the model, the quantization, the context window, the prompt, the temperature, the seed (where supported), the runner version, the hardware, and whether the run was warmed up. We will hit every one of those.

For a deeper philosophical dig into evaluation methodology, llama.cpp's llama-bench tool is the reference implementation most other benchmarks copy from.


The Five Numbers That Actually Matter {#five-numbers}

Forget MMLU. We are measuring the rig, not the model. The five numbers that matter for a local deployment:

1. Generation tokens/sec (eval rate)

Steady-state output speed once the model is generating. This is the number you put on a slide.

2. Time-to-first-token (TTFT)

Wall clock from request to first emitted token. Dominates perceived latency for short prompts and chat UX.

3. Prompt processing tokens/sec (prefill rate)

How fast the model can ingest the prompt. Critical for RAG, long-context coding agents, and document summarization.

4. Effective concurrent throughput

Tokens/sec across N concurrent streams. Single-user vs ten-user numbers can differ 4x in either direction.

5. Peak resident memory

VRAM (GPU) or RSS (CPU/Apple Silicon) at full context. Drives model selection and headroom planning.

MetricSymbolWhat it tells you
Generation ratetok/sThroughput once running
TTFTmsLatency users feel
Prefill ratetok/sLong-context viability
Concurrent throughputtok/s @ NCapacity for multi-user
Peak memoryGBLargest model you can run

Setting Up a Clean Test Environment {#clean-env}

Before any number is trustworthy:

# 1. Pin the model and quantization explicitly
ollama pull llama3.1:8b-instruct-q4_K_M

# 2. Record the runner version
ollama --version
# llama.cpp:
./llama-cli --version
# vLLM:
python -c "import vllm; print(vllm.__version__)"

# 3. Lock GPU clocks (NVIDIA) so thermals do not skew results
sudo nvidia-smi -pm 1
sudo nvidia-smi --lock-gpu-clocks=1410,1410   # adjust per card
sudo nvidia-smi --lock-memory-clocks=10501,10501

# 4. Drop filesystem cache between runs (Linux)
sync && sudo sysctl -w vm.drop_caches=3

# 5. Disable background indexers (macOS example)
sudo mdutil -a -i off

Standard benchmark prompts

We use three fixed prompts so you can compare directly:

SHORT (about 32 tokens):
"Explain in two sentences why local LLMs reduce egress costs versus hosted APIs."

MEDIUM (about 512 tokens):
[paste a Wikipedia paragraph, then ask] "Summarize the above in five bullets."

LONG (about 4,096 tokens):
[paste a long technical doc] "Extract every numeric claim with its source sentence."

Lock temperature, top-p, seed:

export OLLAMA_KEEP_ALIVE=30m   # keep model resident between runs
ollama run llama3.1:8b-instruct-q4_K_M \
  --verbose \
  --temperature 0 \
  --top-p 1 \
  --seed 42

Benchmarking Ollama {#ollama}

Step 1: Warm up

# Burn one run to warm caches and JIT
ollama run llama3.1:8b-instruct-q4_K_M --verbose "warmup" > /dev/null

Step 2: Capture verbose stats

ollama run llama3.1:8b-instruct-q4_K_M --verbose \
  "Explain in two sentences why local LLMs reduce egress costs versus hosted APIs."

You should see, at the bottom:

total duration:       4.812s
load duration:        38.1ms
prompt eval count:    32 token(s)
prompt eval duration: 410ms
prompt eval rate:     78.05 tokens/s
eval count:           186 token(s)
eval duration:        4.36s
eval rate:            42.66 tokens/s

The two numbers to record are prompt eval rate (prefill) and eval rate (generation).

Step 3: Automate with the API and measure TTFT

Ollama emits SSE chunks. The first non-empty chunk is your TTFT.

# bench-ollama.sh
#!/usr/bin/env bash
set -euo pipefail

MODEL="${1:-llama3.1:8b-instruct-q4_K_M}"
PROMPT="${2:-Write a 200 word essay on caching strategies.}"
URL="http://127.0.0.1:11434/api/generate"

start=$(date +%s%N)
first_token_ns=""
total_tokens=0

curl -sN -X POST "$URL" \
  -H 'Content-Type: application/json' \
  -d "{\"model\":\"$MODEL\",\"prompt\":\"$PROMPT\",\"stream\":true,\"options\":{\"temperature\":0,\"seed\":42}}" \
| while IFS= read -r line; do
    if [[ -z "$first_token_ns" ]] && echo "$line" | grep -q '"response":"[^"]'; then
      first_token_ns=$(date +%s%N)
      ttft_ms=$(( (first_token_ns - start) / 1000000 ))
      echo "TTFT: $ttft_ms ms"
    fi
    total_tokens=$((total_tokens + 1))
done

end=$(date +%s%N)
elapsed_s=$(echo "scale=3; ($end - $start) / 1000000000" | bc)
echo "Wall: $elapsed_s s, chunks: $total_tokens"

Run five iterations, drop the first, average the rest.

Step 4: Concurrency

# Hit Ollama with N parallel streams
seq 1 8 | xargs -P 8 -I {} ./bench-ollama.sh llama3.1:8b-instruct-q4_K_M

Compare aggregate tokens/sec against single-stream. On most consumer cards, throughput peaks somewhere between 4 and 16 concurrent streams, then collapses.


Benchmarking llama.cpp Directly {#llama-cpp}

llama.cpp ships a purpose-built benchmark that is more rigorous than wrapping the CLI:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j

# Standard sweep: prompt processing 512 / generation 128
./build/bin/llama-bench \
  -m /models/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf \
  -p 512 -n 128 -t 8 -ngl 99 -r 5

Output:

| model            |       size | params | backend | ngl |     test |              t/s |
| Meta-Llama-3.1-8B | 4.58 GiB   | 8.03B  | CUDA    |  99 |   pp 512 | 1842.31 ± 12.40  |
| Meta-Llama-3.1-8B | 4.58 GiB   | 8.03B  | CUDA    |  99 |   tg 128 |   78.04 ±  0.18  |

pp is prefill. tg is generation. The ± is one standard deviation across -r 5 runs — publish that, not a single number.

Useful sweeps

# Sweep context lengths to find where prefill collapses
./build/bin/llama-bench -m model.gguf -p 128,512,2048,8192 -n 64

# Compare quantizations on the same hardware
./build/bin/llama-bench \
  -m llama-3.1-8b.Q4_K_M.gguf \
  -m llama-3.1-8b.Q5_K_M.gguf \
  -m llama-3.1-8b.Q8_0.gguf \
  -p 512 -n 128

If you are choosing between quants, our GGUF, AWQ and GPTQ comparison walks through the quality vs throughput tradeoff in detail.


Benchmarking vLLM for Concurrency {#vllm}

Ollama and llama.cpp are excellent single-tenant runners. If you are serving multiple users, vLLM's continuous batching changes the shape of the curve completely.

pip install "vllm>=0.6.4"

# Serve
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

vLLM ships its own benchmark harness:

git clone https://github.com/vllm-project/vllm
cd vllm/benchmarks

# ShareGPT-style realistic load
python benchmark_serving.py \
  --backend openai-chat \
  --base-url http://127.0.0.1:8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 500 \
  --request-rate 8

The output is what you actually want for capacity planning:

Successful requests:                     500
Benchmark duration (s):                  62.41
Total input tokens:                      141,205
Total generated tokens:                  72,418
Request throughput (req/s):              8.01
Output token throughput (tok/s):         1160.46
Mean TTFT (ms):                          187
P99 TTFT (ms):                           412
Mean TPOT (ms):                          22.4

TPOT (time per output token) is vLLM's preferred per-token latency metric — useful when comparing against batched APIs.


Cross-Stack Results from Our Lab {#results}

Same model (Llama 3.1 8B Instruct, Q4_K_M / AWQ-equivalent), same prompts, three runners. Pin one row to your wall.

Single stream, RTX 4090 (24GB), Ryzen 9 7950X, 64GB DDR5

RunnerPrefill (tok/s)Generation (tok/s)TTFT (ms)Peak VRAM (GB)
Ollama 0.3.x1,62471.8925.9
llama.cpp (CUDA)1,84278.0785.6
vLLM (AWQ)4,31084.26418.4

Eight concurrent streams, same hardware

RunnerAggregate gen tok/sP50 TTFT (ms)P99 TTFT (ms)
Ollama 0.3.x1422401,840
llama.cpp server318168920
vLLM1,160187412

MacBook Pro M3 Max (64GB), Llama 3.1 8B Q4_K_M

RunnerPrefill (tok/s)Generation (tok/s)TTFT (ms)Peak unified mem (GB)
Ollama (Metal)61238.41425.7
llama.cpp (Metal)68141.21285.5

If you are weighing Apple Silicon against discrete GPUs, our Mac Studio vs PC build comparison uses the exact same prompts so the numbers are comparable.


Common Pitfalls That Tank Numbers {#pitfalls}

1. Forgetting OLLAMA_KEEP_ALIVE

Default is 5 minutes. If your benchmark loop sleeps longer than that, the model unloads and the next "run" pays the load tax. Set OLLAMA_KEEP_ALIVE=30m or longer.

2. Background processes on the GPU

A single Chrome tab decoding video can shave 10-15% off your tokens/sec. Run nvidia-smi and confirm the only process on the card is your runner.

3. Power and thermal throttling

Laptops and SFF builds hit thermal walls fast. Pin clocks, watch nvidia-smi -l 1, and reject any run where GPU temperature crosses 83°C. On Apple Silicon use sudo powermetrics --samplers thermal -i1000 and reject runs that show Throttle: yes.

4. Mismatched context windows

If you build the GGUF with -c 2048 and llama.cpp with -c 8192, you are benchmarking different memory profiles. Pin --ctx-size explicitly.

5. Ignoring P99 latency

Mean TTFT looks great until one user gets a 3-second wait. Always publish P50 and P99.

6. Single-run reporting

Run at least 5 iterations, drop the first (warmup), report mean and standard deviation. A single number is not a benchmark; it is an anecdote.

7. Not pinning the runner version

llama.cpp ships breaking performance changes monthly. Pin the commit:

git -C llama.cpp rev-parse --short HEAD

Include that hash in your results table.


Putting It All Together: A Reproducible Benchmark Report

Every benchmark you publish should answer these questions on one page:

Hardware:        RTX 4090 24GB / Ryzen 9 7950X / 64GB DDR5-6000
OS:              Ubuntu 22.04, kernel 6.5.0, NVIDIA driver 550.90.07
Runner:          llama.cpp commit a1b2c3d (CUDA, cuBLAS)
Model:           Meta-Llama-3.1-8B-Instruct
Quantization:    Q4_K_M (4.58 GiB)
Context:         --ctx-size 4096
Prompts:         3 fixed prompts, see Appendix
Sampling:        temperature 0, top-p 1, seed 42
Iterations:      5 per condition, first dropped
Concurrency:     1, 4, 8, 16
Power state:     GPU clocks locked 1410/10501 MHz, ambient 22°C

That is the format auditors, hiring managers, and procurement teams take seriously. It is also the format that survives r/LocalLLaMA scrutiny.


Frequently Asked Questions {#faq}

Q: Which single number should I report if I only have time for one?

A: Generation tokens/sec at concurrency 1 with a fixed 512-token prompt and 128-token output, averaged over 5 runs after a warmup. It is not the whole story, but it is the least lying number you can report in one figure.

Q: How many runs is enough?

A: Five iterations, dropping the first, is the floor for stable means. For latency P99, you need at least 100 requests because P99 by definition needs ~100 samples to even exist.

Q: Why does my Ollama benchmark show different numbers than llama-bench on the same model?

A: Ollama applies a default system prompt, default sampler settings, and may chunk differently. Match settings explicitly: same context size, same temperature, same prompt, same quantization file.

Q: Should I benchmark with batch size > 1?

A: Only if your real workload uses batched requests. For chat UIs, single stream and concurrency curves are what matter. For offline pipelines, increase batch until VRAM saturates.

Q: How do I measure VRAM properly?

A: nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the run, take the peak. On Apple Silicon, sudo powermetrics --samplers gpu_power plus memory_pressure gives you the equivalent.

Q: My CPU-only run is faster than my GPU run for tiny models. Why?

A: Sub-3B models often fit in CPU cache, and PCIe transfer overhead can dominate GPU runtime. This is real. For models under ~3B parameters, benchmark both backends and pick the winner.

Q: Can I trust hosted leaderboards for my hardware decision?

A: Use them as a sanity check, never as the deciding number. Your prompt distribution, context length, and concurrency profile dictate which runner wins on your machine.

Q: How often should I re-benchmark?

A: Every minor llama.cpp / vLLM / Ollama bump, every driver update, and after every model swap. Pin the version in your report.


Conclusion

A benchmark is only useful if someone else can reproduce it. Pin the model, pin the quantization, pin the runner version, warm up, run five times, publish mean and standard deviation, include P50 and P99 latency, and disclose your hardware down to the GPU clock lock. Do that once and you will never have to argue about local AI performance with a stranger again — you will just send them your table.

If you are about to make a hardware purchase based on these numbers, pair this guide with our hardware requirements walkthrough and the budget local AI machine build before you click order.


Want printable benchmark templates and a YAML harness that runs the full Ollama / llama.cpp / vLLM matrix nightly? Subscribe to the LocalAimaster newsletter — we ship the harness to readers first.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: February 12, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

Get the Reproducible Benchmark Harness

Subscribers receive our YAML benchmark matrix that runs Ollama, llama.cpp, and vLLM nightly and posts results to a CSV. Drop it on any rig.

Related Guides

Continue your local AI journey with these comprehensive guides

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators