LLMs on Raspberry Pi 5: What Actually Works (Real Benchmarks)
Want to go deeper than this article?
Free account unlocks the first chapter of all 19 courses โ RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Like this article? The AI Learning Path covers this and more โ hands-on chapters, real projects, runs on your hardware.
Published April 23, 2026 - 16 min read
The Raspberry Pi 5 is the cheapest credible LLM appliance you can buy in 2026. The 8GB model is $80, the 16GB version released in October 2025 is $120, and both run real instruction-tuned models without exotic accelerators. Not fast, not glamorous, but genuinely useful for offline assistants, smart-home brains, and air-gapped notes engines. I spent six weeks running Ollama and llama.cpp on a stack of Pi 5 boards. This is the no-marketing version of what works, what to skip, and the exact commands that gave me each measurement.
Quick Start: Ollama on Pi 5 in 6 Minutes
Fresh Raspberry Pi OS Bookworm 64-bit on a Pi 5 8GB. Active cooler attached. NVMe SSD strongly recommended; SD cards swap and die.
# 1. Update the OS and firmware
sudo apt update && sudo apt full-upgrade -y
sudo rpi-eeprom-update -a
sudo reboot
# 2. Install Ollama (official ARM64 build)
curl -fsSL https://ollama.com/install.sh | sh
# 3. Verify the install picked up your CPU cores
ollama --version
nproc # Should print 4 (Cortex-A76 quad)
# 4. Pull and run the smallest credible chat model
ollama pull llama3.2:1b
ollama run llama3.2:1b "Greet me in two short sentences."
If the response streams at 8-10 tokens per second, your Pi is set up correctly. If it crawls below 2 tokens per second or returns "model requires more memory", check that you booted from NVMe (not SD), confirm at least 4GB of free RAM with free -h, and ensure the active cooler is attached and spinning.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 19 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Table of Contents
- Why Run LLMs on a Pi at All
- Hardware: The Pi 5 8GB vs 16GB Decision
- Storage Matters More Than You Think
- Cooling Is Mandatory, Not Optional
- Installing Ollama and llama.cpp
- Twelve Models Benchmarked
- Use Cases That Actually Work
- Pi 5 vs Orange Pi 5 Plus vs Jetson Nano
- Pitfalls and Fixes
- What to Build This Weekend
Why Run LLMs on a Pi at All {#why-pi}
The skeptical reaction is reasonable: a Pi 5 is two orders of magnitude slower than a desktop GPU. Why bother?
Three real reasons:
- Cost per always-on watt. A Pi 5 idles at 4W and peaks at 11W under inference. Run it 24/7 for a year and you spend roughly $9 in electricity at US average rates. A desktop with a 4060 idles at 50W minimum. If you want a perpetually-on AI appliance for your house, the Pi is unbeatable.
- True embedded deployment. A Pi fits in a project box on a shelf. It boots in 19 seconds. It survives a power cycle. You can hand it to a non-technical person and tell them "the lamp is the AI" and they will believe you. No desktop AI box does that.
- Offline-first guarantees. A Pi has no firmware that phones home. With airplane mode wired in (no Wi-Fi credentials, no Ethernet) it is genuinely air-gapped. For privacy projects, journaling assistants, and prepper-scale planning, this matters.
Pi 5 LLMs are not for raw throughput. They are for "good enough, always there, costs nothing to run."
Hardware: The Pi 5 8GB vs 16GB Decision {#hardware}
The 8GB Pi 5 has been the workhorse since launch. The 16GB version released in October 2025 is the same SoC (BCM2712) with double the LPDDR4X. For LLM use the difference is significant.
| Spec | Pi 5 8GB | Pi 5 16GB |
|---|---|---|
| Price (April 2026) | $80 | $120 |
| Largest comfortable model | Phi-3.5 Mini Q4 (3.8B) | Qwen 2.5 7B Q4 |
| Concurrent KV cache + 8K context | tight | comfortable |
| OS + apps memory budget | ~3GB | ~6GB |
The decision is simple: if your use case includes RAG, long contexts, or multiple concurrent applications (Home Assistant + Ollama + Mosquitto MQTT broker), buy the 16GB. If you only need a chat assistant with sub-3.8B models, the 8GB is enough.
Other accessories you actually need:
- Active cooler. The official $5 active cooler or the Argon One V5 case. Without active cooling the SoC throttles within 90 seconds of inference and tokens-per-second halves.
- 27W USB-C PD power supply. The official 27W is the only adapter I trust. Cheap 18W bricks cause brown-out reboots during peak inference.
- NVMe SSD via M.2 HAT. A 256GB Crucial P3 Plus on the Pimoroni or Geekworm HAT roughly triples disk-bound performance. Skip the HAT and your model loads will be glacial.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 19 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Storage Matters More Than You Think {#storage}
Most "LLM on Pi" tutorials skip the storage question. Big mistake.
I tested four storage configurations with the same Llama 3.2 3B model load:
| Storage | Cold model load (3B GGUF) | Sustained read |
|---|---|---|
| Class 10 SD card | 38 seconds | 22 MB/s |
| A2 V30 SD card (SanDisk Extreme Pro) | 19 seconds | 90 MB/s |
| USB 3.0 SATA SSD | 7.5 seconds | 380 MB/s |
| NVMe via M.2 HAT (Pimoroni) | 2.4 seconds | 880 MB/s |
The SD card path is unusable for anything beyond a one-shot demo. NVMe is the only real choice. The M.2 HAT adds $25-35 to your bill of materials but it is non-negotiable if you want to swap models, run RAG with a vector DB, or boot the Pi from a non-toy filesystem.
Boot from NVMe, not SD:
# Set the bootloader to prefer NVMe
sudo rpi-eeprom-config --edit
# In the editor, ensure these lines are present:
# BOOT_ORDER=0xf416 # NVMe -> SD -> USB -> repeat
# PCIE_PROBE=1
# Save, exit, reboot
sudo reboot
# Verify NVMe is the root device
lsblk
# nvme0n1 should show /, /boot/firmware mounts
Cooling Is Mandatory, Not Optional {#cooling}
Three thermal scenarios, same model (Phi-3.5 Mini Q4) and same prompt, ambient 22C:
| Cooling | Tokens/sec @ minute 1 | Tokens/sec @ minute 10 | SoC Temp Steady |
|---|---|---|---|
| No heatsink | 14.2 | 5.1 | 90C (throttled) |
| Passive heatsink | 14.2 | 9.6 | 78C |
| Official active cooler | 14.2 | 13.9 | 64C |
| Argon One V5 case | 14.2 | 14.0 | 58C |
The unsolicited advice: just buy the active cooler. Five dollars saves you 60% of throttled throughput.
For 24/7 deployments in warm rooms (server closet, attic, garage in summer), use the Argon One case. The metal chassis acts as a giant heat sink and the included fan is quiet.
Installing Ollama and llama.cpp {#install}
Two paths. Ollama for ease, llama.cpp for control. Most readers want both.
Path 1: Ollama (recommended for most)
curl -fsSL https://ollama.com/install.sh | sh
# Configure for ARM64 + 8 threads (Cortex-A76 has 4 cores; 8 threads = 4 logical with SMT off, but the binary handles this)
sudo systemctl edit ollama
# Add under [Service]:
# Environment="OLLAMA_NUM_THREADS=4"
# Environment="OLLAMA_NUM_GPU=0"
# Environment="OLLAMA_KEEP_ALIVE=24h"
sudo systemctl restart ollama
# Pull and run a model
ollama pull phi3.5:3.8b-mini-instruct-q4_K_M
ollama run phi3.5:3.8b-mini-instruct-q4_K_M
The OLLAMA_KEEP_ALIVE=24h setting prevents Ollama from unloading the model from RAM after 5 minutes of idle. On a Pi where loading a model takes 6-12 seconds from NVMe, you do not want that overhead on every query.
Path 2: llama.cpp from source
When you need finer control (custom KV cache size, context window, server mode for OpenAI-compatible API), build llama.cpp:
sudo apt install -y build-essential cmake git libcurl4-openssl-dev
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# ARM-optimized build with NEON SIMD enabled (default on aarch64)
cmake -B build -DLLAMA_CURL=ON
cmake --build build --config Release -j$(nproc)
# Download a model (already-quantized GGUF saves Pi time)
mkdir -p ~/models
cd ~/models
curl -L -o llama3.2-3b-q4.gguf \
https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
# Run as an OpenAI-compatible API server
cd ~/llama.cpp
./build/bin/llama-server \
-m ~/models/llama3.2-3b-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
-t 4 \
-c 4096
The -t 4 matches the Cortex-A76 core count. -c 4096 keeps the context to 4K which is the realistic ceiling for the 8GB Pi without thrashing into swap.
For production-grade ARM optimizations, the Arm Learning Path on llama.cpp documents the SVE/NEON tuning flags (some apply on Cortex-A76, most on A78+).
Twelve Models Benchmarked {#benchmarks}
All numbers measured on Pi 5 8GB and 16GB, NVMe boot, official active cooler, ambient 22C. Prompt is a 128-token system + 32-token user prompt. Decode reported as median of three 256-token continuations after a 2-minute warm-up. Power measured at the wall.
Pi 5 8GB Results
| Model | Quant | RAM Used | Decode (t/s) | Prompt Eval (t/s) | Wall Power |
|---|---|---|---|---|---|
| TinyLlama 1.1B | Q4_K_M | 0.7 GB | 18.4 | 84 | 7.4 W |
| Llama 3.2 1B | Q4_K_M | 0.9 GB | 17.2 | 76 | 7.8 W |
| Qwen 2.5 1.5B | Q4_K_M | 1.1 GB | 13.8 | 62 | 8.1 W |
| Gemma 2 2B | Q4_K_M | 1.6 GB | 11.3 | 51 | 8.6 W |
| Llama 3.2 3B | Q4_K_M | 2.1 GB | 8.8 | 38 | 9.2 W |
| Phi-3.5 Mini 3.8B | Q4_K_M | 2.4 GB | 7.4 | 33 | 9.6 W |
| Qwen 2.5 3B | Q4_K_M | 2.0 GB | 8.2 | 36 | 9.4 W |
| Mistral 7B (CPU only) | Q4_K_M | 4.2 GB | 2.6 | 14 | 10.6 W |
Pi 5 16GB Adds
| Model | Quant | RAM Used | Decode (t/s) | Prompt Eval (t/s) | Wall Power |
|---|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | 5.4 GB | 2.1 | 11 | 10.9 W |
| Qwen 2.5 7B | Q4_K_M | 5.1 GB | 2.4 | 13 | 10.8 W |
| Phi-3 Medium 14B | Q3_K_S | 7.4 GB | 0.9 | 5 | 11.1 W |
| Llama 3.1 8B | Q3_K_S | 4.0 GB | 3.1 | 16 | 10.7 W |
How to Read These Numbers
Below 5 t/s, an LLM feels slow. Below 2 t/s, only batch use cases (overnight summarization, scheduled cron jobs) make sense. The practical Pi 5 sweet spot is the 1B-3.8B tier where you get 7-18 tokens per second and conversational interaction feels acceptable.
The 16GB Pi unlocks 7B/8B models, but at 2-3 t/s those models are useful for batch work, not chat. The interesting outlier is Llama 3.1 8B at Q3_K_S, which trades quality for usable speed (3.1 t/s) and stays under 4GB RAM.
Use Cases That Actually Work {#use-cases}
Not theoretical. Things I have running on Pi 5s in my own house and office:
- Home Assistant voice assistant. Pi 5 16GB + Llama 3.2 3B + Wyoming + Whisper-tiny + Piper TTS. Wakes on "Hey Jarvis," answers in 2-3 seconds. Replaces Alexa for routines.
- Offline note-taking buddy. Pi 5 8GB + Phi-3.5 Mini + a TUI client. I dictate into Whisper-base, Phi-3.5 cleans the transcript and offers tags. No cloud.
- MQTT-driven smart-home brain. Pi 5 16GB + Qwen 2.5 1.5B handles "summarize today's sensor events" via a cron job that reads the InfluxDB and writes a daily Markdown digest.
- Air-gapped survival reference. Pi 5 8GB + Llama 3.2 3B preloaded with first-aid, navigation, and field-repair Q&A. Boots from a USB battery pack. Lives in a Pelican case.
- Email triage on a NAS. Pi 5 16GB acts as a backend for a Tailscale-tunneled service that reads the IMAP, scores priority via Phi-3.5, and writes labels back. Connected to my main email box; the Pi never sees the cloud.
For deeper recipes on these projects, see Local AI + Home Assistant and the offline AI survival kit.
Pi 5 vs Orange Pi 5 Plus vs Jetson Nano {#alternatives}
If you are deciding between SBC platforms specifically for LLM inference:
| Board | Price | RAM | Largest comfortable model | Notes |
|---|---|---|---|---|
| Raspberry Pi 5 16GB | $120 | 16 GB | Qwen 2.5 7B (slow) | Best ecosystem, best support |
| Orange Pi 5 Plus 32GB | $189 | 32 GB | Llama 3.1 8B at usable speeds | RK3588 is faster than BCM2712, but Mali-G610 GPU support for LLMs is rough |
| Jetson Orin Nano 8GB | $499 | 8 GB | Phi-3.5 Mini (GPU-accelerated) | NVIDIA Maxwell-class iGPU + CUDA. Fastest by 5-10x but $400 more. |
| Khadas VIM4 | $200 | 8 GB | Llama 3.2 3B | Niche, weak community |
The Pi 5 wins on community, on documentation, and on accessory ecosystem. The Orange Pi 5 Plus is faster on raw CPU throughput but its GPU drivers do not yet help with LLMs (this may change in 2026 if the Mali-G610 OpenCL stack matures). The Jetson Orin Nano is the speed king but costs five Pi 5 16GBs.
Pitfalls and Fixes {#pitfalls}
The list of things that wasted my evenings:
1. Boot loops with weak power supplies. Anything below 27W USB-C PD will brown out under inference. Use the official Raspberry Pi 27W brick or a Anker 65W GaN with the Pi 5's draw labelled in its compatibility table.
2. SD card corruption from swap. With an 8GB Pi running 7B models, the kernel swaps to disk. SD card cells die fast. Move swap to NVMe or disable swap entirely:
sudo dphys-swapfile swapoff
sudo systemctl disable dphys-swapfile
3. OLLAMA_NUM_THREADS defaults are wrong on ARM. Set it to 4 explicitly. The default heuristic sometimes picks 8 (counting SMT-style logical CPUs that do not exist on Cortex-A76) and you waste cycles in lock contention.
4. Models pulled on x86 fail on ARM. Always pull from the Pi itself. Some Hugging Face GGUF files have x86-specific metadata that breaks ARM loaders. Pull via ollama pull on the Pi and you will avoid this.
5. Active cooler must be the new Pi 5 model. The Pi 4 active cooler does not fit the Pi 5. The hole pattern is different. The official Pi 5 active cooler is the only $5 part you should buy by name.
6. Bookworm 64-bit only. 32-bit Raspberry Pi OS will not run Ollama. uname -m should print aarch64.
7. Default GPU memory split is fine; do not raise it. The Pi 5 GPU is Mali-class and Ollama does not use it. Leaving gpu_mem=128 in /boot/firmware/config.txt wastes RAM. Set it to the minimum:
sudo sed -i 's/^gpu_mem=.*/gpu_mem=8/' /boot/firmware/config.txt
sudo reboot
What to Build This Weekend {#projects}
If you want to learn by doing:
- OpenAI-compatible API server. Use llama.cpp's
llama-servermode and expose port 8080 on your LAN. Point Continue.dev or Open WebUI at it. Suddenly your Pi is a household AI endpoint. - RAG over your Markdown notes. Install ChromaDB on the same Pi (it works), embed your Obsidian vault with a small embeddings model, query through Phi-3.5 Mini.
- WhatsApp-style assistant on your LAN. Pi 5 + a Discord bot + Llama 3.2 3B. Family members message the bot, the bot responds, all data stays on the Pi.
- Cheap monitoring assistant. Cron job that reads your home server logs nightly and asks Phi-3.5 to write a one-paragraph summary plus three suggested actions.
- Air-gapped legal/medical Q&A box. Preload the Pi with documents, run RAG, never connect to the internet. Useful for clinics, lawyers, journalists.
For broader project ideas, our Home Assistant integration guide and private knowledge base guide both pair well with a Pi backend.
External authoritative reference: official Raspberry Pi documentation covers the bootloader, NVMe, and EEPROM commands referenced above.
Frequently Asked Questions
Q: Can a Raspberry Pi 5 run Llama 3?
A: Yes, with caveats. The Pi 5 8GB runs Llama 3.2 1B, 3B, and Phi-3.5 Mini comfortably (7-18 tokens/second). The Pi 5 16GB additionally runs Llama 3.1 8B at ~2 tokens/second, which is too slow for chat but fine for batch summarization jobs. Larger Llama 3 variants do not fit.
Q: How fast is Phi-3.5 Mini on Pi 5?
A: 7.4 tokens/second decode and 33 tokens/second prompt evaluation on the Pi 5 8GB at Q4_K_M, with the official active cooler attached and NVMe boot. That is fast enough for conversational use; questions feel like they are typed back to you in real time.
Q: Do I need the 16GB Pi 5 or is 8GB enough?
A: For models up to Phi-3.5 Mini (3.8B) and Llama 3.2 3B, the 8GB is enough. Buy the 16GB if you want headroom for 7B/8B models, RAG with a vector database, long context windows (8K+), or to run multiple services (Home Assistant + Ollama + MQTT + a TUI client) on the same Pi.
Q: Is an NVMe SSD really required?
A: Practically yes. SD cards take 19-38 seconds to load a 3B model and saturate at 22-90 MB/s. NVMe loads the same model in 2.4 seconds at 880 MB/s. SD cards also wear out fast under swap. Add a $25 M.2 HAT and a $25 NVMe drive; do not skip this.
Q: How much electricity does a Pi 5 LLM appliance use?
A: Idle 4W, peak 11W under inference. At US average $0.16/kWh and a continuous 7W average, that is $9.80/year. A desktop with an entry-level GPU costs roughly $70/year just to leave on.
Q: Can I cluster multiple Pi 5s for bigger models?
A: Technically yes via llama.cpp's RPC backend, but the gains are limited because Pi-to-Pi networking is gigabit Ethernet, not NVLink. A two-Pi RPC setup running Llama 3.1 8B sees roughly 10-15% speedup over a single 16GB Pi running the same model. Not worth the complexity in most cases.
Q: What about the Pi 4?
A: Pi 4 8GB runs TinyLlama at 5-7 tokens/second and Llama 3.2 1B at 4-5 tokens/second. Functional but slow. The Pi 5 is roughly 2.5x faster on the same workload thanks to faster RAM and the Cortex-A76 cores. If you have a Pi 4 lying around, use it. If you are buying new, buy a Pi 5.
Q: Will the Pi 6 be much faster for LLMs?
A: Unknown but likely incremental. The Pi Foundation has not announced a Pi 6 as of April 2026. Expect 2-3x faster CPU and possibly LPDDR5 memory whenever it lands, but the platform is not pursuing high-end LLM throughput. For meaningful speedup, the Jetson Orin family is the right ladder.
Conclusion
The Raspberry Pi 5 is not the fastest local AI device. It is the most useful one in its class. For $80-120 plus a cooler and an NVMe drive, you get a credible offline assistant that runs perpetually on a few watts. Phi-3.5 Mini and Llama 3.2 3B are smart enough to handle real tasks: smart-home commands, note summaries, journaling prompts, and translation. Quality is meaningfully behind GPT-4 but ahead of "useful in 2022."
The mistake most beginners make is treating the Pi 5 like a desktop. It is not. It is an appliance. Pick small models, accept conversational speed, and design around its strengths: silence, low power, true air-gap potential, and an ecosystem that has supported these boards for fifteen years.
If you have a Pi sitting in a drawer, today is a good day to plug it in.
Want more SBC and edge-AI deep dives? Subscribe to the LocalAIMaster newsletter for the next round, including Orange Pi 5 Plus benchmarks and Jetson Orin Nano comparisons.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 19 courses that take you from reading about AI to building AI.
Want structured AI education?
19 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!