Why Is My Local LLM So Slow? 12 Fixes Ranked
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Why Is My Local LLM So Slow? 12 Fixes Ranked by Likelihood
Published on April 11, 2026 • 17 min read
Your local model should generate at 20-50 tokens per second on a decent GPU. If you are getting single digits, something is wrong. The frustrating part is that Ollama and other inference engines rarely tell you why it is slow. The model loads, text appears, but at a crawl.
I have ranked these 12 fixes by how often each one is the actual cause when someone reports slow inference. Fix #1 solves the problem for roughly 40% of people. Fix #2 catches another 20%. By the time you reach #6, you have covered over 90% of cases.
Work through them in order. Each fix has a diagnostic command that tells you in seconds whether it applies to you.
Fix 1: Model Too Large for VRAM (Partial CPU Offloading) {#fix-1}
Likelihood: Very High (40% of slow inference cases)
This is the single most common cause of slow local AI. Your model does not fully fit in GPU VRAM, so Ollama silently offloads some layers to system RAM. GPU layers run at full speed. RAM layers run 5-10x slower. The average speed tanks.
Diagnose it:
# Check VRAM usage while model is loaded
nvidia-smi
# Look at "Memory-Usage" column
# If it shows e.g., 7800MiB / 8192MiB, your VRAM is maxed out
# Some layers are being processed on CPU
# Ollama shows layer split in verbose mode
OLLAMA_DEBUG=1 ollama run llama3.2:7b "test" 2>&1 | grep -i "gpu\|layer\|offload"
What you will see if this is your problem:
- VRAM usage near 100%
- Generation speed drops from expected 30-40 tok/s to 5-15 tok/s
- First token is slow (prompt evaluation uses GPU + CPU)
Fix it:
# Option 1: Use a smaller model
ollama pull llama3.2:3b # Instead of 7b
ollama pull phi4-mini # 3.8B, fits easily in 8GB VRAM
# Option 2: Use a more aggressive quantization
ollama pull llama3.2:7b-q3_K_M # ~3.3GB vs ~4.1GB for Q4_K_M
# Option 3: Reduce context window (frees VRAM for model layers)
ollama run llama3.2:7b --num-ctx 1024 # Default is 2048
# Option 4: Force all layers to GPU (will fail if not enough VRAM)
# In a Modelfile:
# PARAMETER num_gpu 999
The math: A 7B Q4_K_M model needs ~4.1GB for weights + ~0.5-1.5GB for KV cache depending on context length. On an 8GB GPU, that leaves almost zero headroom. On a 12GB GPU, it fits comfortably.
For detailed model-to-VRAM mapping, see our VRAM requirements guide.
Fix 2: CPU Inference When GPU Is Available {#fix-2}
Likelihood: High (20% of cases)
Ollama is running entirely on CPU even though you have a perfectly good GPU. This happens when CUDA is not detected, the driver is wrong, or Ollama was installed without GPU support.
Diagnose it:
# Step 1: Is the GPU even visible to the system?
nvidia-smi
# If this fails, your driver is not installed or not loaded
# Step 2: Is Ollama seeing the GPU?
OLLAMA_DEBUG=1 ollama run llama3.2 "test" 2>&1 | head -30
# Look for lines mentioning "CUDA", "GPU", or "metal"
# If you only see "CPU" references, Ollama is not using your GPU
# Step 3: While model is running, check GPU utilization
nvidia-smi
# GPU-Util should be >0%. If it is 0%, Ollama is not using the GPU
Fix it:
# Linux: Install/update NVIDIA driver
sudo apt update
sudo apt install nvidia-driver-550
sudo reboot
# Verify CUDA version after reboot
nvidia-smi
# Must show CUDA Version: 11.8 or higher
# If driver is fine but Ollama still uses CPU, reinstall Ollama
curl -fsSL https://ollama.com/install.sh | sh
# macOS: Metal should be automatic on Apple Silicon
# Verify Metal support:
system_profiler SPDisplaysDataType | grep Metal
Speed difference: CPU inference on a modern 8-core CPU gives 5-10 tok/s for a 7B model. GPU inference on an RTX 3060 gives 35-45 tok/s. That is a 4-7x difference. If you have a GPU and are not using it, this is by far the biggest performance gain available.
Fix 3: Wrong Quantization (Q2 Is Terrible) {#fix-3}
Likelihood: Moderate (10% of cases)
Q2_K quantization reduces model size dramatically but destroys quality and paradoxically can be slower on some hardware because the dequantization overhead is higher. Some people pick Q2 thinking smaller means faster. It often does not.
Diagnose it:
# Check what quantization your model uses
ollama show llama3.2 --modelfile | grep -i "quant\|format"
# Or check the model name - it usually includes the quant level
ollama list
# NAME SIZE
# llama3.2:7b-q2_K 2.7GB ← Q2, likely slow + bad quality
# llama3.2:7b-q4_K_M 4.1GB ← Q4, sweet spot
Fix it:
# Remove the bad quantization
ollama rm llama3.2:7b-q2_K
# Pull the standard Q4_K_M (default for most Ollama models)
ollama pull llama3.2:7b
# Or explicitly request Q4_K_M
ollama pull llama3.2:7b-q4_K_M
Quantization speed comparison (RTX 3060 12GB, Llama 3.2 7B):
| Quantization | Size | tok/s | Quality |
|---|---|---|---|
| Q2_K | 2.7GB | 32 | Poor |
| Q3_K_M | 3.3GB | 38 | Acceptable |
| Q4_K_M | 4.1GB | 42 | Good |
| Q5_K_M | 4.8GB | 40 | Very Good |
| Q8_0 | 7.2GB | 35 | Near-Perfect |
Q4_K_M is actually faster than Q2_K in many scenarios because the compute kernels are better optimized for 4-bit operations. Read our RAM requirements guide for quantization-to-performance mapping across different hardware.
Fix 4: Thermal Throttling {#fix-4}
Likelihood: Moderate (8% of cases, much higher on laptops)
Your GPU generates text at 40 tok/s for the first 30 seconds, then drops to 20 tok/s. The GPU is overheating and reducing its clock speed to protect itself.
Diagnose it:
# Monitor GPU temperature in real time
watch -n 1 'nvidia-smi --query-gpu=temperature.gpu,clocks.gr,power.draw --format=csv,noheader'
# Example output when throttling:
# 87, 1200 MHz, 120 W ← temperature high, clock dropping
# Normal output:
# 65, 1800 MHz, 150 W ← cool, full clock speed
# macOS Apple Silicon
sudo powermetrics --samplers thermal -i 2000 -n 10
Throttling typically starts at:
- NVIDIA desktop GPUs: 83-90C
- NVIDIA laptop GPUs: 80-87C
- Apple Silicon: 95-100C (higher threshold, passively cooled)
Fix it:
# Improve cooling first (no commands needed)
# - Clean dust from GPU heatsink and fans
# - Improve case airflow (remove side panel temporarily to test)
# - Use a laptop cooling pad
# - Ensure fans are actually spinning (check fan curve in NVIDIA Settings)
# Set a power limit to reduce heat (trades ~10% performance for ~10C cooler)
sudo nvidia-smi -pl 120 # Set to 120W instead of 150W default
# Set a more aggressive fan curve (Linux, requires nvidia-settings)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=80"
# Reduce GPU clock speed manually (nuclear option)
sudo nvidia-smi -lgc 300,1500 # Cap max clock at 1500MHz
For desktop users, cleaning dust and improving case airflow usually solves the problem. For laptop users, a $20 cooling pad makes a measurable difference. See our system requirements guide for recommended cooling solutions.
Fix 5: Background Processes Eating VRAM {#fix-5}
Likelihood: Moderate (7% of cases)
Chrome with 40 tabs, a game running in the background, or even your desktop compositor can consume 1-3GB of VRAM. That pushes your model from "fits in VRAM" to "partially offloaded to RAM" territory.
Diagnose it:
# Check VRAM usage by process
nvidia-smi
# Look at the "Processes" table at the bottom
# Common VRAM hogs:
# Chrome/Firefox: 200-800MB per GPU-accelerated tab
# Discord: 100-300MB
# Desktop compositor (Xorg/Wayland): 100-500MB
# Games: 2-8GB
Fix it:
# Close Chrome (biggest offender)
# Or disable hardware acceleration: chrome://settings → System → "Use hardware acceleration"
# Kill specific GPU processes
# Find PID from nvidia-smi, then:
kill -9 <PID>
# On Linux, disable compositor GPU usage
# For GNOME:
gsettings set org.gnome.mutter experimental-features "['rt-scheduler']"
# Verify freed VRAM
nvidia-smi
# Memory-Usage should drop significantly
Quick test: Close everything except your terminal and Ollama. Run the model. If it is significantly faster, something else was eating your VRAM. Add applications back one at a time to find the culprit.
Fix 6: Wrong Ollama GPU Layer Settings {#fix-6}
Likelihood: Moderate (6% of cases)
Ollama decides how many model layers to put on GPU vs CPU based on available VRAM. Sometimes this automatic detection is wrong, especially on multi-GPU systems or when VRAM reporting is inaccurate.
Diagnose it:
# Check how many layers Ollama assigned to GPU
OLLAMA_DEBUG=1 ollama run llama3.2 "test" 2>&1 | grep -i "layer\|gpu"
# You might see something like:
# "offloading 20/32 layers to GPU"
# This means 12 layers are on CPU, slowing things down
Fix it:
# Force all layers to GPU via Modelfile
cat > Modelfile << 'EOF'
FROM llama3.2:7b
PARAMETER num_gpu 999
EOF
ollama create llama3.2-gpu -f Modelfile
ollama run llama3.2-gpu "test"
# If it crashes with OOM, reduce num_gpu
# Try 28 out of 32 layers on GPU:
# PARAMETER num_gpu 28
# Or set via environment variable
OLLAMA_NUM_GPU=999 ollama run llama3.2
The tradeoff: Forcing all layers to GPU when there is not enough VRAM will cause out-of-memory errors. You need to find the sweet spot: as many layers on GPU as possible while leaving room for the KV cache. Start with num_gpu 999 and reduce by 2-4 layers if you get OOM errors.
Fix 7: HDD Instead of SSD for Model Storage {#fix-7}
Likelihood: Low-Moderate (4% of cases)
Model loading time and generation speed can be affected by storage speed. An HDD takes 30-60 seconds to load a 7B model. An NVMe SSD takes 2-3 seconds. During inference, if the model does not fully fit in RAM and the system swaps to disk, HDD makes everything 10-50x slower.
Diagnose it:
# Check where Ollama stores models
ls -la ~/.ollama/models/
# Check if that path is on an HDD or SSD
df -h ~/.ollama/models/
# Note the mount point, then:
lsblk -d -o name,rota
# ROTA=1 means HDD (rotational), ROTA=0 means SSD
# Check if swap is being used (indicates RAM pressure + disk dependency)
free -h
# If "Swap" used is >0, disk speed directly impacts inference
Fix it:
# Move models to SSD
sudo systemctl stop ollama
mv ~/.ollama/models/ /path/to/ssd/ollama-models/
ln -s /path/to/ssd/ollama-models/ ~/.ollama/models
# Or set the storage path via environment variable
export OLLAMA_MODELS=/path/to/ssd/ollama-models
# Add to /etc/systemd/system/ollama.service.d/override.conf for persistence
This fix matters most when:
- You are loading/switching between models frequently
- Your model barely fits in RAM and the system uses swap
- You are running on an older machine with an HDD boot drive
Fix 8: Insufficient RAM for Context Window {#fix-8}
Likelihood: Low-Moderate (3% of cases)
The KV cache for the context window lives in VRAM (or RAM if offloaded). Large context windows eat significant memory. Going from 2048 to 32768 context tokens can add 2-4GB of memory usage, pushing model layers out of VRAM.
Diagnose it:
# Check current context window size
ollama show llama3.2 --modelfile | grep num_ctx
# Estimate KV cache size
# Rough formula: KV cache (GB) = (context_length * num_layers * d_model * 2 * 2) / 1e9
# For Llama 3.2 7B with 8192 context: ~1.5GB KV cache
# For Llama 3.2 7B with 32768 context: ~6GB KV cache
# Monitor VRAM during a long conversation
watch -n 2 nvidia-smi
# VRAM usage increases as conversation grows
Fix it:
# Reduce context window
ollama run llama3.2 --num-ctx 2048 # Down from 8192
# Or in a Modelfile
cat > Modelfile << 'EOF'
FROM llama3.2:7b
PARAMETER num_ctx 2048
PARAMETER num_predict 512
EOF
ollama create llama3.2-fast -f Modelfile
# Start a new conversation (clears accumulated KV cache)
# Long conversations accumulate context, slowing down over time
Practical advice: Start with 2048 context for interactive chat. Only increase to 4096 or 8192 when you genuinely need to process long documents. Most chat interactions use fewer than 1000 tokens of context.
Fix 9: Power Management Throttling GPU {#fix-9}
Likelihood: Low (2% of cases)
Your OS may be throttling GPU performance through power management. This is especially common on laptops set to "Balanced" or "Power Saver" mode, where the GPU clock speed is artificially limited.
Diagnose it:
# Check current GPU clock speed
nvidia-smi --query-gpu=clocks.gr,clocks.max.gr --format=csv
# If current clock is significantly below max, power management is limiting it
# Check current power state
nvidia-smi --query-gpu=pstate --format=csv
# P0 = max performance, P8 = minimum. You want P0 or P2 during inference.
# Linux: Check CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# "powersave" is bad for AI. "performance" is what you want.
Fix it:
# Linux: Set CPU to performance mode
sudo cpupower frequency-set -g performance
# Or for all cores:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Windows: Set to "High Performance" power plan
# Control Panel → Power Options → High Performance
# NVIDIA: Set GPU to max performance
sudo nvidia-smi -pm 1 # Enable persistent mode
sudo nvidia-smi -ac 5001,1800 # Set memory and GPU clock maximums
# macOS: Disable low power mode
# System Settings → Battery → Low Power Mode: Off
Fix 10: Outdated GPU Drivers {#fix-10}
Likelihood: Low (2% of cases)
Old drivers can have suboptimal CUDA performance or missing optimizations for newer model architectures. This is less common than other issues because Ollama works with a wide range of driver versions, but updating can still provide measurable speed improvements.
Diagnose it:
# Check current driver version
nvidia-smi | head -3
# Driver Version: 535.xx → might benefit from update
# Driver Version: 550.xx → probably fine
# Check if a newer driver is available
apt list --upgradable 2>/dev/null | grep nvidia
Fix it:
# Ubuntu: Update to latest driver
sudo apt update
sudo apt install nvidia-driver-550
sudo reboot
# Or use the NVIDIA PPA for the very latest
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-driver-560
sudo reboot
# Verify the update
nvidia-smi
Expected improvement: Typically 0-10% speed increase. Driver updates rarely cause dramatic improvements but occasionally fix specific performance regressions that affected certain GPU models.
Fix 11: WSL2 Memory Limits on Windows {#fix-11}
Likelihood: Low (1.5% of cases, but 15% among WSL2 users)
By default, WSL2 gets only 50% of your system RAM (or 8GB max on older Windows builds). If you have 32GB of RAM but WSL2 can only use 8GB, your model is memory-constrained even though the host machine has plenty.
Diagnose it:
# Inside WSL2, check available memory
free -h
# If total shows 8GB when your machine has 32GB, WSL2 is limited
# Check current WSL2 config
cat /mnt/c/Users/$USER/.wslconfig 2>/dev/null || echo "No .wslconfig found"
Fix it:
# In Windows (PowerShell), create/edit .wslconfig
notepad $env:USERPROFILE\.wslconfig
Add these settings:
[wsl2]
memory=24GB
swap=8GB
processors=8
# Restart WSL2 for changes to take effect
wsl --shutdown
# Then reopen your WSL2 terminal
# Verify inside WSL2
free -h
# Should now show ~24GB total
Recommendation: Set WSL2 memory to 75% of your system RAM. Leave the remaining 25% for Windows itself. More details in our Ollama system requirements guide.
Fix 12: Docker Without GPU Passthrough {#fix-12}
Likelihood: Low (1% of all users, but very common among Docker users)
If you run Ollama inside Docker, the container does not get GPU access by default. Everything runs on CPU. You get 5-10 tok/s instead of 30-45 tok/s.
Diagnose it:
# Check if the Ollama container can see the GPU
docker exec -it ollama nvidia-smi
# If this fails with "command not found" or "no devices", GPU passthrough is not working
# Check how the container was started
docker inspect ollama | grep -i gpu
# If no GPU-related config appears, the container has no GPU access
Fix it:
# Step 1: Install NVIDIA Container Toolkit (if not installed)
# Follow: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
sudo apt install nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Step 2: Recreate the Ollama container with GPU access
docker stop ollama && docker rm ollama
docker run -d --gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
# Step 3: Verify GPU access inside container
docker exec -it ollama nvidia-smi
# Should show your GPU with VRAM info
The critical flag is --gpus all. Without it, no GPU access. For more Docker configuration details, check the Ollama FAQ documentation.
The 60-Second Diagnostic Flowchart {#diagnostic}
Run these commands in order. Stop at the first one that reveals a problem.
# 1. Is the GPU even being used? (10 seconds)
nvidia-smi
# GPU-Util should be >0% while generating. If 0%, go to Fix #2.
# 2. Is VRAM maxed out? (5 seconds)
nvidia-smi | grep "MiB"
# If VRAM usage is >95% of total, go to Fix #1.
# 3. Is something else using VRAM? (5 seconds)
# Check "Processes" section of nvidia-smi output
# If Chrome/Discord/Games appear, go to Fix #5.
# 4. Is the GPU throttling? (10 seconds)
nvidia-smi --query-gpu=temperature.gpu,clocks.gr --format=csv,noheader
# If temp >85C or clock <70% of max, go to Fix #4.
# 5. Is the model using a bad quantization? (5 seconds)
ollama list
# If you see q2_K in the model name, go to Fix #3.
# 6. Is power management limiting the GPU? (5 seconds)
nvidia-smi --query-gpu=pstate --format=csv,noheader
# If P5-P8, go to Fix #9.
# 7. Are you in Docker without GPU? (10 seconds)
docker exec -it ollama nvidia-smi 2>/dev/null || echo "No GPU in container"
# If it fails, go to Fix #12.
If none of these reveal the issue, you likely have a model-specific or configuration-specific problem. Check the Ollama logs:
# Linux
journalctl -u ollama --no-pager | tail -50
# macOS
tail -50 ~/.ollama/logs/server.log
# Windows (in PowerShell)
Get-Content "$env:LOCALAPPDATA\Ollama\server.log" -Tail 50
Expected Performance: Know Your Baseline {#baseline}
Before troubleshooting, know what speed you should expect. These are realistic benchmarks for common hardware running Llama 3.2 7B Q4_K_M:
| GPU | VRAM | Expected tok/s | Notes |
|---|---|---|---|
| RTX 4090 | 24GB | 80-100 | Full GPU, no offloading |
| RTX 3090 | 24GB | 45-55 | Full GPU |
| RTX 4070 Ti | 12GB | 40-50 | Full GPU |
| RTX 3060 | 12GB | 35-42 | Full GPU |
| RTX 3060 | 8GB | 25-35 | Partial offload possible |
| GTX 1060 | 6GB | 8-15 | Heavy offloading to RAM |
| Apple M2 | 16GB unified | 30-35 | Full Metal acceleration |
| Apple M3 Pro | 18GB unified | 38-45 | Full Metal acceleration |
| CPU only (i7-12700) | N/A | 8-12 | No GPU acceleration |
If you are significantly below these numbers, one of the 12 fixes above will resolve it. If you are within 10-15% of these numbers, your setup is working correctly and the remaining optimization is marginal.
For complete VRAM planning, see our VRAM requirements guide.
Conclusion
Slow local AI is almost always one of three things: the model does not fit in VRAM (Fix #1), the GPU is not being used at all (Fix #2), or something is eating your VRAM (Fix #5). Together, these three account for about 70% of all slow inference complaints.
Run the 60-second diagnostic flowchart. It will point you to the right fix. Save this page for the next time your inference speed drops unexpectedly — it usually means something changed in the background, not that your hardware degraded overnight.
For hardware upgrade planning, our troubleshooting guide covers additional edge cases and the system requirements guide helps you plan your next upgrade.
Want to go deeper? Our courses include hands-on performance tuning labs where you learn to profile, diagnose, and optimize local AI inference on any hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!