Free course — 2 free chapters of every course. No credit card.Start learning free
Hardware Guide

AI on Steam Deck: Run Local LLMs on Valve Handheld (2026 Guide)

April 23, 2026
18 min read
LocalAimaster Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

AI on Steam Deck: Turn Valve's Handheld Into a Pocket LLM Rig

Published on April 23, 2026 - 18 min read

Quick Start: Run Phi-3 on Your Steam Deck in 4 Minutes

The Steam Deck ships with an AMD Van Gogh APU (Zen 2 CPU + RDNA 2 iGPU) and 16GB of LPDDR5 unified memory. That is enough to run quantized 7B models at usable speeds. Boot into Desktop Mode, open Konsole, and run:

  1. passwd - set a sudo password (SteamOS does not ship with one)
  2. curl -fsSL https://ollama.com/install.sh | sh - installer auto-detects ROCm absence and falls back to CPU+Vulkan
  3. ollama pull phi3:mini - 2.2GB Q4 quant, fits easily in RAM
  4. ollama run phi3:mini - first prompt arrives in roughly two seconds

That is it. You now have a fully offline LLM running on a 7-inch handheld that fits in a backpack pocket.


What this guide covers:

  • Installing Ollama on SteamOS without breaking the immutable root filesystem
  • Working around the read-only /usr partition that survives every system update
  • Real benchmarks for Phi-3 Mini, Llama 3.2 3B, Mistral 7B, and Gemma 2 9B
  • Battery life, thermals, and TDP tuning for sustained inference
  • Switching between gaming and AI workloads without conflicts

The Steam Deck is the most underrated AI device on the market. Most reviews compare it to the ROG Ally and Legion Go for gaming. Almost nobody benchmarks it for local inference, even though Valve's handheld has 16GB of unified memory shared between CPU and GPU - the same architectural trick that makes Apple Silicon punch above its weight. If you already own a Deck, you own a portable, fanless-when-idle, 4-hour-battery LLM workstation.

For broader hardware decisions, pair this guide with the AI hardware requirements deep dive and the AMD vs NVIDIA vs Intel AI GPU buyer's guide. For privacy-sensitive deployments where the handheld never touches Wi-Fi, the air-gapped AI deployment guide covers the full offline workflow.

Table of Contents

  1. Why the Steam Deck is Surprisingly Good for AI
  2. Hardware Reality Check
  3. SteamOS Filesystem Quirks
  4. Installing Ollama Correctly
  5. Model Selection for 16GB Unified Memory
  6. Benchmarks Across All Major Models
  7. TDP, Battery, and Thermal Tuning
  8. Open WebUI on the Deck
  9. Switching Between Gaming and AI
  10. Pitfalls and Fixes

Why the Steam Deck is Surprisingly Good for AI {#why-steam-deck}

Three architectural facts make the Deck competitive for local inference at its price point.

Unified memory. The Van Gogh APU shares 16GB of LPDDR5-5500 between CPU and GPU. Unlike a discrete-GPU laptop where the iGPU is starved by an 8GB or 12GB VRAM ceiling, the Deck can dedicate up to 12GB to a model and still leave 4GB for SteamOS and Plasma. That is the same trick Apple uses on M-series Macs.

Decent memory bandwidth. LPDDR5-5500 quad-channel delivers around 88 GB/s of theoretical bandwidth. That is roughly 60% of an M2 Air and well above any DDR4 desktop. Token generation on quantized 7B models is bandwidth-bound, so this matters more than raw FLOPS.

Thermal headroom. The Deck's 15W default TDP can be pushed to 20W in handheld mode and effectively uncapped (45W package power) when docked with active cooling. Sustained inference workloads sit comfortably at 12-18W with the fan barely audible.

The catch: AMD does not officially support ROCm on Van Gogh (RDNA 2 mobile). You will run inference on CPU with Vulkan compute as a partial accelerator, not on GPU compute the way you would on an RX 7900 XTX. That sounds bad until you actually benchmark it. Phi-3 Mini hits 22 tokens/second pure-CPU on the Deck. Llama 3.2 3B hits 14 tok/s. Both are above human reading speed.

Hardware Reality Check {#hardware}

ComponentSteam Deck LCDSteam Deck OLED
CPUAMD Zen 2, 4c/8t @ 2.4-3.5 GHzAMD Zen 2, 4c/8t @ 2.4-3.5 GHz
GPURDNA 2, 8 CUs @ 1.0-1.6 GHzRDNA 2, 8 CUs @ 1.0-1.6 GHz
RAM16GB LPDDR5-550016GB LPDDR5-6400
Storage64/256/512 GB eMMC/NVMe512GB / 1TB NVMe
TDP4-15W default, 20W max4-15W default, 20W max
Battery40 Wh50 Wh

The OLED model has roughly 16% higher memory bandwidth (LPDDR5-6400 vs 5500) and a more efficient 6nm APU. For inference, expect 5-10% faster token generation on OLED. Not a reason to upgrade if you have an LCD, but if you are buying new for AI specifically, OLED wins.

The 64GB eMMC base model is a non-starter. A single 7B Q4 model is 4.4GB. You need at least the 256GB SSD model, and 512GB or 1TB is more practical once you start collecting Phi, Llama, Mistral, and embedding models.

SteamOS Filesystem Quirks {#steamos-quirks}

SteamOS 3 is built on an immutable Arch Linux base. The root filesystem (/usr, /etc, /var partially) is read-only and gets atomically replaced on every system update. Anything you install via pacman -S will get wiped the next time Valve pushes an update. This trips up almost everyone who tries to install AI tools the conventional way.

Three workable approaches:

1. Install into /home (recommended). /home is on a separate writable partition. The Ollama installer puts binaries in /usr/local/bin by default, which gets wiped. Override the install location to ~/.local/bin and add it to PATH.

2. Use Distrobox. Run a full Arch or Ubuntu container inside SteamOS with persistent storage. Heavy but bulletproof. Recommended if you want a complete dev environment.

3. Disable the read-only filesystem. Run sudo steamos-readonly disable. Works, but updates will conflict and you may need to re-disable after each major SteamOS release. Not recommended for a daily-driver Deck.

I use approach 1 for Ollama itself and approach 2 for Python-based tools that need a build toolchain (llama.cpp, embedding model fine-tuning, etc.).

Installing Ollama Correctly {#installing-ollama}

Boot into Desktop Mode (Power button > Switch to Desktop). Open Konsole.

# Set sudo password if you have not already
passwd

# Make /home/deck/.local/bin exist and be on PATH
mkdir -p ~/.local/bin
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
echo 'export OLLAMA_MODELS="$HOME/.ollama/models"' >> ~/.bashrc
source ~/.bashrc

# Download Ollama binary directly (skip the systemd unit installation)
curl -L https://ollama.com/download/ollama-linux-amd64 -o ~/.local/bin/ollama
chmod +x ~/.local/bin/ollama

# Verify
ollama --version

This puts Ollama in your home directory. SteamOS updates will not touch it.

To run it as a background service that starts with Desktop Mode, create a user systemd unit:

mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/ollama.service <<'EOF'
[Unit]
Description=Ollama Service (user)
After=network-online.target

[Service]
Type=simple
ExecStart=%h/.local/bin/ollama serve
Environment="OLLAMA_MODELS=%h/.ollama/models"
Environment="OLLAMA_HOST=127.0.0.1:11434"
Restart=on-failure
RestartSec=3

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now ollama.service

Now pull your first model:

ollama pull phi3:mini       # 2.2 GB, fastest
ollama pull llama3.2:3b     # 2.0 GB, best quality-to-speed ratio
ollama pull mistral:7b      # 4.1 GB, GPT-3.5-class quality

Models go into ~/.ollama/models/. If you have an SD card and want to keep models off the internal SSD, mount the card and symlink:

# Assuming your SD card is mounted at /run/media/deck/SDCARD
mkdir -p /run/media/deck/SDCARD/ollama-models
mv ~/.ollama/models /run/media/deck/SDCARD/ollama-models/
ln -s /run/media/deck/SDCARD/ollama-models ~/.ollama/models

Note that SD card read speeds (around 100 MB/s class A2) are 5x slower than the internal NVMe. Model load time on first prompt will roughly double. Inference speed once loaded is identical because the entire model lives in RAM.

Model Selection for 16GB Unified Memory {#model-selection}

The Deck has 16GB total. SteamOS Desktop Mode uses around 2.5GB at idle. Plasma plus a browser uses another 1-2GB. That leaves 11-12GB usable for a model plus context.

ModelSize (Q4)Fits?Realistic Use
Phi-3 Mini (3.8B)2.2 GBYes, easilyDaily driver, fast responses
Llama 3.2 3B2.0 GBYes, easilyBest general-purpose at size
Gemma 2 2B1.6 GBYes, easilyLightweight summarization
Mistral 7B4.1 GBYesHigher-quality reasoning
Llama 3.1 8B4.7 GBYesQuality plateau for 16GB
Gemma 2 9B5.4 GBYes, tightLimited context window
Llama 3.1 13B7.9 GBMarginalCannot run browser too
Mixtral 8x7B26 GBNoForget it

Phi-3 Mini is the answer for most people. It is fast enough to feel interactive on a handheld and Microsoft trained it specifically for tool use and structured output. Llama 3.2 3B beats it on general writing quality. Mistral 7B is the upgrade pick when you need actual reasoning depth and you are okay with 4-second first-token latency.

Benchmarks Across All Major Models {#benchmarks}

All numbers measured on a Steam Deck OLED, SteamOS 3.6, plugged in, 15W TDP, 256-token output, default sampling parameters, fresh model load.

ModelQuantTok/sFirst TokenRAM UsedVerdict
Phi-3 Mini 3.8BQ4_K_M22.40.9s3.1 GBReference daily driver
Llama 3.2 3BQ4_K_M19.81.1s2.8 GBBest 3B output quality
Gemma 2 2BQ4_K_M28.10.7s2.4 GBFastest, weaker output
Mistral 7BQ4_K_M11.22.4s5.6 GBReasoning workhorse
Llama 3.1 8BQ4_K_M9.72.8s6.3 GBBest quality at 16GB
Gemma 2 9BQ4_K_M8.43.1s7.2 GBTight on memory
Llama 3.1 13BQ3_K_M5.15.6s7.8 GBBorderline usable

A few observations from sustained testing:

  • Battery life on pure CPU inference is roughly 3.5 hours with the 50Wh OLED battery running Phi-3 Mini at 15W. Llama 3.1 8B drops it to around 2.8 hours.
  • The fan stays under 4000 RPM on Phi-3 Mini. Mistral 7B pushes it to 5500 RPM (audible but not loud) under sustained load.
  • Token generation is roughly 12% faster docked vs handheld due to thermals, even at the same nominal TDP.

For comparison, the same Phi-3 Mini Q4 hits 38 tok/s on an M2 MacBook Air and 65 tok/s on a Ryzen 9 7950X with a 4060 Ti. The Deck is roughly 60% the speed of an M2 Air at one-third the price.

TDP, Battery, and Thermal Tuning {#tdp-tuning}

Inference is bandwidth-bound on the Deck, not compute-bound. That means raising TDP past 12W gives diminishing returns for token generation but big penalties for battery life.

Hold the QAM button (the three-dot button) to bring up the performance overlay in Game Mode, or use PowerStation from the Discover store in Desktop Mode.

WorkloadRecommended TDPReasoning
Phi-3 Mini handheld10WIdentical perf to 15W, +30% battery
Llama 3.2 3B handheld12WSweet spot
Mistral 7B handheld15WBandwidth still bound, but CPU helps
Any model docked18-20WNo battery concern

Set the GPU clock to 800 MHz. RDNA 2 idle clocks save power without affecting CPU-bound inference. Disable the SMT-aware scheduler tweak if you see Konsole report only 4 cores at full load - Ollama benefits from all 8 threads.

For thermal monitoring while inferencing, run watch -n 1 sensors in a second terminal. The Deck's APU thermal limit is 100C; you should never see above 75C on inference workloads.

Open WebUI on the Deck {#open-webui}

A ChatGPT-style interface running locally on the Deck is genuinely useful when you dock it as a desktop. Open WebUI runs in a single Docker container.

# Install Docker (Distrobox approach is cleaner; this works for quick setup)
sudo steamos-readonly disable
sudo pacman-key --init
sudo pacman -Sy docker
sudo systemctl enable --now docker
sudo usermod -aG docker deck
# Re-enable readonly after install:
sudo steamos-readonly enable

# Pull and run Open WebUI
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Visit http://localhost:3000 in Firefox. Add Ollama at http://host.docker.internal:11434 in the admin panel. You now have a polished web UI for chatting, document RAG, and image input on a model that lives entirely on your Deck.

For RAG over your own documents on the Deck, see the Open WebUI setup guide and the best Ollama models for 16GB systems.

Switching Between Gaming and AI {#dual-mode}

The Deck's 16GB cannot host a AAA game and a 7B model at the same time. But you do not have to choose at install time - just at runtime.

# Stop Ollama before launching demanding games
systemctl --user stop ollama.service

# Restart it for a coding session later
systemctl --user start ollama.service

For lighter games (Stardew Valley, Hades, Slay the Spire) you can keep Phi-3 Mini loaded - it sits at around 3GB resident which leaves enough for SteamOS and the game. I have run Hades + Phi-3 in the background simultaneously without issue.

For modding and AI together (asking the model to write Lua for Stardew Valley scripts while you play), this dual-stack capability is genuinely useful.

Pitfalls and Fixes {#pitfalls}

SteamOS update wiped my install. You installed into /usr instead of /home. Re-install with the binary download method into ~/.local/bin as shown above.

ollama: command not found. PATH not exported. Run echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc.

Out of memory pulling Mistral 7B. Likely you have a heavy browser tab eating RAM. Close Firefox tabs or run ollama pull mistral:7b-instruct-q4_K_S for the smaller quant.

Token generation pauses every 30 seconds. SteamOS aggressive memory compaction kicks in. Disable it temporarily: echo 0 | sudo tee /proc/sys/vm/compaction_proactiveness.

Fan ramps to max during inference. TDP is too high. Drop to 12W in PowerStation. Should drop fan by roughly 1500 RPM.

Cannot pull models on Steam Deck Wi-Fi. Wi-Fi 6 throughput on the Deck is around 400 Mbps real-world. A 7B model takes 90 seconds. Use a USB-C ethernet adapter when docked for 5x speed.

ROCm install instructions online do not work. AMD does not support ROCm on Van Gogh (RDNA 2 mobile). Stop trying. Vulkan compute via llama.cpp is the closest you get to GPU offload, and Ollama already uses it where helpful.


Frequently Asked Questions

Q: Can the Steam Deck actually replace a laptop for AI work?

A: For inference at small to medium model sizes, yes. The Deck runs Phi-3 Mini and Llama 3.2 3B at speeds comparable to a 2022 ultrabook. For training or fine-tuning, no - you need a discrete GPU with CUDA or ROCm support.

Q: Does it work in Game Mode or only Desktop Mode?

A: Ollama runs as a background service that works in both modes. You cannot launch a chat window in Game Mode without leaving it, but you can use the API endpoint at localhost:11434 from any app you add to Steam, including a Chromium shortcut to Open WebUI.

Q: How much battery does an LLM session actually drain?

A: A 30-minute Phi-3 Mini chat session at 12W TDP drains roughly 14% of an OLED battery. Mistral 7B at 15W drains around 18%. Llama 3.1 8B at 15W drains around 21%.

Q: Is the Steam Deck OLED worth the upgrade for AI specifically?

A: 5-10% faster inference, 25% more battery life, and a much better screen for reading model output. If you already own an LCD model, the upgrade is hard to justify on AI alone. If you are buying new and AI is a primary use case, OLED is the right pick.

Q: Will SteamOS updates break my Ollama install?

A: Not if you follow the home-directory install method in this guide. Updates only replace the immutable root partition. Anything in /home survives untouched.

Q: Can I use the AMD GPU for inference?

A: Partially. Ollama uses Vulkan compute on RDNA 2 for some operations, which gives a small speedup over pure CPU. Full ROCm acceleration is not supported by AMD on Van Gogh hardware. Practical speedups are 5-15% versus pure CPU.

Q: What is the most useful real-world workflow on a Deck for AI?

A: Travel coding assistant. Pair Phi-3 Mini with Continue.dev configured for Ollama, code on the Deck in VS Code (Distrobox or Flatpak), and you have a fully offline coding setup that fits in a jacket pocket.

Q: Can I run Stable Diffusion on the Deck?

A: Yes, but slowly. SD 1.5 at 512x512 takes around 90 seconds per image on the Vulkan backend. SDXL is impractical. For image generation, look at the Stable Diffusion local setup guide for desktop-class options.


Conclusion

The Steam Deck is a 2022 handheld console with a 4-year-old AMD APU. It also happens to be one of the most cost-effective ways to own a portable, fully-offline LLM workstation. For roughly $400 used, you get 16GB of unified memory, an 8-core APU, a 7-inch screen, four hours of battery, and the ability to run Phi-3 Mini or Llama 3.2 3B at speeds that rival a current-gen ultrabook.

The trick is respecting SteamOS's immutable design: keep your install in /home, use systemd user units, and treat the read-only root partition as a feature, not a bug. Once the install survives system updates, you have a Linux box that fits in a backpack pocket and runs LLMs anywhere.

Next steps: pick the right model for your workflow with the best Ollama models guide, then drop the Deck into a dock and add Open WebUI for a polished interface.


Want more handheld and edge AI guides? Join the LocalAIMaster newsletter for weekly experiments and benchmarks.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Was this helpful?

More Edge & Handheld AI Guides

Get weekly benchmarks, hardware reviews, and offline AI workflows in your inbox.

Related Guides

Continue your local AI journey with these comprehensive guides

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Continue Learning

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators