Strix Halo / AMD Ryzen AI Max+ 395 for Local AI (2026): 128GB Unified Memory in a Mini PC
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Strix Halo — AMD's Ryzen AI Max+ 395 — is the most interesting local-AI hardware platform of 2025-2026. 128 GB of unified memory accessible by a 40-CU RDNA 3.5 iGPU, all in a 65-120 W mini-PC starting around $2,000. It runs Llama 3.1 70B in BF16 entirely on the iGPU. No consumer discrete NVIDIA card can do that. The Mac Studio M4 Max can, but costs twice as much.
This guide covers everything: the hardware spec, system options (Framework Desktop, Asus, HP), ROCm 6.3+ setup with gfx1151, BIOS memory allocation, real benchmarks vs Mac Studio M4 Max and dual-GPU NVIDIA, Ollama / vLLM / llama.cpp recipes, image and video generation, the XDNA 2 NPU situation, and the workloads where Strix Halo wins or loses.
Table of Contents
- What Strix Halo Is
- Why 128 GB Unified Memory Matters
- Hardware Specs
- Available Systems (Framework, Asus, HP)
- vs Mac Studio M4 Max
- vs Multi-GPU NVIDIA Builds
- BIOS: Allocating Memory to iGPU
- ROCm Setup for gfx1151
- Ollama on Strix Halo
- llama.cpp Native Build
- vLLM-ROCm
- PyTorch + Hugging Face
- Image Generation Performance
- Video Generation (Wan 2.2, Hunyuan)
- The XDNA 2 NPU
- Real Benchmarks
- Power, Thermals, Acoustics
- Use Cases Where Strix Halo Wins
- Where Discrete GPUs Still Win
- Buying Advice
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Strix Halo Is {#what-it-is}
Strix Halo is the codename for the high-end variant of AMD's 2025 mobile/desktop AI platform branded as Ryzen AI Max+ 395 (and its lower-tier siblings 390, 385). Architecture:
- CPU: 16 Zen 5 cores / 32 threads (split across two CCDs)
- iGPU: 40 RDNA 3.5 compute units, codenamed gfx1151
- NPU: XDNA 2, ~50 TOPS INT8 (separate from CPU and iGPU)
- Memory: Up to 128 GB LPDDR5X-8000 unified (CPU + iGPU share it)
- Memory bandwidth: ~256 GB/s
- Process: TSMC N4P
- TDP: Configurable 55-120 W
Released early 2025. Successor (Medusa Halo) expected 2026-2027.
Why 128 GB Unified Memory Matters {#why-unified}
For local LLMs, the biggest constraint is "does the model fit on the GPU?" Discrete consumer GPUs cap at 24 GB (RTX 3090 / 4090 / 7900 XTX) or 32 GB (RTX 5090). At 24 GB you can run:
- 8B models in FP16 ✅
- 14B models in INT4-INT8 ✅
- 32B models in INT4 ✅
- 70B models only with offload (slow) ❌
- 70B models in FP16 ❌
- 405B models ❌
128 GB unified memory unlocks:
- 70B models in BF16 ✅ (no quantization quality loss)
- 100B-class models in INT4 ✅
- 200B+ MoE models in INT4 ✅
- Long contexts (131K+) on big models ✅
The trade-off: unified memory bandwidth (~256 GB/s) is much lower than discrete VRAM (~1,000+ GB/s), so per-token speed on small models is lower. But for big models, discrete GPUs can't run them at all without offloading.
Hardware Specs {#specs}
| Spec | Ryzen AI Max+ 395 | Ryzen AI Max 390 | Ryzen AI Max 385 |
|---|---|---|---|
| Cores / threads | 16 / 32 | 12 / 24 | 8 / 16 |
| Boost clock (CPU) | 5.1 GHz | 5.0 GHz | 5.0 GHz |
| iGPU | Radeon 8060S | 8050S | 8050S |
| iGPU CUs | 40 | 32 | 32 |
| iGPU clock | 2.9 GHz | 2.8 GHz | 2.7 GHz |
| NPU TOPS (INT8) | 50 | 50 | 50 |
| Total platform TOPS | 126 | 120 | 117 |
| Memory | LPDDR5X-8000 | LPDDR5X-8000 | LPDDR5X-8000 |
| Max memory | 128 GB | 128 GB | 64 GB |
| Memory bandwidth | 256 GB/s | 256 GB/s | 256 GB/s |
| TDP | 45-120 W | 45-120 W | 45-120 W |
For local AI the 395 is the only variant worth considering — fewer iGPU CUs on the 390/385 means proportionally lower throughput. Always pair with the 128 GB memory option.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Available Systems (Framework, Asus, HP) {#systems}
Mid-2026 platforms shipping Strix Halo:
| System | Form factor | Memory options | Starting price |
|---|---|---|---|
| Framework Desktop | Mini-ITX-ish | 32 / 64 / 128 GB | $1,999 (128 GB) |
| Asus ROG NUC AI | Mini-PC | 64 / 128 GB | $2,500-3,000 |
| HP Z2 Mini G9 | Compact desktop | 64 / 128 GB | $3,000+ |
| HP ZBook Studio | Workstation laptop | 64 / 128 GB | $3,500-4,000 |
| Asus ROG Flow Z13 | 2-in-1 tablet | 32 / 64 / 128 GB | $2,500+ |
| HP Omen Transcend | Gaming laptop | 64 / 128 GB | $2,800-3,500 |
| GMKtec EVO-X2 | Mini-PC | 64 / 128 GB | $2,200-2,500 |
Best value: Framework Desktop with 128 GB. $1,999 mid-2026. Mini-ITX-style chassis, replaceable I/O, full Linux support, 65W typical / 120W boost. The closest to "open hardware" in this segment.
vs Mac Studio M4 Max {#vs-mac}
| Aspect | Strix Halo (Framework Desktop 128GB) | Mac Studio M4 Max 128GB |
|---|---|---|
| Price | ~$2,000 | ~$4,000-4,200 |
| Memory bandwidth | 256 GB/s | 410 GB/s (M4 Max) |
| Llama 3.1 8B Q4_K_M (tok/s) | 48 | 55 |
| Llama 3.1 70B Q4_K_M (tok/s) | 32 | 28 |
| Llama 3.1 70B BF16 (tok/s) | 14 | ~13 |
| SDXL 1024² (sec) | 14-18 | 18-25 |
| Power (idle / load) | 25W / 90W | 15W / 80W |
| OS | Linux / Windows | macOS |
| Software ecosystem | ROCm + open-source | MLX + open-source |
| Form factor | Mini-ITX | Studio (compact desktop) |
Mac wins on per-core efficiency, build quality, and macOS integration. Strix Halo wins on price (~50% less), Linux ecosystem, x86 software compatibility, and slightly higher 70B throughput. For pure $/perf on local LLMs in 2026, Strix Halo is hard to beat.
vs Multi-GPU NVIDIA Builds {#vs-nvidia}
A 2x RTX 4090 PCIe rig costs $3,000-3,500 (GPUs alone) plus system, total $4,500+. It delivers:
- Llama 3.1 8B FP16: 250+ tok/s (vs Strix Halo 48)
- Llama 3.1 70B AWQ: 38 tok/s (vs Strix Halo 32)
- Llama 3.1 70B BF16: doesn't fit in 48 GB
- SDXL 1024²: 4 sec (vs Strix Halo 14-18)
- Power: 600-900 W full-load (vs Strix Halo 90 W)
The dual-4090 wins on small-model speed and image gen. Strix Halo wins on 70B BF16 capability, power, noise, form factor, and price. For most "I want to run 70B locally without compromises" buyers, Strix Halo is the smarter choice.
BIOS: Allocating Memory to iGPU {#bios}
By default, Strix Halo systems allocate ~16-32 GB to the iGPU and leave the rest for system RAM. To run 70B models on the iGPU, you need to allocate ~96-110 GB to it.
Most BIOSes expose this under: Advanced → AMD CBS → NBIO Common Options → GFX Configuration → UMA Frame Buffer Size.
| Workload | Recommended iGPU allocation |
|---|---|
| LLMs only | 96 GB (leaves 32 GB for system) |
| LLM + image gen | 80 GB (more system RAM for ComfyUI buffers) |
| 70B BF16 + 32K context | 110 GB (max practical) |
| Mixed workstation | 64 GB (balance) |
Some systems (Framework Desktop, ROG Flow) expose this in a simpler "AI Memory Reservation" menu. Reboot required after change.
ROCm Setup for gfx1151 {#rocm}
ROCm 6.3+ has official Strix Halo (gfx1151) support:
wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install*.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk -y
sudo usermod -aG render,video $USER
sudo reboot
Some workloads still need the gfx version override:
echo 'export HSA_OVERRIDE_GFX_VERSION=11.5.1' >> ~/.bashrc
echo 'export HCC_AMDGPU_TARGET=gfx1151' >> ~/.bashrc
source ~/.bashrc
Verify:
rocminfo | grep gfx
# Expected: gfx1151 (Strix Halo iGPU)
rocm-smi --showmeminfo vram
# Should show ~96-110 GB available depending on BIOS allocation
Ollama on Strix Halo {#ollama}
curl -fsSL https://ollama.com/install.sh | sh
Edit systemd to set the override:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HCC_AMDGPU_TARGET=gfx1151"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Run a 70B model — fits entirely in unified memory
ollama run llama3.1:70b
For BF16 70B (large but possible on 128 GB):
ollama run llama3.1:70b-instruct-fp16
llama.cpp Native Build {#llamacpp}
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/bin/llama-cli -m llama-3.1-70b-instruct-Q4_K_M.gguf -ngl 999 -fa
For long context: enable Q4 KV cache (--cache-type-k q4_0) — saves ~50% KV memory.
vLLM-ROCm {#vllm}
docker pull rocm/vllm:latest
docker run --device /dev/kfd --device /dev/dri \
--group-add video --group-add render \
--security-opt seccomp=unconfined \
--shm-size 16G \
-e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
-p 8000:8000 \
rocm/vllm:latest \
vllm serve casperhansen/llama-3.1-70b-instruct-awq \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.85
vLLM-ROCm on Strix Halo is functional but throughput-bound by the 256 GB/s memory bandwidth — single-stream decode is similar to llama.cpp / Ollama, but multi-user batching has less headroom than discrete GPUs.
PyTorch + Hugging Face {#pytorch}
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.3
Run any HF Transformers model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
The 128 GB unified memory means BF16 70B loads without sharding. Inference at 14 tok/s, training is impractical (compute-bound, not memory-bound).
Image Generation Performance {#image-gen}
| Workflow | Strix Halo | RTX 4090 | M4 Max 128GB |
|---|---|---|---|
| SDXL 1024² | 14-18 sec | 4 sec | 18-25 sec |
| SDXL Lightning (8 steps) | 5 sec | 1.5 sec | 7 sec |
| Flux Schnell (4 steps) | 10 sec | 3 sec | 12 sec |
| Flux Dev FP8 (25 steps) | 30 sec | 12 sec | 35 sec |
| SD 3.5 Large (28 steps) | 20 sec | 6 sec | 25 sec |
Strix Halo is roughly 3-4x slower than RTX 4090 for image gen — compute-bound, not memory-bound. For LLM-primary use cases this is acceptable; for image-gen primary, get a discrete GPU.
Video Generation (Wan 2.2, Hunyuan) {#video-gen}
| Model | Strix Halo | RTX 4090 |
|---|---|---|
| Wan 2.2 (5 sec, 720p, Q8) | 18-25 min | 6-10 min |
| HunyuanVideo (5 sec, 720p, Q4) | 35-50 min | 12-20 min |
| Mochi (5 sec, 480p) | 12-18 min | 5-8 min |
Slow but functional for occasional generation. The 128 GB memory means you can fit unquantized video models that don't fit on a 4090, but compute time makes routine use impractical.
The XDNA 2 NPU {#npu}
The 50-TOPS XDNA 2 NPU is largely unused for general LLM inference as of mid-2026. AMD's Lemonade SDK and Ryzen AI Software stack target it for specific accelerated paths — Llama 3.1 8B, Phi 3.5 Mini, etc. — but the iGPU is faster for most workloads.
The NPU shines for:
- Always-on background AI (Windows Studio Effects, Recall)
- Battery-sensitive laptop workloads
- Specific quantized models AMD ships optimized kernels for
For mainstream LLM inference, ignore the NPU and use the iGPU. Future ROCm releases may expand NPU coverage.
Real Benchmarks {#benchmarks}
Framework Desktop 128 GB, Ubuntu 24.04, ROCm 6.3, BIOS UMA = 96 GB.
| Workload | tok/s |
|---|---|
| Llama 3.2 1B Q4_K_M | 110 |
| Llama 3.2 3B Q4_K_M | 78 |
| Qwen 2.5 7B Q4_K_M | 52 |
| Llama 3.1 8B Q4_K_M | 48 |
| Qwen 2.5 14B Q4_K_M | 32 |
| Qwen 2.5 32B AWQ-INT4 | 22 |
| Llama 3.3 70B Q4_K_M | 32 |
| Llama 3.3 70B Q5_K_M | 26 |
| Llama 3.3 70B BF16 | 14 |
| Mixtral 8x7B Q4 | 38 |
| DeepSeek V3 235B (INT4 partial-fit) | 8 |
For Mac M4 Max comparison see Apple M4 for AI Guide. For dual-GPU comparison see Multi-GPU Ollama Setup.
Power, Thermals, Acoustics {#power}
Configurable cTDP from 45 W to 120 W. Sweet spots:
| Profile | cTDP | LLM tok/s (70B Q4) | Noise |
|---|---|---|---|
| Quiet | 65 W | 24 | Whisper-quiet |
| Balanced | 90 W | 30 | Audible at desk |
| Performance | 120 W | 32 | Notable fan |
For 24/7 inference servers, 65-90 W is the sweet spot — small efficiency gains plateau above 90 W. Most Strix Halo systems ship with adjustable power profiles in BIOS or vendor utility.
Use Cases Where Strix Halo Wins {#use-cases}
- 70B local inference on a single device — no consumer NVIDIA card matches.
- Large model fine-tuning data prep — load full BF16 model for inspection / activation logging.
- Multi-tenant home server — 128 GB lets you run multiple medium models simultaneously.
- Quiet / efficient desktop — 65-90 W for usable 70B inference.
- Privacy / air-gapped 70B — single mini-PC, easy to lock down. See Air-Gapped AI Deployment.
- Long-context RAG / agents — can fit full 131K context on 70B with Q4 KV cache.
- Mac Studio alternative for PC users — same use case, half the price.
Where Discrete GPUs Still Win {#discrete-wins}
- Small model speed — RTX 4090 hits 127 tok/s on 8B vs 48 on Strix Halo.
- Image generation — 3-4x faster on RTX 4090.
- Multi-user concurrent serving — vLLM PagedAttention scales better with VRAM bandwidth.
- Fine-tuning — discrete GPU compute throughput dominates.
- TensorRT-LLM — NVIDIA only.
- FP8 hardware — RDNA 3.5 lacks FP8.
- Latency-sensitive paths — discrete VRAM bandwidth wins on first-token-time.
Buying Advice {#buying}
Buy Strix Halo if:
- You want to run 70B models locally without offload pain.
- You value quiet, low-power, mini-PC form factor.
- You're comfortable with Linux + ROCm or Windows + WSL2.
- $2,000-3,000 budget; Mac Studio M4 Max is too expensive.
- LLMs are the primary workload; image gen is occasional.
Buy a discrete GPU instead if:
- You primarily run 7B-32B models — RTX 4090 / 7900 XTX much faster.
- You do heavy image / video generation.
- You need TensorRT-LLM or ExLlamaV2.
- You serve many concurrent users.
Buy Mac Studio M4 Max instead if:
- You're already in the Apple ecosystem.
- You need MLX-specific frameworks.
- Budget allows the ~2x premium.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| iGPU only sees 16 GB | BIOS UMA too low | Increase to 96+ GB in BIOS |
| Ollama uses CPU only | gfx version mismatch | Set HSA_OVERRIDE_GFX_VERSION=11.5.1 |
| ROCm "no agents found" | Driver / groups | render+video group, reboot |
| Throttling under sustained load | TDP / thermal | Increase cTDP profile or improve cooling |
| WSL2 GPU not visible | AMD WSL driver missing | Install AMD Software for WSL on host |
| Long prompt prefill slow | Compute-bound | Expected — Strix Halo trades compute for memory |
FAQ {#faq}
See answers to common Strix Halo questions below.
Sources: AMD Ryzen AI Max product page | Framework Desktop announcement | ROCm 6.3 release notes | Internal benchmarks Framework Desktop 128 GB.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!