What is Strix Halo / Ryzen AI Max+ 395 and why does it matter for local AI?

Strix Halo is AMD's 2025 flagship APU platform combining 16 Zen 5 cores, a 40-CU RDNA 3.5 iGPU (gfx1151), an XDNA 2 NPU (~50 TOPS), and up to 128 GB of LPDDR5X unified memory at ~256 GB/s bandwidth. The killer spec: 128 GB of memory accessible by the iGPU, all in a 65-120 W mini-PC. That means it can run Llama 3.1 70B in BF16 entirely on-GPU — something no consumer NVIDIA discrete card can do. Framework Desktop, Asus ROG NUC, HP Z2 Mini, and several others ship Strix Halo systems starting around $2,000.

How fast is Strix Halo on local LLMs?

For Llama 3.1 70B Q4_K_M: ~32 tok/s. For Llama 3.1 70B BF16 (fits in 128GB unified): ~14 tok/s. For Llama 3.1 8B Q4_K_M: ~48 tok/s. Compared to discrete GPU references: a single RTX 4090 hits 127 tok/s on 8B but maxes out around 8 tok/s on 70B partial-offload — Strix Halo is much slower on small models but ~4x faster on 70B. Vs Mac Studio M4 Max 128GB ($4,000+): Mac is ~28 tok/s on 70B Q4 vs Strix Halo ~32 tok/s, but Mac wins on 8B (~55 vs ~48 tok/s) thanks to higher per-core compute.

How much does a Strix Halo system cost?

Mid-2026 pricing for 128 GB configurations: Framework Desktop ~$2,000-2,400, Asus ROG NUC ~$2,500-3,000, HP Z2 Mini G9 ~$3,000+. Strix Halo laptops (Asus ROG Flow Z13, HP Omen Transcend) start around $2,500. Compare to Mac Studio M4 Max 128 GB at ~$4,000+ or a multi-GPU PC build (2x RTX 5090 + system) at $5,000+. For "70B model on a quiet desktop with 128 GB unified memory," Strix Halo is the cheapest path.

Does ROCm support Strix Halo officially?

Yes — ROCm 6.3+ added gfx1151 support. On Linux: install ROCm 6.3 or newer, set `HSA_OVERRIDE_GFX_VERSION=11.5.1` in some workloads (the iGPU reports as 11.5.1), and Ollama / llama.cpp / vLLM-ROCm work. Memory allocation between system and iGPU is configured in the BIOS — most Strix Halo systems let you allocate up to ~96-110 GB to the GPU with the rest reserved for system. WSL2 on Windows also has Strix Halo support via the AMD WSL driver.

How does Strix Halo compare to Apple M4 Max for local AI?

Both have ~128 GB unified memory at similar bandwidth (Strix Halo ~256 GB/s vs M4 Max ~410 GB/s). M4 Max wins on small models (higher per-core compute and faster memory): ~55 tok/s on 8B vs ~48 tok/s. They're roughly comparable on 70B (M4 Max ~28 tok/s vs Strix Halo ~32 tok/s — Strix Halo slightly faster on quantized, slower on BF16). Strix Halo wins on price ($2,000 vs $4,000+), Linux ecosystem, and PC software stack (vLLM, more frameworks). M4 Max wins on power efficiency, MLX ecosystem, and macOS integration. For pure local-AI value in 2026, Strix Halo is hard to beat below the Mac Studio price point.

Can Strix Halo do image generation and video?

Yes, with caveats. SDXL on Strix Halo via ComfyUI / Automatic1111 with ROCm: ~14-18 sec for 1024² images (vs 4 sec on RTX 4090). Flux Schnell ~10 sec, Flux Dev FP8 ~30 sec. Video generation (Wan 2.2, HunyuanVideo) works but is slow — a 5-second 720p Wan clip takes 18-25 minutes on Strix Halo vs 4-8 minutes on RTX 4090. The 128 GB unified memory means you can fit very large models without quantization, but the compute throughput limits practical use to LLM-heavy workloads with image gen as a secondary capability.

Is the XDNA 2 NPU useful for local LLMs?

Mostly not yet. The XDNA 2 NPU peaks at ~50 TOPS INT8 and is designed for low-power, always-on tasks (Windows Studio Effects, Recall) rather than high-throughput LLM inference. As of mid-2026 Ollama, llama.cpp, and vLLM all use the iGPU (RDNA 3.5) for LLM workloads, not the NPU. AMD's Lemonade SDK and Ryzen AI Software stack target the NPU but support is limited to specific small models. Treat the NPU as a bonus for future-proofing rather than a current LLM accelerator.

What's the catch? Why isn't Strix Halo dominant?

Three things. (1) **Memory bandwidth ceiling** — 256 GB/s is a small fraction of discrete GPU VRAM bandwidth (RTX 4090: 1,008 GB/s); LLM decode is bandwidth-bound, so single-stream throughput on small models that fit in a 24 GB GPU lags discrete significantly. (2) **Compute ceiling** — 40 RDNA 3.5 CUs are about a quarter of what a 7900 XTX has; long prompt prefill and image gen feel slow. (3) **Software ecosystem** — gfx1151 is officially supported but newer than gfx1100; some custom kernels and FlashAttention paths still need the override. Strix Halo wins on the niche where 128 GB unified beats 24 GB discrete; outside that niche, discrete GPUs win.

Strix Halo / AMD Ryzen AI Max+ 395 for Local AI (2026): 128GB Unified Memory in a Mini PC

Strix Halo — AMD's Ryzen AI Max+ 395 — is the most interesting local-AI hardware platform of 2025-2026. 128 GB of unified memory accessible by a 40-CU RDNA 3.5 iGPU, all in a 65-120 W mini-PC starting around $2,000. It runs Llama 3.1 70B in BF16 entirely on the iGPU. No consumer discrete NVIDIA card can do that. The Mac Studio M4 Max can, but costs twice as much.

This guide covers everything: the hardware spec, system options (Framework Desktop, Asus, HP), ROCm 6.3+ setup with gfx1151, BIOS memory allocation, real benchmarks vs Mac Studio M4 Max and dual-GPU NVIDIA, Ollama / vLLM / llama.cpp recipes, image and video generation, the XDNA 2 NPU situation, and the workloads where Strix Halo wins or loses.

What Strix Halo Is
Why 128 GB Unified Memory Matters
Hardware Specs
Available Systems (Framework, Asus, HP)
vs Mac Studio M4 Max
vs Multi-GPU NVIDIA Builds
BIOS: Allocating Memory to iGPU
ROCm Setup for gfx1151
Ollama on Strix Halo
llama.cpp Native Build
vLLM-ROCm
PyTorch + Hugging Face
Image Generation Performance
Video Generation (Wan 2.2, Hunyuan)
The XDNA 2 NPU
Real Benchmarks
Power, Thermals, Acoustics
Use Cases Where Strix Halo Wins
Where Discrete GPUs Still Win
Buying Advice
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What Strix Halo Is {#what-it-is}

Strix Halo is the codename for the high-end variant of AMD's 2025 mobile/desktop AI platform branded as Ryzen AI Max+ 395 (and its lower-tier siblings 390, 385). Architecture:

CPU: 16 Zen 5 cores / 32 threads (split across two CCDs)
iGPU: 40 RDNA 3.5 compute units, codenamed gfx1151
NPU: XDNA 2, ~50 TOPS INT8 (separate from CPU and iGPU)
Memory: Up to 128 GB LPDDR5X-8000 unified (CPU + iGPU share it)
Memory bandwidth: ~256 GB/s
Process: TSMC N4P
TDP: Configurable 55-120 W

Released early 2025. Successor (Medusa Halo) expected 2026-2027.

Why 128 GB Unified Memory Matters {#why-unified}

For local LLMs, the biggest constraint is "does the model fit on the GPU?" Discrete consumer GPUs cap at 24 GB (RTX 3090 / 4090 / 7900 XTX) or 32 GB (RTX 5090). At 24 GB you can run:

8B models in FP16 ✅
14B models in INT4-INT8 ✅
32B models in INT4 ✅
70B models only with offload (slow) ❌
70B models in FP16 ❌
405B models ❌

128 GB unified memory unlocks:

70B models in BF16 ✅ (no quantization quality loss)
100B-class models in INT4 ✅
200B+ MoE models in INT4 ✅
Long contexts (131K+) on big models ✅

The trade-off: unified memory bandwidth (~256 GB/s) is much lower than discrete VRAM (~1,000+ GB/s), so per-token speed on small models is lower. But for big models, discrete GPUs can't run them at all without offloading.

Hardware Specs {#specs}

Spec	Ryzen AI Max+ 395	Ryzen AI Max 390	Ryzen AI Max 385
Cores / threads	16 / 32	12 / 24	8 / 16
Boost clock (CPU)	5.1 GHz	5.0 GHz	5.0 GHz
iGPU	Radeon 8060S	8050S	8050S
iGPU CUs	40	32	32
iGPU clock	2.9 GHz	2.8 GHz	2.7 GHz
NPU TOPS (INT8)	50	50	50
Total platform TOPS	126	120	117
Memory	LPDDR5X-8000	LPDDR5X-8000	LPDDR5X-8000
Max memory	128 GB	128 GB	64 GB
Memory bandwidth	256 GB/s	256 GB/s	256 GB/s
TDP	45-120 W	45-120 W	45-120 W

For local AI the 395 is the only variant worth considering — fewer iGPU CUs on the 390/385 means proportionally lower throughput. Always pair with the 128 GB memory option.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Available Systems (Framework, Asus, HP) {#systems}

Mid-2026 platforms shipping Strix Halo:

System	Form factor	Memory options	Starting price
Framework Desktop	Mini-ITX-ish	32 / 64 / 128 GB	$1,999 (128 GB)
Asus ROG NUC AI	Mini-PC	64 / 128 GB	$2,500-3,000
HP Z2 Mini G9	Compact desktop	64 / 128 GB	$3,000+
HP ZBook Studio	Workstation laptop	64 / 128 GB	$3,500-4,000
Asus ROG Flow Z13	2-in-1 tablet	32 / 64 / 128 GB	$2,500+
HP Omen Transcend	Gaming laptop	64 / 128 GB	$2,800-3,500
GMKtec EVO-X2	Mini-PC	64 / 128 GB	$2,200-2,500

Best value: Framework Desktop with 128 GB. $1,999 mid-2026. Mini-ITX-style chassis, replaceable I/O, full Linux support, 65W typical / 120W boost. The closest to "open hardware" in this segment.

vs Mac Studio M4 Max {#vs-mac}

Aspect	Strix Halo (Framework Desktop 128GB)	Mac Studio M4 Max 128GB
Price	~$2,000	~$4,000-4,200
Memory bandwidth	256 GB/s	410 GB/s (M4 Max)
Llama 3.1 8B Q4_K_M (tok/s)	48	55
Llama 3.1 70B Q4_K_M (tok/s)	32	28
Llama 3.1 70B BF16 (tok/s)	14	~13
SDXL 1024² (sec)	14-18	18-25
Power (idle / load)	25W / 90W	15W / 80W
OS	Linux / Windows	macOS
Software ecosystem	ROCm + open-source	MLX + open-source
Form factor	Mini-ITX	Studio (compact desktop)

Mac wins on per-core efficiency, build quality, and macOS integration. Strix Halo wins on price (~50% less), Linux ecosystem, x86 software compatibility, and slightly higher 70B throughput. For pure $/perf on local LLMs in 2026, Strix Halo is hard to beat.

vs Multi-GPU NVIDIA Builds {#vs-nvidia}

A 2x RTX 4090 PCIe rig costs $3,000-3,500 (GPUs alone) plus system, total $4,500+. It delivers:

Llama 3.1 8B FP16: 250+ tok/s (vs Strix Halo 48)
Llama 3.1 70B AWQ: 38 tok/s (vs Strix Halo 32)
Llama 3.1 70B BF16: doesn't fit in 48 GB
SDXL 1024²: 4 sec (vs Strix Halo 14-18)
Power: 600-900 W full-load (vs Strix Halo 90 W)

The dual-4090 wins on small-model speed and image gen. Strix Halo wins on 70B BF16 capability, power, noise, form factor, and price. For most "I want to run 70B locally without compromises" buyers, Strix Halo is the smarter choice.

BIOS: Allocating Memory to iGPU {#bios}

By default, Strix Halo systems allocate ~16-32 GB to the iGPU and leave the rest for system RAM. To run 70B models on the iGPU, you need to allocate ~96-110 GB to it.

Most BIOSes expose this under: Advanced → AMD CBS → NBIO Common Options → GFX Configuration → UMA Frame Buffer Size.

Workload	Recommended iGPU allocation
LLMs only	96 GB (leaves 32 GB for system)
LLM + image gen	80 GB (more system RAM for ComfyUI buffers)
70B BF16 + 32K context	110 GB (max practical)
Mixed workstation	64 GB (balance)

Some systems (Framework Desktop, ROG Flow) expose this in a simpler "AI Memory Reservation" menu. Reboot required after change.

ROCm Setup for gfx1151 {#rocm}

ROCm 6.3+ has official Strix Halo (gfx1151) support:

wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install*.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk -y
sudo usermod -aG render,video $USER
sudo reboot

Some workloads still need the gfx version override:

echo 'export HSA_OVERRIDE_GFX_VERSION=11.5.1' >> ~/.bashrc
echo 'export HCC_AMDGPU_TARGET=gfx1151' >> ~/.bashrc
source ~/.bashrc

Verify:

rocminfo | grep gfx
# Expected: gfx1151 (Strix Halo iGPU)
rocm-smi --showmeminfo vram
# Should show ~96-110 GB available depending on BIOS allocation

Ollama on Strix Halo {#ollama}

curl -fsSL https://ollama.com/install.sh | sh

Edit systemd to set the override:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HCC_AMDGPU_TARGET=gfx1151"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Run a 70B model — fits entirely in unified memory
ollama run llama3.1:70b

For BF16 70B (large but possible on 128 GB):

ollama run llama3.1:70b-instruct-fp16

llama.cpp Native Build {#llamacpp}

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1151 \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

./build/bin/llama-cli -m llama-3.1-70b-instruct-Q4_K_M.gguf -ngl 999 -fa

For long context: enable Q4 KV cache (--cache-type-k q4_0) — saves ~50% KV memory.

vLLM-ROCm {#vllm}

docker pull rocm/vllm:latest

docker run --device /dev/kfd --device /dev/dri \
    --group-add video --group-add render \
    --security-opt seccomp=unconfined \
    --shm-size 16G \
    -e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
    -p 8000:8000 \
    rocm/vllm:latest \
    vllm serve casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85

vLLM-ROCm on Strix Halo is functional but throughput-bound by the 256 GB/s memory bandwidth — single-stream decode is similar to llama.cpp / Ollama, but multi-user batching has less headroom than discrete GPUs.

PyTorch + Hugging Face {#pytorch}

pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.3

Run any HF Transformers model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

The 128 GB unified memory means BF16 70B loads without sharding. Inference at 14 tok/s, training is impractical (compute-bound, not memory-bound).

Image Generation Performance {#image-gen}

Workflow	Strix Halo	RTX 4090	M4 Max 128GB
SDXL 1024²	14-18 sec	4 sec	18-25 sec
SDXL Lightning (8 steps)	5 sec	1.5 sec	7 sec
Flux Schnell (4 steps)	10 sec	3 sec	12 sec
Flux Dev FP8 (25 steps)	30 sec	12 sec	35 sec
SD 3.5 Large (28 steps)	20 sec	6 sec	25 sec

Strix Halo is roughly 3-4x slower than RTX 4090 for image gen — compute-bound, not memory-bound. For LLM-primary use cases this is acceptable; for image-gen primary, get a discrete GPU.

Video Generation (Wan 2.2, Hunyuan) {#video-gen}

Model	Strix Halo	RTX 4090
Wan 2.2 (5 sec, 720p, Q8)	18-25 min	6-10 min
HunyuanVideo (5 sec, 720p, Q4)	35-50 min	12-20 min
Mochi (5 sec, 480p)	12-18 min	5-8 min

Slow but functional for occasional generation. The 128 GB memory means you can fit unquantized video models that don't fit on a 4090, but compute time makes routine use impractical.

The XDNA 2 NPU {#npu}

The 50-TOPS XDNA 2 NPU is largely unused for general LLM inference as of mid-2026. AMD's Lemonade SDK and Ryzen AI Software stack target it for specific accelerated paths — Llama 3.1 8B, Phi 3.5 Mini, etc. — but the iGPU is faster for most workloads.

The NPU shines for:

Always-on background AI (Windows Studio Effects, Recall)
Battery-sensitive laptop workloads
Specific quantized models AMD ships optimized kernels for

For mainstream LLM inference, ignore the NPU and use the iGPU. Future ROCm releases may expand NPU coverage.

Real Benchmarks {#benchmarks}

Framework Desktop 128 GB, Ubuntu 24.04, ROCm 6.3, BIOS UMA = 96 GB.

Workload	tok/s
Llama 3.2 1B Q4_K_M	110
Llama 3.2 3B Q4_K_M	78
Qwen 2.5 7B Q4_K_M	52
Llama 3.1 8B Q4_K_M	48
Qwen 2.5 14B Q4_K_M	32
Qwen 2.5 32B AWQ-INT4	22
Llama 3.3 70B Q4_K_M	32
Llama 3.3 70B Q5_K_M	26
Llama 3.3 70B BF16	14
Mixtral 8x7B Q4	38
DeepSeek V3 235B (INT4 partial-fit)	8

For Mac M4 Max comparison see Apple M4 for AI Guide. For dual-GPU comparison see Multi-GPU Ollama Setup.

Power, Thermals, Acoustics {#power}

Configurable cTDP from 45 W to 120 W. Sweet spots:

Profile	cTDP	LLM tok/s (70B Q4)	Noise
Quiet	65 W	24	Whisper-quiet
Balanced	90 W	30	Audible at desk
Performance	120 W	32	Notable fan

For 24/7 inference servers, 65-90 W is the sweet spot — small efficiency gains plateau above 90 W. Most Strix Halo systems ship with adjustable power profiles in BIOS or vendor utility.

Use Cases Where Strix Halo Wins {#use-cases}

70B local inference on a single device — no consumer NVIDIA card matches.
Large model fine-tuning data prep — load full BF16 model for inspection / activation logging.
Multi-tenant home server — 128 GB lets you run multiple medium models simultaneously.
Quiet / efficient desktop — 65-90 W for usable 70B inference.
Privacy / air-gapped 70B — single mini-PC, easy to lock down. See Air-Gapped AI Deployment.
Long-context RAG / agents — can fit full 131K context on 70B with Q4 KV cache.
Mac Studio alternative for PC users — same use case, half the price.

Where Discrete GPUs Still Win {#discrete-wins}

Small model speed — RTX 4090 hits 127 tok/s on 8B vs 48 on Strix Halo.
Image generation — 3-4x faster on RTX 4090.
Multi-user concurrent serving — vLLM PagedAttention scales better with VRAM bandwidth.
Fine-tuning — discrete GPU compute throughput dominates.
TensorRT-LLM — NVIDIA only.
FP8 hardware — RDNA 3.5 lacks FP8.
Latency-sensitive paths — discrete VRAM bandwidth wins on first-token-time.

Buying Advice {#buying}

Buy Strix Halo if:

You want to run 70B models locally without offload pain.
You value quiet, low-power, mini-PC form factor.
You're comfortable with Linux + ROCm or Windows + WSL2.
$2,000-3,000 budget; Mac Studio M4 Max is too expensive.
LLMs are the primary workload; image gen is occasional.

Buy a discrete GPU instead if:

You primarily run 7B-32B models — RTX 4090 / 7900 XTX much faster.
You do heavy image / video generation.
You need TensorRT-LLM or ExLlamaV2.
You serve many concurrent users.

Buy Mac Studio M4 Max instead if:

You're already in the Apple ecosystem.
You need MLX-specific frameworks.
Budget allows the ~2x premium.

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
iGPU only sees 16 GB	BIOS UMA too low	Increase to 96+ GB in BIOS
Ollama uses CPU only	gfx version mismatch	Set HSA_OVERRIDE_GFX_VERSION=11.5.1
ROCm "no agents found"	Driver / groups	render+video group, reboot
Throttling under sustained load	TDP / thermal	Increase cTDP profile or improve cooling
WSL2 GPU not visible	AMD WSL driver missing	Install AMD Software for WSL on host
Long prompt prefill slow	Compute-bound	Expected — Strix Halo trades compute for memory

FAQ {#faq}

See answers to common Strix Halo questions below.

Sources: AMD Ryzen AI Max product page | Framework Desktop announcement | ROCm 6.3 release notes | Internal benchmarks Framework Desktop 128 GB.

Related guides:

Strix Halo / AMD Ryzen AI Max+ 395 for Local AI (2026): 128GB Unified Memory in a Mini PC

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Strix Halo Is {#what-it-is}

Why 128 GB Unified Memory Matters {#why-unified}

Hardware Specs {#specs}

Reading articles is good. Building is better.

Available Systems (Framework, Asus, HP) {#systems}

vs Mac Studio M4 Max {#vs-mac}

vs Multi-GPU NVIDIA Builds {#vs-nvidia}

BIOS: Allocating Memory to iGPU {#bios}

ROCm Setup for gfx1151 {#rocm}

Ollama on Strix Halo {#ollama}

llama.cpp Native Build {#llamacpp}

vLLM-ROCm {#vllm}

PyTorch + Hugging Face {#pytorch}

Image Generation Performance {#image-gen}

Video Generation (Wan 2.2, Hunyuan) {#video-gen}

The XDNA 2 NPU {#npu}

Real Benchmarks {#benchmarks}

Power, Thermals, Acoustics {#power}

Use Cases Where Strix Halo Wins {#use-cases}

Where Discrete GPUs Still Win {#discrete-wins}

Buying Advice {#buying}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

AMD ROCm Setup for Local LLMs

Radeon RX 7900 XTX for Local AI

Apple M4 for AI Guide

Mac Studio vs PC AI Build

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI