★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Hardware

Strix Halo / AMD Ryzen AI Max+ 395 for Local AI (2026): 128GB Unified Memory in a Mini PC

May 1, 2026
28 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Strix Halo — AMD's Ryzen AI Max+ 395 — is the most interesting local-AI hardware platform of 2025-2026. 128 GB of unified memory accessible by a 40-CU RDNA 3.5 iGPU, all in a 65-120 W mini-PC starting around $2,000. It runs Llama 3.1 70B in BF16 entirely on the iGPU. No consumer discrete NVIDIA card can do that. The Mac Studio M4 Max can, but costs twice as much.

This guide covers everything: the hardware spec, system options (Framework Desktop, Asus, HP), ROCm 6.3+ setup with gfx1151, BIOS memory allocation, real benchmarks vs Mac Studio M4 Max and dual-GPU NVIDIA, Ollama / vLLM / llama.cpp recipes, image and video generation, the XDNA 2 NPU situation, and the workloads where Strix Halo wins or loses.

Table of Contents

  1. What Strix Halo Is
  2. Why 128 GB Unified Memory Matters
  3. Hardware Specs
  4. Available Systems (Framework, Asus, HP)
  5. vs Mac Studio M4 Max
  6. vs Multi-GPU NVIDIA Builds
  7. BIOS: Allocating Memory to iGPU
  8. ROCm Setup for gfx1151
  9. Ollama on Strix Halo
  10. llama.cpp Native Build
  11. vLLM-ROCm
  12. PyTorch + Hugging Face
  13. Image Generation Performance
  14. Video Generation (Wan 2.2, Hunyuan)
  15. The XDNA 2 NPU
  16. Real Benchmarks
  17. Power, Thermals, Acoustics
  18. Use Cases Where Strix Halo Wins
  19. Where Discrete GPUs Still Win
  20. Buying Advice
  21. Troubleshooting
  22. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Strix Halo Is {#what-it-is}

Strix Halo is the codename for the high-end variant of AMD's 2025 mobile/desktop AI platform branded as Ryzen AI Max+ 395 (and its lower-tier siblings 390, 385). Architecture:

  • CPU: 16 Zen 5 cores / 32 threads (split across two CCDs)
  • iGPU: 40 RDNA 3.5 compute units, codenamed gfx1151
  • NPU: XDNA 2, ~50 TOPS INT8 (separate from CPU and iGPU)
  • Memory: Up to 128 GB LPDDR5X-8000 unified (CPU + iGPU share it)
  • Memory bandwidth: ~256 GB/s
  • Process: TSMC N4P
  • TDP: Configurable 55-120 W

Released early 2025. Successor (Medusa Halo) expected 2026-2027.


Why 128 GB Unified Memory Matters {#why-unified}

For local LLMs, the biggest constraint is "does the model fit on the GPU?" Discrete consumer GPUs cap at 24 GB (RTX 3090 / 4090 / 7900 XTX) or 32 GB (RTX 5090). At 24 GB you can run:

  • 8B models in FP16 ✅
  • 14B models in INT4-INT8 ✅
  • 32B models in INT4 ✅
  • 70B models only with offload (slow) ❌
  • 70B models in FP16 ❌
  • 405B models ❌

128 GB unified memory unlocks:

  • 70B models in BF16 ✅ (no quantization quality loss)
  • 100B-class models in INT4 ✅
  • 200B+ MoE models in INT4 ✅
  • Long contexts (131K+) on big models ✅

The trade-off: unified memory bandwidth (~256 GB/s) is much lower than discrete VRAM (~1,000+ GB/s), so per-token speed on small models is lower. But for big models, discrete GPUs can't run them at all without offloading.


Hardware Specs {#specs}

SpecRyzen AI Max+ 395Ryzen AI Max 390Ryzen AI Max 385
Cores / threads16 / 3212 / 248 / 16
Boost clock (CPU)5.1 GHz5.0 GHz5.0 GHz
iGPURadeon 8060S8050S8050S
iGPU CUs403232
iGPU clock2.9 GHz2.8 GHz2.7 GHz
NPU TOPS (INT8)505050
Total platform TOPS126120117
MemoryLPDDR5X-8000LPDDR5X-8000LPDDR5X-8000
Max memory128 GB128 GB64 GB
Memory bandwidth256 GB/s256 GB/s256 GB/s
TDP45-120 W45-120 W45-120 W

For local AI the 395 is the only variant worth considering — fewer iGPU CUs on the 390/385 means proportionally lower throughput. Always pair with the 128 GB memory option.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Available Systems (Framework, Asus, HP) {#systems}

Mid-2026 platforms shipping Strix Halo:

SystemForm factorMemory optionsStarting price
Framework DesktopMini-ITX-ish32 / 64 / 128 GB$1,999 (128 GB)
Asus ROG NUC AIMini-PC64 / 128 GB$2,500-3,000
HP Z2 Mini G9Compact desktop64 / 128 GB$3,000+
HP ZBook StudioWorkstation laptop64 / 128 GB$3,500-4,000
Asus ROG Flow Z132-in-1 tablet32 / 64 / 128 GB$2,500+
HP Omen TranscendGaming laptop64 / 128 GB$2,800-3,500
GMKtec EVO-X2Mini-PC64 / 128 GB$2,200-2,500

Best value: Framework Desktop with 128 GB. $1,999 mid-2026. Mini-ITX-style chassis, replaceable I/O, full Linux support, 65W typical / 120W boost. The closest to "open hardware" in this segment.


vs Mac Studio M4 Max {#vs-mac}

AspectStrix Halo (Framework Desktop 128GB)Mac Studio M4 Max 128GB
Price~$2,000~$4,000-4,200
Memory bandwidth256 GB/s410 GB/s (M4 Max)
Llama 3.1 8B Q4_K_M (tok/s)4855
Llama 3.1 70B Q4_K_M (tok/s)3228
Llama 3.1 70B BF16 (tok/s)14~13
SDXL 1024² (sec)14-1818-25
Power (idle / load)25W / 90W15W / 80W
OSLinux / WindowsmacOS
Software ecosystemROCm + open-sourceMLX + open-source
Form factorMini-ITXStudio (compact desktop)

Mac wins on per-core efficiency, build quality, and macOS integration. Strix Halo wins on price (~50% less), Linux ecosystem, x86 software compatibility, and slightly higher 70B throughput. For pure $/perf on local LLMs in 2026, Strix Halo is hard to beat.


vs Multi-GPU NVIDIA Builds {#vs-nvidia}

A 2x RTX 4090 PCIe rig costs $3,000-3,500 (GPUs alone) plus system, total $4,500+. It delivers:

  • Llama 3.1 8B FP16: 250+ tok/s (vs Strix Halo 48)
  • Llama 3.1 70B AWQ: 38 tok/s (vs Strix Halo 32)
  • Llama 3.1 70B BF16: doesn't fit in 48 GB
  • SDXL 1024²: 4 sec (vs Strix Halo 14-18)
  • Power: 600-900 W full-load (vs Strix Halo 90 W)

The dual-4090 wins on small-model speed and image gen. Strix Halo wins on 70B BF16 capability, power, noise, form factor, and price. For most "I want to run 70B locally without compromises" buyers, Strix Halo is the smarter choice.


BIOS: Allocating Memory to iGPU {#bios}

By default, Strix Halo systems allocate ~16-32 GB to the iGPU and leave the rest for system RAM. To run 70B models on the iGPU, you need to allocate ~96-110 GB to it.

Most BIOSes expose this under: AdvancedAMD CBSNBIO Common OptionsGFX ConfigurationUMA Frame Buffer Size.

WorkloadRecommended iGPU allocation
LLMs only96 GB (leaves 32 GB for system)
LLM + image gen80 GB (more system RAM for ComfyUI buffers)
70B BF16 + 32K context110 GB (max practical)
Mixed workstation64 GB (balance)

Some systems (Framework Desktop, ROG Flow) expose this in a simpler "AI Memory Reservation" menu. Reboot required after change.


ROCm Setup for gfx1151 {#rocm}

ROCm 6.3+ has official Strix Halo (gfx1151) support:

wget https://repo.radeon.com/amdgpu-install/6.3/ubuntu/jammy/amdgpu-install_6.3.60300-1_all.deb
sudo apt install ./amdgpu-install*.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk -y
sudo usermod -aG render,video $USER
sudo reboot

Some workloads still need the gfx version override:

echo 'export HSA_OVERRIDE_GFX_VERSION=11.5.1' >> ~/.bashrc
echo 'export HCC_AMDGPU_TARGET=gfx1151' >> ~/.bashrc
source ~/.bashrc

Verify:

rocminfo | grep gfx
# Expected: gfx1151 (Strix Halo iGPU)
rocm-smi --showmeminfo vram
# Should show ~96-110 GB available depending on BIOS allocation

Ollama on Strix Halo {#ollama}

curl -fsSL https://ollama.com/install.sh | sh

Edit systemd to set the override:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HCC_AMDGPU_TARGET=gfx1151"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Run a 70B model — fits entirely in unified memory
ollama run llama3.1:70b

For BF16 70B (large but possible on 128 GB):

ollama run llama3.1:70b-instruct-fp16

llama.cpp Native Build {#llamacpp}

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1151 \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

./build/bin/llama-cli -m llama-3.1-70b-instruct-Q4_K_M.gguf -ngl 999 -fa

For long context: enable Q4 KV cache (--cache-type-k q4_0) — saves ~50% KV memory.


vLLM-ROCm {#vllm}

docker pull rocm/vllm:latest

docker run --device /dev/kfd --device /dev/dri \
    --group-add video --group-add render \
    --security-opt seccomp=unconfined \
    --shm-size 16G \
    -e HSA_OVERRIDE_GFX_VERSION=11.5.1 \
    -p 8000:8000 \
    rocm/vllm:latest \
    vllm serve casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85

vLLM-ROCm on Strix Halo is functional but throughput-bound by the 256 GB/s memory bandwidth — single-stream decode is similar to llama.cpp / Ollama, but multi-user batching has less headroom than discrete GPUs.


PyTorch + Hugging Face {#pytorch}

pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.3

Run any HF Transformers model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

The 128 GB unified memory means BF16 70B loads without sharding. Inference at 14 tok/s, training is impractical (compute-bound, not memory-bound).


Image Generation Performance {#image-gen}

WorkflowStrix HaloRTX 4090M4 Max 128GB
SDXL 1024²14-18 sec4 sec18-25 sec
SDXL Lightning (8 steps)5 sec1.5 sec7 sec
Flux Schnell (4 steps)10 sec3 sec12 sec
Flux Dev FP8 (25 steps)30 sec12 sec35 sec
SD 3.5 Large (28 steps)20 sec6 sec25 sec

Strix Halo is roughly 3-4x slower than RTX 4090 for image gen — compute-bound, not memory-bound. For LLM-primary use cases this is acceptable; for image-gen primary, get a discrete GPU.


Video Generation (Wan 2.2, Hunyuan) {#video-gen}

ModelStrix HaloRTX 4090
Wan 2.2 (5 sec, 720p, Q8)18-25 min6-10 min
HunyuanVideo (5 sec, 720p, Q4)35-50 min12-20 min
Mochi (5 sec, 480p)12-18 min5-8 min

Slow but functional for occasional generation. The 128 GB memory means you can fit unquantized video models that don't fit on a 4090, but compute time makes routine use impractical.


The XDNA 2 NPU {#npu}

The 50-TOPS XDNA 2 NPU is largely unused for general LLM inference as of mid-2026. AMD's Lemonade SDK and Ryzen AI Software stack target it for specific accelerated paths — Llama 3.1 8B, Phi 3.5 Mini, etc. — but the iGPU is faster for most workloads.

The NPU shines for:

  • Always-on background AI (Windows Studio Effects, Recall)
  • Battery-sensitive laptop workloads
  • Specific quantized models AMD ships optimized kernels for

For mainstream LLM inference, ignore the NPU and use the iGPU. Future ROCm releases may expand NPU coverage.


Real Benchmarks {#benchmarks}

Framework Desktop 128 GB, Ubuntu 24.04, ROCm 6.3, BIOS UMA = 96 GB.

Workloadtok/s
Llama 3.2 1B Q4_K_M110
Llama 3.2 3B Q4_K_M78
Qwen 2.5 7B Q4_K_M52
Llama 3.1 8B Q4_K_M48
Qwen 2.5 14B Q4_K_M32
Qwen 2.5 32B AWQ-INT422
Llama 3.3 70B Q4_K_M32
Llama 3.3 70B Q5_K_M26
Llama 3.3 70B BF1614
Mixtral 8x7B Q438
DeepSeek V3 235B (INT4 partial-fit)8

For Mac M4 Max comparison see Apple M4 for AI Guide. For dual-GPU comparison see Multi-GPU Ollama Setup.


Power, Thermals, Acoustics {#power}

Configurable cTDP from 45 W to 120 W. Sweet spots:

ProfilecTDPLLM tok/s (70B Q4)Noise
Quiet65 W24Whisper-quiet
Balanced90 W30Audible at desk
Performance120 W32Notable fan

For 24/7 inference servers, 65-90 W is the sweet spot — small efficiency gains plateau above 90 W. Most Strix Halo systems ship with adjustable power profiles in BIOS or vendor utility.


Use Cases Where Strix Halo Wins {#use-cases}

  1. 70B local inference on a single device — no consumer NVIDIA card matches.
  2. Large model fine-tuning data prep — load full BF16 model for inspection / activation logging.
  3. Multi-tenant home server — 128 GB lets you run multiple medium models simultaneously.
  4. Quiet / efficient desktop — 65-90 W for usable 70B inference.
  5. Privacy / air-gapped 70B — single mini-PC, easy to lock down. See Air-Gapped AI Deployment.
  6. Long-context RAG / agents — can fit full 131K context on 70B with Q4 KV cache.
  7. Mac Studio alternative for PC users — same use case, half the price.

Where Discrete GPUs Still Win {#discrete-wins}

  1. Small model speed — RTX 4090 hits 127 tok/s on 8B vs 48 on Strix Halo.
  2. Image generation — 3-4x faster on RTX 4090.
  3. Multi-user concurrent serving — vLLM PagedAttention scales better with VRAM bandwidth.
  4. Fine-tuning — discrete GPU compute throughput dominates.
  5. TensorRT-LLM — NVIDIA only.
  6. FP8 hardware — RDNA 3.5 lacks FP8.
  7. Latency-sensitive paths — discrete VRAM bandwidth wins on first-token-time.

Buying Advice {#buying}

Buy Strix Halo if:

  • You want to run 70B models locally without offload pain.
  • You value quiet, low-power, mini-PC form factor.
  • You're comfortable with Linux + ROCm or Windows + WSL2.
  • $2,000-3,000 budget; Mac Studio M4 Max is too expensive.
  • LLMs are the primary workload; image gen is occasional.

Buy a discrete GPU instead if:

  • You primarily run 7B-32B models — RTX 4090 / 7900 XTX much faster.
  • You do heavy image / video generation.
  • You need TensorRT-LLM or ExLlamaV2.
  • You serve many concurrent users.

Buy Mac Studio M4 Max instead if:

  • You're already in the Apple ecosystem.
  • You need MLX-specific frameworks.
  • Budget allows the ~2x premium.

Troubleshooting {#troubleshooting}

SymptomCauseFix
iGPU only sees 16 GBBIOS UMA too lowIncrease to 96+ GB in BIOS
Ollama uses CPU onlygfx version mismatchSet HSA_OVERRIDE_GFX_VERSION=11.5.1
ROCm "no agents found"Driver / groupsrender+video group, reboot
Throttling under sustained loadTDP / thermalIncrease cTDP profile or improve cooling
WSL2 GPU not visibleAMD WSL driver missingInstall AMD Software for WSL on host
Long prompt prefill slowCompute-boundExpected — Strix Halo trades compute for memory

FAQ {#faq}

See answers to common Strix Halo questions below.


Sources: AMD Ryzen AI Max product page | Framework Desktop announcement | ROCm 6.3 release notes | Internal benchmarks Framework Desktop 128 GB.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Strix Halo-tuned Ollama config. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators