Is AMD ROCm a real alternative to NVIDIA CUDA for local LLMs in 2026?

Yes — for inference, ROCm 6.x has reached production parity for the most-used frameworks (Ollama, llama.cpp, vLLM, PyTorch). Performance on Radeon RX 7900 XTX runs roughly 70-90% of an RTX 4090 on LLM inference, at half the price. For data-center workloads, MI300X (192GB HBM3) outperforms H100 on memory-bound LLM inference. Where ROCm still lags: training tooling depth, FP8 maturity, ecosystem libraries (e.g. some Hugging Face / xformers paths), and broad driver support on consumer Linux distros. For pure inference on a fixed model, ROCm is now a credible buy.

Which AMD GPUs are officially supported for ROCm with local LLMs?

Officially supported in ROCm 6.x: Radeon RX 7900 XTX, 7900 XT, 7900 GRE, RX 9070 XT, RX 9070, Pro W7900, Pro W7800, MI210, MI250/250X, MI300A, MI300X, MI325X, and Ryzen AI Max+ 395 / 390 (Strix Halo). Unofficially working with HSA_OVERRIDE_GFX_VERSION: RX 6800/6800 XT/6900 XT (set 10.3.0), RX 6700 XT (10.3.0), RX 7800 XT / 7700 XT (11.0.0), and many Ryzen APUs. The unofficial path works for inference but is unsupported for production. Older Vega and Polaris cards are out of scope.

How does Radeon RX 7900 XTX compare to RTX 4090 for local LLM inference?

On Llama 3.1 8B BF16, RTX 4090 hits ~127 tok/s, RX 7900 XTX hits ~96 tok/s — about 75%. On Llama 3.1 70B AWQ-INT4 (with vLLM), 4090 partial-offloads at ~8 tok/s while 7900 XTX (also 24GB) hits ~7 tok/s; both bottleneck on VRAM, not compute. Cost: 7900 XTX is $750-900, 4090 is $1,800-2,000 used. Per dollar of inference throughput, 7900 XTX wins by ~40%. The 4090 wins absolute speed and broader software support. For a budget-constrained 24GB inference rig, 7900 XTX is the rational choice in 2026.

What is Strix Halo / Ryzen AI Max+ 395 and why does it matter for local LLMs?

Strix Halo is AMD's 2025 APU platform combining Zen 5 CPUs with a 40-CU RDNA 3.5 iGPU and up to 128GB of unified LPDDR5X memory at ~256 GB/s. The 128GB unified memory means it can run 70B models in BF16 entirely on the iGPU — something no consumer NVIDIA discrete card can do. Performance is roughly 30-50 tok/s on 70B Q4 — slower than a multi-GPU NVIDIA rig but in a 65W mini-PC. Framework / Asus / HP all ship Strix Halo systems with 128GB. For "70B model on a quiet desktop" use cases, it is the strongest single option in 2026 outside of an expensive Mac Studio.

How do I install ROCm on Ubuntu, and which version should I use?

For Ubuntu 22.04/24.04, use ROCm 6.2 or newer — install via the AMD repo: `wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb` then `sudo apt install ./amdgpu-install*.deb && sudo amdgpu-install --usecase=rocm,hiplibsdk`. Reboot. Add your user to `render` and `video` groups: `sudo usermod -aG render,video $USER`. Verify with `rocminfo` and `rocm-smi`. Use ROCm 6.3+ for FP8 support (MI300X) and the latest Radeon kernels. Avoid ROCm 5.x — many local LLM frameworks now require 6.x.

Can I use ROCm with Ollama out of the box?

Yes — Ollama ships official ROCm binaries since v0.1.40. Install with the standard `curl -fsSL https://ollama.com/install.sh | sh`; it auto-detects ROCm and uses the AMDGPU backend. For unsupported gfx versions (RX 6000-series), set the env var before launching: `HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve`. For Strix Halo iGPU, set `HSA_OVERRIDE_GFX_VERSION=11.5.1`. If Ollama falls back to CPU, check `rocminfo | grep gfx` to see your detected gfx version and that your user is in `render` and `video` groups.

Does FlashAttention work on AMD GPUs?

Yes, but with caveats. FlashAttention-2 has an official ROCm port (Tri Dao's repo + AMD's rocm/flash-attention fork) that works on RDNA 3 (RX 7900 series, RX 9070 series) and CDNA 3 (MI300X). Performance is 60-85% of an equivalent CUDA card. FlashAttention-3 is CUDA-only (Hopper / Blackwell). For RDNA 3, use the rocm/flash-attention fork; for CDNA 3, use the upstream FlashAttention with ROCm 6.3+. In practice: enable in llama.cpp with `-fa`, in vLLM it auto-detects, in PyTorch use `torch.nn.functional.scaled_dot_product_attention` which dispatches automatically.

What about AMD on Windows — is it usable for local LLMs?

Partially. ROCm officially supports Windows since ROCm 6.1 but only for HIP runtime, not the full math libraries needed for transformer inference. The practical path on Windows: WSL2 Ubuntu 22.04 with full ROCm (works on Radeon 7900 series and Strix Halo). For pure-Windows local LLM, use LM Studio with Vulkan backend (works on any AMD GPU) — slower than ROCm but plug-and-play. Ollama on Windows uses ROCm where available, Vulkan as fallback. For production, run ROCm on Linux; for desktop use, WSL2 + Ollama is the best Windows experience.

AMD ROCm Setup for Local LLMs (2026): Radeon, Strix Halo, MI300X

AMD ROCm has quietly become a credible alternative to NVIDIA CUDA for local LLMs. A Radeon RX 7900 XTX runs Llama 3.1 8B at ~96 tok/s — 75% of an RTX 4090's speed at less than half the price. Strix Halo (Ryzen AI Max+ 395) puts 128GB of unified memory in a 65W mini-PC, running 70B models entirely on the iGPU. And MI300X with 192GB HBM3 is outperforming H100 on long-context inference.

This guide is the complete reference: installing ROCm, configuring your GPU, running Ollama / llama.cpp / vLLM / PyTorch on AMD, FlashAttention support, real benchmarks vs CUDA, and the gotchas nobody tells you about. We cover Radeon RX 7000/9000-series, Strix Halo, and CDNA 3 (MI300X / MI325X).

The State of ROCm in 2026
Supported AMD Hardware
ROCm vs CUDA: What's Actually Different
Installation: Ubuntu, Fedora, WSL2
Verifying Your Install
Ollama on ROCm
llama.cpp on ROCm
vLLM on ROCm
PyTorch on ROCm
FlashAttention on AMD
Quantization on ROCm
Tuning the Radeon RX 7900 XTX
Strix Halo / Ryzen AI Max+ 395
MI300X / MI325X for Servers
HSA_OVERRIDE_GFX_VERSION: Unofficial Cards
Performance Benchmarks
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The State of ROCm in 2026 {#state-of-rocm}

For two years AMD users were second-class citizens in local AI. That ended in 2024-2025. Today:

Ollama, llama.cpp, vLLM, PyTorch, JAX, Triton, ExecuTorch all ship ROCm builds.
FlashAttention-2 has an official AMD port for RDNA 3 and CDNA 3.
Strix Halo / Ryzen AI Max+ put 128GB of unified memory on a desktop platform.
MI300X (192GB HBM3) is in production at major hyperscalers and beats H100 on long-context inference.
Radeon RX 7900 XTX at $750-900 hits 75% of RTX 4090 inference speed.

What still lags: training tooling parity, FP8 software maturity (works on MI300X, weak on RDNA 3), some xformers / SDXL paths, and Windows native support.

For pure inference, ROCm is a real choice in 2026. For training large models, NVIDIA still wins.

Supported AMD Hardware {#hardware}

Officially supported in ROCm 6.x

GPU / APU	Architecture	gfx	VRAM	Best For
Radeon RX 7900 XTX	RDNA 3 (Navi 31)	gfx1100	24 GB	Best Radeon for LLMs
Radeon RX 7900 XT	RDNA 3	gfx1100	20 GB	Mid-range LLM
Radeon RX 7900 GRE	RDNA 3	gfx1100	16 GB	Budget 14B models
Radeon RX 9070 XT	RDNA 4 (Navi 48)	gfx1201	16 GB	New gen, FP8 likely
Radeon Pro W7900	RDNA 3	gfx1100	48 GB	Workstation 70B
Radeon Pro W7800	RDNA 3	gfx1100	32 GB	Workstation 32B
Ryzen AI Max+ 395	RDNA 3.5 + Zen 5	gfx1151	up to 128 GB unified	70B in mini-PC
MI210	CDNA 2	gfx90a	64 GB HBM2e	Server inference
MI250 / 250X	CDNA 2	gfx90a	128 GB HBM2e	Multi-GPU server
MI300A	CDNA 3 + Zen 4	gfx940	128 GB HBM3	APU server
MI300X	CDNA 3	gfx942	192 GB HBM3	405B+ on single GPU
MI325X	CDNA 3	gfx942	256 GB HBM3e	Largest single-GPU

Unofficially supported (HSA_OVERRIDE_GFX_VERSION)

GPU	gfx	Override
RX 6800 / 6800 XT / 6900 XT	gfx1030	10.3.0
RX 6700 XT / 6750 XT	gfx1031	10.3.0
RX 6600 / 6650 XT	gfx1032	10.3.0
RX 7800 XT / 7700 XT	gfx1101	11.0.0
RX 7600 / 7600 XT	gfx1102	11.0.0
Ryzen 7040/8040 (780M iGPU)	gfx1103	11.0.3
Ryzen AI 300-series (Strix Point)	gfx1150	11.5.0
Ryzen AI Max+ (Strix Halo)	gfx1151	11.5.1

These work for inference on Ollama / llama.cpp but are unsupported for production. See HSA_OVERRIDE_GFX_VERSION below.

ROCm vs CUDA: What's Actually Different {#rocm-vs-cuda}

Concept	NVIDIA CUDA	AMD ROCm
Compute API	CUDA C++	HIP (CUDA-source-compatible)
Driver	nvidia-smi	rocm-smi
Math library	cuBLAS	hipBLAS / rocBLAS
DNN library	cuDNN	MIOpen
Collective comms	NCCL	RCCL
Profiler	Nsight Systems / Compute	rocprof / Omniperf
Compiler	nvcc	hipcc
Container runtime	nvidia-container-toolkit	rocm-container-runtime
Tensor core equivalent	Tensor Cores	Matrix Cores (CDNA), WMMA (RDNA)
FP8	Ada / Hopper / Blackwell	MI300X (CDNA 3); RDNA 3 lacks

The good news: HIP is mostly source-compatible with CUDA. hipify translates CUDA code automatically, so most frameworks ship both backends from the same codebase.

The bad news: kernels hand-tuned for NVIDIA Tensor Cores (FlashAttention-3, fused MoE kernels, etc.) need separate AMD implementations and tend to lag.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Installation: Ubuntu, Fedora, WSL2 {#installation}

Ubuntu 22.04 / 24.04 (recommended)

# Add AMDGPU repo and install (ROCm 6.2)
wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb
sudo apt install ./amdgpu-install_6.2.60200-1_all.deb
sudo amdgpu-install --usecase=rocm,hiplibsdk -y

# Add yourself to required groups
sudo usermod -aG render,video $USER

# Reboot
sudo reboot

For Ubuntu 24.04, replace jammy with noble in the URL.

Fedora 39+

sudo dnf install rocm-hip rocm-hip-devel rocm-comgr rocm-runtime
sudo usermod -aG render,video $USER
sudo reboot

WSL2 (Windows 10/11)

ROCm in WSL2 supports a narrower set of GPUs (RX 7900 XTX/XT/GRE, Pro W7900/W7800, Strix Halo) with the AMD Software: Adrenalin Edition for WSL driver.

# Inside WSL2 Ubuntu 22.04
wget https://repo.radeon.com/amdgpu-install/6.2/ubuntu/jammy/amdgpu-install_6.2.60200-1_all.deb
sudo apt install ./amdgpu-install*.deb
sudo amdgpu-install -y --usecase=wsl,rocm --no-dkms

Critical: --no-dkms because WSL uses the Windows driver. Skip the Linux kernel module install.

Docker

docker run -it --rm \
    --device /dev/kfd --device /dev/dri \
    --group-add video --group-add render \
    --security-opt seccomp=unconfined \
    rocm/dev-ubuntu-22.04:6.2 bash

The /dev/kfd (Kernel Fusion Driver) and /dev/dri devices are how ROCm reaches the GPU.

Verifying Your Install {#verify}

# Should list your GPU
rocminfo | grep -A 5 "Agent"

# GPU utilization, temp, power
rocm-smi

# Detailed
rocm-smi --showallinfo

Expected output for a Radeon RX 7900 XTX:

Agent 2
*******
  Name: gfx1100
  Marketing Name: Radeon RX 7900 XTX
  Vendor Name: AMD
  Feature: KERNEL_DISPATCH
  ...

If you see only gfx000 or no agents, your driver did not load — check dmesg | grep amdgpu.

Ollama on ROCm {#ollama}

Ollama ships a ROCm binary that auto-detects the GPU.

curl -fsSL https://ollama.com/install.sh | sh

The installer prints the detected backend. For unofficially-supported GPUs:

# RX 6800 / 6900 XT
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve

# Strix Halo iGPU
HSA_OVERRIDE_GFX_VERSION=11.5.1 ollama serve

# Set persistently in systemd
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=11.5.1"
Environment="HCC_AMDGPU_TARGET=gfx1151"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify GPU is being used:

ollama run llama3.1:8b "hi"
# In another terminal:
rocm-smi
# Should show >0% GPU utilization during inference

llama.cpp on ROCm {#llamacpp}

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with HIP backend
HIPCXX="$(hipconfig -l)/clang" \
HIP_PATH="$(hipconfig -R)" \
cmake -B build \
    -DGGML_HIP=ON \
    -DAMDGPU_TARGETS=gfx1100 \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

./build/bin/llama-cli -m model.gguf -ngl 999 -fa

AMDGPU_TARGETS should match your GPU's gfx version. For multiple GPUs, use gfx1100;gfx1101.

llama.cpp Vulkan (alternative)

For unsupported AMD GPUs, the Vulkan backend works on anything with Vulkan 1.2:

cmake -B build-vk -DGGML_VULKAN=ON
cmake --build build-vk -j

Vulkan is 60-80% the speed of HIP/ROCm but works on RX 5000-series, Intel Arc, and even some integrated GPUs. See our Intel Arc A770 guide for similar patterns.

vLLM on ROCm {#vllm}

docker pull rocm/vllm:latest

docker run --device /dev/kfd --device /dev/dri \
    --group-add video --group-add render \
    --security-opt seccomp=unconfined \
    --shm-size 16G \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    rocm/vllm:latest \
    vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 16384

For supported AWQ quantization on RDNA 3:

vllm serve casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 16384

FP8 weights work on MI300X/MI325X but not on RDNA 3. INT8 (W8A8) works on both.

For full vLLM tuning, see our vLLM Complete Setup Guide — most flags are identical between CUDA and ROCm.

PyTorch on ROCm {#pytorch}

# Stable wheel for ROCm 6.2
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2

# Verify
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

torch.cuda.is_available() returns True even on AMD because the HIP runtime maps to the CUDA API. torch.version.hip returns the ROCm version.

Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

Most Hugging Face models work unchanged. Exceptions: anything that imports xformers (use attn_implementation="sdpa" instead) or bitsandbytes (use AWQ / GPTQ instead, or the bitsandbytes-rocm fork).

FlashAttention on AMD {#flash-attention}

RDNA 3 (RX 7900-series)

git clone https://github.com/ROCm/flash-attention
cd flash-attention
GPU_ARCHS="gfx1100" python setup.py install

This is AMD's fork; the upstream Tri Dao FlashAttention-2 also has CK (Composable Kernel) support but the AMD fork is generally faster on RDNA 3.

CDNA 3 (MI300X)

pip install flash-attn --no-build-isolation

Upstream FlashAttention-2 supports MI300X via CK kernels in v2.5+.

Use in llama.cpp / vLLM / PyTorch

llama.cpp: -fa flag (auto-detected)
vLLM: auto-detected
PyTorch: torch.nn.functional.scaled_dot_product_attention automatically uses FlashAttention when shapes are favorable

Performance impact

Workload	RX 7900 XTX (no FA)	RX 7900 XTX (FA2)	Speedup
Llama 3.1 8B, 8K ctx	58 tok/s	91 tok/s	1.57x
Llama 3.1 8B, 16K ctx	22 tok/s	67 tok/s	3.05x
Llama 3.1 8B, 32K ctx	OOM	38 tok/s	∞

Same pattern as NVIDIA — FlashAttention is mandatory for long context.

Quantization on ROCm {#quantization}

Format	RDNA 3 (RX 7900)	RDNA 4 (RX 9070)	CDNA 3 (MI300X)
FP16 / BF16	✅	✅	✅
FP8	❌	✅ (Navi 48)	✅ (E4M3, E5M2)
INT8 (W8A8)	✅	✅	✅
AWQ-INT4	✅	✅	✅
GPTQ-INT4	✅	✅	✅
GGUF Q4_K_M / Q5_K_M	✅ (llama.cpp)	✅	✅

For RDNA 3, AWQ-INT4 with vLLM or Q5_K_M / Q4_K_M with llama.cpp are the practical defaults. For MI300X, FP8 + AWQ-INT4 + INT8 W8A8 all work.

See AWQ vs GPTQ vs GGUF for the underlying theory.

Tuning the Radeon RX 7900 XTX {#radeon-7900}

Power and thermals

# Set power cap to 290W (stock 355W) — saves ~15% power, ~3% perf loss
sudo rocm-smi --setpoweroverdrive 290

# Lock GPU clock for predictable latency
sudo rocm-smi --setperflevel high

# Fan curve
sudo rocm-smi --setfan 70%

llama.cpp recipe

./build/bin/llama-server \
    -m llama-3.1-8b-instruct-q5_k_m.gguf \
    -ngl 999 \
    -c 16384 \
    -fa \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    -t 8 \
    --host 0.0.0.0 --port 8080

vLLM recipe

vllm serve casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192

Multi-GPU 7900 XTX

# llama.cpp tensor split for 2x 7900 XTX
./llama-cli -m 70b-q4.gguf -ngl 999 --tensor-split 24,24

# vLLM TP=2
vllm serve <70b-awq-model> --tensor-parallel-size 2 --quantization awq

7900 XTX does not support NVLink/Infinity Fabric Link, so multi-GPU is PCIe-only. PCIe 4.0 x16 (~32 GB/s practical) is the bottleneck for tensor parallelism — expect 1.5-1.7x speedup with TP=2 vs single-GPU on a 70B model that fits split.

Strix Halo / Ryzen AI Max+ 395 {#strix-halo}

The most interesting AMD platform of 2025-2026 for local AI.

Hardware

CPU: Zen 5, 16 cores / 32 threads
GPU: RDNA 3.5, 40 CUs, gfx1151
Memory: Up to 128 GB LPDDR5X-8000, ~256 GB/s bandwidth
NPU: XDNA 2, ~50 TOPS (separate accelerator)
Power: 55-120W configurable
Form factor: Mini-PC (Framework Desktop, Asus, HP), laptops

Why it matters

128 GB of unified memory means a 70B model in BF16 fits entirely on the iGPU — no quantization required for quality-critical workloads. No consumer NVIDIA card has this much memory. The closest comparison is a Mac Studio M4 Max with 128 GB unified at $4,000+; Strix Halo systems start ~$2,000.

Setup

# Override gfx version
echo 'export HSA_OVERRIDE_GFX_VERSION=11.5.1' >> ~/.bashrc
echo 'export HCC_AMDGPU_TARGET=gfx1151' >> ~/.bashrc
source ~/.bashrc

# Install ROCm 6.3+ (Strix Halo support added)
sudo amdgpu-install --usecase=rocm,hiplibsdk

# Reserve memory for iGPU (BIOS or sysctl)
# Most Strix Halo systems let you allocate 96-110 GB to GPU in BIOS

Performance (Llama 3.1 70B)

Quant	tok/s	Notes
BF16	~14	Fits in unified memory!
Q5_K_M	~26	Best balance
Q4_K_M	~32	Fastest
AWQ-INT4 (vLLM)	~38	Highest quality at 4-bit

For comparison: 2x RTX 4090 (PCIe, no NVLink) on the same 70B Q4 hits ~24 tok/s — at much higher cost and power. Strix Halo is the best "70B in a quiet desktop" option in 2026 below Mac Studio prices.

Caveats

LPDDR5X bandwidth (~256 GB/s) is much lower than discrete GPU VRAM (1,000+ GB/s), so single-stream throughput on small models that fit in 24 GB lags discrete GPUs significantly.
GPU and CPU share the memory bus — heavy CPU workloads steal bandwidth from inference.
Long prompt prefill is slow vs discrete GPUs because compute is bandwidth-bound on this platform.

MI300X / MI325X for Servers {#mi300x}

Specs

GPU	VRAM	Bandwidth	Compute (FP16/BF16)	Compute (FP8)
MI300X	192 GB HBM3	5.3 TB/s	1.3 PFLOPS	2.6 PFLOPS
MI325X	256 GB HBM3e	6.0 TB/s	1.3 PFLOPS	2.6 PFLOPS
H100 SXM	80 GB HBM3	3.35 TB/s	0.99 PFLOPS	1.98 PFLOPS
H200 SXM	141 GB HBM3e	4.8 TB/s	0.99 PFLOPS	1.98 PFLOPS
B200 SXM	192 GB HBM3e	8.0 TB/s	~2.5 PFLOPS	4.5 PFLOPS

Why MI300X wins on certain LLM workloads

LLM inference is memory-bandwidth-bound for decode and capacity-bound for long-context KV-cache. MI300X's 192 GB at 5.3 TB/s beats an H100's 80 GB at 3.35 TB/s on both axes. Real-world wins:

Llama 3.1 405B FP8 on a single MI300X — fits with room for a 32K KV cache. H100 needs at least 4-way TP.
Long-context (128K) inference — MI300X holds the KV cache without offload at large batch sizes.
Multi-tenant servers — more concurrent requests fit per GPU.

vLLM on MI300X

docker run --device=/dev/kfd --device=/dev/dri \
    --security-opt seccomp=unconfined --shm-size 32G \
    --network=host \
    rocm/vllm:latest \
    vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.93 \
    --enable-prefix-caching

For multi-GPU MI300X (8x in OAM), use --tensor-parallel-size 8 and ensure RCCL is configured:

export NCCL_DEBUG=WARN     # RCCL respects NCCL_* vars
export RCCL_MSCCL_ENABLE=1 # MSCCL collective optimizations

HSA_OVERRIDE_GFX_VERSION: Unofficial Cards {#hsa-override}

Many older Radeon cards work with ROCm by lying about their gfx version. ROCm ships kernels for officially-supported gfx versions; unsupported cards run those kernels by claiming a compatible architecture.

Your GPU	Override Value
RX 6700 XT, 6750 XT, 6800, 6800 XT, 6900 XT, 6950 XT	10.3.0
RX 6600, 6600 XT, 6650 XT	10.3.0
RX 7600, 7600 XT, 7700 XT, 7800 XT	11.0.0
Ryzen 7040/8040 (780M iGPU)	11.0.3
Ryzen AI 300-series (Strix Point 890M)	11.5.0

Set globally:

echo 'export HSA_OVERRIDE_GFX_VERSION=10.3.0' >> ~/.bashrc

Or per-application via systemd override (see Ollama section).

Risks: unsupported configurations can crash, hang, or silently produce wrong results on edge-case operations. For inference of standard Llama / Mistral / Qwen models the failure modes are usually clean (crashes, not silent wrong outputs). Do not use HSA override for production.

Performance Benchmarks {#benchmarks}

All benchmarks: Ollama with Q4_K_M quantization at 4K context, single user, room-temperature 22°C ambient.

Llama 3.1 8B

GPU	tok/s	Power	$/perf
RTX 4090 (24GB)	127	380W	$14.2/tok
RTX 5080 (16GB)	168	360W	$7.7/tok
RX 7900 XTX (24GB)	96	320W	$8.6/tok
RX 7900 XT (20GB)	82	290W	$8.5/tok
RTX 3090 (24GB)	95	320W	$7.4/tok (used)
Strix Halo iGPU	48	80W	n/a (system price)

Llama 3.1 70B Q4

GPU	tok/s	Notes
2x RTX 3090 NVLink (48GB)	28	Best NVIDIA value
2x RX 7900 XTX (48GB, no NVLink)	22	Best AMD discrete
1x MI300X (192GB)	58	Single GPU
Strix Halo (128GB unified)	32	Mini-PC
Mac Studio M4 Max (128GB)	28	Reference

MI300X is clearly the throughput leader for 70B. For the home / SMB segment, Strix Halo is genuinely competitive with multi-GPU rigs at lower cost and dramatically lower power.

Troubleshooting {#troubleshooting}

Symptom	Likely Cause	Fix
`hipErrorNoBinaryForGpu`	gfx mismatch	Set `HSA_OVERRIDE_GFX_VERSION` or rebuild with right `AMDGPU_TARGETS`
Ollama uses CPU only	Driver / group permissions	`groups` should include render+video; `rocm-smi` should list GPU
`HSAKMT_STATUS_KERNEL_ALREADY_OPENED`	Old AMD kernel module	Reinstall `amdgpu-dkms`
Crashes mid-inference	Memory pressure / power	Lower `-c` context size, reduce `--gpu-memory-utilization`
Very slow on RX 7900	FlashAttention not enabled	Build llama.cpp with HIP, pass `-fa`
WSL2 GPU not detected	Wrong driver	Install AMD Software for WSL on Windows host
`rocBLAS` warnings	Architecture not in shipped tensile library	Run `rocblas-bench --rebuild_tensile` or use `gfx1100`-compatible override
Strix Halo OOM at 70B BF16	Memory not allocated to iGPU	Increase iGPU reservation in BIOS to 96GB+
MI300X under-utilized	Not enough concurrency	Increase `--max-num-seqs` to 256+ in vLLM

FAQ {#faq}

See answers to common AMD ROCm questions below.

Related guides on Local AI Master:

AMD ROCm Setup for Local LLMs (2026): Radeon, Strix Halo, MI300X

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

The State of ROCm in 2026 {#state-of-rocm}

Supported AMD Hardware {#hardware}

Officially supported in ROCm 6.x

Unofficially supported (HSA_OVERRIDE_GFX_VERSION)

ROCm vs CUDA: What's Actually Different {#rocm-vs-cuda}

Reading articles is good. Building is better.

Installation: Ubuntu, Fedora, WSL2 {#installation}

Ubuntu 22.04 / 24.04 (recommended)

Fedora 39+

WSL2 (Windows 10/11)

Docker

Verifying Your Install {#verify}

Ollama on ROCm {#ollama}

llama.cpp on ROCm {#llamacpp}

llama.cpp Vulkan (alternative)

vLLM on ROCm {#vllm}

PyTorch on ROCm {#pytorch}

Hugging Face Transformers

FlashAttention on AMD {#flash-attention}

RDNA 3 (RX 7900-series)

CDNA 3 (MI300X)

Use in llama.cpp / vLLM / PyTorch

Performance impact

Quantization on ROCm {#quantization}

Tuning the Radeon RX 7900 XTX {#radeon-7900}

Power and thermals

llama.cpp recipe

vLLM recipe

Multi-GPU 7900 XTX

Strix Halo / Ryzen AI Max+ 395 {#strix-halo}

Hardware

Why it matters

Setup

Performance (Llama 3.1 70B)

Caveats

MI300X / MI325X for Servers {#mi300x}

Specs

Why MI300X wins on certain LLM workloads

vLLM on MI300X

HSA_OVERRIDE_GFX_VERSION: Unofficial Cards {#hsa-override}

Performance Benchmarks {#benchmarks}

Llama 3.1 8B

Llama 3.1 70B Q4

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

AMD vs NVIDIA vs Intel AI GPU

CUDA Optimization for Local LLMs

vLLM Complete Setup Guide

Best GPUs for AI 2025

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI