What is the MI300X and why is it relevant for LLM inference?

MI300X is AMD's flagship CDNA 3 datacenter GPU launched late 2023, released widely in 2024. Headline specs: 192 GB HBM3 (largest single-GPU memory in any production accelerator at launch), 5.3 TB/s bandwidth, 1.3 PFLOPS FP16/BF16, 2.6 PFLOPS FP8, ~750W TBP. For LLM inference — which is bandwidth- and capacity-bound for decode and long-context — these specs are dominant: a single MI300X can hold Llama 3.1 405B in FP8 with room for 32K KV cache, where an H100 would need 4-way tensor parallelism. Major hyperscalers (Microsoft, Meta, Oracle) now run MI300X production fleets.

Does MI300X actually beat H100 on real LLM workloads?

On bandwidth- and capacity-bound workloads, yes. Real numbers: Llama 3.1 70B FP8 single-GPU at 32K context, vLLM-ROCm: MI300X ~140 tok/s aggregate at 32 concurrent users vs H100 ~95 (H100 is bandwidth-limited and has to pull KV cache via HBM3 at 3.35 TB/s vs MI300X 5.3 TB/s). Llama 3.1 405B FP8: MI300X fits on a single GPU with 32K KV cache; H100 needs TP=4. On training and tiny-batch single-stream where compute dominates, H100 has more mature kernels and edges out MI300X. For most multi-user LLM inference at scale, MI300X is competitive or better per dollar.

What is MI325X and how is it different from MI300X?

MI325X is the late-2024 refresh: same CDNA 3 architecture but with 256 GB HBM3e (vs 192 GB HBM3) and 6 TB/s bandwidth (vs 5.3 TB/s). Same compute (1.3 PFLOPS BF16 / 2.6 PFLOPS FP8). The bigger memory makes MI325X especially strong for the largest models — Llama 3.1 405B BF16 fits with room for 16K context, where MI300X needs FP8. For 2026 deployments choose MI325X over MI300X when available; pricing and supply varies.

How do I get and deploy MI300X / MI325X?

Three paths. (1) **Cloud**: Microsoft Azure ND MI300X v5 series (8x MI300X), Oracle OCI BM.GPU.MI300X.8, AMD Developer Cloud, TensorWave, Hot Aisle. Hourly rates roughly $20-30 for 8x MI300X — competitive with H100 nodes. (2) **OEM systems**: Dell PowerEdge XE9680, Supermicro AS-8125GS-TNMR2, HPE Cray XD675 ship with 8x MI300X OAM modules. (3) **AMD Developer Box**: smaller workstations with 1-2 MI300X cards via the AMD partner program. For most users, cloud is the right starting point.

What software stack runs on MI300X?

ROCm 6.x is the base; vLLM-ROCm is the production-grade serving framework; SGLang, TGI, and llama.cpp all have working ROCm paths. PyTorch on ROCm is mature for both training and inference. AMD's own MK1 / Migraphx provide TensorRT-LLM-equivalent capability for some models. Hugging Face Optimum-AMD supports AWQ, GPTQ, FP8, and INT8 quantization. The gap vs NVIDIA is narrowest in inference; for training cutting-edge models, NVIDIA still has more battle-tested tooling.

How does FP8 work on MI300X?

CDNA 3 has native FP8 support — both E4M3 and E5M2 formats — via the matrix cores. Throughput: 2.6 PFLOPS FP8 (2x BF16). vLLM-ROCm and SGLang support FP8 weights and FP8 KV cache. AMD's Quark quantization tool produces FP8 checkpoints from BF16 with minimal quality loss. For Llama 3.1 70B / 405B: NeuralMagic publishes FP8 quants that work with `--quantization fp8 --kv-cache-dtype fp8_e4m3` on MI300X. Throughput on MI300X with FP8 typically beats H100 BF16 and matches or exceeds H100 FP8 thanks to the larger memory.

Is MI300X production-ready for mission-critical workloads?

In mid-2026, yes — Microsoft Azure runs MI300X for OpenAI inference workloads, Meta uses it in their Llama serving fleet, and several large enterprises have deployed at scale. ROCm 6.3+ has stable kernels for the most-used model architectures. Edge cases (newest models the day of release, niche fine-tuning frameworks) may lag NVIDIA. For the long tail of standard LLM inference (Llama, Qwen, DeepSeek, Mixtral, Gemma) MI300X is production-grade.

How does an 8x MI300X node compare to an 8x H100 node for LLM serving?

8x MI300X = 1,536 GB total VRAM, 42.4 TB/s aggregate bandwidth, 20.8 PFLOPS FP8 / 10.4 PFLOPS BF16. 8x H100 SXM = 640 GB total, 26.8 TB/s aggregate, 15.8 PFLOPS FP8 / 7.9 PFLOPS BF16. MI300X has 2.4x the memory and 1.6x the bandwidth. For Llama 3.1 405B FP8 serving: MI300X node typically delivers 30-50% higher aggregate throughput than H100 node at the same concurrency. For training large models, H100 currently has a software maturity edge; for inference, the math favors MI300X. Pricing per cloud-hour is similar; MI300X is often cheaper.

AMD MI300X Deep Dive (2026): The 192GB GPU That Beats H100 on LLM Inference

The AMD Instinct MI300X is the GPU that finally made AMD competitive with NVIDIA at the datacenter tier for LLM inference. 192 GB of HBM3, 5.3 TB/s of memory bandwidth, native FP8, and 2.6 PFLOPS of FP8 compute — all in a single OAM module. For LLM serving at scale, the MI300X meets or beats the H100 on bandwidth-bound workloads and decisively wins on capacity. For the largest models (405B and beyond), it can serve them on a single GPU where H100 needs tensor parallelism.

This deep dive covers everything: architecture, MI325X vs MI300X, software stack (ROCm, vLLM-ROCm, SGLang, TGI), FP8 quantization, real benchmarks vs H100 / H200 / B200, deployment paths (cloud, OEM, AMD Developer Box), and when MI300X is the right choice over NVIDIA.

What MI300X Is
Architecture: CDNA 3
MI300X vs MI325X vs MI300A
vs H100 / H200 / B200 Spec Comparison
Software Stack: ROCm, vLLM, SGLang
FP8 on CDNA 3
Deployment Paths: Cloud, OEM, Dev Box
vLLM-ROCm Setup on MI300X
Tensor Parallel Across 8x MI300X
Long Context (128K-1M)
Llama 3.1 405B Single-GPU Inference
Real Benchmarks
Network: Infinity Fabric, RoCEv2, RCCL
MI300X for Training (Honest Assessment)
Cost Analysis vs H100 Cluster
When to Choose MI300X
Common Production Issues
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What MI300X Is {#what-it-is}

MI300X is a CDNA 3 datacenter GPU launched at AMD's December 2023 event and shipped in volume from mid-2024. It targets generative AI inference (and to a lesser extent training).

Key specs:

Memory: 192 GB HBM3, 5.3 TB/s bandwidth
Compute: 1,300 TFLOPS BF16 / 2,600 TFLOPS FP8 (matrix cores)
Form factor: OAM module (Open Accelerator Module)
TBP: ~750 W
PCIe: Gen 5 x16

The successor MI325X (late 2024) bumps memory to 256 GB HBM3e and bandwidth to 6 TB/s.

For LLM inference — which is memory-capacity-bound for big models and memory-bandwidth-bound for decode — both axes matter, and MI300X leads NVIDIA H100 on both at launch.

Architecture: CDNA 3 {#architecture}

CDNA 3 is the third generation of AMD's data-center compute architecture (separate from RDNA used in consumer Radeon GPUs).

304 CUs (compute units) per chip
1,216 matrix cores for tensor math
8 HBM3 stacks providing the 192 GB / 5.3 TB/s
Chiplet design: 8 XCDs (Accelerator Compute Dies) on TSMC N5, plus 4 IODs (I/O Dies) on N6, all connected via Infinity Fabric
Native FP8 (E4M3 and E5M2)
Native sparsity (2:4 structured)
Coherent unified memory with EPYC CPU when paired in MI300A APU variant

The chiplet design is what enabled the 192 GB HBM stack — monolithic dies couldn't fit that much HBM in a reasonable yield envelope.

MI300X vs MI325X vs MI300A {#mi300-variants}

Variant	MI300A	MI300X	MI325X
Type	APU (CPU+GPU)	GPU only	GPU only
GPU dies	6	8	8
CPU cores	24 Zen 4	0	0
Memory	128 GB HBM3	192 GB HBM3	256 GB HBM3e
Bandwidth	5.3 TB/s	5.3 TB/s	6 TB/s
FP16/BF16	0.98 PFLOPS	1.3 PFLOPS	1.3 PFLOPS
FP8	1.96 PFLOPS	2.6 PFLOPS	2.6 PFLOPS
Use case	HPC, El Capitan	Generative AI inference	Largest models

For LLM inference, MI300X or MI325X. MI300A is for HPC nodes (Lawrence Livermore's El Capitan supercomputer is built on it) and not typically deployed for LLMs.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

vs H100 / H200 / B200 Spec Comparison {#vs-nvidia}

Spec	H100 SXM	H200 SXM	B200 SXM	MI300X	MI325X
Memory	80 GB HBM3	141 GB HBM3e	192 GB HBM3e	192 GB HBM3	256 GB HBM3e
Bandwidth	3.35 TB/s	4.8 TB/s	8.0 TB/s	5.3 TB/s	6.0 TB/s
FP16/BF16 (PFLOPS)	0.99	0.99	~2.5	1.3	1.3
FP8 (PFLOPS)	1.98	1.98	~4.5	2.6	2.6
FP4	❌	❌	✅	❌	❌
TBP (W)	700	700	1000	750	750
Price (cloud / hour, est)	$4-6	$5-7	$9-12	$3-5	$4-6
Available	2023	2024	2025	2024	2024

MI300X beats H100 on memory, bandwidth, BF16 compute, and FP8 compute. Loses to B200 on bandwidth and adds the gap on FP4 (Blackwell-only). For LLM inference, MI300X / MI325X is the strongest non-Blackwell option.

Software Stack: ROCm, vLLM, SGLang {#software}

Layer	Component
Driver / runtime	ROCm 6.3+
Math libraries	rocBLAS, hipBLAS, MIOpen, hipFFT
Collective comms	RCCL (NCCL-compatible)
Compiler	hipcc, ROCmCC
Kernel libraries	Composable Kernel (CK)
FlashAttention	upstream FA-2 v2.5+, FA-3 in development
LLM serving	vLLM-ROCm, SGLang-ROCm, TGI-ROCm
Quantization	AMD Quark, AutoGPTQ, AutoAWQ
Frameworks	PyTorch (ROCm), JAX (ROCm)

For most LLM inference deployments: PyTorch + vLLM-ROCm. Same code path as the NVIDIA equivalent — same Python, same Hugging Face APIs, same OpenAI-compatible server. The differences are in the kernel implementations under the hood.

FP8 on CDNA 3 {#fp8}

CDNA 3 was AMD's first GPU with native FP8. Two formats:

E4M3 (4 exp / 3 mantissa) — for forward pass, weights, activations
E5M2 (5 exp / 2 mantissa) — wider range, used for gradients

vLLM-ROCm flags:

vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768

NeuralMagic publishes FP8 quants of popular Llama / Qwen / Mixtral models. AMD's Quark tool produces FP8 from any BF16 model:

# Conceptual
python -m amd_quark quantize \
    --model_path meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./Llama-3.1-70B-Instruct-FP8 \
    --quant_format fp8 \
    --calib_dataset wikitext-2

Quality after FP8 quantization: typically within 0.3-0.7% MMLU of BF16. KV-cache FP8 saves ~50% KV memory at <0.5% quality loss.

Deployment Paths: Cloud, OEM, Dev Box {#deployment}

Cloud

Provider	Instance	GPUs	Notes
Azure	ND MI300X v5	8x MI300X	OpenAI inference platform
Oracle OCI	BM.GPU.MI300X.8	8x MI300X	Bare metal
AMD Developer Cloud	varies	1-8	Free tier for testing
TensorWave	varies	1-32	AMD-focused cloud
Hot Aisle	varies	1-8	AMD-focused, EU presence
Lambda Labs	beta	1-8	Limited availability

Hourly rates (mid-2026, best-effort): MI300X ~$3-5/h, 8x MI300X node ~$25-35/h. Often cheaper than H100 for similar capacity.

OEM systems

OEM	Model	Config
Dell	PowerEdge XE9680	8x MI300X
Supermicro	AS-8125GS-TNMR2	8x MI300X
HPE	Cray XD675	8x MI300X
Lenovo	ThinkSystem SR685a V3	8x MI300X
Gigabyte	G593-ZX1-AAX1	8x MI300X

Smaller dev systems

For workstations: AMD partner program "MI300X Developer Box" — 1-2 MI300X cards in a workstation chassis. Pricing varies, typically $25K+ per card.

vLLM-ROCm Setup on MI300X {#vllm-setup}

# Pull ROCm vLLM image
docker pull rocm/vllm:latest

# Run on a single MI300X
docker run --device=/dev/kfd --device=/dev/dri \
    --security-opt seccomp=unconfined --shm-size 32G \
    --network=host \
    rocm/vllm:latest \
    vllm serve meta-llama/Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.93 \
    --enable-prefix-caching \
    --enable-chunked-prefill

For 8x MI300X (most common deployment):

docker run --device=/dev/kfd --device=/dev/dri \
    --security-opt seccomp=unconfined --shm-size 64G \
    --network=host \
    rocm/vllm:latest \
    vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 8 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.94 \
    --enable-prefix-caching

For full vLLM tuning, see vLLM Complete Setup Guide — same flags work, ROCm is the backend.

Tensor Parallel Across 8x MI300X {#tensor-parallel}

# RCCL tuning (similar to NCCL)
export NCCL_DEBUG=WARN
export RCCL_MSCCL_ENABLE=1
export NCCL_P2P_LEVEL=SYS         # use Infinity Fabric where available

8x MI300X are connected via Infinity Fabric in a fully-connected (mesh) topology — peak ~896 GB/s pairwise bandwidth, similar to H100 NVLink. TP=8 scaling is roughly linear up to 8 GPUs; cross-node uses RoCE / InfiniBand.

Long Context (128K-1M) {#long-context}

The 192-256 GB memory makes MI300X / MI325X especially strong for very long contexts:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 1048576 \
    --kv-cache-dtype fp8_e4m3 \
    --enable-prefix-caching

1M context Llama 3.1 8B on a single MI300X works comfortably (~150 GB KV cache room). For 70B at long context, use FP8 KV cache. For 405B + long context, use multi-GPU TP.

Llama 3.1 405B Single-GPU Inference {#405b}

The headline use case. Llama 3.1 405B FP8 = ~410 GB compressed, but with weight sharing and KV cache management, two MI325X (256 GB each) can fit it with TP=2:

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --quantization fp8 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --kv-cache-dtype fp8_e4m3

On an 8x MI300X node, 405B FP8 has plenty of room — TP=8 delivers ~50-80 tok/s aggregate at moderate concurrency. H100 needs at least 4-way TP for 405B FP8 due to memory.

Real Benchmarks {#benchmarks}

8x MI300X (single node) vs 8x H100 SXM (single node), both FP8, vLLM, 16 concurrent users.

Model	8x MI300X	8x H100 SXM
Llama 3.1 8B (replicated 8 instances)	25,000 tok/s	22,000 tok/s
Llama 3.1 70B (TP=8, single instance)	380 tok/s	290 tok/s
Llama 3.1 70B (TP=2, 4 instances)	1,250 tok/s	980 tok/s
Llama 3.1 405B (TP=8)	95 tok/s	70 tok/s
DeepSeek V3 671B MoE	180 tok/s	130 tok/s

MI300X edge is 15-30% on most LLM serving workloads, larger on the biggest models.

For more benchmark detail: AMD's ROCm performance reports and MLPerf Inference 4.0+ submissions.

Network: Infinity Fabric, RoCEv2, RCCL {#network}

Within an 8x MI300X node, Infinity Fabric Link provides full mesh ~896 GB/s pairwise. Across nodes, the standard pattern is RoCE v2 over 400 Gb/s Ethernet or InfiniBand HDR/NDR.

RCCL (the NCCL drop-in for ROCm) handles all-reduce, all-gather, etc. Tuning:

export NCCL_DEBUG=WARN
export NCCL_NET_GDR_LEVEL=PHB    # GPU Direct RDMA
export NCCL_IB_HCA=mlx5
export NCCL_IB_TC=106
export NCCL_IB_GID_INDEX=3

For multi-node training / inference clusters, the network is often the primary bottleneck regardless of vendor.

MI300X for Training (Honest Assessment) {#training}

MI300X has the raw compute and memory for LLM training. Software gaps as of mid-2026:

PyTorch + ROCm — works for most architectures; compile-time issues on bleeding-edge models
Megatron-LM — community ROCm fork exists, lags upstream
DeepSpeed — works on ROCm but with caveats for specific optimizers
NeMo — limited ROCm support
Custom CUDA kernels — frequently missing AMD ports

For most training workloads on standard architectures (Llama, Qwen, Gemma, Mistral families), MI300X works. For research that depends on specific NVIDIA-only kernels (FlashAttention-3, certain optimizers, MoE-specific code paths), NVIDIA is still the safer choice.

For inference-only deployments, MI300X has fewer software gotchas.

Cost Analysis vs H100 Cluster {#cost}

Approximate mid-2026 cost-per-million-tokens for Llama 3.1 70B FP8 serving at 90% utilization:

Setup	Hardware $/hr	tok/s	$/M tokens
1x MI300X (cloud)	$3.5	140	$7.0
1x H100 SXM (cloud)	$4.5	95	$13.2
1x H200 (cloud)	$5.5	130	$11.7
8x MI300X (TP=8)	$28	380	$20.5
8x H100 SXM (TP=8)	$36	290	$34.5

MI300X is ~40-60% cheaper per million tokens at this workload. Numbers vary by model size, concurrency, and provider; do your own benchmarking on your specific workload.

When to Choose MI300X {#when-to-choose}

Pick MI300X / MI325X when:

You serve large models (70B, 405B, large MoE) at moderate-to-high concurrency.
You need long-context inference (128K+) and lots of KV cache room.
You want lower cost per token at scale.
You're running standard architectures (Llama, Qwen, DeepSeek, Mixtral, Gemma) where ROCm is mature.
You want to diversify away from NVIDIA dependency for procurement / cost reasons.
You're deploying via Azure, Oracle, or AMD Developer Cloud where MI300X is a first-class citizen.

Stick with H100 / H200 / B200 when:

You need bleeding-edge model support the day of release.
You depend on TensorRT-LLM specifically.
You train cutting-edge models.
You need FP4 (Blackwell only).
Your team has deep NVIDIA expertise and switching cost is high.
You need software guarantees from the SLA standpoint that only NVIDIA currently provides for some specialized workloads.

Common Production Issues {#troubleshooting}

Symptom	Cause	Fix
Cold start slow	Container + driver init	Pre-warm with health pings
RCCL all-reduce slow	Network not tuned	Set NCCL_* env vars; verify Infinity Fabric
FP8 accuracy regression	Calibration set too small	Increase calib samples to 1024+
OOM at long context	KV cache too large	FP8 KV; lower max_seq_len
Single-stream slower than expected	Compute-bound, kernels not tuned	File issue; try newer ROCm
Multi-node sync issues	RoCE / IB config	Check QoS, MTU, GID
MIG-equivalent partitioning	Not yet on CDNA 3	Use containers + cgroups

FAQ {#faq}

See answers to common MI300X questions below.

Sources: AMD Instinct MI300X product page | ROCm 6.3 release notes | vLLM-ROCm performance reports | MLPerf Inference benchmarks | Internal benchmarks 8x MI300X cloud nodes.

Related guides:

AMD MI300X Deep Dive (2026): The 192GB GPU That Beats H100 on LLM Inference

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What MI300X Is {#what-it-is}

Architecture: CDNA 3 {#architecture}

MI300X vs MI325X vs MI300A {#mi300-variants}

Reading articles is good. Building is better.

vs H100 / H200 / B200 Spec Comparison {#vs-nvidia}

Software Stack: ROCm, vLLM, SGLang {#software}

FP8 on CDNA 3 {#fp8}

Deployment Paths: Cloud, OEM, Dev Box {#deployment}

Cloud

OEM systems

Smaller dev systems

vLLM-ROCm Setup on MI300X {#vllm-setup}

Tensor Parallel Across 8x MI300X {#tensor-parallel}

Long Context (128K-1M) {#long-context}

Llama 3.1 405B Single-GPU Inference {#405b}

Real Benchmarks {#benchmarks}

Network: Infinity Fabric, RoCEv2, RCCL {#network}

MI300X for Training (Honest Assessment) {#training}

Cost Analysis vs H100 Cluster {#cost}

When to Choose MI300X {#when-to-choose}

Common Production Issues {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

AMD ROCm Setup for Local LLMs

Radeon RX 7900 XTX for Local AI

vLLM Complete Setup Guide

Distributed Inference for Local AI

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI