AMD MI300X Deep Dive (2026): The 192GB GPU That Beats H100 on LLM Inference
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
The AMD Instinct MI300X is the GPU that finally made AMD competitive with NVIDIA at the datacenter tier for LLM inference. 192 GB of HBM3, 5.3 TB/s of memory bandwidth, native FP8, and 2.6 PFLOPS of FP8 compute — all in a single OAM module. For LLM serving at scale, the MI300X meets or beats the H100 on bandwidth-bound workloads and decisively wins on capacity. For the largest models (405B and beyond), it can serve them on a single GPU where H100 needs tensor parallelism.
This deep dive covers everything: architecture, MI325X vs MI300X, software stack (ROCm, vLLM-ROCm, SGLang, TGI), FP8 quantization, real benchmarks vs H100 / H200 / B200, deployment paths (cloud, OEM, AMD Developer Box), and when MI300X is the right choice over NVIDIA.
Table of Contents
- What MI300X Is
- Architecture: CDNA 3
- MI300X vs MI325X vs MI300A
- vs H100 / H200 / B200 Spec Comparison
- Software Stack: ROCm, vLLM, SGLang
- FP8 on CDNA 3
- Deployment Paths: Cloud, OEM, Dev Box
- vLLM-ROCm Setup on MI300X
- Tensor Parallel Across 8x MI300X
- Long Context (128K-1M)
- Llama 3.1 405B Single-GPU Inference
- Real Benchmarks
- Network: Infinity Fabric, RoCEv2, RCCL
- MI300X for Training (Honest Assessment)
- Cost Analysis vs H100 Cluster
- When to Choose MI300X
- Common Production Issues
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What MI300X Is {#what-it-is}
MI300X is a CDNA 3 datacenter GPU launched at AMD's December 2023 event and shipped in volume from mid-2024. It targets generative AI inference (and to a lesser extent training).
Key specs:
- Memory: 192 GB HBM3, 5.3 TB/s bandwidth
- Compute: 1,300 TFLOPS BF16 / 2,600 TFLOPS FP8 (matrix cores)
- Form factor: OAM module (Open Accelerator Module)
- TBP: ~750 W
- PCIe: Gen 5 x16
The successor MI325X (late 2024) bumps memory to 256 GB HBM3e and bandwidth to 6 TB/s.
For LLM inference — which is memory-capacity-bound for big models and memory-bandwidth-bound for decode — both axes matter, and MI300X leads NVIDIA H100 on both at launch.
Architecture: CDNA 3 {#architecture}
CDNA 3 is the third generation of AMD's data-center compute architecture (separate from RDNA used in consumer Radeon GPUs).
- 304 CUs (compute units) per chip
- 1,216 matrix cores for tensor math
- 8 HBM3 stacks providing the 192 GB / 5.3 TB/s
- Chiplet design: 8 XCDs (Accelerator Compute Dies) on TSMC N5, plus 4 IODs (I/O Dies) on N6, all connected via Infinity Fabric
- Native FP8 (E4M3 and E5M2)
- Native sparsity (2:4 structured)
- Coherent unified memory with EPYC CPU when paired in MI300A APU variant
The chiplet design is what enabled the 192 GB HBM stack — monolithic dies couldn't fit that much HBM in a reasonable yield envelope.
MI300X vs MI325X vs MI300A {#mi300-variants}
| Variant | MI300A | MI300X | MI325X |
|---|---|---|---|
| Type | APU (CPU+GPU) | GPU only | GPU only |
| GPU dies | 6 | 8 | 8 |
| CPU cores | 24 Zen 4 | 0 | 0 |
| Memory | 128 GB HBM3 | 192 GB HBM3 | 256 GB HBM3e |
| Bandwidth | 5.3 TB/s | 5.3 TB/s | 6 TB/s |
| FP16/BF16 | 0.98 PFLOPS | 1.3 PFLOPS | 1.3 PFLOPS |
| FP8 | 1.96 PFLOPS | 2.6 PFLOPS | 2.6 PFLOPS |
| Use case | HPC, El Capitan | Generative AI inference | Largest models |
For LLM inference, MI300X or MI325X. MI300A is for HPC nodes (Lawrence Livermore's El Capitan supercomputer is built on it) and not typically deployed for LLMs.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
vs H100 / H200 / B200 Spec Comparison {#vs-nvidia}
| Spec | H100 SXM | H200 SXM | B200 SXM | MI300X | MI325X |
|---|---|---|---|---|---|
| Memory | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e | 192 GB HBM3 | 256 GB HBM3e |
| Bandwidth | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s | 5.3 TB/s | 6.0 TB/s |
| FP16/BF16 (PFLOPS) | 0.99 | 0.99 | ~2.5 | 1.3 | 1.3 |
| FP8 (PFLOPS) | 1.98 | 1.98 | ~4.5 | 2.6 | 2.6 |
| FP4 | ❌ | ❌ | ✅ | ❌ | ❌ |
| TBP (W) | 700 | 700 | 1000 | 750 | 750 |
| Price (cloud / hour, est) | $4-6 | $5-7 | $9-12 | $3-5 | $4-6 |
| Available | 2023 | 2024 | 2025 | 2024 | 2024 |
MI300X beats H100 on memory, bandwidth, BF16 compute, and FP8 compute. Loses to B200 on bandwidth and adds the gap on FP4 (Blackwell-only). For LLM inference, MI300X / MI325X is the strongest non-Blackwell option.
Software Stack: ROCm, vLLM, SGLang {#software}
| Layer | Component |
|---|---|
| Driver / runtime | ROCm 6.3+ |
| Math libraries | rocBLAS, hipBLAS, MIOpen, hipFFT |
| Collective comms | RCCL (NCCL-compatible) |
| Compiler | hipcc, ROCmCC |
| Kernel libraries | Composable Kernel (CK) |
| FlashAttention | upstream FA-2 v2.5+, FA-3 in development |
| LLM serving | vLLM-ROCm, SGLang-ROCm, TGI-ROCm |
| Quantization | AMD Quark, AutoGPTQ, AutoAWQ |
| Frameworks | PyTorch (ROCm), JAX (ROCm) |
For most LLM inference deployments: PyTorch + vLLM-ROCm. Same code path as the NVIDIA equivalent — same Python, same Hugging Face APIs, same OpenAI-compatible server. The differences are in the kernel implementations under the hood.
FP8 on CDNA 3 {#fp8}
CDNA 3 was AMD's first GPU with native FP8. Two formats:
- E4M3 (4 exp / 3 mantissa) — for forward pass, weights, activations
- E5M2 (5 exp / 2 mantissa) — wider range, used for gradients
vLLM-ROCm flags:
vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768
NeuralMagic publishes FP8 quants of popular Llama / Qwen / Mixtral models. AMD's Quark tool produces FP8 from any BF16 model:
# Conceptual
python -m amd_quark quantize \
--model_path meta-llama/Llama-3.1-70B-Instruct \
--output_dir ./Llama-3.1-70B-Instruct-FP8 \
--quant_format fp8 \
--calib_dataset wikitext-2
Quality after FP8 quantization: typically within 0.3-0.7% MMLU of BF16. KV-cache FP8 saves ~50% KV memory at <0.5% quality loss.
Deployment Paths: Cloud, OEM, Dev Box {#deployment}
Cloud
| Provider | Instance | GPUs | Notes |
|---|---|---|---|
| Azure | ND MI300X v5 | 8x MI300X | OpenAI inference platform |
| Oracle OCI | BM.GPU.MI300X.8 | 8x MI300X | Bare metal |
| AMD Developer Cloud | varies | 1-8 | Free tier for testing |
| TensorWave | varies | 1-32 | AMD-focused cloud |
| Hot Aisle | varies | 1-8 | AMD-focused, EU presence |
| Lambda Labs | beta | 1-8 | Limited availability |
Hourly rates (mid-2026, best-effort): MI300X ~$3-5/h, 8x MI300X node ~$25-35/h. Often cheaper than H100 for similar capacity.
OEM systems
| OEM | Model | Config |
|---|---|---|
| Dell | PowerEdge XE9680 | 8x MI300X |
| Supermicro | AS-8125GS-TNMR2 | 8x MI300X |
| HPE | Cray XD675 | 8x MI300X |
| Lenovo | ThinkSystem SR685a V3 | 8x MI300X |
| Gigabyte | G593-ZX1-AAX1 | 8x MI300X |
Smaller dev systems
For workstations: AMD partner program "MI300X Developer Box" — 1-2 MI300X cards in a workstation chassis. Pricing varies, typically $25K+ per card.
vLLM-ROCm Setup on MI300X {#vllm-setup}
# Pull ROCm vLLM image
docker pull rocm/vllm:latest
# Run on a single MI300X
docker run --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --shm-size 32G \
--network=host \
rocm/vllm:latest \
vllm serve meta-llama/Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 65536 \
--gpu-memory-utilization 0.93 \
--enable-prefix-caching \
--enable-chunked-prefill
For 8x MI300X (most common deployment):
docker run --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --shm-size 64G \
--network=host \
rocm/vllm:latest \
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.94 \
--enable-prefix-caching
For full vLLM tuning, see vLLM Complete Setup Guide — same flags work, ROCm is the backend.
Tensor Parallel Across 8x MI300X {#tensor-parallel}
# RCCL tuning (similar to NCCL)
export NCCL_DEBUG=WARN
export RCCL_MSCCL_ENABLE=1
export NCCL_P2P_LEVEL=SYS # use Infinity Fabric where available
8x MI300X are connected via Infinity Fabric in a fully-connected (mesh) topology — peak ~896 GB/s pairwise bandwidth, similar to H100 NVLink. TP=8 scaling is roughly linear up to 8 GPUs; cross-node uses RoCE / InfiniBand.
Long Context (128K-1M) {#long-context}
The 192-256 GB memory makes MI300X / MI325X especially strong for very long contexts:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 1048576 \
--kv-cache-dtype fp8_e4m3 \
--enable-prefix-caching
1M context Llama 3.1 8B on a single MI300X works comfortably (~150 GB KV cache room). For 70B at long context, use FP8 KV cache. For 405B + long context, use multi-GPU TP.
Llama 3.1 405B Single-GPU Inference {#405b}
The headline use case. Llama 3.1 405B FP8 = ~410 GB compressed, but with weight sharing and KV cache management, two MI325X (256 GB each) can fit it with TP=2:
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--quantization fp8 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--kv-cache-dtype fp8_e4m3
On an 8x MI300X node, 405B FP8 has plenty of room — TP=8 delivers ~50-80 tok/s aggregate at moderate concurrency. H100 needs at least 4-way TP for 405B FP8 due to memory.
Real Benchmarks {#benchmarks}
8x MI300X (single node) vs 8x H100 SXM (single node), both FP8, vLLM, 16 concurrent users.
| Model | 8x MI300X | 8x H100 SXM |
|---|---|---|
| Llama 3.1 8B (replicated 8 instances) | 25,000 tok/s | 22,000 tok/s |
| Llama 3.1 70B (TP=8, single instance) | 380 tok/s | 290 tok/s |
| Llama 3.1 70B (TP=2, 4 instances) | 1,250 tok/s | 980 tok/s |
| Llama 3.1 405B (TP=8) | 95 tok/s | 70 tok/s |
| DeepSeek V3 671B MoE | 180 tok/s | 130 tok/s |
MI300X edge is 15-30% on most LLM serving workloads, larger on the biggest models.
For more benchmark detail: AMD's ROCm performance reports and MLPerf Inference 4.0+ submissions.
Network: Infinity Fabric, RoCEv2, RCCL {#network}
Within an 8x MI300X node, Infinity Fabric Link provides full mesh ~896 GB/s pairwise. Across nodes, the standard pattern is RoCE v2 over 400 Gb/s Ethernet or InfiniBand HDR/NDR.
RCCL (the NCCL drop-in for ROCm) handles all-reduce, all-gather, etc. Tuning:
export NCCL_DEBUG=WARN
export NCCL_NET_GDR_LEVEL=PHB # GPU Direct RDMA
export NCCL_IB_HCA=mlx5
export NCCL_IB_TC=106
export NCCL_IB_GID_INDEX=3
For multi-node training / inference clusters, the network is often the primary bottleneck regardless of vendor.
MI300X for Training (Honest Assessment) {#training}
MI300X has the raw compute and memory for LLM training. Software gaps as of mid-2026:
- PyTorch + ROCm — works for most architectures; compile-time issues on bleeding-edge models
- Megatron-LM — community ROCm fork exists, lags upstream
- DeepSpeed — works on ROCm but with caveats for specific optimizers
- NeMo — limited ROCm support
- Custom CUDA kernels — frequently missing AMD ports
For most training workloads on standard architectures (Llama, Qwen, Gemma, Mistral families), MI300X works. For research that depends on specific NVIDIA-only kernels (FlashAttention-3, certain optimizers, MoE-specific code paths), NVIDIA is still the safer choice.
For inference-only deployments, MI300X has fewer software gotchas.
Cost Analysis vs H100 Cluster {#cost}
Approximate mid-2026 cost-per-million-tokens for Llama 3.1 70B FP8 serving at 90% utilization:
| Setup | Hardware $/hr | tok/s | $/M tokens |
|---|---|---|---|
| 1x MI300X (cloud) | $3.5 | 140 | $7.0 |
| 1x H100 SXM (cloud) | $4.5 | 95 | $13.2 |
| 1x H200 (cloud) | $5.5 | 130 | $11.7 |
| 8x MI300X (TP=8) | $28 | 380 | $20.5 |
| 8x H100 SXM (TP=8) | $36 | 290 | $34.5 |
MI300X is ~40-60% cheaper per million tokens at this workload. Numbers vary by model size, concurrency, and provider; do your own benchmarking on your specific workload.
When to Choose MI300X {#when-to-choose}
Pick MI300X / MI325X when:
- You serve large models (70B, 405B, large MoE) at moderate-to-high concurrency.
- You need long-context inference (128K+) and lots of KV cache room.
- You want lower cost per token at scale.
- You're running standard architectures (Llama, Qwen, DeepSeek, Mixtral, Gemma) where ROCm is mature.
- You want to diversify away from NVIDIA dependency for procurement / cost reasons.
- You're deploying via Azure, Oracle, or AMD Developer Cloud where MI300X is a first-class citizen.
Stick with H100 / H200 / B200 when:
- You need bleeding-edge model support the day of release.
- You depend on TensorRT-LLM specifically.
- You train cutting-edge models.
- You need FP4 (Blackwell only).
- Your team has deep NVIDIA expertise and switching cost is high.
- You need software guarantees from the SLA standpoint that only NVIDIA currently provides for some specialized workloads.
Common Production Issues {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Cold start slow | Container + driver init | Pre-warm with health pings |
| RCCL all-reduce slow | Network not tuned | Set NCCL_* env vars; verify Infinity Fabric |
| FP8 accuracy regression | Calibration set too small | Increase calib samples to 1024+ |
| OOM at long context | KV cache too large | FP8 KV; lower max_seq_len |
| Single-stream slower than expected | Compute-bound, kernels not tuned | File issue; try newer ROCm |
| Multi-node sync issues | RoCE / IB config | Check QoS, MTU, GID |
| MIG-equivalent partitioning | Not yet on CDNA 3 | Use containers + cgroups |
FAQ {#faq}
See answers to common MI300X questions below.
Sources: AMD Instinct MI300X product page | ROCm 6.3 release notes | vLLM-ROCm performance reports | MLPerf Inference benchmarks | Internal benchmarks 8x MI300X cloud nodes.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!