FlashAttention Complete Guide (2026): FA-2, FA-3, Sliding Window, Hopper Optimizations
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
FlashAttention is the algorithm that made long-context training and inference practical. By tiling attention computation and keeping intermediate values in SRAM (GPU on-chip memory) instead of HBM, it produces mathematically identical attention output with O(N) memory and 2-4x speedup. Without FlashAttention, training Llama 3 at 8K context would need terabytes of activation memory; with it, the same training fits standard 8x H100 nodes. Every serving engine in 2026 — vLLM, SGLang, TensorRT-LLM, llama.cpp, ExLlamaV2, MLC-LLM — uses FlashAttention or a direct derivative.
This guide covers the full FlashAttention story: how the IO-aware tiling works, FA-1 vs FA-2 vs FA-3 differences (with the FA-3 Hopper-specific TMA + WGMMA + FP8 paths), feature support for sliding window / ALiBi / GQA / MLA, installation in PyTorch / Hugging Face / vLLM / TensorRT-LLM, AMD ROCm support, and decision trees for which attention path to use on which hardware.
Table of Contents
- The Problem with Vanilla Attention
- How FlashAttention Works
- FA-1 vs FA-2 vs FA-3
- FA-3 Hopper-Specific Optimizations
- Feature Support Matrix
- Sliding Window Attention
- ALiBi and Other Position Biases
- GQA / MQA / MLA / CLA Compatibility
- Installation: PyTorch + Hugging Face
- Installation: vLLM / SGLang / TRT-LLM
- llama.cpp Flash Attention
- AMD ROCm and Triton Variants
- PyTorch SDPA vs flash-attn
- xFormers vs FlashAttention
- Performance Benchmarks
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
The Problem with Vanilla Attention {#problem}
Vanilla attention computes:
S = Q · K^T # N × N matrix (the attention scores)
P = softmax(S) # N × N
O = P · V # N × d
The N×N matrix S must materialize in HBM. For seq_len=128K: 128K × 128K × 2 bytes = 32 GB just for S. Plus another 32 GB for P. The HBM read/write traffic is O(N²) — even on H100 (3 TB/s), reading 32 GB twice takes ~21 ms per layer. At 80 layers: 1.7 seconds per forward pass for attention alone. Unworkable for long context.
The actual matmul FLOPs are only ~3% of that wall time. The bottleneck is memory bandwidth, not compute.
How FlashAttention Works {#how-it-works}
Tile Q, K, V into blocks of size (B_r, B_c, d) that fit in SRAM (~100 KB per SM on H100). For each Q-block, iterate over K and V blocks in chunks; compute partial softmax incrementally using the online-softmax algorithm:
For each Q block i:
O_i = 0; ell_i = 0; m_i = -inf
For each K, V block j:
S_ij = Q_i · K_j^T
m_new = max(m_i, max(S_ij))
P_ij = exp(S_ij - m_new)
ell_new = exp(m_i - m_new) · ell_i + sum(P_ij)
O_i = (exp(m_i - m_new) · ell_i / ell_new) · O_i + (P_ij / ell_new) · V_j
m_i = m_new; ell_i = ell_new
Output O_i
The full N×N matrix never exists in HBM. SRAM holds tiny tiles. HBM reads/writes drop from O(N²) to O(N²/SRAM_size) — a constant factor smaller — but more importantly, the constant factor is large (SRAM is ~100 KB on H100; N=128K block is ~512 KB → ~5x reduction in HBM IO). Empirically: 2-4x speedup on long context.
FA-1 vs FA-2 vs FA-3 {#versions}
| Version | Released | Key Innovation | Speedup vs Prev |
|---|---|---|---|
| FA-1 | 2022 (Dao et al.) | IO-aware tiling, online softmax | 2-4x vs vanilla |
| FA-2 | 2023 | Better parallelism, fewer non-matmul ops | 2x vs FA-1 |
| FA-3 | 2024 | Hopper-specific TMA + WGMMA + FP8 | 1.5-2x vs FA-2 (Hopper only) |
For A100, RTX 30/40 series: FA-2 is the standard.
For H100, H200, GH200, B200 (Hopper/Blackwell): FA-3.
Backward compatibility: most engines auto-select. pip install flash-attn installs FA-2 by default; install from FA-3 branch for Hopper paths.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
FA-3 Hopper-Specific Optimizations {#fa3-hopper}
FA-3 takes advantage of three Hopper-specific hardware features:
TMA (Tensor Memory Accelerator)
Hardware async copy engine that moves tensors between HBM and SRAM without occupying SM compute. FA-3 overlaps memory transfer with computation.
WGMMA (Warp Group MMA)
Hopper's bigger matmul instruction (vs Ampere's MMA). Bigger blocks per instruction = less instruction overhead.
FP8 Native
H100 has dedicated FP8 tensor cores. FA-3 in FP8 mode runs attention scores in FP8, with BF16 accumulation. Quality loss: <0.1 perplexity. Throughput: 2-3x FA-3 BF16, 4-6x FA-2 BF16.
For H100 production: FA-3 FP8 for attention is the fastest path. Combine with FP8 weights for end-to-end FP8 inference.
Feature Support Matrix {#features}
| Feature | FA-1 | FA-2 | FA-3 |
|---|---|---|---|
| Causal mask | ✓ | ✓ | ✓ |
| Non-causal | ✓ | ✓ | ✓ |
| GQA / MQA | ✓ | ✓ | ✓ |
| Sliding window | ✗ | ✓ (2.4+) | ✓ |
| ALiBi | ✗ | ✓ (2.4+) | ✓ |
| Soft cap (Gemma) | ✗ | ✓ (2.6+) | ✓ |
| INT8 attention | ✗ | ✓ (experimental) | ✓ |
| FP8 attention | ✗ | ✗ | ✓ (Hopper only) |
| Variable-length sequences | ✓ (varlen) | ✓ | ✓ |
| Backward (training) | ✓ | ✓ | ✓ (slower than fwd) |
| MLA (DeepSeek) | ✗ | partial | ✓ (with patches) |
Sliding Window Attention {#sliding-window}
Restrict each token to attend only to the previous K tokens. Used by Mistral 7B (window 4K), Phi-3 medium, some Gemma variants. With FA-2:
from flash_attn import flash_attn_func
# Causal sliding window with window of 4K
out = flash_attn_func(
q, k, v,
causal=True,
window_size=(4096, 0), # (left, right) — right=0 for causal
)
Memory: O(N × window) instead of O(N²). For Mistral 7B at 32K context: ~4 GB attention memory vs ~32 GB.
Most engines auto-detect window size from model config — no manual config needed.
ALiBi and Other Position Biases {#alibi}
ALiBi (Press et al., 2022) adds a linear bias to attention scores based on token distance, instead of using positional embeddings. Used by Falcon, BLOOM, MPT.
out = flash_attn_func(
q, k, v,
causal=True,
alibi_slopes=alibi_slopes, # tensor of shape (num_heads,)
)
For YaRN / NTK-aware RoPE / LongRoPE: these scale RoPE frequencies before attention; FA itself runs normally. See RoPE / YaRN Long Context Guide.
For Gemma's logit soft-capping (clip attention scores to ±50): FA-2 2.6+ supports the softcap parameter.
GQA / MQA / MLA / CLA Compatibility {#gqa-mla}
| Attention Variant | FA-2 Support | FA-3 Support | Notes |
|---|---|---|---|
| MHA | Full | Full | Standard |
| GQA | Full | Full | Pass num_kv_heads |
| MQA | Full (num_kv_heads=1) | Full | Standard |
| MLA (DeepSeek) | Partial (custom kernels needed) | Full (with FA-3 MLA patches) | vLLM/SGLang ship MLA-aware kernels |
| CLA (Hunyuan) | Full per-layer | Full per-layer | KV sharing handled in model code, not FA |
| Sparse attention | Limited | Limited | Custom kernels typically needed |
For frontier MoE models in 2026 (DeepSeek V3, Hunyuan-Large, Llama 4 MoE): use vLLM/SGLang/TRT-LLM rather than raw flash-attn — they ship the necessary MLA / MoE kernels.
Installation: PyTorch + Hugging Face {#install-pytorch}
# Install flash-attn (auto-builds for your CUDA)
pip install flash-attn --no-build-isolation
# Or pre-built wheel from releases page
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
For HuggingFace Transformers:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-70B-Instruct",
attn_implementation="flash_attention_2",
torch_dtype="bfloat16",
device_map="auto",
)
For FA-3 on Hopper:
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper
pip install -e .
Installation: vLLM / SGLang / TRT-LLM {#install-engines}
vLLM ships flash-attn as a dependency:
pip install vllm
# Auto-detects FA version based on GPU; uses FA-3 on Hopper if available
Force FA-2 (debugging Hopper issues):
VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve ...
Force FA-3:
VLLM_FLASH_ATTN_VERSION=3 vllm serve ...
SGLang and TensorRT-LLM auto-select FA-3 on Hopper builds.
llama.cpp Flash Attention {#llamacpp}
llama.cpp has its own optimized attention kernels (not the Dao Lab flash-attn library) that implement FlashAttention-style tiling for CPU and various GPU backends:
./llama-cli \
-m model.gguf \
-ngl 999 \
-c 32768 \
-fa # Enable Flash Attention
llama.cpp's -fa provides:
- ~30-50% speedup at long context vs standard attention
- 50% reduced KV memory
- Works on CUDA, ROCm, Metal, Vulkan, SYCL
Always enable for long-context use cases.
AMD ROCm and Triton Variants {#amd-triton}
Official flash-attn is CUDA-only. For AMD MI250 / MI300X:
- AOTriton-based fork: ROCm port maintained by AMD.
pip install flash-attn-rocm(or build from source). - Composable Kernel (CK) port: lower-level CK implementation, used by SGLang and vLLM ROCm builds.
- Triton-based fallback: portable but ~2x slower than custom kernels.
Performance: ROCm flash-attn on MI300X is ~70-90% of H100 FA-3 throughput. For AMD inference deployments in 2026, vLLM and SGLang have mature ROCm builds — they bundle the right kernels. See AMD ROCm Local LLM Setup.
PyTorch SDPA vs flash-attn {#sdpa}
PyTorch 2.0+ ships torch.nn.functional.scaled_dot_product_attention (SDPA), which is a backend dispatcher that auto-selects:
- FlashAttention (if installed and shape supported)
- xFormers memory-efficient attention
- Math (vanilla fallback)
For most use cases, SDPA + flash-attn installed = same performance as direct flash-attn calls. Benefits of SDPA:
- Single API works across backends
- Auto-fallback when FA doesn't support a shape
- Handled by HuggingFace internally
When to call flash-attn directly: custom kernels in research, or when SDPA's dispatch picks the wrong backend. For 99% of users: use attn_implementation="sdpa" or "flash_attention_2" in HuggingFace and let it work.
xFormers vs FlashAttention {#xformers}
xFormers (Meta) is a broader memory-efficient attention library that pre-dates FA-2. Includes its own attention implementations, including a "flash"-style backend that's similar but not identical to Dao's flash-attn.
Trade-offs:
- xFormers is older, broader (sparse, low-rank, etc.)
- flash-attn is faster on standard dense attention
- Most production has migrated from xFormers to flash-attn
For diffusion models (Stable Diffusion, Flux), xFormers is still widely used — the diffusion ecosystem hasn't fully migrated to flash-attn. For LLM serving: flash-attn dominates.
Performance Benchmarks {#benchmarks}
Llama 3.1 70B forward pass, 32K context, batch=1, H100 80GB:
| Attention | Time per forward | Memory |
|---|---|---|
| Vanilla (PyTorch) | 1820 ms | OOM at 32K |
| xFormers memory-efficient | 480 ms | 18 GB |
| FA-2 BF16 | 220 ms | 14 GB |
| FA-3 BF16 (Hopper) | 130 ms | 14 GB |
| FA-3 FP8 (Hopper) | 65 ms | 12 GB |
Llama 3.1 8B training, 8K context, A100 80GB, BF16:
| Attention | Tokens/sec/GPU |
|---|---|
| Vanilla | 8200 |
| FA-2 | 23500 |
| FA-3 (would be Hopper-only) | n/a |
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| ImportError: flash_attn not found | Wheel mismatch | Match torch + cuda version; reinstall |
| RuntimeError: head_dim not supported | FA only supports {32, 64, 96, 128, 256} | Use SDPA fallback for unusual shapes |
| Slow on H100 with FA-2 | Not using FA-3 | Install FA-3 from Hopper branch |
| OOM despite FA enabled | KV cache, not attention | Enable FP8 KV; see KV Cache Guide |
| ROCm build fails | Wrong fork | Use flash-attn-rocm or AOTriton port |
| Sliding window ignored | FA <2.4 | Upgrade flash-attn to 2.4+ |
| FP8 quality regression | E5M2 too coarse | Try E4M3 (vLLM auto-selects) |
| Backward (training) much slower than forward | Expected | FA backward is 2-3x slower than forward; normal |
FAQ {#faq}
See answers to common FlashAttention questions below.
Sources: Dao et al. (2022) FlashAttention | Dao (2023) FlashAttention-2 | Shah et al. (2024) FlashAttention-3 | flash-attention GitHub | Press et al. (2022) ALiBi | PyTorch SDPA docs | Internal benchmarks H100 + A100.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!