Q: What's the difference between FlashAttention-1, FA-2, and FA-3?

FA-1 (2022) introduced IO-aware tiling — first version. FA-2 (2023) reduced non-matmul FLOPs and added better parallelism — 2x faster than FA-1, became the standard for A100/H100 BF16 workloads through 2024. FA-3 (2024) is a Hopper-specific (H100/H200) rewrite that uses Tensor Memory Accelerator (TMA), WGMMA instructions, and asynchronous warp scheduling — 1.5-2x faster than FA-2 on Hopper, and adds native FP8 support (2-3x further on FP8 paths). For Ampere (A100, RTX 30/40 series): FA-2 is still the best. For Hopper (H100, H200): FA-3 if your framework supports it. As of 2026, vLLM, SGLang, and TRT-LLM all default to FA-3 on Hopper.

Q: How do I install FlashAttention in PyTorch?

`pip install flash-attn --no-build-isolation` for the official wheel. Requires CUDA 11.8+ and PyTorch 2.0+. Build can take 30-60 minutes from source; pre-built wheels are available for common Python/CUDA combinations on the GitHub releases page. Alternative: PyTorch 2.0+ ships `torch.nn.functional.scaled_dot_product_attention` (SDPA) which auto-selects FlashAttention when available — no separate install needed for HuggingFace Transformers usage. For Hugging Face Transformers: `model = AutoModel.from_pretrained(name, attn_implementation="flash_attention_2")` enables FA-2 directly. For FA-3: install from the Hopper branch of the flash-attention repo.

Q: Does FlashAttention work with sliding window attention (Mistral / Mixtral)?

Yes — FA-2 added sliding window support, FA-3 has it natively. Pass `window_size=(left, right)` to the FA call (e.g., `(4096, 0)` for causal sliding window of 4K). Sliding window restricts each token to attend only to the previous K tokens — used by Mistral 7B (4K window), Mixtral, Phi-3 medium, some Gemma variants. Combined with FlashAttention: O(N×K) memory and compute instead of O(N²), enabling long contexts on small VRAM. For Mistral 7B at 32K context: sliding-window attention with FA-2 uses ~4 GB attention working memory vs ~32 GB for full causal attention. Most serving engines (vLLM, SGLang) auto-detect window size from the model config.

Q: Does FlashAttention support ALiBi, RoPE, and other positional embeddings?

RoPE (rotary positional embeddings) is applied to Q and K before attention, so it's orthogonal to FlashAttention — works natively with FA-2 and FA-3. ALiBi (Attention with Linear Biases) requires FA-2.4+ which added ALiBi slope parameter support. For models like Falcon and BLOOM that use ALiBi: pass `alibi_slopes` to the FA call. For YaRN, NTK-aware RoPE, and other RoPE scaling methods: these are pre-FA scaling of the rotary frequencies, no FA changes needed. See [RoPE / YaRN Long Context Guide](/blog/rope-yarn-long-context-guide).

Q: What about GQA / MQA / MLA — does FlashAttention support them?

GQA (Grouped-Query Attention, used by Llama 3, Qwen 2.5, Mistral): FA-2 has native support — pass num_kv_heads parameter. The compute pattern is identical to MHA from FA's perspective; GQA just means K and V tensors are smaller. MQA (single shared KV): same handling, num_kv_heads=1. MLA (Multi-Head Latent Attention, used by DeepSeek V2/V3): requires custom kernel — vanilla FA-3 doesn't handle the K/V decompression step. SGLang and vLLM ship MLA-aware FA-3 variants. CLA (Cross-Layer Attention, Hunyuan-Large): KV is shared across consecutive layers — handled at the model architecture level, FA itself runs normally per-layer.

Q: Is FA-3 worth upgrading to from FA-2 on Ampere (A100, RTX 4090)?

No — FA-3 is Hopper-specific. It uses TMA (Tensor Memory Accelerator) and WGMMA (Warp Group Matrix Multiply Accumulate) instructions that only exist on H100 / H200 / GH200 / B200 (Hopper and Blackwell). On A100 or RTX 30/40 series, FA-3 either falls back to FA-2 paths or refuses to run. For Ampere users in 2026: stay on FA-2. The 2024-2025 FA-2 release line includes incremental improvements (better window-attention, INT8 paths, wider kernel coverage). For RDNA / CDNA AMD GPUs: FlashAttention has experimental ROCm forks (composable_kernel-based) — performance is ~70-80% of NVIDIA equivalents.

Q: What is "flash" in xFormers and how does it differ from FlashAttention?

xFormers is Meta's memory-efficient attention library that includes multiple attention implementations: cutlass-based, FlashAttention-style memory-efficient, and triton kernels. Confusingly, the "flash" backend in xFormers is xFormers' own implementation, not the official Tri Dao FlashAttention package. xFormers is older, broader (supports many attention variants), but generally slightly slower than the latest FA-2 / FA-3. Most production serving has migrated from xFormers to flash-attn directly. xFormers is still useful for research where you need experimental attention variants (sparse, low-rank, custom). For inference deployment: use flash-attn (FA-2 or FA-3).

Question 1

What is FlashAttention and why does it speed up training and inference?

Accepted Answer

FlashAttention (Dao et al., 2022) is an IO-aware exact attention algorithm. Standard attention materializes the full N×N attention matrix in HBM — O(N²) memory and O(N²) HBM reads/writes. FlashAttention tiles the computation, keeps intermediate values in SRAM (the GPU's small fast on-chip memory), and never materializes the full matrix. Same mathematical output as vanilla attention, but: 2-4x faster on training, 2-3x faster on long-context inference, and O(N) memory instead of O(N²). The wins compound at long context — at 128K, vanilla attention would need 32+ GB just for the attention matrix; FlashAttention needs <100 MB. Every modern serving engine (vLLM, SGLang, TRT-LLM, llama.cpp) uses FlashAttention or a derivative.

Question 2

What's the difference between FlashAttention-1, FA-2, and FA-3?

Accepted Answer

FA-1 (2022) introduced IO-aware tiling — first version. FA-2 (2023) reduced non-matmul FLOPs and added better parallelism — 2x faster than FA-1, became the standard for A100/H100 BF16 workloads through 2024. FA-3 (2024) is a Hopper-specific (H100/H200) rewrite that uses Tensor Memory Accelerator (TMA), WGMMA instructions, and asynchronous warp scheduling — 1.5-2x faster than FA-2 on Hopper, and adds native FP8 support (2-3x further on FP8 paths). For Ampere (A100, RTX 30/40 series): FA-2 is still the best. For Hopper (H100, H200): FA-3 if your framework supports it. As of 2026, vLLM, SGLang, and TRT-LLM all default to FA-3 on Hopper.

Question 3

How do I install FlashAttention in PyTorch?

Accepted Answer

`pip install flash-attn --no-build-isolation` for the official wheel. Requires CUDA 11.8+ and PyTorch 2.0+. Build can take 30-60 minutes from source; pre-built wheels are available for common Python/CUDA combinations on the GitHub releases page. Alternative: PyTorch 2.0+ ships `torch.nn.functional.scaled_dot_product_attention` (SDPA) which auto-selects FlashAttention when available — no separate install needed for HuggingFace Transformers usage. For Hugging Face Transformers: `model = AutoModel.from_pretrained(name, attn_implementation="flash_attention_2")` enables FA-2 directly. For FA-3: install from the Hopper branch of the flash-attention repo.

Question 4

Does FlashAttention work with sliding window attention (Mistral / Mixtral)?

Accepted Answer

Yes — FA-2 added sliding window support, FA-3 has it natively. Pass `window_size=(left, right)` to the FA call (e.g., `(4096, 0)` for causal sliding window of 4K). Sliding window restricts each token to attend only to the previous K tokens — used by Mistral 7B (4K window), Mixtral, Phi-3 medium, some Gemma variants. Combined with FlashAttention: O(N×K) memory and compute instead of O(N²), enabling long contexts on small VRAM. For Mistral 7B at 32K context: sliding-window attention with FA-2 uses ~4 GB attention working memory vs ~32 GB for full causal attention. Most serving engines (vLLM, SGLang) auto-detect window size from the model config.

Question 5

Does FlashAttention support ALiBi, RoPE, and other positional embeddings?

Accepted Answer

RoPE (rotary positional embeddings) is applied to Q and K before attention, so it's orthogonal to FlashAttention — works natively with FA-2 and FA-3. ALiBi (Attention with Linear Biases) requires FA-2.4+ which added ALiBi slope parameter support. For models like Falcon and BLOOM that use ALiBi: pass `alibi_slopes` to the FA call. For YaRN, NTK-aware RoPE, and other RoPE scaling methods: these are pre-FA scaling of the rotary frequencies, no FA changes needed. See [RoPE / YaRN Long Context Guide](/blog/rope-yarn-long-context-guide).

Question 6

What about GQA / MQA / MLA — does FlashAttention support them?

Accepted Answer

GQA (Grouped-Query Attention, used by Llama 3, Qwen 2.5, Mistral): FA-2 has native support — pass num_kv_heads parameter. The compute pattern is identical to MHA from FA's perspective; GQA just means K and V tensors are smaller. MQA (single shared KV): same handling, num_kv_heads=1. MLA (Multi-Head Latent Attention, used by DeepSeek V2/V3): requires custom kernel — vanilla FA-3 doesn't handle the K/V decompression step. SGLang and vLLM ship MLA-aware FA-3 variants. CLA (Cross-Layer Attention, Hunyuan-Large): KV is shared across consecutive layers — handled at the model architecture level, FA itself runs normally per-layer.

Question 7

Is FA-3 worth upgrading to from FA-2 on Ampere (A100, RTX 4090)?

Accepted Answer

No — FA-3 is Hopper-specific. It uses TMA (Tensor Memory Accelerator) and WGMMA (Warp Group Matrix Multiply Accumulate) instructions that only exist on H100 / H200 / GH200 / B200 (Hopper and Blackwell). On A100 or RTX 30/40 series, FA-3 either falls back to FA-2 paths or refuses to run. For Ampere users in 2026: stay on FA-2. The 2024-2025 FA-2 release line includes incremental improvements (better window-attention, INT8 paths, wider kernel coverage). For RDNA / CDNA AMD GPUs: FlashAttention has experimental ROCm forks (composable_kernel-based) — performance is ~70-80% of NVIDIA equivalents.

Question 8

What is "flash" in xFormers and how does it differ from FlashAttention?

Accepted Answer

xFormers is Meta's memory-efficient attention library that includes multiple attention implementations: cutlass-based, FlashAttention-style memory-efficient, and triton kernels. Confusingly, the "flash" backend in xFormers is xFormers' own implementation, not the official Tri Dao FlashAttention package. xFormers is older, broader (supports many attention variants), but generally slightly slower than the latest FA-2 / FA-3. Most production serving has migrated from xFormers to flash-attn directly. xFormers is still useful for research where you need experimental attention variants (sparse, low-rank, custom). For inference deployment: use flash-attn (FA-2 or FA-3).

Version	Released	Key Innovation	Speedup vs Prev
FA-1	2022 (Dao et al.)	IO-aware tiling, online softmax	2-4x vs vanilla
FA-2	2023	Better parallelism, fewer non-matmul ops	2x vs FA-1
FA-3	2024	Hopper-specific TMA + WGMMA + FP8	1.5-2x vs FA-2 (Hopper only)

Feature	FA-1	FA-2	FA-3
Causal mask	✓	✓	✓
Non-causal	✓	✓	✓
GQA / MQA	✓	✓	✓
Sliding window	✗	✓ (2.4+)	✓
ALiBi	✗	✓ (2.4+)	✓
Soft cap (Gemma)	✗	✓ (2.6+)	✓
INT8 attention	✗	✓ (experimental)	✓
FP8 attention	✗	✗	✓ (Hopper only)
Variable-length sequences	✓ (varlen)	✓	✓
Backward (training)	✓	✓	✓ (slower than fwd)
MLA (DeepSeek)	✗	partial	✓ (with patches)

Attention Variant	FA-2 Support	FA-3 Support	Notes
MHA	Full	Full	Standard
GQA	Full	Full	Pass num_kv_heads
MQA	Full (num_kv_heads=1)	Full	Standard
MLA (DeepSeek)	Partial (custom kernels needed)	Full (with FA-3 MLA patches)	vLLM/SGLang ship MLA-aware kernels
CLA (Hunyuan)	Full per-layer	Full per-layer	KV sharing handled in model code, not FA
Sparse attention	Limited	Limited	Custom kernels typically needed

Attention	Time per forward	Memory
Vanilla (PyTorch)	1820 ms	OOM at 32K
xFormers memory-efficient	480 ms	18 GB
FA-2 BF16	220 ms	14 GB
FA-3 BF16 (Hopper)	130 ms	14 GB
FA-3 FP8 (Hopper)	65 ms	12 GB

Attention	Tokens/sec/GPU
Vanilla	8200
FA-2	23500
FA-3 (would be Hopper-only)	n/a

Symptom	Cause	Fix
ImportError: flash_attn not found	Wheel mismatch	Match torch + cuda version; reinstall
RuntimeError: head_dim not supported	FA only supports {32, 64, 96, 128, 256}	Use SDPA fallback for unusual shapes
Slow on H100 with FA-2	Not using FA-3	Install FA-3 from Hopper branch
OOM despite FA enabled	KV cache, not attention	Enable FP8 KV; see KV Cache Guide
ROCm build fails	Wrong fork	Use flash-attn-rocm or AOTriton port
Sliding window ignored	FA <2.4	Upgrade flash-attn to 2.4+
FP8 quality regression	E5M2 too coarse	Try E4M3 (vLLM auto-selects)
Backward (training) much slower than forward	Expected	FA backward is 2-3x slower than forward; normal

FlashAttention Complete Guide (2026): FA-2, FA-3, Sliding Window, Hopper Optimizations

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

The Problem with Vanilla Attention {#problem}

How FlashAttention Works {#how-it-works}

FA-1 vs FA-2 vs FA-3 {#versions}

Reading articles is good. Building is better.

FA-3 Hopper-Specific Optimizations {#fa3-hopper}

TMA (Tensor Memory Accelerator)

WGMMA (Warp Group MMA)

FP8 Native

Feature Support Matrix {#features}

Sliding Window Attention {#sliding-window}

ALiBi and Other Position Biases {#alibi}

GQA / MQA / MLA / CLA Compatibility {#gqa-mla}

Installation: PyTorch + Hugging Face {#install-pytorch}

Installation: vLLM / SGLang / TRT-LLM {#install-engines}

llama.cpp Flash Attention {#llamacpp}

AMD ROCm and Triton Variants {#amd-triton}

PyTorch SDPA vs flash-attn {#sdpa}

xFormers vs FlashAttention {#xformers}

Performance Benchmarks {#benchmarks}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

KV Cache & PagedAttention Guide

CUDA Optimization

RoPE / YaRN Long Context

vLLM Complete Setup Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI