★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Performance

FlashAttention Complete Guide (2026): FA-2, FA-3, Sliding Window, Hopper Optimizations

May 2, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

FlashAttention is the algorithm that made long-context training and inference practical. By tiling attention computation and keeping intermediate values in SRAM (GPU on-chip memory) instead of HBM, it produces mathematically identical attention output with O(N) memory and 2-4x speedup. Without FlashAttention, training Llama 3 at 8K context would need terabytes of activation memory; with it, the same training fits standard 8x H100 nodes. Every serving engine in 2026 — vLLM, SGLang, TensorRT-LLM, llama.cpp, ExLlamaV2, MLC-LLM — uses FlashAttention or a direct derivative.

This guide covers the full FlashAttention story: how the IO-aware tiling works, FA-1 vs FA-2 vs FA-3 differences (with the FA-3 Hopper-specific TMA + WGMMA + FP8 paths), feature support for sliding window / ALiBi / GQA / MLA, installation in PyTorch / Hugging Face / vLLM / TensorRT-LLM, AMD ROCm support, and decision trees for which attention path to use on which hardware.

Table of Contents

  1. The Problem with Vanilla Attention
  2. How FlashAttention Works
  3. FA-1 vs FA-2 vs FA-3
  4. FA-3 Hopper-Specific Optimizations
  5. Feature Support Matrix
  6. Sliding Window Attention
  7. ALiBi and Other Position Biases
  8. GQA / MQA / MLA / CLA Compatibility
  9. Installation: PyTorch + Hugging Face
  10. Installation: vLLM / SGLang / TRT-LLM
  11. llama.cpp Flash Attention
  12. AMD ROCm and Triton Variants
  13. PyTorch SDPA vs flash-attn
  14. xFormers vs FlashAttention
  15. Performance Benchmarks
  16. Troubleshooting
  17. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

The Problem with Vanilla Attention {#problem}

Vanilla attention computes:

S = Q · K^T          # N × N matrix (the attention scores)
P = softmax(S)       # N × N
O = P · V            # N × d

The N×N matrix S must materialize in HBM. For seq_len=128K: 128K × 128K × 2 bytes = 32 GB just for S. Plus another 32 GB for P. The HBM read/write traffic is O(N²) — even on H100 (3 TB/s), reading 32 GB twice takes ~21 ms per layer. At 80 layers: 1.7 seconds per forward pass for attention alone. Unworkable for long context.

The actual matmul FLOPs are only ~3% of that wall time. The bottleneck is memory bandwidth, not compute.


How FlashAttention Works {#how-it-works}

Tile Q, K, V into blocks of size (B_r, B_c, d) that fit in SRAM (~100 KB per SM on H100). For each Q-block, iterate over K and V blocks in chunks; compute partial softmax incrementally using the online-softmax algorithm:

For each Q block i:
    O_i = 0; ell_i = 0; m_i = -inf
    For each K, V block j:
        S_ij = Q_i · K_j^T
        m_new = max(m_i, max(S_ij))
        P_ij = exp(S_ij - m_new)
        ell_new = exp(m_i - m_new) · ell_i + sum(P_ij)
        O_i = (exp(m_i - m_new) · ell_i / ell_new) · O_i + (P_ij / ell_new) · V_j
        m_i = m_new; ell_i = ell_new
    Output O_i

The full N×N matrix never exists in HBM. SRAM holds tiny tiles. HBM reads/writes drop from O(N²) to O(N²/SRAM_size) — a constant factor smaller — but more importantly, the constant factor is large (SRAM is ~100 KB on H100; N=128K block is ~512 KB → ~5x reduction in HBM IO). Empirically: 2-4x speedup on long context.


FA-1 vs FA-2 vs FA-3 {#versions}

VersionReleasedKey InnovationSpeedup vs Prev
FA-12022 (Dao et al.)IO-aware tiling, online softmax2-4x vs vanilla
FA-22023Better parallelism, fewer non-matmul ops2x vs FA-1
FA-32024Hopper-specific TMA + WGMMA + FP81.5-2x vs FA-2 (Hopper only)

For A100, RTX 30/40 series: FA-2 is the standard.

For H100, H200, GH200, B200 (Hopper/Blackwell): FA-3.

Backward compatibility: most engines auto-select. pip install flash-attn installs FA-2 by default; install from FA-3 branch for Hopper paths.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

FA-3 Hopper-Specific Optimizations {#fa3-hopper}

FA-3 takes advantage of three Hopper-specific hardware features:

TMA (Tensor Memory Accelerator)

Hardware async copy engine that moves tensors between HBM and SRAM without occupying SM compute. FA-3 overlaps memory transfer with computation.

WGMMA (Warp Group MMA)

Hopper's bigger matmul instruction (vs Ampere's MMA). Bigger blocks per instruction = less instruction overhead.

FP8 Native

H100 has dedicated FP8 tensor cores. FA-3 in FP8 mode runs attention scores in FP8, with BF16 accumulation. Quality loss: <0.1 perplexity. Throughput: 2-3x FA-3 BF16, 4-6x FA-2 BF16.

For H100 production: FA-3 FP8 for attention is the fastest path. Combine with FP8 weights for end-to-end FP8 inference.


Feature Support Matrix {#features}

FeatureFA-1FA-2FA-3
Causal mask
Non-causal
GQA / MQA
Sliding window✓ (2.4+)
ALiBi✓ (2.4+)
Soft cap (Gemma)✓ (2.6+)
INT8 attention✓ (experimental)
FP8 attention✓ (Hopper only)
Variable-length sequences✓ (varlen)
Backward (training)✓ (slower than fwd)
MLA (DeepSeek)partial✓ (with patches)

Sliding Window Attention {#sliding-window}

Restrict each token to attend only to the previous K tokens. Used by Mistral 7B (window 4K), Phi-3 medium, some Gemma variants. With FA-2:

from flash_attn import flash_attn_func

# Causal sliding window with window of 4K
out = flash_attn_func(
    q, k, v,
    causal=True,
    window_size=(4096, 0),  # (left, right) — right=0 for causal
)

Memory: O(N × window) instead of O(N²). For Mistral 7B at 32K context: ~4 GB attention memory vs ~32 GB.

Most engines auto-detect window size from model config — no manual config needed.


ALiBi and Other Position Biases {#alibi}

ALiBi (Press et al., 2022) adds a linear bias to attention scores based on token distance, instead of using positional embeddings. Used by Falcon, BLOOM, MPT.

out = flash_attn_func(
    q, k, v,
    causal=True,
    alibi_slopes=alibi_slopes,  # tensor of shape (num_heads,)
)

For YaRN / NTK-aware RoPE / LongRoPE: these scale RoPE frequencies before attention; FA itself runs normally. See RoPE / YaRN Long Context Guide.

For Gemma's logit soft-capping (clip attention scores to ±50): FA-2 2.6+ supports the softcap parameter.


GQA / MQA / MLA / CLA Compatibility {#gqa-mla}

Attention VariantFA-2 SupportFA-3 SupportNotes
MHAFullFullStandard
GQAFullFullPass num_kv_heads
MQAFull (num_kv_heads=1)FullStandard
MLA (DeepSeek)Partial (custom kernels needed)Full (with FA-3 MLA patches)vLLM/SGLang ship MLA-aware kernels
CLA (Hunyuan)Full per-layerFull per-layerKV sharing handled in model code, not FA
Sparse attentionLimitedLimitedCustom kernels typically needed

For frontier MoE models in 2026 (DeepSeek V3, Hunyuan-Large, Llama 4 MoE): use vLLM/SGLang/TRT-LLM rather than raw flash-attn — they ship the necessary MLA / MoE kernels.


Installation: PyTorch + Hugging Face {#install-pytorch}

# Install flash-attn (auto-builds for your CUDA)
pip install flash-attn --no-build-isolation

# Or pre-built wheel from releases page
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

For HuggingFace Transformers:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B-Instruct",
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16",
    device_map="auto",
)

For FA-3 on Hopper:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper
pip install -e .

Installation: vLLM / SGLang / TRT-LLM {#install-engines}

vLLM ships flash-attn as a dependency:

pip install vllm
# Auto-detects FA version based on GPU; uses FA-3 on Hopper if available

Force FA-2 (debugging Hopper issues):

VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve ...

Force FA-3:

VLLM_FLASH_ATTN_VERSION=3 vllm serve ...

SGLang and TensorRT-LLM auto-select FA-3 on Hopper builds.


llama.cpp Flash Attention {#llamacpp}

llama.cpp has its own optimized attention kernels (not the Dao Lab flash-attn library) that implement FlashAttention-style tiling for CPU and various GPU backends:

./llama-cli \
    -m model.gguf \
    -ngl 999 \
    -c 32768 \
    -fa            # Enable Flash Attention

llama.cpp's -fa provides:

  • ~30-50% speedup at long context vs standard attention
  • 50% reduced KV memory
  • Works on CUDA, ROCm, Metal, Vulkan, SYCL

Always enable for long-context use cases.


AMD ROCm and Triton Variants {#amd-triton}

Official flash-attn is CUDA-only. For AMD MI250 / MI300X:

  • AOTriton-based fork: ROCm port maintained by AMD. pip install flash-attn-rocm (or build from source).
  • Composable Kernel (CK) port: lower-level CK implementation, used by SGLang and vLLM ROCm builds.
  • Triton-based fallback: portable but ~2x slower than custom kernels.

Performance: ROCm flash-attn on MI300X is ~70-90% of H100 FA-3 throughput. For AMD inference deployments in 2026, vLLM and SGLang have mature ROCm builds — they bundle the right kernels. See AMD ROCm Local LLM Setup.


PyTorch SDPA vs flash-attn {#sdpa}

PyTorch 2.0+ ships torch.nn.functional.scaled_dot_product_attention (SDPA), which is a backend dispatcher that auto-selects:

  • FlashAttention (if installed and shape supported)
  • xFormers memory-efficient attention
  • Math (vanilla fallback)

For most use cases, SDPA + flash-attn installed = same performance as direct flash-attn calls. Benefits of SDPA:

  • Single API works across backends
  • Auto-fallback when FA doesn't support a shape
  • Handled by HuggingFace internally

When to call flash-attn directly: custom kernels in research, or when SDPA's dispatch picks the wrong backend. For 99% of users: use attn_implementation="sdpa" or "flash_attention_2" in HuggingFace and let it work.


xFormers vs FlashAttention {#xformers}

xFormers (Meta) is a broader memory-efficient attention library that pre-dates FA-2. Includes its own attention implementations, including a "flash"-style backend that's similar but not identical to Dao's flash-attn.

Trade-offs:

  • xFormers is older, broader (sparse, low-rank, etc.)
  • flash-attn is faster on standard dense attention
  • Most production has migrated from xFormers to flash-attn

For diffusion models (Stable Diffusion, Flux), xFormers is still widely used — the diffusion ecosystem hasn't fully migrated to flash-attn. For LLM serving: flash-attn dominates.


Performance Benchmarks {#benchmarks}

Llama 3.1 70B forward pass, 32K context, batch=1, H100 80GB:

AttentionTime per forwardMemory
Vanilla (PyTorch)1820 msOOM at 32K
xFormers memory-efficient480 ms18 GB
FA-2 BF16220 ms14 GB
FA-3 BF16 (Hopper)130 ms14 GB
FA-3 FP8 (Hopper)65 ms12 GB

Llama 3.1 8B training, 8K context, A100 80GB, BF16:

AttentionTokens/sec/GPU
Vanilla8200
FA-223500
FA-3 (would be Hopper-only)n/a

Troubleshooting {#troubleshooting}

SymptomCauseFix
ImportError: flash_attn not foundWheel mismatchMatch torch + cuda version; reinstall
RuntimeError: head_dim not supportedFA only supports {32, 64, 96, 128, 256}Use SDPA fallback for unusual shapes
Slow on H100 with FA-2Not using FA-3Install FA-3 from Hopper branch
OOM despite FA enabledKV cache, not attentionEnable FP8 KV; see KV Cache Guide
ROCm build failsWrong forkUse flash-attn-rocm or AOTriton port
Sliding window ignoredFA <2.4Upgrade flash-attn to 2.4+
FP8 quality regressionE5M2 too coarseTry E4M3 (vLLM auto-selects)
Backward (training) much slower than forwardExpectedFA backward is 2-3x slower than forward; normal

FAQ {#faq}

See answers to common FlashAttention questions below.


Sources: Dao et al. (2022) FlashAttention | Dao (2023) FlashAttention-2 | Shah et al. (2024) FlashAttention-3 | flash-attention GitHub | Press et al. (2022) ALiBi | PyTorch SDPA docs | Internal benchmarks H100 + A100.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes vLLM with FA-3 FP8 attention production deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators