What is the single highest-impact CUDA optimization for local LLM inference?

Making sure the entire model fits in VRAM with FP16/BF16 weights or a high-quality quantization (Q5_K_M or Q6_K), and that all layers run on the GPU. In llama.cpp / Ollama this means setting --n-gpu-layers (a.k.a. ngl) to a value greater than or equal to the model layer count. A 70B model with 18GB offloaded to system RAM will run 5-10x slower than one that fits entirely in VRAM. Every other optimization is a refinement on top of this. KV-cache quantization (Q8_0) and FlashAttention typically add another 15-40% on top of a properly fitted model.

Should I use FP16, BF16, FP8, or INT8 for local LLM inference on CUDA?

On Ada (RTX 40-series) and Blackwell (RTX 50-series) GPUs, BF16 is the safe default for full-precision inference because it has the same dynamic range as FP32 and avoids overflow issues that occasionally affect FP16. FP8 (E4M3) on Hopper/Ada/Blackwell can roughly double throughput vs BF16 and halve memory pressure with minimal quality loss when calibrated, but support varies by framework — TensorRT-LLM and vLLM support it well, llama.cpp does not. INT8 (W8A8) via SmoothQuant or AWQ is the best choice when you need maximum speed and the framework supports it; quality is usually within 1% of FP16 on common benchmarks.

How do I tune --n-gpu-layers (ngl) correctly in llama.cpp / Ollama?

Start by setting ngl to the total number of layers in the model (Llama 3 8B has 32, Llama 3 70B has 80, Qwen 2.5 32B has 64). If you get OOM, drop ngl by 4 layers at a time until the model loads. Always include the output layer and embeddings on the GPU if possible — they have outsized impact on token generation latency. In Ollama set the equivalent via the num_gpu parameter in the Modelfile or PARAMETER block. Using ngl=999 is a common shortcut that means "all layers"; llama.cpp clamps it to the actual layer count automatically.

Does FlashAttention actually help on consumer NVIDIA GPUs?

Yes, especially at long context lengths. FlashAttention-2 reduces attention memory from O(N²) to O(N), so a 32K context that would OOM at standard attention fits comfortably with FlashAttention. On an RTX 4090 with Llama 3 8B, FlashAttention-2 typically delivers 1.4-2.1x speedup at 8K context and up to 4x at 32K context. In llama.cpp enable it with -fa or --flash-attn. In vLLM and TensorRT-LLM it is enabled by default. FlashAttention-3 is even faster on Hopper (H100) and Blackwell (RTX 50-series, B100) thanks to native FP8 support.

Is NVLink worth it for two RTX 3090s or 4090s running local LLMs?

For tensor-parallel inference, yes — NVLink provides ~112 GB/s bidirectional vs ~32 GB/s for PCIe 4.0 x16, which removes a real bottleneck during all-reduce operations. Expect 10-25% speedup with TP=2 vs PCIe alone. For pipeline-parallel inference (the default in llama.cpp and Ollama), NVLink helps less because cross-GPU traffic is bounded to layer boundaries. NVLink is only available on RTX 3090 (consumer) and pre-Hopper data-center cards — RTX 4090 and 5090 dropped NVLink, so PCIe is your only option there.

Should I undervolt or power-limit my GPU for sustained LLM inference?

Almost always yes. Most NVIDIA consumer GPUs hit ~95% of their stock performance at 70-80% of their stock power limit. An RTX 4090 capped at 350W (vs 450W stock) loses about 3-5% throughput, runs ~10°C cooler, and reduces fan noise dramatically. Use nvidia-smi -pl 350 (Linux) or MSI Afterburner / NVIDIA App on Windows. For RTX 3090 / 3090 Ti, capping at 300W is a sweet spot. Undervolting via the curve editor can recover some lost performance but requires per-card stability testing.

When should I use vLLM or TensorRT-LLM instead of Ollama / llama.cpp?

Switch to vLLM when you need to serve more than one user concurrently — its PagedAttention KV cache and continuous batching deliver 5-20x higher aggregate throughput than llama.cpp under load. Switch to TensorRT-LLM when single-stream latency matters and you can afford the build complexity — fused kernels, INT4-AWQ, and CUDA graphs typically deliver 20-40% lower per-token latency than vLLM on the same hardware. Stay on Ollama / llama.cpp for single-user desktop use, broad model support, and simplicity — the gap closes considerably for batch size 1.

Does KV-cache quantization hurt output quality?

Q8_0 KV cache (8-bit) is virtually indistinguishable from FP16 — perplexity differences are within measurement noise on standard benchmarks. Q4_0 KV cache (4-bit) is noticeably worse on long-context retrieval and reasoning, and is generally not recommended for production. Enable it in llama.cpp with --cache-type-k q8_0 --cache-type-v q8_0 (requires FlashAttention). Memory savings are roughly 50% for the KV cache, which on long-context workloads (32K+) frees several GB and lets you run bigger models or longer contexts.

CUDA Optimization Techniques for Local LLMs: The Complete 2026 Guide

NVIDIA CUDA gives local LLMs their speed — but most users leave 30-70% of their GPU's performance on the table. This guide is the complete, no-fluff reference to every CUDA optimization that meaningfully accelerates local LLM inference: from the obvious (--n-gpu-layers) to the subtle (CUDA graphs, FP8 calibration, KV-cache quantization, NCCL tuning).

Every technique below is ranked by typical impact, with concrete commands, real benchmarks, and the trade-offs nobody mentions in the README.

Impact Ranking: What Actually Matters
Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL
Quantization & Precision (FP16, BF16, FP8, INT8, INT4)
GPU Layer Offload: --n-gpu-layers Done Right
FlashAttention 2 & 3
KV-Cache Quantization & PagedAttention
Tensor Cores & Mixed Precision
cuBLAS, cuDNN, and Kernel Selection
CUDA Graphs
Tensor Parallelism, Pipeline Parallelism, NVLink
Speculative Decoding & Medusa
Continuous Batching (vLLM, TGI, TensorRT-LLM)
Power, Clock, and Thermal Tuning
MIG, MPS, and Multi-Tenant Isolation
Framework-Specific Tuning: Ollama, llama.cpp, vLLM, TensorRT-LLM, ExLlamaV2
Profiling: Nsight Systems, Nsight Compute, nvidia-smi dmon
Common Mistakes That Silently Kill Performance
Reference Configs by GPU
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Impact Ranking: What Actually Matters {#impact-ranking}

Before tuning anything, know where the time goes. Here is a typical 70B Q4 inference workload on an RTX 4090 (24GB) and the order of optimizations by impact:

Rank	Optimization	Typical Speedup	Effort	Risk
1	Fit model entirely in VRAM (right quant + ngl)	5-10x	Low	None
2	FlashAttention 2/3	1.4-4.0x at long context	Low	None
3	KV-cache quantization (Q8_0)	1.1-1.4x + memory savings	Low	Negligible
4	Switch from FP16 to FP8 / INT8 / INT4 (where supported)	1.3-2.5x	Medium	Calibration
5	Continuous batching (vLLM/TGI) — multi-user only	5-20x aggregate	High	Framework swap
6	Speculative decoding (Medusa, EAGLE, n-gram)	1.5-3x	Medium	Quality drift
7	CUDA graphs	1.05-1.2x	Low	Framework support
8	Tensor parallelism with NVLink	1.4-1.8x for 2 GPUs	Medium	Hardware
9	Power-limit / undervolt for thermal headroom	1.0-1.05x sustained	Low	None
10	cuBLAS / cuDNN version + kernel autotune	1.02-1.10x	Low	None

If you do only the top three, you will outrun 90% of casual users. The rest is fine-tuning.

Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL {#foundation}

The fastest kernels in the world cannot save you from a stale driver. Versions matter.

Recommended baseline (May 2026)

Component	Minimum	Recommended	Notes
NVIDIA driver	555.x	570.x or newer	Required for FP8 on Ada/Blackwell
CUDA Toolkit	12.4	12.6+	Build-time only; runtime uses driver
cuDNN	9.0	9.5+	Fused attention kernels improved
NCCL	2.20	2.23+	Multi-GPU all-reduce performance
TensorRT	10.0	10.4+	For TensorRT-LLM users

# Verify your stack
nvidia-smi --query-gpu=driver_version,name,vbios_version --format=csv
nvcc --version
python -c "import torch; print(torch.__version__, torch.version.cuda, torch.backends.cudnn.version())"

Driver Persistence Mode (Linux)

By default the driver tears down state between processes, adding 1-3 seconds of CUDA context creation per inference run. Enable persistence mode for long-running services:

sudo nvidia-smi -pm 1   # Enable persistence (deprecated on newer drivers)
# Modern replacement (driver 470+):
sudo systemctl enable --now nvidia-persistenced

Compute Mode

For dedicated inference servers, set exclusive mode so a single CUDA context owns the GPU and avoids context-switch overhead:

sudo nvidia-smi -c EXCLUSIVE_PROCESS

Set back to default (-c DEFAULT) on workstations where you also game.

Quantization & Precision {#quantization}

Quantization is the single biggest lever — but the right format depends on your GPU generation and framework.

Format compatibility matrix

Format	RTX 30xx (Ampere)	RTX 40xx (Ada)	RTX 50xx (Blackwell)	H100 (Hopper)	Frameworks
FP32	✅ slow	✅ slow	✅ slow	✅ slow	All
FP16	✅	✅	✅	✅	All
BF16	✅	✅	✅	✅	All except very old
FP8 (E4M3 / E5M2)	❌	✅	✅	✅	TensorRT-LLM, vLLM, transformer-engine
INT8 (W8A8)	✅	✅	✅	✅	TensorRT-LLM, vLLM, llama.cpp (partial)
INT4 (W4A16, AWQ, GPTQ)	✅	✅	✅	✅	All major
GGUF Q4_K_M / Q5_K_M / Q6_K	✅	✅	✅	✅	llama.cpp, Ollama, koboldcpp
GGUF IQ-quants (IQ2_XS, IQ3_XXS)	✅	✅	✅	✅	llama.cpp

Practical recommendations

8B models on 12-24GB VRAM: FP16 / BF16 is fine; quality is highest, speed is plenty.
14-32B models on 24GB VRAM: Q5_K_M (GGUF) or AWQ-INT4. Sweet spot for quality.
70B models on 24GB VRAM: Q4_K_M (GGUF) at ~42GB total — partial offload required.
70B models on 48GB VRAM (2x 3090, A6000): Q5_K_M or Q4_K_M fully on GPU.
70B models on 80GB+ (H100, 2x 4090 Ti / 5090): FP8 or AWQ-INT4 for max speed.

Why BF16 beats FP16 in 2026

BF16 has the same exponent range as FP32 (8 bits) but fewer mantissa bits (7 vs 23). For LLM inference this is almost always a net win: no overflow at long context, minimal quality difference, same throughput as FP16 on Ampere and newer. PyTorch / vLLM / TensorRT-LLM all default to BF16 for new models.

# vLLM example — explicitly request BF16
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", dtype="bfloat16")

FP8 — the 2025-2026 breakthrough

Hopper introduced FP8, Ada brought it to consumer GPUs (RTX 40-series), and Blackwell doubled FP8 throughput again. Two formats:

E4M3 — 4 exponent bits, 3 mantissa bits — used for weights and activations.
E5M2 — 5 exponent bits, 2 mantissa bits — used for gradients (training only).

For inference you almost always want E4M3 with per-tensor or per-channel scaling. Quality after calibration is typically within 0.5% of BF16 on MMLU/HumanEval, with ~2x throughput.

# vLLM with FP8 KV cache and FP8 weights (Ada+ required)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768

INT4 via AWQ vs GPTQ vs GGUF

AWQ (Activation-aware Weight Quantization) — preserves salient weights based on activation magnitude. Best quality at 4-bit. Strongly recommended for vLLM and TensorRT-LLM.
GPTQ — older but widely available; group-size 128 is standard.
GGUF Q4_K_M / IQ4_XS — llama.cpp's k-quants and i-quants. Q4_K_M is roughly equivalent to GPTQ-128g; IQ-quants squeeze 2-3% more quality at the same bit budget at the cost of slower inference.

In our testing on RTX 4090 with Llama 3.1 70B:

Format	Size on disk	tok/s (fully on GPU)	MMLU
FP16	140 GB	OOM (offloaded)	79.1
FP8 (E4M3)	70 GB	~22	78.9
AWQ-INT4	36 GB	~38	77.8
GPTQ-128g	36 GB	~34	77.5
GGUF Q4_K_M	42 GB	~8 (partial offload)	77.6
GGUF IQ4_XS	38 GB	~9 (partial offload)	77.9

For multi-GPU setups (2x 4090, 2x 5090), AWQ-INT4 + vLLM is the highest-throughput option for 70B models.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

GPU Layer Offload: `--n-gpu-layers` Done Right {#gpu-layers}

In llama.cpp / Ollama, this single flag controls how many transformer layers run on the GPU. Get it wrong and you lose 5-10x performance.

Layer counts by model

Model	Layers	Approximate VRAM at Q4_K_M
Llama 3.1 8B	32	~5 GB
Llama 3.1 / 3.3 70B	80	~42 GB
Qwen 2.5 7B	28	~4.5 GB
Qwen 2.5 32B	64	~20 GB
Qwen 2.5 72B	80	~43 GB
Mixtral 8x7B	32	~26 GB (Q4_K_M)
Gemma 2 27B	46	~17 GB

Tuning procedure

# 1. Start with all layers on GPU
./llama-cli -m model.gguf -ngl 999 -c 4096

# 2. If OOM, drop in steps of 4
./llama-cli -m model.gguf -ngl 76 -c 4096   # 70B with 4 layers on CPU
./llama-cli -m model.gguf -ngl 72 -c 4096

# 3. Always keep the output (lm_head) on GPU
./llama-cli -m model.gguf -ngl 72 --override-tensor "output.weight=GPU"

In Ollama

# Modelfile
FROM llama3.1:70b-instruct-q4_K_M
PARAMETER num_gpu 80          # 80 = all 70B layers
PARAMETER num_ctx 8192
PARAMETER num_batch 512

Or at runtime:

OLLAMA_NUM_GPU=80 ollama run llama3.1:70b

Why "all layers" beats "almost all layers"

With even one layer on CPU, every token requires a full PCIe round-trip per generated token. On PCIe 4.0 x16 (~32 GB/s practical) this adds 5-30ms per token depending on layer size — easily 50% of total latency. The KV cache also has to ping-pong between host and device. Always size your quantization to fit fully on GPU if possible.

Multi-GPU tensor split

For multi-GPU setups, control distribution explicitly:

# llama.cpp — proportionally split across two 24GB GPUs
./llama-cli -m model.gguf -ngl 999 --tensor-split 24,24

# Asymmetric: 4090 (24GB) + 3090 (24GB) — keep more on the faster card
./llama-cli -m model.gguf -ngl 999 --tensor-split 28,22

FlashAttention 2 & 3 {#flash-attention}

Standard attention has O(N²) memory. FlashAttention restructures the computation to be O(N) memory and 2-4x faster by keeping the softmax in SRAM and tiling the QKV matmul.

Versions and hardware support

Version	Best For	Hardware
FlashAttention 1	Reference / older Ampere	All CUDA
FlashAttention 2	Most local users	Ampere, Ada, Hopper, Blackwell
FlashAttention 3	Maximum throughput	Hopper (H100), Blackwell (B100, RTX 50-series)

FA3 adds FP8 support, async warp-specialized kernels, and better tail handling — typically 1.5-2.0x faster than FA2 on Hopper.

Enabling FlashAttention

llama.cpp / Ollama:

# llama.cpp
./llama-cli -m model.gguf -ngl 999 -fa

# Ollama Modelfile
PARAMETER flash_attn true

vLLM:

# Auto-selected; force a specific backend if needed:
VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve <model>
# On Hopper/Blackwell, use FlashInfer or FA3:
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve <model>

TensorRT-LLM: built into the engine; no flag needed.

Real benchmarks (RTX 4090, Llama 3.1 8B BF16)

Context	No FA	FA2	Speedup
2K	138 tok/s	142 tok/s	1.03x
8K	78 tok/s	121 tok/s	1.55x
16K	31 tok/s	88 tok/s	2.84x
32K	OOM	51 tok/s	∞

The longer your context, the bigger the win. For RAG and agent workflows where context routinely exceeds 8K, FlashAttention is mandatory.

KV-Cache Quantization & PagedAttention {#kv-cache}

For autoregressive generation, the KV cache is often the dominant memory cost — a Llama 3.1 70B at 32K context with FP16 KV cache eats ~20 GB on its own. Two complementary techniques:

KV-cache quantization

Quantize the K and V tensors in place. Q8_0 is essentially free quality-wise; Q4 is risky.

# llama.cpp — requires FlashAttention
./llama-cli -m model.gguf -ngl 999 -fa \
    --cache-type-k q8_0 --cache-type-v q8_0

# vLLM — FP8 KV cache (Ada+ required for hardware acceleration)
vllm serve <model> --kv-cache-dtype fp8_e4m3

Memory savings: ~50% from FP16 → 8-bit, ~75% from FP16 → 4-bit. On long-context workloads this directly translates to bigger usable contexts or smaller GPU requirements.

PagedAttention (vLLM)

vLLM stores the KV cache in fixed-size blocks (default 16 tokens) so that fragmentation drops from ~60-80% to <4%. This is the primary reason vLLM out-throughputs llama.cpp 5-20x on multi-user workloads — you can fit more concurrent requests in the same VRAM.

vllm serve <model> --block-size 16 --gpu-memory-utilization 0.92

Tune --gpu-memory-utilization upward (0.95-0.97) on dedicated inference boxes; leave it at 0.85-0.90 on workstations where you also run other apps.

Prefix caching

Both vLLM and TensorRT-LLM support prefix caching — system prompts and few-shot exemplars are computed once and reused across requests. For agent workloads this is a 10-100x latency win on first-token-time.

vllm serve <model> --enable-prefix-caching

Tensor Cores & Mixed Precision {#tensor-cores}

Tensor Cores are specialized matrix-multiply units. Every CUDA generation since Volta (V100) has them, but the supported types changed:

Generation	GPUs	Tensor Core Types
Volta	V100	FP16
Turing	RTX 20-series	FP16, INT8, INT4
Ampere	RTX 30-series, A100	FP16, BF16, TF32, INT8, INT4, sparse
Hopper	H100, H200	+ FP8 (E4M3, E5M2), Transformer Engine
Ada	RTX 40-series, L40S	+ FP8
Blackwell	RTX 50-series, B100, B200	+ FP4, microscaling, FA3 native

Making sure you actually use Tensor Cores

PyTorch:

import torch
torch.backends.cuda.matmul.allow_tf32 = True       # Ampere+
torch.backends.cudnn.allow_tf32 = True
torch.set_float32_matmul_precision("high")          # alias for TF32 on

For pure inference, dtypes BF16 / FP16 / FP8 / INT8 automatically dispatch to Tensor Cores. FP32 does not.

Mixed precision in custom code

with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(input_ids)

llama.cpp, Ollama, vLLM, and TensorRT-LLM all already use Tensor Cores correctly when given a compatible dtype.

cuBLAS, cuDNN, and Kernel Selection {#cublas}

These are the math libraries underneath every framework. You usually do not touch them directly, but a few flags matter.

cuBLAS LT and heuristics caching

cuBLAS chooses kernels at runtime via heuristics. Stable workloads (same shapes repeated) benefit from caching:

# Enable cuBLAS LT heuristic cache
export CUBLASLT_LOG_LEVEL=0
export CUBLASLT_HEURISTICS_CACHE_PATH=/tmp/cublaslt-cache

cuDNN benchmark mode (PyTorch)

torch.backends.cudnn.benchmark = True   # autotune for fixed-shape workloads

Use only when input shapes are stable (which is true for inference once context length stabilizes). Setting this on dynamically-shaped training can hurt.

llama.cpp build flags

If you build llama.cpp yourself, build with:

cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_F16=ON \
    -DGGML_CUDA_FORCE_MMQ=ON \
    -DGGML_CUDA_FORCE_CUBLAS=OFF \
    -DCMAKE_CUDA_ARCHITECTURES="89;90;120"
cmake --build build -j

CMAKE_CUDA_ARCHITECTURES matters — 89 = Ada (RTX 40), 90 = Hopper, 120 = Blackwell. Building only for your card avoids fat binaries and slightly faster startup. GGML_CUDA_FORCE_MMQ=ON enables custom mat-mul kernels for quantized types that often beat cuBLAS on small batch sizes.

CUDA Graphs {#cuda-graphs}

A CUDA graph captures an entire sequence of kernel launches and replays them with a single CPU-side operation. For inference (where the kernel sequence is mostly the same per token), this removes 10-50µs of launch overhead per token — small per token, large in aggregate.

Frameworks that use CUDA Graphs

TensorRT-LLM — yes, automatic
vLLM — yes, with --enforce-eager false (default off) for decode steps
llama.cpp — yes since b3000+, automatic when supported
PyTorch — manual via torch.cuda.graph()

vLLM explicit setting

vllm serve <model> --enforce-eager false   # default is false, but be explicit

Eager mode (--enforce-eager true) disables CUDA graphs — useful for debugging, painful for production.

Typical impact: 5-15% lower per-token latency at batch size 1, 2-5% at large batch sizes. Free win.

Tensor Parallelism, Pipeline Parallelism, NVLink {#parallelism}

For multi-GPU inference, the choice of parallelism strategy is bigger than any kernel-level tuning.

The three strategies

Tensor Parallelism (TP) — split each matmul across GPUs. Communication: AllReduce per layer. Latency-friendly. Used by vLLM, TensorRT-LLM, DeepSpeed-Inference.
Pipeline Parallelism (PP) — split the model by layer ranges. Communication: activations between stages. Throughput-friendly for batched workloads, terrible for batch-size-1 latency. Used by llama.cpp, Ollama by default.
Expert Parallelism (EP) — only for MoE models like Mixtral. Different experts on different GPUs.

When to pick which

Setup	Best Strategy
Single user, latency matters	TP=N (with NVLink if available)
Batched server, throughput matters	TP=2 + PP if needed
MoE model	EP across experts
Mixed VRAM (24GB + 16GB)	PP with manual layer split

vLLM tensor parallel

# 2x RTX 4090
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ \
    --tensor-parallel-size 2 \
    --quantization awq \
    --max-model-len 32768

NVLink vs PCIe

Bus	Bandwidth (bidir)	Available On
PCIe 4.0 x16	~32 GB/s	All modern PCs
PCIe 5.0 x16	~64 GB/s	Z790/X670, RTX 50-series
NVLink 3 (consumer)	~112 GB/s	RTX 3090, RTX 3090 Ti
NVLink 4	~900 GB/s	H100 SXM, B100

RTX 4090 and 5090 do not support NVLink. For consumer 70B inference on 2 GPUs, your only options are 2x 3090 with NVLink, or 2x 4090/5090 over PCIe.

NCCL tuning for multi-GPU

export NCCL_P2P_LEVEL=NVL          # require NVLink path if available
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=1            # disable InfiniBand on workstations
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_NET_GDR_LEVEL=PHB       # GPU Direct RDMA when applicable

For 2 GPUs in one box, defaults are usually fine; the above matters more on 4+ GPU rigs.

Speculative Decoding & Medusa {#speculative}

Speculative decoding uses a small "draft" model to guess several tokens, then has the big model verify them in a single forward pass. Net effect: fewer big-model forward passes per generated token.

Methods

Method	Speedup	Quality	Notes
Vanilla speculative decoding	1.5-2.5x	Identical to target	Need a small fast model with same vocab
n-gram (prompt lookup)	1.2-1.6x	Identical	No draft model required
Medusa heads	2.0-3.0x	Near-identical	Train extra heads on the target model
EAGLE / EAGLE-2	2.5-3.5x	Identical	More complex training
Lookahead decoding	1.4-2.2x	Identical	No draft model

llama.cpp speculative decoding

./llama-cli -m large.gguf \
    --model-draft small.gguf \
    -ngl 999 --draft-max 8 -p "..."

Pair models with the same tokenizer (e.g., Llama 3.1 70B target + Llama 3.2 1B draft).

vLLM speculative decoding

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5

Acceptance rate matters

Speculative decoding is only fast if the draft model agrees with the target. Acceptance rates below 60% can be slower than no speculation. Measure with vLLM's --collect-detailed-traces.

Continuous Batching {#batching}

If you serve more than one user, this single feature is the biggest single throughput win available.

Static batching waits for a batch to fill, then runs all sequences to completion together. Continuous batching swaps in new requests at every decoding step, so the GPU is never idle.

vLLM, TGI (HuggingFace), TensorRT-LLM, and SGLang all implement continuous batching with various names ("iteration-level scheduling," "in-flight batching"). llama.cpp / Ollama do not — they are intended for single-user desktop use.

Aggregate throughput on RTX 4090 with Llama 3.1 8B BF16:

Concurrency	Ollama (tok/s, sum)	vLLM (tok/s, sum)
1	138	132
4	142	480
16	145	1,150
32	145	1,720
64	145	2,200

Single-user, llama.cpp wins by a hair. Multi-user, vLLM is 15x faster. Pick the right tool.

Power, Clock, and Thermal Tuning {#power-tuning}

NVIDIA consumer GPUs ship with aggressive boost behavior that thermal-throttles in long inference runs and inflates noise. Capping power gives you most of the speed at much lower noise and heat.

Power limit (Linux)

sudo nvidia-smi -pl 350                # cap RTX 4090 at 350W (stock 450W)
sudo nvidia-smi -pl 280                # cap RTX 3090 at 280W (stock 350W)

Set persistent at boot:

# /etc/systemd/system/nvidia-power-limit.service
[Unit]
Description=NVIDIA GPU power limit
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pl 350
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Lock clocks for predictable latency

sudo nvidia-smi -lgc 1500,2520        # lock graphics clock between 1500 and 2520 MHz
sudo nvidia-smi -lmc 10501            # lock memory clock at 10501 MHz (4090 stock)

Locked clocks eliminate the 30-50ms jitter from boost transitions, which matters for low-latency agent loops.

Undervolting

On Linux: nvidia-smi can lock voltage indirectly via -lgc upper bound. On Windows: MSI Afterburner / NVIDIA App curve editor — drop the curve by 50-80mV and lower the clock ceiling 100-150 MHz; test with sustained llama-bench for an hour.

Typical undervolt: RTX 4090 at 0.9V / 2520 MHz delivers ~99% of stock performance at ~340W instead of ~450W.

Fan curve

Default fan curves are conservative. For long inference runs, set fans to ramp earlier:

# nvidia-settings (Linux, X required)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
                -a "[fan:0]/GPUTargetFanSpeed=70"

MIG, MPS, and Multi-Tenant Isolation {#mig-mps}

If multiple processes need to share a GPU without one starving the other:

MPS (Multi-Process Service) — multiplexes CUDA contexts on a single GPU. Works on all NVIDIA GPUs since Volta. Latency-friendly, no isolation.
MIG (Multi-Instance GPU) — partitions a GPU into hardware-isolated slices. A100, H100, H200 only — not consumer.

Enabling MPS (Linux)

# Per user, before launching CUDA processes
export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d   # start MPS daemon

MPS is useful if you run, e.g., Ollama for chat and an embedding service simultaneously on one GPU — without MPS they serialize CUDA contexts and steal latency from each other.

MIG on H100

sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C   # seven 1g.10gb instances

For local-LLM hobbyists this is rarely relevant. For shared lab GPUs it is essential.

Framework-Specific Tuning {#framework-tuning}

Ollama

# Modelfile
FROM llama3.1:70b-instruct-q4_K_M
PARAMETER num_gpu 80
PARAMETER num_ctx 8192
PARAMETER num_batch 512
PARAMETER num_thread 8
PARAMETER flash_attn true
PARAMETER use_mmap true

Useful environment variables:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_KEEP_ALIVE=24h

OLLAMA_KEEP_ALIVE matters — by default Ollama unloads the model after 5 minutes, and reloading costs 5-30 seconds.

llama.cpp

./llama-server \
    -m model.gguf \
    -ngl 999 \
    -c 8192 \
    -b 2048 -ub 512 \
    -fa \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --no-mmap \
    --threads 8 \
    --tensor-split 24,24

-b 2048 -ub 512 controls logical and physical batch sizes for prompt processing — higher -b is faster on prompt eval, higher -ub uses more VRAM.

vLLM

vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ \
    --quantization awq \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192

--enable-chunked-prefill interleaves long prompt prefill with decode steps so a 32K-token prompt does not block other requests.

TensorRT-LLM

Build the engine ahead of time, then serve. Engines are GPU-architecture-specific.

# Build (Llama 3.1 70B with AWQ-INT4, TP=2)
trtllm-build \
    --checkpoint_dir ./Llama3.1-70B-awq \
    --output_dir ./engines/llama3.1-70b-awq-tp2 \
    --gemm_plugin auto \
    --gpt_attention_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 32768 \
    --max_seq_len 33792 \
    --max_batch_size 16 \
    --tp_size 2

# Serve via Triton or trtllm-serve
trtllm-serve ./engines/llama3.1-70b-awq-tp2 --port 8000

ExLlamaV2

Best in class for single-GPU INT4 inference on Ampere/Ada with 24GB-class cards. Use exllamav2_HF loader in text-generation-webui, or the standalone server. EXL2 quantization (variable bit allocation) frequently beats AWQ on quality at the same size.

Profiling: Nsight Systems, Nsight Compute, `nvidia-smi dmon` {#profiling}

You cannot optimize what you do not measure.

Quick health check

nvidia-smi dmon -s pucvmet -d 1

Watch for:

sm (SM utilization) — should be 80-99% during decode. If it sits below 50%, you are CPU-bound or PCIe-bound.
mem (memory bandwidth) — for LLM decode this is usually the bottleneck; expect 70-90% of theoretical.
pwr — should sit just under your power limit.
tmp — under 80°C ideally; 85°C+ means thermal throttling is imminent.

Nsight Systems (timeline)

nsys profile -o llm-trace --stats=true \
    python -c "..."

Open in Nsight Systems UI. Look for gaps between kernels (CPU bottleneck) and unusually long kernels (memory-bound).

Nsight Compute (kernel-level)

ncu --set full -o kernel-report ./llama-cli ...

Heavy hammer; use only when you suspect a specific kernel is slow.

Framework-native profilers

vLLM: --collect-detailed-traces all writes per-request trace JSON.
PyTorch: torch.profiler with the tensorboard_trace_handler.
llama.cpp: llama-bench -m model.gguf -ngl 999 -p 512 -n 128 for repeatable throughput numbers.

Common Mistakes That Silently Kill Performance {#mistakes}

Wrong PCIe slot — second NVMe or chipset PCIe slots often run x4 or x8. Verify with nvidia-smi --query-gpu=pcie.link.width.current,pcie.link.gen.current --format=csv.
Resizable BAR off — enable in BIOS. Improves PCIe transfers significantly.
Background processes on the GPU — browser hardware acceleration, Discord overlay, OBS. Each steals 100-500 MB VRAM and a few percent throughput.
PCIe ASPM (power saving) — disable in BIOS for inference servers. Wakeup latency adds jitter.
malloc / pageable host memory — for tools that copy weights via host, pinned memory is 2-3x faster. Most frameworks handle this; custom code often does not.
Wrong GGUF quant — IQ-quants are higher quality at the same size but slower; on Ada/Blackwell prefer K-quants for raw speed.
OLLAMA_KEEP_ALIVE default of 5 min — model unload + reload kills latency-sensitive workflows. Set to 24h.
Persistent mode off — context creation latency at every CUDA process start.
Mixing CUDA toolkit versions — match PyTorch's expected CUDA runtime to your driver.
Running BF16 model in FP32 — happens silently when frameworks fall back. Always verify with a profile or memory footprint check.

Reference Configs by GPU {#reference-configs}

RTX 3090 / 3090 Ti (24 GB, Ampere)

Best target models: 8B FP16, 14B AWQ, 32B Q5_K_M, 70B partial offload.
Power limit: 280-300W.
Use llama.cpp with FA2 + Q8 KV cache for 32B models.
Two cards with NVLink: best price-to-performance for 70B at home.

RTX 4070 Ti Super / 4080 Super (16 GB, Ada)

Best target models: 8B BF16, 14B AWQ, 32B Q4_K_M with offload.
Power limit: 250W.
Enable FP8 KV cache in vLLM for long context.

RTX 4090 (24 GB, Ada)

Best target models: 8B BF16, 32B AWQ, 70B partial offload (Q4_K_M).
Power limit: 350-380W is the sweet spot.
vLLM AWQ + FP8 KV cache for max throughput.

RTX 5090 (32 GB, Blackwell)

Best target models: 8B BF16, 32B BF16, 70B AWQ-INT4 fits!
FA3 + FP8 deliver the largest gen-over-gen jump in years.
Power limit: 500-550W.
Lock memory clocks; GDDR7 boost can be jittery on early drivers.

Dual-GPU rigs

2x 3090 NVLink: still the best $/perf for 70B in 2026.
2x 4090 PCIe: faster than 2x 3090 in TP=2 despite no NVLink, but ~2x cost.
2x 5090 PCIe 5.0: fastest consumer 70B inference; FP8 + FA3 is transformative.

FAQ {#faq}

See answers to common CUDA optimization questions below.

Related guides on Local AI Master:

CUDA Optimization Techniques for Local LLMs: The Complete 2026 Guide

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

Impact Ranking: What Actually Matters {#impact-ranking}

Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL {#foundation}

Recommended baseline (May 2026)

Driver Persistence Mode (Linux)

Compute Mode

Quantization & Precision {#quantization}

Format compatibility matrix

Practical recommendations

Why BF16 beats FP16 in 2026

FP8 — the 2025-2026 breakthrough

INT4 via AWQ vs GPTQ vs GGUF

Reading articles is good. Building is better.

GPU Layer Offload: --n-gpu-layers Done Right {#gpu-layers}

Layer counts by model

Tuning procedure

In Ollama

Why "all layers" beats "almost all layers"

Multi-GPU tensor split

FlashAttention 2 & 3 {#flash-attention}

Versions and hardware support

Enabling FlashAttention

Real benchmarks (RTX 4090, Llama 3.1 8B BF16)

KV-Cache Quantization & PagedAttention {#kv-cache}

KV-cache quantization

PagedAttention (vLLM)

Prefix caching

Tensor Cores & Mixed Precision {#tensor-cores}

Making sure you actually use Tensor Cores

Mixed precision in custom code

cuBLAS, cuDNN, and Kernel Selection {#cublas}

cuBLAS LT and heuristics caching

cuDNN benchmark mode (PyTorch)

llama.cpp build flags

CUDA Graphs {#cuda-graphs}

Frameworks that use CUDA Graphs

vLLM explicit setting

Tensor Parallelism, Pipeline Parallelism, NVLink {#parallelism}

The three strategies

When to pick which

vLLM tensor parallel

NVLink vs PCIe

NCCL tuning for multi-GPU

Speculative Decoding & Medusa {#speculative}

Methods

llama.cpp speculative decoding

vLLM speculative decoding

Acceptance rate matters

Continuous Batching {#batching}

Power, Clock, and Thermal Tuning {#power-tuning}

Power limit (Linux)

Lock clocks for predictable latency

Undervolting

Fan curve

MIG, MPS, and Multi-Tenant Isolation {#mig-mps}

Enabling MPS (Linux)

MIG on H100

Framework-Specific Tuning {#framework-tuning}

Ollama

llama.cpp

vLLM

TensorRT-LLM

ExLlamaV2

Profiling: Nsight Systems, Nsight Compute, nvidia-smi dmon {#profiling}

Quick health check

Nsight Systems (timeline)

Nsight Compute (kernel-level)

Framework-native profilers

Common Mistakes That Silently Kill Performance {#mistakes}

Reference Configs by GPU {#reference-configs}

RTX 3090 / 3090 Ti (24 GB, Ampere)

RTX 4070 Ti Super / 4080 Super (16 GB, Ada)

RTX 4090 (24 GB, Ada)

RTX 5090 (32 GB, Blackwell)

Dual-GPU rigs

FAQ {#faq}

Go from reading about AI to building with AI

GPU Layer Offload: `--n-gpu-layers` Done Right {#gpu-layers}

Profiling: Nsight Systems, Nsight Compute, `nvidia-smi dmon` {#profiling}