CUDA Optimization Techniques for Local LLMs: The Complete 2026 Guide
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
NVIDIA CUDA gives local LLMs their speed — but most users leave 30-70% of their GPU's performance on the table. This guide is the complete, no-fluff reference to every CUDA optimization that meaningfully accelerates local LLM inference: from the obvious (--n-gpu-layers) to the subtle (CUDA graphs, FP8 calibration, KV-cache quantization, NCCL tuning).
Every technique below is ranked by typical impact, with concrete commands, real benchmarks, and the trade-offs nobody mentions in the README.
Table of Contents
- Impact Ranking: What Actually Matters
- Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL
- Quantization & Precision (FP16, BF16, FP8, INT8, INT4)
- GPU Layer Offload:
--n-gpu-layersDone Right - FlashAttention 2 & 3
- KV-Cache Quantization & PagedAttention
- Tensor Cores & Mixed Precision
- cuBLAS, cuDNN, and Kernel Selection
- CUDA Graphs
- Tensor Parallelism, Pipeline Parallelism, NVLink
- Speculative Decoding & Medusa
- Continuous Batching (vLLM, TGI, TensorRT-LLM)
- Power, Clock, and Thermal Tuning
- MIG, MPS, and Multi-Tenant Isolation
- Framework-Specific Tuning: Ollama, llama.cpp, vLLM, TensorRT-LLM, ExLlamaV2
- Profiling: Nsight Systems, Nsight Compute,
nvidia-smi dmon - Common Mistakes That Silently Kill Performance
- Reference Configs by GPU
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Impact Ranking: What Actually Matters {#impact-ranking}
Before tuning anything, know where the time goes. Here is a typical 70B Q4 inference workload on an RTX 4090 (24GB) and the order of optimizations by impact:
| Rank | Optimization | Typical Speedup | Effort | Risk |
|---|---|---|---|---|
| 1 | Fit model entirely in VRAM (right quant + ngl) | 5-10x | Low | None |
| 2 | FlashAttention 2/3 | 1.4-4.0x at long context | Low | None |
| 3 | KV-cache quantization (Q8_0) | 1.1-1.4x + memory savings | Low | Negligible |
| 4 | Switch from FP16 to FP8 / INT8 / INT4 (where supported) | 1.3-2.5x | Medium | Calibration |
| 5 | Continuous batching (vLLM/TGI) — multi-user only | 5-20x aggregate | High | Framework swap |
| 6 | Speculative decoding (Medusa, EAGLE, n-gram) | 1.5-3x | Medium | Quality drift |
| 7 | CUDA graphs | 1.05-1.2x | Low | Framework support |
| 8 | Tensor parallelism with NVLink | 1.4-1.8x for 2 GPUs | Medium | Hardware |
| 9 | Power-limit / undervolt for thermal headroom | 1.0-1.05x sustained | Low | None |
| 10 | cuBLAS / cuDNN version + kernel autotune | 1.02-1.10x | Low | None |
If you do only the top three, you will outrun 90% of casual users. The rest is fine-tuning.
Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL {#foundation}
The fastest kernels in the world cannot save you from a stale driver. Versions matter.
Recommended baseline (May 2026)
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| NVIDIA driver | 555.x | 570.x or newer | Required for FP8 on Ada/Blackwell |
| CUDA Toolkit | 12.4 | 12.6+ | Build-time only; runtime uses driver |
| cuDNN | 9.0 | 9.5+ | Fused attention kernels improved |
| NCCL | 2.20 | 2.23+ | Multi-GPU all-reduce performance |
| TensorRT | 10.0 | 10.4+ | For TensorRT-LLM users |
# Verify your stack
nvidia-smi --query-gpu=driver_version,name,vbios_version --format=csv
nvcc --version
python -c "import torch; print(torch.__version__, torch.version.cuda, torch.backends.cudnn.version())"
Driver Persistence Mode (Linux)
By default the driver tears down state between processes, adding 1-3 seconds of CUDA context creation per inference run. Enable persistence mode for long-running services:
sudo nvidia-smi -pm 1 # Enable persistence (deprecated on newer drivers)
# Modern replacement (driver 470+):
sudo systemctl enable --now nvidia-persistenced
Compute Mode
For dedicated inference servers, set exclusive mode so a single CUDA context owns the GPU and avoids context-switch overhead:
sudo nvidia-smi -c EXCLUSIVE_PROCESS
Set back to default (-c DEFAULT) on workstations where you also game.
Quantization & Precision {#quantization}
Quantization is the single biggest lever — but the right format depends on your GPU generation and framework.
Format compatibility matrix
| Format | RTX 30xx (Ampere) | RTX 40xx (Ada) | RTX 50xx (Blackwell) | H100 (Hopper) | Frameworks |
|---|---|---|---|---|---|
| FP32 | ✅ slow | ✅ slow | ✅ slow | ✅ slow | All |
| FP16 | ✅ | ✅ | ✅ | ✅ | All |
| BF16 | ✅ | ✅ | ✅ | ✅ | All except very old |
| FP8 (E4M3 / E5M2) | ❌ | ✅ | ✅ | ✅ | TensorRT-LLM, vLLM, transformer-engine |
| INT8 (W8A8) | ✅ | ✅ | ✅ | ✅ | TensorRT-LLM, vLLM, llama.cpp (partial) |
| INT4 (W4A16, AWQ, GPTQ) | ✅ | ✅ | ✅ | ✅ | All major |
| GGUF Q4_K_M / Q5_K_M / Q6_K | ✅ | ✅ | ✅ | ✅ | llama.cpp, Ollama, koboldcpp |
| GGUF IQ-quants (IQ2_XS, IQ3_XXS) | ✅ | ✅ | ✅ | ✅ | llama.cpp |
Practical recommendations
- 8B models on 12-24GB VRAM: FP16 / BF16 is fine; quality is highest, speed is plenty.
- 14-32B models on 24GB VRAM: Q5_K_M (GGUF) or AWQ-INT4. Sweet spot for quality.
- 70B models on 24GB VRAM: Q4_K_M (GGUF) at ~42GB total — partial offload required.
- 70B models on 48GB VRAM (2x 3090, A6000): Q5_K_M or Q4_K_M fully on GPU.
- 70B models on 80GB+ (H100, 2x 4090 Ti / 5090): FP8 or AWQ-INT4 for max speed.
Why BF16 beats FP16 in 2026
BF16 has the same exponent range as FP32 (8 bits) but fewer mantissa bits (7 vs 23). For LLM inference this is almost always a net win: no overflow at long context, minimal quality difference, same throughput as FP16 on Ampere and newer. PyTorch / vLLM / TensorRT-LLM all default to BF16 for new models.
# vLLM example — explicitly request BF16
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", dtype="bfloat16")
FP8 — the 2025-2026 breakthrough
Hopper introduced FP8, Ada brought it to consumer GPUs (RTX 40-series), and Blackwell doubled FP8 throughput again. Two formats:
- E4M3 — 4 exponent bits, 3 mantissa bits — used for weights and activations.
- E5M2 — 5 exponent bits, 2 mantissa bits — used for gradients (training only).
For inference you almost always want E4M3 with per-tensor or per-channel scaling. Quality after calibration is typically within 0.5% of BF16 on MMLU/HumanEval, with ~2x throughput.
# vLLM with FP8 KV cache and FP8 weights (Ada+ required)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768
INT4 via AWQ vs GPTQ vs GGUF
- AWQ (Activation-aware Weight Quantization) — preserves salient weights based on activation magnitude. Best quality at 4-bit. Strongly recommended for vLLM and TensorRT-LLM.
- GPTQ — older but widely available; group-size 128 is standard.
- GGUF Q4_K_M / IQ4_XS — llama.cpp's k-quants and i-quants. Q4_K_M is roughly equivalent to GPTQ-128g; IQ-quants squeeze 2-3% more quality at the same bit budget at the cost of slower inference.
In our testing on RTX 4090 with Llama 3.1 70B:
| Format | Size on disk | tok/s (fully on GPU) | MMLU |
|---|---|---|---|
| FP16 | 140 GB | OOM (offloaded) | 79.1 |
| FP8 (E4M3) | 70 GB | ~22 | 78.9 |
| AWQ-INT4 | 36 GB | ~38 | 77.8 |
| GPTQ-128g | 36 GB | ~34 | 77.5 |
| GGUF Q4_K_M | 42 GB | ~8 (partial offload) | 77.6 |
| GGUF IQ4_XS | 38 GB | ~9 (partial offload) | 77.9 |
For multi-GPU setups (2x 4090, 2x 5090), AWQ-INT4 + vLLM is the highest-throughput option for 70B models.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
GPU Layer Offload: --n-gpu-layers Done Right {#gpu-layers}
In llama.cpp / Ollama, this single flag controls how many transformer layers run on the GPU. Get it wrong and you lose 5-10x performance.
Layer counts by model
| Model | Layers | Approximate VRAM at Q4_K_M |
|---|---|---|
| Llama 3.1 8B | 32 | ~5 GB |
| Llama 3.1 / 3.3 70B | 80 | ~42 GB |
| Qwen 2.5 7B | 28 | ~4.5 GB |
| Qwen 2.5 32B | 64 | ~20 GB |
| Qwen 2.5 72B | 80 | ~43 GB |
| Mixtral 8x7B | 32 | ~26 GB (Q4_K_M) |
| Gemma 2 27B | 46 | ~17 GB |
Tuning procedure
# 1. Start with all layers on GPU
./llama-cli -m model.gguf -ngl 999 -c 4096
# 2. If OOM, drop in steps of 4
./llama-cli -m model.gguf -ngl 76 -c 4096 # 70B with 4 layers on CPU
./llama-cli -m model.gguf -ngl 72 -c 4096
# 3. Always keep the output (lm_head) on GPU
./llama-cli -m model.gguf -ngl 72 --override-tensor "output.weight=GPU"
In Ollama
# Modelfile
FROM llama3.1:70b-instruct-q4_K_M
PARAMETER num_gpu 80 # 80 = all 70B layers
PARAMETER num_ctx 8192
PARAMETER num_batch 512
Or at runtime:
OLLAMA_NUM_GPU=80 ollama run llama3.1:70b
Why "all layers" beats "almost all layers"
With even one layer on CPU, every token requires a full PCIe round-trip per generated token. On PCIe 4.0 x16 (~32 GB/s practical) this adds 5-30ms per token depending on layer size — easily 50% of total latency. The KV cache also has to ping-pong between host and device. Always size your quantization to fit fully on GPU if possible.
Multi-GPU tensor split
For multi-GPU setups, control distribution explicitly:
# llama.cpp — proportionally split across two 24GB GPUs
./llama-cli -m model.gguf -ngl 999 --tensor-split 24,24
# Asymmetric: 4090 (24GB) + 3090 (24GB) — keep more on the faster card
./llama-cli -m model.gguf -ngl 999 --tensor-split 28,22
FlashAttention 2 & 3 {#flash-attention}
Standard attention has O(N²) memory. FlashAttention restructures the computation to be O(N) memory and 2-4x faster by keeping the softmax in SRAM and tiling the QKV matmul.
Versions and hardware support
| Version | Best For | Hardware |
|---|---|---|
| FlashAttention 1 | Reference / older Ampere | All CUDA |
| FlashAttention 2 | Most local users | Ampere, Ada, Hopper, Blackwell |
| FlashAttention 3 | Maximum throughput | Hopper (H100), Blackwell (B100, RTX 50-series) |
FA3 adds FP8 support, async warp-specialized kernels, and better tail handling — typically 1.5-2.0x faster than FA2 on Hopper.
Enabling FlashAttention
llama.cpp / Ollama:
# llama.cpp
./llama-cli -m model.gguf -ngl 999 -fa
# Ollama Modelfile
PARAMETER flash_attn true
vLLM:
# Auto-selected; force a specific backend if needed:
VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve <model>
# On Hopper/Blackwell, use FlashInfer or FA3:
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve <model>
TensorRT-LLM: built into the engine; no flag needed.
Real benchmarks (RTX 4090, Llama 3.1 8B BF16)
| Context | No FA | FA2 | Speedup |
|---|---|---|---|
| 2K | 138 tok/s | 142 tok/s | 1.03x |
| 8K | 78 tok/s | 121 tok/s | 1.55x |
| 16K | 31 tok/s | 88 tok/s | 2.84x |
| 32K | OOM | 51 tok/s | ∞ |
The longer your context, the bigger the win. For RAG and agent workflows where context routinely exceeds 8K, FlashAttention is mandatory.
KV-Cache Quantization & PagedAttention {#kv-cache}
For autoregressive generation, the KV cache is often the dominant memory cost — a Llama 3.1 70B at 32K context with FP16 KV cache eats ~20 GB on its own. Two complementary techniques:
KV-cache quantization
Quantize the K and V tensors in place. Q8_0 is essentially free quality-wise; Q4 is risky.
# llama.cpp — requires FlashAttention
./llama-cli -m model.gguf -ngl 999 -fa \
--cache-type-k q8_0 --cache-type-v q8_0
# vLLM — FP8 KV cache (Ada+ required for hardware acceleration)
vllm serve <model> --kv-cache-dtype fp8_e4m3
Memory savings: ~50% from FP16 → 8-bit, ~75% from FP16 → 4-bit. On long-context workloads this directly translates to bigger usable contexts or smaller GPU requirements.
PagedAttention (vLLM)
vLLM stores the KV cache in fixed-size blocks (default 16 tokens) so that fragmentation drops from ~60-80% to <4%. This is the primary reason vLLM out-throughputs llama.cpp 5-20x on multi-user workloads — you can fit more concurrent requests in the same VRAM.
vllm serve <model> --block-size 16 --gpu-memory-utilization 0.92
Tune --gpu-memory-utilization upward (0.95-0.97) on dedicated inference boxes; leave it at 0.85-0.90 on workstations where you also run other apps.
Prefix caching
Both vLLM and TensorRT-LLM support prefix caching — system prompts and few-shot exemplars are computed once and reused across requests. For agent workloads this is a 10-100x latency win on first-token-time.
vllm serve <model> --enable-prefix-caching
Tensor Cores & Mixed Precision {#tensor-cores}
Tensor Cores are specialized matrix-multiply units. Every CUDA generation since Volta (V100) has them, but the supported types changed:
| Generation | GPUs | Tensor Core Types |
|---|---|---|
| Volta | V100 | FP16 |
| Turing | RTX 20-series | FP16, INT8, INT4 |
| Ampere | RTX 30-series, A100 | FP16, BF16, TF32, INT8, INT4, sparse |
| Hopper | H100, H200 | + FP8 (E4M3, E5M2), Transformer Engine |
| Ada | RTX 40-series, L40S | + FP8 |
| Blackwell | RTX 50-series, B100, B200 | + FP4, microscaling, FA3 native |
Making sure you actually use Tensor Cores
PyTorch:
import torch
torch.backends.cuda.matmul.allow_tf32 = True # Ampere+
torch.backends.cudnn.allow_tf32 = True
torch.set_float32_matmul_precision("high") # alias for TF32 on
For pure inference, dtypes BF16 / FP16 / FP8 / INT8 automatically dispatch to Tensor Cores. FP32 does not.
Mixed precision in custom code
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
out = model(input_ids)
llama.cpp, Ollama, vLLM, and TensorRT-LLM all already use Tensor Cores correctly when given a compatible dtype.
cuBLAS, cuDNN, and Kernel Selection {#cublas}
These are the math libraries underneath every framework. You usually do not touch them directly, but a few flags matter.
cuBLAS LT and heuristics caching
cuBLAS chooses kernels at runtime via heuristics. Stable workloads (same shapes repeated) benefit from caching:
# Enable cuBLAS LT heuristic cache
export CUBLASLT_LOG_LEVEL=0
export CUBLASLT_HEURISTICS_CACHE_PATH=/tmp/cublaslt-cache
cuDNN benchmark mode (PyTorch)
torch.backends.cudnn.benchmark = True # autotune for fixed-shape workloads
Use only when input shapes are stable (which is true for inference once context length stabilizes). Setting this on dynamically-shaped training can hurt.
llama.cpp build flags
If you build llama.cpp yourself, build with:
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_FORCE_MMQ=ON \
-DGGML_CUDA_FORCE_CUBLAS=OFF \
-DCMAKE_CUDA_ARCHITECTURES="89;90;120"
cmake --build build -j
CMAKE_CUDA_ARCHITECTURES matters — 89 = Ada (RTX 40), 90 = Hopper, 120 = Blackwell. Building only for your card avoids fat binaries and slightly faster startup. GGML_CUDA_FORCE_MMQ=ON enables custom mat-mul kernels for quantized types that often beat cuBLAS on small batch sizes.
CUDA Graphs {#cuda-graphs}
A CUDA graph captures an entire sequence of kernel launches and replays them with a single CPU-side operation. For inference (where the kernel sequence is mostly the same per token), this removes 10-50µs of launch overhead per token — small per token, large in aggregate.
Frameworks that use CUDA Graphs
- TensorRT-LLM — yes, automatic
- vLLM — yes, with
--enforce-eager false(default off) for decode steps - llama.cpp — yes since b3000+, automatic when supported
- PyTorch — manual via
torch.cuda.graph()
vLLM explicit setting
vllm serve <model> --enforce-eager false # default is false, but be explicit
Eager mode (--enforce-eager true) disables CUDA graphs — useful for debugging, painful for production.
Typical impact: 5-15% lower per-token latency at batch size 1, 2-5% at large batch sizes. Free win.
Tensor Parallelism, Pipeline Parallelism, NVLink {#parallelism}
For multi-GPU inference, the choice of parallelism strategy is bigger than any kernel-level tuning.
The three strategies
- Tensor Parallelism (TP) — split each matmul across GPUs. Communication: AllReduce per layer. Latency-friendly. Used by vLLM, TensorRT-LLM, DeepSpeed-Inference.
- Pipeline Parallelism (PP) — split the model by layer ranges. Communication: activations between stages. Throughput-friendly for batched workloads, terrible for batch-size-1 latency. Used by llama.cpp, Ollama by default.
- Expert Parallelism (EP) — only for MoE models like Mixtral. Different experts on different GPUs.
When to pick which
| Setup | Best Strategy |
|---|---|
| Single user, latency matters | TP=N (with NVLink if available) |
| Batched server, throughput matters | TP=2 + PP if needed |
| MoE model | EP across experts |
| Mixed VRAM (24GB + 16GB) | PP with manual layer split |
vLLM tensor parallel
# 2x RTX 4090
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 32768
NVLink vs PCIe
| Bus | Bandwidth (bidir) | Available On |
|---|---|---|
| PCIe 4.0 x16 | ~32 GB/s | All modern PCs |
| PCIe 5.0 x16 | ~64 GB/s | Z790/X670, RTX 50-series |
| NVLink 3 (consumer) | ~112 GB/s | RTX 3090, RTX 3090 Ti |
| NVLink 4 | ~900 GB/s | H100 SXM, B100 |
RTX 4090 and 5090 do not support NVLink. For consumer 70B inference on 2 GPUs, your only options are 2x 3090 with NVLink, or 2x 4090/5090 over PCIe.
NCCL tuning for multi-GPU
export NCCL_P2P_LEVEL=NVL # require NVLink path if available
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=1 # disable InfiniBand on workstations
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_NET_GDR_LEVEL=PHB # GPU Direct RDMA when applicable
For 2 GPUs in one box, defaults are usually fine; the above matters more on 4+ GPU rigs.
Speculative Decoding & Medusa {#speculative}
Speculative decoding uses a small "draft" model to guess several tokens, then has the big model verify them in a single forward pass. Net effect: fewer big-model forward passes per generated token.
Methods
| Method | Speedup | Quality | Notes |
|---|---|---|---|
| Vanilla speculative decoding | 1.5-2.5x | Identical to target | Need a small fast model with same vocab |
| n-gram (prompt lookup) | 1.2-1.6x | Identical | No draft model required |
| Medusa heads | 2.0-3.0x | Near-identical | Train extra heads on the target model |
| EAGLE / EAGLE-2 | 2.5-3.5x | Identical | More complex training |
| Lookahead decoding | 1.4-2.2x | Identical | No draft model |
llama.cpp speculative decoding
./llama-cli -m large.gguf \
--model-draft small.gguf \
-ngl 999 --draft-max 8 -p "..."
Pair models with the same tokenizer (e.g., Llama 3.1 70B target + Llama 3.2 1B draft).
vLLM speculative decoding
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5
Acceptance rate matters
Speculative decoding is only fast if the draft model agrees with the target. Acceptance rates below 60% can be slower than no speculation. Measure with vLLM's --collect-detailed-traces.
Continuous Batching {#batching}
If you serve more than one user, this single feature is the biggest single throughput win available.
Static batching waits for a batch to fill, then runs all sequences to completion together. Continuous batching swaps in new requests at every decoding step, so the GPU is never idle.
vLLM, TGI (HuggingFace), TensorRT-LLM, and SGLang all implement continuous batching with various names ("iteration-level scheduling," "in-flight batching"). llama.cpp / Ollama do not — they are intended for single-user desktop use.
Aggregate throughput on RTX 4090 with Llama 3.1 8B BF16:
| Concurrency | Ollama (tok/s, sum) | vLLM (tok/s, sum) |
|---|---|---|
| 1 | 138 | 132 |
| 4 | 142 | 480 |
| 16 | 145 | 1,150 |
| 32 | 145 | 1,720 |
| 64 | 145 | 2,200 |
Single-user, llama.cpp wins by a hair. Multi-user, vLLM is 15x faster. Pick the right tool.
Power, Clock, and Thermal Tuning {#power-tuning}
NVIDIA consumer GPUs ship with aggressive boost behavior that thermal-throttles in long inference runs and inflates noise. Capping power gives you most of the speed at much lower noise and heat.
Power limit (Linux)
sudo nvidia-smi -pl 350 # cap RTX 4090 at 350W (stock 450W)
sudo nvidia-smi -pl 280 # cap RTX 3090 at 280W (stock 350W)
Set persistent at boot:
# /etc/systemd/system/nvidia-power-limit.service
[Unit]
Description=NVIDIA GPU power limit
After=nvidia-persistenced.service
[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pl 350
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
Lock clocks for predictable latency
sudo nvidia-smi -lgc 1500,2520 # lock graphics clock between 1500 and 2520 MHz
sudo nvidia-smi -lmc 10501 # lock memory clock at 10501 MHz (4090 stock)
Locked clocks eliminate the 30-50ms jitter from boost transitions, which matters for low-latency agent loops.
Undervolting
On Linux: nvidia-smi can lock voltage indirectly via -lgc upper bound. On Windows: MSI Afterburner / NVIDIA App curve editor — drop the curve by 50-80mV and lower the clock ceiling 100-150 MHz; test with sustained llama-bench for an hour.
Typical undervolt: RTX 4090 at 0.9V / 2520 MHz delivers ~99% of stock performance at ~340W instead of ~450W.
Fan curve
Default fan curves are conservative. For long inference runs, set fans to ramp earlier:
# nvidia-settings (Linux, X required)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
-a "[fan:0]/GPUTargetFanSpeed=70"
MIG, MPS, and Multi-Tenant Isolation {#mig-mps}
If multiple processes need to share a GPU without one starving the other:
- MPS (Multi-Process Service) — multiplexes CUDA contexts on a single GPU. Works on all NVIDIA GPUs since Volta. Latency-friendly, no isolation.
- MIG (Multi-Instance GPU) — partitions a GPU into hardware-isolated slices. A100, H100, H200 only — not consumer.
Enabling MPS (Linux)
# Per user, before launching CUDA processes
export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d # start MPS daemon
MPS is useful if you run, e.g., Ollama for chat and an embedding service simultaneously on one GPU — without MPS they serialize CUDA contexts and steal latency from each other.
MIG on H100
sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C # seven 1g.10gb instances
For local-LLM hobbyists this is rarely relevant. For shared lab GPUs it is essential.
Framework-Specific Tuning {#framework-tuning}
Ollama
# Modelfile
FROM llama3.1:70b-instruct-q4_K_M
PARAMETER num_gpu 80
PARAMETER num_ctx 8192
PARAMETER num_batch 512
PARAMETER num_thread 8
PARAMETER flash_attn true
PARAMETER use_mmap true
Useful environment variables:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_KEEP_ALIVE=24h
OLLAMA_KEEP_ALIVE matters — by default Ollama unloads the model after 5 minutes, and reloading costs 5-30 seconds.
llama.cpp
./llama-server \
-m model.gguf \
-ngl 999 \
-c 8192 \
-b 2048 -ub 512 \
-fa \
--cache-type-k q8_0 --cache-type-v q8_0 \
--no-mmap \
--threads 8 \
--tensor-split 24,24
-b 2048 -ub 512 controls logical and physical batch sizes for prompt processing — higher -b is faster on prompt eval, higher -ub uses more VRAM.
vLLM
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ \
--quantization awq \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192
--enable-chunked-prefill interleaves long prompt prefill with decode steps so a 32K-token prompt does not block other requests.
TensorRT-LLM
Build the engine ahead of time, then serve. Engines are GPU-architecture-specific.
# Build (Llama 3.1 70B with AWQ-INT4, TP=2)
trtllm-build \
--checkpoint_dir ./Llama3.1-70B-awq \
--output_dir ./engines/llama3.1-70b-awq-tp2 \
--gemm_plugin auto \
--gpt_attention_plugin auto \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable \
--max_input_len 32768 \
--max_seq_len 33792 \
--max_batch_size 16 \
--tp_size 2
# Serve via Triton or trtllm-serve
trtllm-serve ./engines/llama3.1-70b-awq-tp2 --port 8000
ExLlamaV2
Best in class for single-GPU INT4 inference on Ampere/Ada with 24GB-class cards. Use exllamav2_HF loader in text-generation-webui, or the standalone server. EXL2 quantization (variable bit allocation) frequently beats AWQ on quality at the same size.
Profiling: Nsight Systems, Nsight Compute, nvidia-smi dmon {#profiling}
You cannot optimize what you do not measure.
Quick health check
nvidia-smi dmon -s pucvmet -d 1
Watch for:
sm(SM utilization) — should be 80-99% during decode. If it sits below 50%, you are CPU-bound or PCIe-bound.mem(memory bandwidth) — for LLM decode this is usually the bottleneck; expect 70-90% of theoretical.pwr— should sit just under your power limit.tmp— under 80°C ideally; 85°C+ means thermal throttling is imminent.
Nsight Systems (timeline)
nsys profile -o llm-trace --stats=true \
python -c "..."
Open in Nsight Systems UI. Look for gaps between kernels (CPU bottleneck) and unusually long kernels (memory-bound).
Nsight Compute (kernel-level)
ncu --set full -o kernel-report ./llama-cli ...
Heavy hammer; use only when you suspect a specific kernel is slow.
Framework-native profilers
- vLLM:
--collect-detailed-traces allwrites per-request trace JSON. - PyTorch:
torch.profilerwith thetensorboard_trace_handler. - llama.cpp:
llama-bench -m model.gguf -ngl 999 -p 512 -n 128for repeatable throughput numbers.
Common Mistakes That Silently Kill Performance {#mistakes}
- Wrong PCIe slot — second NVMe or chipset PCIe slots often run x4 or x8. Verify with
nvidia-smi --query-gpu=pcie.link.width.current,pcie.link.gen.current --format=csv. - Resizable BAR off — enable in BIOS. Improves PCIe transfers significantly.
- Background processes on the GPU — browser hardware acceleration, Discord overlay, OBS. Each steals 100-500 MB VRAM and a few percent throughput.
- PCIe ASPM (power saving) — disable in BIOS for inference servers. Wakeup latency adds jitter.
malloc/ pageable host memory — for tools that copy weights via host, pinned memory is 2-3x faster. Most frameworks handle this; custom code often does not.- Wrong GGUF quant — IQ-quants are higher quality at the same size but slower; on Ada/Blackwell prefer K-quants for raw speed.
OLLAMA_KEEP_ALIVEdefault of 5 min — model unload + reload kills latency-sensitive workflows. Set to 24h.- Persistent mode off — context creation latency at every CUDA process start.
- Mixing CUDA toolkit versions — match PyTorch's expected CUDA runtime to your driver.
- Running BF16 model in FP32 — happens silently when frameworks fall back. Always verify with a profile or memory footprint check.
Reference Configs by GPU {#reference-configs}
RTX 3090 / 3090 Ti (24 GB, Ampere)
- Best target models: 8B FP16, 14B AWQ, 32B Q5_K_M, 70B partial offload.
- Power limit: 280-300W.
- Use llama.cpp with FA2 + Q8 KV cache for 32B models.
- Two cards with NVLink: best price-to-performance for 70B at home.
RTX 4070 Ti Super / 4080 Super (16 GB, Ada)
- Best target models: 8B BF16, 14B AWQ, 32B Q4_K_M with offload.
- Power limit: 250W.
- Enable FP8 KV cache in vLLM for long context.
RTX 4090 (24 GB, Ada)
- Best target models: 8B BF16, 32B AWQ, 70B partial offload (Q4_K_M).
- Power limit: 350-380W is the sweet spot.
- vLLM AWQ + FP8 KV cache for max throughput.
RTX 5090 (32 GB, Blackwell)
- Best target models: 8B BF16, 32B BF16, 70B AWQ-INT4 fits!
- FA3 + FP8 deliver the largest gen-over-gen jump in years.
- Power limit: 500-550W.
- Lock memory clocks; GDDR7 boost can be jittery on early drivers.
Dual-GPU rigs
- 2x 3090 NVLink: still the best $/perf for 70B in 2026.
- 2x 4090 PCIe: faster than 2x 3090 in TP=2 despite no NVLink, but ~2x cost.
- 2x 5090 PCIe 5.0: fastest consumer 70B inference; FP8 + FA3 is transformative.
FAQ {#faq}
See answers to common CUDA optimization questions below.
Sources & further reading: llama.cpp performance docs | vLLM documentation | TensorRT-LLM repo | FlashAttention paper | PagedAttention / vLLM paper | NVIDIA CUDA Best Practices Guide | Internal benchmarks on RTX 3090, 4090, 5090, and H100.
Related guides on Local AI Master:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!