★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Performance

CUDA Optimization Techniques for Local LLMs: The Complete 2026 Guide

May 1, 2026
28 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

NVIDIA CUDA gives local LLMs their speed — but most users leave 30-70% of their GPU's performance on the table. This guide is the complete, no-fluff reference to every CUDA optimization that meaningfully accelerates local LLM inference: from the obvious (--n-gpu-layers) to the subtle (CUDA graphs, FP8 calibration, KV-cache quantization, NCCL tuning).

Every technique below is ranked by typical impact, with concrete commands, real benchmarks, and the trade-offs nobody mentions in the README.

Table of Contents

  1. Impact Ranking: What Actually Matters
  2. Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL
  3. Quantization & Precision (FP16, BF16, FP8, INT8, INT4)
  4. GPU Layer Offload: --n-gpu-layers Done Right
  5. FlashAttention 2 & 3
  6. KV-Cache Quantization & PagedAttention
  7. Tensor Cores & Mixed Precision
  8. cuBLAS, cuDNN, and Kernel Selection
  9. CUDA Graphs
  10. Tensor Parallelism, Pipeline Parallelism, NVLink
  11. Speculative Decoding & Medusa
  12. Continuous Batching (vLLM, TGI, TensorRT-LLM)
  13. Power, Clock, and Thermal Tuning
  14. MIG, MPS, and Multi-Tenant Isolation
  15. Framework-Specific Tuning: Ollama, llama.cpp, vLLM, TensorRT-LLM, ExLlamaV2
  16. Profiling: Nsight Systems, Nsight Compute, nvidia-smi dmon
  17. Common Mistakes That Silently Kill Performance
  18. Reference Configs by GPU
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Impact Ranking: What Actually Matters {#impact-ranking}

Before tuning anything, know where the time goes. Here is a typical 70B Q4 inference workload on an RTX 4090 (24GB) and the order of optimizations by impact:

RankOptimizationTypical SpeedupEffortRisk
1Fit model entirely in VRAM (right quant + ngl)5-10xLowNone
2FlashAttention 2/31.4-4.0x at long contextLowNone
3KV-cache quantization (Q8_0)1.1-1.4x + memory savingsLowNegligible
4Switch from FP16 to FP8 / INT8 / INT4 (where supported)1.3-2.5xMediumCalibration
5Continuous batching (vLLM/TGI) — multi-user only5-20x aggregateHighFramework swap
6Speculative decoding (Medusa, EAGLE, n-gram)1.5-3xMediumQuality drift
7CUDA graphs1.05-1.2xLowFramework support
8Tensor parallelism with NVLink1.4-1.8x for 2 GPUsMediumHardware
9Power-limit / undervolt for thermal headroom1.0-1.05x sustainedLowNone
10cuBLAS / cuDNN version + kernel autotune1.02-1.10xLowNone

If you do only the top three, you will outrun 90% of casual users. The rest is fine-tuning.


Foundation: Drivers, CUDA Toolkit, cuDNN, NCCL {#foundation}

The fastest kernels in the world cannot save you from a stale driver. Versions matter.

ComponentMinimumRecommendedNotes
NVIDIA driver555.x570.x or newerRequired for FP8 on Ada/Blackwell
CUDA Toolkit12.412.6+Build-time only; runtime uses driver
cuDNN9.09.5+Fused attention kernels improved
NCCL2.202.23+Multi-GPU all-reduce performance
TensorRT10.010.4+For TensorRT-LLM users
# Verify your stack
nvidia-smi --query-gpu=driver_version,name,vbios_version --format=csv
nvcc --version
python -c "import torch; print(torch.__version__, torch.version.cuda, torch.backends.cudnn.version())"

Driver Persistence Mode (Linux)

By default the driver tears down state between processes, adding 1-3 seconds of CUDA context creation per inference run. Enable persistence mode for long-running services:

sudo nvidia-smi -pm 1   # Enable persistence (deprecated on newer drivers)
# Modern replacement (driver 470+):
sudo systemctl enable --now nvidia-persistenced

Compute Mode

For dedicated inference servers, set exclusive mode so a single CUDA context owns the GPU and avoids context-switch overhead:

sudo nvidia-smi -c EXCLUSIVE_PROCESS

Set back to default (-c DEFAULT) on workstations where you also game.


Quantization & Precision {#quantization}

Quantization is the single biggest lever — but the right format depends on your GPU generation and framework.

Format compatibility matrix

FormatRTX 30xx (Ampere)RTX 40xx (Ada)RTX 50xx (Blackwell)H100 (Hopper)Frameworks
FP32✅ slow✅ slow✅ slow✅ slowAll
FP16All
BF16All except very old
FP8 (E4M3 / E5M2)TensorRT-LLM, vLLM, transformer-engine
INT8 (W8A8)TensorRT-LLM, vLLM, llama.cpp (partial)
INT4 (W4A16, AWQ, GPTQ)All major
GGUF Q4_K_M / Q5_K_M / Q6_Kllama.cpp, Ollama, koboldcpp
GGUF IQ-quants (IQ2_XS, IQ3_XXS)llama.cpp

Practical recommendations

  • 8B models on 12-24GB VRAM: FP16 / BF16 is fine; quality is highest, speed is plenty.
  • 14-32B models on 24GB VRAM: Q5_K_M (GGUF) or AWQ-INT4. Sweet spot for quality.
  • 70B models on 24GB VRAM: Q4_K_M (GGUF) at ~42GB total — partial offload required.
  • 70B models on 48GB VRAM (2x 3090, A6000): Q5_K_M or Q4_K_M fully on GPU.
  • 70B models on 80GB+ (H100, 2x 4090 Ti / 5090): FP8 or AWQ-INT4 for max speed.

Why BF16 beats FP16 in 2026

BF16 has the same exponent range as FP32 (8 bits) but fewer mantissa bits (7 vs 23). For LLM inference this is almost always a net win: no overflow at long context, minimal quality difference, same throughput as FP16 on Ampere and newer. PyTorch / vLLM / TensorRT-LLM all default to BF16 for new models.

# vLLM example — explicitly request BF16
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", dtype="bfloat16")

FP8 — the 2025-2026 breakthrough

Hopper introduced FP8, Ada brought it to consumer GPUs (RTX 40-series), and Blackwell doubled FP8 throughput again. Two formats:

  • E4M3 — 4 exponent bits, 3 mantissa bits — used for weights and activations.
  • E5M2 — 5 exponent bits, 2 mantissa bits — used for gradients (training only).

For inference you almost always want E4M3 with per-tensor or per-channel scaling. Quality after calibration is typically within 0.5% of BF16 on MMLU/HumanEval, with ~2x throughput.

# vLLM with FP8 KV cache and FP8 weights (Ada+ required)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768

INT4 via AWQ vs GPTQ vs GGUF

  • AWQ (Activation-aware Weight Quantization) — preserves salient weights based on activation magnitude. Best quality at 4-bit. Strongly recommended for vLLM and TensorRT-LLM.
  • GPTQ — older but widely available; group-size 128 is standard.
  • GGUF Q4_K_M / IQ4_XS — llama.cpp's k-quants and i-quants. Q4_K_M is roughly equivalent to GPTQ-128g; IQ-quants squeeze 2-3% more quality at the same bit budget at the cost of slower inference.

In our testing on RTX 4090 with Llama 3.1 70B:

FormatSize on disktok/s (fully on GPU)MMLU
FP16140 GBOOM (offloaded)79.1
FP8 (E4M3)70 GB~2278.9
AWQ-INT436 GB~3877.8
GPTQ-128g36 GB~3477.5
GGUF Q4_K_M42 GB~8 (partial offload)77.6
GGUF IQ4_XS38 GB~9 (partial offload)77.9

For multi-GPU setups (2x 4090, 2x 5090), AWQ-INT4 + vLLM is the highest-throughput option for 70B models.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

GPU Layer Offload: --n-gpu-layers Done Right {#gpu-layers}

In llama.cpp / Ollama, this single flag controls how many transformer layers run on the GPU. Get it wrong and you lose 5-10x performance.

Layer counts by model

ModelLayersApproximate VRAM at Q4_K_M
Llama 3.1 8B32~5 GB
Llama 3.1 / 3.3 70B80~42 GB
Qwen 2.5 7B28~4.5 GB
Qwen 2.5 32B64~20 GB
Qwen 2.5 72B80~43 GB
Mixtral 8x7B32~26 GB (Q4_K_M)
Gemma 2 27B46~17 GB

Tuning procedure

# 1. Start with all layers on GPU
./llama-cli -m model.gguf -ngl 999 -c 4096

# 2. If OOM, drop in steps of 4
./llama-cli -m model.gguf -ngl 76 -c 4096   # 70B with 4 layers on CPU
./llama-cli -m model.gguf -ngl 72 -c 4096

# 3. Always keep the output (lm_head) on GPU
./llama-cli -m model.gguf -ngl 72 --override-tensor "output.weight=GPU"

In Ollama

# Modelfile
FROM llama3.1:70b-instruct-q4_K_M
PARAMETER num_gpu 80          # 80 = all 70B layers
PARAMETER num_ctx 8192
PARAMETER num_batch 512

Or at runtime:

OLLAMA_NUM_GPU=80 ollama run llama3.1:70b

Why "all layers" beats "almost all layers"

With even one layer on CPU, every token requires a full PCIe round-trip per generated token. On PCIe 4.0 x16 (~32 GB/s practical) this adds 5-30ms per token depending on layer size — easily 50% of total latency. The KV cache also has to ping-pong between host and device. Always size your quantization to fit fully on GPU if possible.

Multi-GPU tensor split

For multi-GPU setups, control distribution explicitly:

# llama.cpp — proportionally split across two 24GB GPUs
./llama-cli -m model.gguf -ngl 999 --tensor-split 24,24

# Asymmetric: 4090 (24GB) + 3090 (24GB) — keep more on the faster card
./llama-cli -m model.gguf -ngl 999 --tensor-split 28,22

FlashAttention 2 & 3 {#flash-attention}

Standard attention has O(N²) memory. FlashAttention restructures the computation to be O(N) memory and 2-4x faster by keeping the softmax in SRAM and tiling the QKV matmul.

Versions and hardware support

VersionBest ForHardware
FlashAttention 1Reference / older AmpereAll CUDA
FlashAttention 2Most local usersAmpere, Ada, Hopper, Blackwell
FlashAttention 3Maximum throughputHopper (H100), Blackwell (B100, RTX 50-series)

FA3 adds FP8 support, async warp-specialized kernels, and better tail handling — typically 1.5-2.0x faster than FA2 on Hopper.

Enabling FlashAttention

llama.cpp / Ollama:

# llama.cpp
./llama-cli -m model.gguf -ngl 999 -fa

# Ollama Modelfile
PARAMETER flash_attn true

vLLM:

# Auto-selected; force a specific backend if needed:
VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve <model>
# On Hopper/Blackwell, use FlashInfer or FA3:
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve <model>

TensorRT-LLM: built into the engine; no flag needed.

Real benchmarks (RTX 4090, Llama 3.1 8B BF16)

ContextNo FAFA2Speedup
2K138 tok/s142 tok/s1.03x
8K78 tok/s121 tok/s1.55x
16K31 tok/s88 tok/s2.84x
32KOOM51 tok/s

The longer your context, the bigger the win. For RAG and agent workflows where context routinely exceeds 8K, FlashAttention is mandatory.


KV-Cache Quantization & PagedAttention {#kv-cache}

For autoregressive generation, the KV cache is often the dominant memory cost — a Llama 3.1 70B at 32K context with FP16 KV cache eats ~20 GB on its own. Two complementary techniques:

KV-cache quantization

Quantize the K and V tensors in place. Q8_0 is essentially free quality-wise; Q4 is risky.

# llama.cpp — requires FlashAttention
./llama-cli -m model.gguf -ngl 999 -fa \
    --cache-type-k q8_0 --cache-type-v q8_0
# vLLM — FP8 KV cache (Ada+ required for hardware acceleration)
vllm serve <model> --kv-cache-dtype fp8_e4m3

Memory savings: ~50% from FP16 → 8-bit, ~75% from FP16 → 4-bit. On long-context workloads this directly translates to bigger usable contexts or smaller GPU requirements.

PagedAttention (vLLM)

vLLM stores the KV cache in fixed-size blocks (default 16 tokens) so that fragmentation drops from ~60-80% to <4%. This is the primary reason vLLM out-throughputs llama.cpp 5-20x on multi-user workloads — you can fit more concurrent requests in the same VRAM.

vllm serve <model> --block-size 16 --gpu-memory-utilization 0.92

Tune --gpu-memory-utilization upward (0.95-0.97) on dedicated inference boxes; leave it at 0.85-0.90 on workstations where you also run other apps.

Prefix caching

Both vLLM and TensorRT-LLM support prefix caching — system prompts and few-shot exemplars are computed once and reused across requests. For agent workloads this is a 10-100x latency win on first-token-time.

vllm serve <model> --enable-prefix-caching

Tensor Cores & Mixed Precision {#tensor-cores}

Tensor Cores are specialized matrix-multiply units. Every CUDA generation since Volta (V100) has them, but the supported types changed:

GenerationGPUsTensor Core Types
VoltaV100FP16
TuringRTX 20-seriesFP16, INT8, INT4
AmpereRTX 30-series, A100FP16, BF16, TF32, INT8, INT4, sparse
HopperH100, H200+ FP8 (E4M3, E5M2), Transformer Engine
AdaRTX 40-series, L40S+ FP8
BlackwellRTX 50-series, B100, B200+ FP4, microscaling, FA3 native

Making sure you actually use Tensor Cores

PyTorch:

import torch
torch.backends.cuda.matmul.allow_tf32 = True       # Ampere+
torch.backends.cudnn.allow_tf32 = True
torch.set_float32_matmul_precision("high")          # alias for TF32 on

For pure inference, dtypes BF16 / FP16 / FP8 / INT8 automatically dispatch to Tensor Cores. FP32 does not.

Mixed precision in custom code

with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(input_ids)

llama.cpp, Ollama, vLLM, and TensorRT-LLM all already use Tensor Cores correctly when given a compatible dtype.


cuBLAS, cuDNN, and Kernel Selection {#cublas}

These are the math libraries underneath every framework. You usually do not touch them directly, but a few flags matter.

cuBLAS LT and heuristics caching

cuBLAS chooses kernels at runtime via heuristics. Stable workloads (same shapes repeated) benefit from caching:

# Enable cuBLAS LT heuristic cache
export CUBLASLT_LOG_LEVEL=0
export CUBLASLT_HEURISTICS_CACHE_PATH=/tmp/cublaslt-cache

cuDNN benchmark mode (PyTorch)

torch.backends.cudnn.benchmark = True   # autotune for fixed-shape workloads

Use only when input shapes are stable (which is true for inference once context length stabilizes). Setting this on dynamically-shaped training can hurt.

llama.cpp build flags

If you build llama.cpp yourself, build with:

cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_F16=ON \
    -DGGML_CUDA_FORCE_MMQ=ON \
    -DGGML_CUDA_FORCE_CUBLAS=OFF \
    -DCMAKE_CUDA_ARCHITECTURES="89;90;120"
cmake --build build -j

CMAKE_CUDA_ARCHITECTURES matters — 89 = Ada (RTX 40), 90 = Hopper, 120 = Blackwell. Building only for your card avoids fat binaries and slightly faster startup. GGML_CUDA_FORCE_MMQ=ON enables custom mat-mul kernels for quantized types that often beat cuBLAS on small batch sizes.


CUDA Graphs {#cuda-graphs}

A CUDA graph captures an entire sequence of kernel launches and replays them with a single CPU-side operation. For inference (where the kernel sequence is mostly the same per token), this removes 10-50µs of launch overhead per token — small per token, large in aggregate.

Frameworks that use CUDA Graphs

  • TensorRT-LLM — yes, automatic
  • vLLM — yes, with --enforce-eager false (default off) for decode steps
  • llama.cpp — yes since b3000+, automatic when supported
  • PyTorch — manual via torch.cuda.graph()

vLLM explicit setting

vllm serve <model> --enforce-eager false   # default is false, but be explicit

Eager mode (--enforce-eager true) disables CUDA graphs — useful for debugging, painful for production.

Typical impact: 5-15% lower per-token latency at batch size 1, 2-5% at large batch sizes. Free win.


For multi-GPU inference, the choice of parallelism strategy is bigger than any kernel-level tuning.

The three strategies

  • Tensor Parallelism (TP) — split each matmul across GPUs. Communication: AllReduce per layer. Latency-friendly. Used by vLLM, TensorRT-LLM, DeepSpeed-Inference.
  • Pipeline Parallelism (PP) — split the model by layer ranges. Communication: activations between stages. Throughput-friendly for batched workloads, terrible for batch-size-1 latency. Used by llama.cpp, Ollama by default.
  • Expert Parallelism (EP) — only for MoE models like Mixtral. Different experts on different GPUs.

When to pick which

SetupBest Strategy
Single user, latency mattersTP=N (with NVLink if available)
Batched server, throughput mattersTP=2 + PP if needed
MoE modelEP across experts
Mixed VRAM (24GB + 16GB)PP with manual layer split

vLLM tensor parallel

# 2x RTX 4090
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ \
    --tensor-parallel-size 2 \
    --quantization awq \
    --max-model-len 32768
BusBandwidth (bidir)Available On
PCIe 4.0 x16~32 GB/sAll modern PCs
PCIe 5.0 x16~64 GB/sZ790/X670, RTX 50-series
NVLink 3 (consumer)~112 GB/sRTX 3090, RTX 3090 Ti
NVLink 4~900 GB/sH100 SXM, B100

RTX 4090 and 5090 do not support NVLink. For consumer 70B inference on 2 GPUs, your only options are 2x 3090 with NVLink, or 2x 4090/5090 over PCIe.

NCCL tuning for multi-GPU

export NCCL_P2P_LEVEL=NVL          # require NVLink path if available
export NCCL_DEBUG=WARN
export NCCL_IB_DISABLE=1            # disable InfiniBand on workstations
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_NET_GDR_LEVEL=PHB       # GPU Direct RDMA when applicable

For 2 GPUs in one box, defaults are usually fine; the above matters more on 4+ GPU rigs.


Speculative Decoding & Medusa {#speculative}

Speculative decoding uses a small "draft" model to guess several tokens, then has the big model verify them in a single forward pass. Net effect: fewer big-model forward passes per generated token.

Methods

MethodSpeedupQualityNotes
Vanilla speculative decoding1.5-2.5xIdentical to targetNeed a small fast model with same vocab
n-gram (prompt lookup)1.2-1.6xIdenticalNo draft model required
Medusa heads2.0-3.0xNear-identicalTrain extra heads on the target model
EAGLE / EAGLE-22.5-3.5xIdenticalMore complex training
Lookahead decoding1.4-2.2xIdenticalNo draft model

llama.cpp speculative decoding

./llama-cli -m large.gguf \
    --model-draft small.gguf \
    -ngl 999 --draft-max 8 -p "..."

Pair models with the same tokenizer (e.g., Llama 3.1 70B target + Llama 3.2 1B draft).

vLLM speculative decoding

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5

Acceptance rate matters

Speculative decoding is only fast if the draft model agrees with the target. Acceptance rates below 60% can be slower than no speculation. Measure with vLLM's --collect-detailed-traces.


Continuous Batching {#batching}

If you serve more than one user, this single feature is the biggest single throughput win available.

Static batching waits for a batch to fill, then runs all sequences to completion together. Continuous batching swaps in new requests at every decoding step, so the GPU is never idle.

vLLM, TGI (HuggingFace), TensorRT-LLM, and SGLang all implement continuous batching with various names ("iteration-level scheduling," "in-flight batching"). llama.cpp / Ollama do not — they are intended for single-user desktop use.

Aggregate throughput on RTX 4090 with Llama 3.1 8B BF16:

ConcurrencyOllama (tok/s, sum)vLLM (tok/s, sum)
1138132
4142480
161451,150
321451,720
641452,200

Single-user, llama.cpp wins by a hair. Multi-user, vLLM is 15x faster. Pick the right tool.


Power, Clock, and Thermal Tuning {#power-tuning}

NVIDIA consumer GPUs ship with aggressive boost behavior that thermal-throttles in long inference runs and inflates noise. Capping power gives you most of the speed at much lower noise and heat.

Power limit (Linux)

sudo nvidia-smi -pl 350                # cap RTX 4090 at 350W (stock 450W)
sudo nvidia-smi -pl 280                # cap RTX 3090 at 280W (stock 350W)

Set persistent at boot:

# /etc/systemd/system/nvidia-power-limit.service
[Unit]
Description=NVIDIA GPU power limit
After=nvidia-persistenced.service

[Service]
Type=oneshot
ExecStart=/usr/bin/nvidia-smi -pl 350
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Lock clocks for predictable latency

sudo nvidia-smi -lgc 1500,2520        # lock graphics clock between 1500 and 2520 MHz
sudo nvidia-smi -lmc 10501            # lock memory clock at 10501 MHz (4090 stock)

Locked clocks eliminate the 30-50ms jitter from boost transitions, which matters for low-latency agent loops.

Undervolting

On Linux: nvidia-smi can lock voltage indirectly via -lgc upper bound. On Windows: MSI Afterburner / NVIDIA App curve editor — drop the curve by 50-80mV and lower the clock ceiling 100-150 MHz; test with sustained llama-bench for an hour.

Typical undervolt: RTX 4090 at 0.9V / 2520 MHz delivers ~99% of stock performance at ~340W instead of ~450W.

Fan curve

Default fan curves are conservative. For long inference runs, set fans to ramp earlier:

# nvidia-settings (Linux, X required)
nvidia-settings -a "[gpu:0]/GPUFanControlState=1" \
                -a "[fan:0]/GPUTargetFanSpeed=70"

MIG, MPS, and Multi-Tenant Isolation {#mig-mps}

If multiple processes need to share a GPU without one starving the other:

  • MPS (Multi-Process Service) — multiplexes CUDA contexts on a single GPU. Works on all NVIDIA GPUs since Volta. Latency-friendly, no isolation.
  • MIG (Multi-Instance GPU) — partitions a GPU into hardware-isolated slices. A100, H100, H200 only — not consumer.

Enabling MPS (Linux)

# Per user, before launching CUDA processes
export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d   # start MPS daemon

MPS is useful if you run, e.g., Ollama for chat and an embedding service simultaneously on one GPU — without MPS they serialize CUDA contexts and steal latency from each other.

MIG on H100

sudo nvidia-smi -mig 1
sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C   # seven 1g.10gb instances

For local-LLM hobbyists this is rarely relevant. For shared lab GPUs it is essential.


Framework-Specific Tuning {#framework-tuning}

Ollama

# Modelfile
FROM llama3.1:70b-instruct-q4_K_M
PARAMETER num_gpu 80
PARAMETER num_ctx 8192
PARAMETER num_batch 512
PARAMETER num_thread 8
PARAMETER flash_attn true
PARAMETER use_mmap true

Useful environment variables:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_KEEP_ALIVE=24h

OLLAMA_KEEP_ALIVE matters — by default Ollama unloads the model after 5 minutes, and reloading costs 5-30 seconds.

llama.cpp

./llama-server \
    -m model.gguf \
    -ngl 999 \
    -c 8192 \
    -b 2048 -ub 512 \
    -fa \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --no-mmap \
    --threads 8 \
    --tensor-split 24,24

-b 2048 -ub 512 controls logical and physical batch sizes for prompt processing — higher -b is faster on prompt eval, higher -ub uses more VRAM.

vLLM

vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ \
    --quantization awq \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192

--enable-chunked-prefill interleaves long prompt prefill with decode steps so a 32K-token prompt does not block other requests.

TensorRT-LLM

Build the engine ahead of time, then serve. Engines are GPU-architecture-specific.

# Build (Llama 3.1 70B with AWQ-INT4, TP=2)
trtllm-build \
    --checkpoint_dir ./Llama3.1-70B-awq \
    --output_dir ./engines/llama3.1-70b-awq-tp2 \
    --gemm_plugin auto \
    --gpt_attention_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 32768 \
    --max_seq_len 33792 \
    --max_batch_size 16 \
    --tp_size 2

# Serve via Triton or trtllm-serve
trtllm-serve ./engines/llama3.1-70b-awq-tp2 --port 8000

ExLlamaV2

Best in class for single-GPU INT4 inference on Ampere/Ada with 24GB-class cards. Use exllamav2_HF loader in text-generation-webui, or the standalone server. EXL2 quantization (variable bit allocation) frequently beats AWQ on quality at the same size.


Profiling: Nsight Systems, Nsight Compute, nvidia-smi dmon {#profiling}

You cannot optimize what you do not measure.

Quick health check

nvidia-smi dmon -s pucvmet -d 1

Watch for:

  • sm (SM utilization) — should be 80-99% during decode. If it sits below 50%, you are CPU-bound or PCIe-bound.
  • mem (memory bandwidth) — for LLM decode this is usually the bottleneck; expect 70-90% of theoretical.
  • pwr — should sit just under your power limit.
  • tmp — under 80°C ideally; 85°C+ means thermal throttling is imminent.

Nsight Systems (timeline)

nsys profile -o llm-trace --stats=true \
    python -c "..."

Open in Nsight Systems UI. Look for gaps between kernels (CPU bottleneck) and unusually long kernels (memory-bound).

Nsight Compute (kernel-level)

ncu --set full -o kernel-report ./llama-cli ...

Heavy hammer; use only when you suspect a specific kernel is slow.

Framework-native profilers

  • vLLM: --collect-detailed-traces all writes per-request trace JSON.
  • PyTorch: torch.profiler with the tensorboard_trace_handler.
  • llama.cpp: llama-bench -m model.gguf -ngl 999 -p 512 -n 128 for repeatable throughput numbers.

Common Mistakes That Silently Kill Performance {#mistakes}

  1. Wrong PCIe slot — second NVMe or chipset PCIe slots often run x4 or x8. Verify with nvidia-smi --query-gpu=pcie.link.width.current,pcie.link.gen.current --format=csv.
  2. Resizable BAR off — enable in BIOS. Improves PCIe transfers significantly.
  3. Background processes on the GPU — browser hardware acceleration, Discord overlay, OBS. Each steals 100-500 MB VRAM and a few percent throughput.
  4. PCIe ASPM (power saving) — disable in BIOS for inference servers. Wakeup latency adds jitter.
  5. malloc / pageable host memory — for tools that copy weights via host, pinned memory is 2-3x faster. Most frameworks handle this; custom code often does not.
  6. Wrong GGUF quant — IQ-quants are higher quality at the same size but slower; on Ada/Blackwell prefer K-quants for raw speed.
  7. OLLAMA_KEEP_ALIVE default of 5 min — model unload + reload kills latency-sensitive workflows. Set to 24h.
  8. Persistent mode off — context creation latency at every CUDA process start.
  9. Mixing CUDA toolkit versions — match PyTorch's expected CUDA runtime to your driver.
  10. Running BF16 model in FP32 — happens silently when frameworks fall back. Always verify with a profile or memory footprint check.

Reference Configs by GPU {#reference-configs}

RTX 3090 / 3090 Ti (24 GB, Ampere)

  • Best target models: 8B FP16, 14B AWQ, 32B Q5_K_M, 70B partial offload.
  • Power limit: 280-300W.
  • Use llama.cpp with FA2 + Q8 KV cache for 32B models.
  • Two cards with NVLink: best price-to-performance for 70B at home.

RTX 4070 Ti Super / 4080 Super (16 GB, Ada)

  • Best target models: 8B BF16, 14B AWQ, 32B Q4_K_M with offload.
  • Power limit: 250W.
  • Enable FP8 KV cache in vLLM for long context.

RTX 4090 (24 GB, Ada)

  • Best target models: 8B BF16, 32B AWQ, 70B partial offload (Q4_K_M).
  • Power limit: 350-380W is the sweet spot.
  • vLLM AWQ + FP8 KV cache for max throughput.

RTX 5090 (32 GB, Blackwell)

  • Best target models: 8B BF16, 32B BF16, 70B AWQ-INT4 fits!
  • FA3 + FP8 deliver the largest gen-over-gen jump in years.
  • Power limit: 500-550W.
  • Lock memory clocks; GDDR7 boost can be jittery on early drivers.

Dual-GPU rigs

  • 2x 3090 NVLink: still the best $/perf for 70B in 2026.
  • 2x 4090 PCIe: faster than 2x 3090 in TP=2 despite no NVLink, but ~2x cost.
  • 2x 5090 PCIe 5.0: fastest consumer 70B inference; FP8 + FA3 is transformative.

FAQ {#faq}

See answers to common CUDA optimization questions below.


Sources & further reading: llama.cpp performance docs | vLLM documentation | TensorRT-LLM repo | FlashAttention paper | PagedAttention / vLLM paper | NVIDIA CUDA Best Practices Guide | Internal benchmarks on RTX 3090, 4090, 5090, and H100.

Related guides on Local AI Master:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks for Ollama tuned for CUDA-enabled Linux/Windows hosts. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators