TensorRT-LLM Setup Guide (2026): Engine Build, FP8, INT4-AWQ, Triton
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
TensorRT-LLM is NVIDIA's purpose-built LLM inference compiler. It takes a Hugging Face checkpoint, fuses kernels, applies FP8 / INT4 quantization with CUDA graphs and in-flight batching, and emits a hardware-specific engine that delivers the lowest single-stream latency you can get on an NVIDIA GPU. Used right, it produces 20-40% lower per-token latency than vLLM at batch size 1 — at the cost of a 10-90 minute engine build per (model, GPU, dtype) combination.
This guide is the complete practitioner reference: NGC container install, checkpoint conversion, engine build flags, FP8 vs INT4-AWQ recipes, in-flight batching configuration, Triton integration, LoRA adapters, long-context tuning, multi-GPU deploys, and benchmarks vs vLLM and Ollama on the same hardware.
Table of Contents
- What TensorRT-LLM Is
- TensorRT-LLM vs vLLM vs SGLang vs Ollama
- Hardware & Software Requirements
- Installation: NGC Container, pip, From Source
- The Build Pipeline (Convert → Build → Serve)
- Your First Engine: Llama 3.1 8B FP8
- Quantization Recipes: FP8, INT4-AWQ, INT8
- Tensor Parallel & Pipeline Parallel
- In-Flight Batching
- Paged KV Cache & FP8 KV Cache
- Long Context (32K-128K)
- LoRA Adapters at Runtime
- Speculative Decoding (Medusa, EAGLE, Lookahead)
- Serving with trtllm-serve
- Triton Inference Server Backend
- Kubernetes Deployment
- Observability & Metrics
- Benchmarks vs vLLM and Ollama
- Tuning Recipes by GPU
- Common Errors & Fixes
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What TensorRT-LLM Is {#what-it-is}
TensorRT-LLM is a Python library + CUDA runtime built on TensorRT. It does three things vLLM does not:
- Ahead-of-time compilation — your model is compiled into a binary engine specific to one GPU and one dtype combination. Kernels are autotuned at build time.
- Hand-tuned fused kernels — NVIDIA engineers maintain attention, MoE, and quant-dequant kernels written in CUDA C++ for each architecture.
- First-class FP8 + Transformer Engine — Hopper / Ada / Blackwell FP8 support is the most mature here.
Trade-off: less flexibility. You cannot swap models or change quant at runtime. Engines are not portable across GPUs.
The library exposes Python and C++ APIs, the trtllm-build CLI, the trtllm-serve server (since 0.9.0), and a Triton backend for production.
TensorRT-LLM vs vLLM vs SGLang vs Ollama {#comparison}
| Property | Ollama | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|---|
| Setup time | 60s | 5min | 30-90 min (engine build) | 5min |
| Single-stream latency | OK | Good | Best | Excellent |
| Aggregate throughput | Low | Excellent | Excellent | Excellent |
| FP8 (Ada/Hopper) | ❌ | ✅ | ✅ best | ✅ |
| INT4-AWQ | ❌ | ✅ | ✅ | ✅ |
| Pipeline parallel | basic | ✅ | ✅ | ✅ |
| Tensor parallel | ❌ | ✅ | ✅ | ✅ |
| MoE optimization | basic | good | excellent | excellent |
| Custom kernels | n/a | python | C++ / CUDA | python+triton |
| Engine portability | weights only | weights only | GPU+dtype-specific | weights only |
| Best for | Desktop | Production servers | Lowest-latency prod | Agent frameworks |
Decision: if a 30-minute build per model is acceptable and latency is the KPI, TensorRT-LLM. If you iterate models often or need broad coverage, vLLM. For agent workflows with structured generation, SGLang.
Hardware & Software Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| GPU | CC 7.0+ | Ada (RTX 40), Hopper (H100), Blackwell |
| VRAM | 12 GB (8B BF16) | 24 GB+ for FP8 70B partial; 48 GB+ for 70B AWQ full |
| Driver | 535+ | 555+ (FP8) |
| CUDA | 12.4 | 12.5+ |
| Python | 3.10 | 3.10-3.12 |
| OS | Linux only (Ubuntu 22.04+) | Ubuntu 22.04 LTS |
| RAM | 32 GB | 64 GB+ for 70B builds |
| Disk | 200 GB | NVMe; engines + caches grow fast |
Windows native is not supported. WSL2 works for inference; engine build on WSL2 is slow due to filesystem.
ROCm / AMD: not supported. Use vLLM-ROCm instead.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: NGC Container, pip, From Source {#installation}
NGC Container (recommended)
docker pull nvcr.io/nvidia/tensorrt-llm/release:0.16.0
docker run --rm -it --gpus all --ipc=host \
-v $(pwd):/workspace \
-v ~/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/tensorrt-llm/release:0.16.0
The NGC image bundles CUDA, TensorRT, TRT-LLM, and dependencies pinned to known-good versions.
pip (advanced)
python3.10 -m venv ~/venvs/trtllm
source ~/venvs/trtllm/bin/activate
pip install --upgrade pip
pip install tensorrt-llm==0.16.0 --extra-index-url https://pypi.nvidia.com
This works on Ubuntu 22.04 with CUDA 12.5. Mismatched CUDA versions are the #1 source of installation pain — prefer the container.
From source (custom kernels, latest features)
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
docker build -t trtllm:custom -f docker/Dockerfile.dev .
Source builds take 30-60 minutes and require ~50 GB disk for build artifacts.
The Build Pipeline (Convert → Build → Serve) {#build-pipeline}
Hugging Face checkpoint
│
▼
[convert_checkpoint.py] # model-specific
│
▼
TRT-LLM checkpoint format (separated weights + config)
│
▼
[trtllm-build] # the compiler
│
▼
Engine files (.engine + config.json)
│
▼
[trtllm-serve | Triton]
│
▼
HTTP/gRPC API
Every model family has its own convert_checkpoint.py under examples/<family>/ in the TRT-LLM repo (llama, qwen, gemma, deepseek, mixtral, etc.).
Your First Engine: Llama 3.1 8B FP8 {#first-engine}
# Inside the NGC container
cd /workspace
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
# Convert HF checkpoint to TRT-LLM format
python convert_checkpoint.py \
--model_dir meta-llama/Llama-3.1-8B-Instruct \
--output_dir ./tllm_checkpoint_8b_fp8 \
--dtype bfloat16 \
--use_fp8 \
--calib_dataset cnn_dailymail \
--calib_size 512
# Build engine
trtllm-build \
--checkpoint_dir ./tllm_checkpoint_8b_fp8 \
--output_dir ./engines/llama-3.1-8b-fp8 \
--gpt_attention_plugin auto \
--gemm_plugin auto \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable \
--max_input_len 16384 \
--max_seq_len 16384 \
--max_batch_size 64 \
--max_num_tokens 16384
# Serve (OpenAI-compatible)
trtllm-serve ./engines/llama-3.1-8b-fp8 \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 --port 8000
Build time on RTX 4090: ~10 minutes. Test with the same OpenAI-compatible client used for vLLM.
Quantization Recipes: FP8, INT4-AWQ, INT8 {#quantization}
FP8 (E4M3) — Ada / Hopper / Blackwell
python convert_checkpoint.py \
--model_dir meta-llama/Llama-3.1-70B-Instruct \
--output_dir ./tllm_70b_fp8 \
--dtype bfloat16 \
--use_fp8 \
--tp_size 2 \
--calib_dataset cnn_dailymail --calib_size 512
trtllm-build \
--checkpoint_dir ./tllm_70b_fp8 \
--output_dir ./engines/llama-3.1-70b-fp8 \
--gpt_attention_plugin auto \
--gemm_plugin auto \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable \
--max_input_len 32768 --max_seq_len 32768 \
--max_batch_size 16 \
--tp_size 2
Calibration uses 512 samples from CNN/DailyMail to compute scaling factors. Larger calibration sets (1024-2048) marginally improve quality on long-context tasks.
INT4-AWQ
python ../quantization/quantize.py \
--model_dir meta-llama/Llama-3.1-70B-Instruct \
--output_dir ./tllm_70b_awq \
--dtype bfloat16 \
--qformat int4_awq \
--awq_block_size 128 \
--calib_size 512 \
--tp_size 2
trtllm-build \
--checkpoint_dir ./tllm_70b_awq \
--output_dir ./engines/llama-3.1-70b-awq \
--gpt_attention_plugin auto \
--gemm_plugin auto \
--use_paged_context_fmha enable \
--max_input_len 16384 --max_seq_len 16384 \
--tp_size 2
INT4-AWQ + FP8 KV cache is the highest-throughput recipe on Ada / Hopper for 70B class models.
INT8 (W8A8 SmoothQuant)
python ../quantization/quantize.py \
--model_dir <model> \
--output_dir ./tllm_w8a8 \
--qformat int8_sq \
--calib_size 512
INT8 is rarely the right choice anymore — INT4-AWQ is smaller and FP8 is faster. Keep for legacy compatibility on Ampere where FP8 is unavailable.
Tensor Parallel & Pipeline Parallel {#parallelism}
# 8x H100 — Llama 3.1 405B FP8 with TP=8
trtllm-build \
--checkpoint_dir ./tllm_405b_fp8 \
--output_dir ./engines/llama-405b-fp8-tp8 \
--tp_size 8 \
--max_seq_len 32768 \
--max_batch_size 8
# 16x H100 across 2 nodes — TP=8 within node, PP=2 across nodes
trtllm-build \
--checkpoint_dir ./tllm_405b_fp8 \
--output_dir ./engines/llama-405b-fp8-tp8-pp2 \
--tp_size 8 --pp_size 2 \
--max_seq_len 32768
NCCL must see all GPUs. Within one node, NVLink (H100 SXM, A100 SXM) is critical. Across nodes, InfiniBand is recommended for TP > 8.
In-Flight Batching {#in-flight-batching}
In-Flight Batching (IFB) is TensorRT-LLM's term for continuous batching — request scheduling at iteration granularity, not request granularity. Same goal as vLLM's continuous batching, different implementation.
Enabled by default for any engine built with paged KV cache. Configure via runtime:
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.executor import ExecutorConfig
executor_cfg = ExecutorConfig(
max_batch_size=32,
max_num_tokens=8192,
enable_chunked_context=True,
kv_cache_config={"free_gpu_memory_fraction": 0.9},
)
llm = LLM(model="./engines/llama-3.1-8b-fp8", executor_config=executor_cfg)
enable_chunked_context interleaves long-prompt prefill with decode steps from other requests — same idea as vLLM's chunked prefill.
Paged KV Cache & FP8 KV Cache {#kv-cache}
trtllm-build ... \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable
FP8 KV cache halves memory vs BF16 with negligible quality impact. Required for long context on consumer GPUs.
Runtime tuning:
kv_cache_config = {
"free_gpu_memory_fraction": 0.92, # fraction of free VRAM for KV
"enable_block_reuse": True, # prefix caching
"host_cache_size": 8 * 1024**3, # 8 GB CPU offload
}
enable_block_reuse is TensorRT-LLM's prefix caching — same benefit as vLLM's, automatically applied to repeated prompt prefixes.
Long Context (32K-128K) {#long-context}
For 128K context on Llama 3.1:
trtllm-build ... \
--max_input_len 131072 \
--max_seq_len 131080 \
--max_batch_size 4 \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable
Long-context KV cache memory is the bottleneck — Llama 3.1 8B at 128K with FP8 KV uses ~24 GB. Lower max_batch_size to fit. RoPE scaling parameters are read from the checkpoint config; for non-standard scaling pass --rope_scaling_factor and --rope_theta.
For background on long-context techniques (RoPE scaling, YaRN, NTK-aware), see our Sampling Parameters guide and forthcoming long-context deep dive.
LoRA Adapters at Runtime {#lora}
Build the base engine with LoRA support:
trtllm-build ... \
--lora_plugin auto \
--lora_target_modules attn_q attn_k attn_v attn_dense \
--max_lora_rank 64
Convert and use a LoRA adapter:
python convert_checkpoint.py --lora_path ./my-lora --output_dir ./tllm_lora
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="./engines/base", lora_dir="./tllm_lora")
out = llm.generate(["Once upon a time"], sampling_params=SamplingParams(max_tokens=128), lora_request=LoRARequest("my-lora", 1, "./tllm_lora"))
Per-request LoRA swapping is microseconds — adapter weights stream into a fixed-size LoRA cache. Practical for serving 10-100 fine-tunes from a single base engine.
Speculative Decoding (Medusa, EAGLE, Lookahead) {#speculative}
# Medusa heads
trtllm-build ... \
--speculative_decoding_mode medusa \
--max_draft_len 5 \
--num_medusa_heads 4
# EAGLE
trtllm-build ... \
--speculative_decoding_mode eagle \
--max_draft_len 7
# Lookahead (n-gram, no draft model)
trtllm-build ... \
--speculative_decoding_mode lookahead \
--max_draft_len 5
Expected speedups: Medusa 2.0-2.5x, EAGLE 2.5-3.0x, Lookahead 1.4-1.8x at single-stream batch size 1. Background on the methods: see CUDA Optimization guide.
Serving with trtllm-serve {#trtllm-serve}
trtllm-serve ./engines/llama-3.1-8b-fp8 \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 32 \
--max_num_tokens 8192 \
--kv_cache_free_gpu_memory_fraction 0.9
OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /health, /metrics.
Test:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.1-8b", "messages": [{"role":"user","content":"hi"}]}'
Triton Inference Server Backend {#triton}
For production: tritonserver + tensorrtllm_backend.
docker pull nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
# Triton model repo layout
model_repo/
├── ensemble/ # ensemble pipeline (preprocess + tllm + postprocess)
├── preprocessing/
├── postprocessing/
└── tensorrt_llm/
├── 1/
│ └── (engine + config)
└── config.pbtxt
config.pbtxt highlights:
backend: "tensorrtllm"
max_batch_size: 32
parameters: {
key: "gpt_model_path"
value: { string_value: "/models/tensorrt_llm/1" }
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: { string_value: "0.9" }
}
parameters: {
key: "enable_chunked_context"
value: { string_value: "true" }
}
parameters: {
key: "enable_kv_cache_reuse"
value: { string_value: "true" }
}
Launch:
docker run --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \
-v $(pwd)/model_repo:/models \
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 \
tritonserver --model-repository=/models
Triton exposes HTTP (8000), gRPC (8001), and Prometheus metrics (8002).
Kubernetes Deployment {#kubernetes}
apiVersion: apps/v1
kind: Deployment
metadata: { name: trtllm-llama-8b }
spec:
replicas: 2
selector: { matchLabels: { app: trtllm-llama-8b } }
template:
metadata: { labels: { app: trtllm-llama-8b } }
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
args: ["tritonserver", "--model-repository=/models"]
ports:
- { name: http, containerPort: 8000 }
- { name: grpc, containerPort: 8001 }
- { name: metrics, containerPort: 8002 }
resources: { limits: { nvidia.com/gpu: "1" } }
readinessProbe: { httpGet: { path: /v2/health/ready, port: http }, initialDelaySeconds: 60 }
volumeMounts:
- { name: models, mountPath: /models }
- { name: shm, mountPath: /dev/shm }
volumes:
- { name: models, persistentVolumeClaim: { claimName: trtllm-models-pvc } }
- { name: shm, emptyDir: { medium: Memory, sizeLimit: 16Gi } }
Use the Prometheus metrics on :8002 for HPA scaling. Engines are GPU-specific — pin pods to matching nodes via nodeSelector: { nvidia.com/gpu.product: NVIDIA-H100-PCIe }.
Observability & Metrics {#observability}
Triton exposes Prometheus metrics on :8002/metrics:
| Metric | Meaning |
|---|---|
nv_inference_request_success | Successful requests |
nv_inference_request_failure | Failures |
nv_inference_queue_duration_us | Time queued |
nv_inference_compute_input_duration_us | Prefill |
nv_inference_compute_output_duration_us | Decode |
nv_gpu_utilization | Per-GPU utilization |
nv_gpu_memory_used_bytes | VRAM used |
nv_trt_llm_kv_cache_block_used | KV cache blocks in use |
nv_trt_llm_active_request_count | Active in-flight requests |
Pair with Grafana dashboard 19656 (community Triton + TRT-LLM dashboard).
Benchmarks vs vLLM and Ollama {#benchmarks}
RTX 4090, Llama 3.1 8B FP8, 8K context, 128 output tokens:
| Concurrency | Ollama tok/s | vLLM tok/s | TensorRT-LLM tok/s |
|---|---|---|---|
| 1 | 132 | 142 | 178 |
| 4 | 145 | 480 | 555 |
| 16 | 148 | 1,150 | 1,310 |
| 32 | 148 | 1,720 | 1,940 |
| 64 | 148 | 2,200 | 2,510 |
TensorRT-LLM wins at every concurrency level. The gap is largest at batch size 1 (single-stream latency) — typical 25-30% lower than vLLM.
p99 TTFT at 32-concurrency: vLLM 510 ms, TRT-LLM 380 ms. p99 ITL: vLLM 18 ms, TRT-LLM 14 ms.
Tuning Recipes by GPU {#tuning}
RTX 4090 (24 GB Ada)
trtllm-build \
--checkpoint_dir ./tllm_8b_fp8 \
--output_dir ./engines/4090-8b-fp8 \
--gpt_attention_plugin auto --gemm_plugin auto \
--use_paged_context_fmha enable \
--use_fp8_context_fmha enable \
--max_input_len 32768 --max_seq_len 32768 \
--max_batch_size 64 --max_num_tokens 16384
RTX 5090 (32 GB Blackwell)
Same as 4090 with --max_seq_len 65536. FA3 + native FP8 deliver larger gains than RTX 4090.
2x RTX 4090 (48 GB total, PCIe)
trtllm-build \
--checkpoint_dir ./tllm_70b_awq_tp2 \
--output_dir ./engines/2x4090-70b-awq \
--tp_size 2 \
--max_input_len 16384 --max_seq_len 16384 \
--max_batch_size 16
H100 SXM 80 GB
# Llama 3.1 70B FP8 single GPU
trtllm-build \
--checkpoint_dir ./tllm_70b_fp8 \
--output_dir ./engines/h100-70b-fp8 \
--max_input_len 32768 --max_seq_len 32768 \
--max_batch_size 64
8x H100 SXM (640 GB)
trtllm-build \
--checkpoint_dir ./tllm_405b_fp8_tp8 \
--output_dir ./engines/8xh100-405b-fp8 \
--tp_size 8 \
--max_input_len 65536 --max_seq_len 65536 \
--max_batch_size 32
Common Errors & Fixes {#troubleshooting}
| Error | Cause | Fix |
|---|---|---|
CUDA out of memory during build | Build needs 2x model size in RAM | Reduce TP, build on a larger node |
| Engine OOMs at runtime | KV cache too large | Lower --max_seq_len or --max_batch_size |
unsupported plugin version | Engine built on different TRT-LLM version | Rebuild with current version |
| FP8 build fails | GPU lacks FP8 (Ampere) | Use INT4-AWQ instead |
| Triton container fails to start | shm too small | Add --shm-size=16g |
| Slow first request | CUDA graph warmup | Send 5-10 warmup requests after start |
tritonserver: KV cache reuse OOM | Prefix cache too large | Lower kv_cache_free_gpu_mem_fraction |
| Engine works on dev box, fails in prod | Different GPU SKU | Engines are GPU-specific; rebuild on prod GPU |
FAQ {#faq}
See answers to common TensorRT-LLM questions below.
Sources: TensorRT-LLM GitHub | TensorRT-LLM docs | Triton Inference Server | tensorrtllm_backend | Internal benchmarks RTX 4090, RTX 5090, H100.
Related guides on Local AI Master:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!