★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

TensorRT-LLM Setup Guide (2026): Engine Build, FP8, INT4-AWQ, Triton

May 1, 2026
30 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

TensorRT-LLM is NVIDIA's purpose-built LLM inference compiler. It takes a Hugging Face checkpoint, fuses kernels, applies FP8 / INT4 quantization with CUDA graphs and in-flight batching, and emits a hardware-specific engine that delivers the lowest single-stream latency you can get on an NVIDIA GPU. Used right, it produces 20-40% lower per-token latency than vLLM at batch size 1 — at the cost of a 10-90 minute engine build per (model, GPU, dtype) combination.

This guide is the complete practitioner reference: NGC container install, checkpoint conversion, engine build flags, FP8 vs INT4-AWQ recipes, in-flight batching configuration, Triton integration, LoRA adapters, long-context tuning, multi-GPU deploys, and benchmarks vs vLLM and Ollama on the same hardware.

Table of Contents

  1. What TensorRT-LLM Is
  2. TensorRT-LLM vs vLLM vs SGLang vs Ollama
  3. Hardware & Software Requirements
  4. Installation: NGC Container, pip, From Source
  5. The Build Pipeline (Convert → Build → Serve)
  6. Your First Engine: Llama 3.1 8B FP8
  7. Quantization Recipes: FP8, INT4-AWQ, INT8
  8. Tensor Parallel & Pipeline Parallel
  9. In-Flight Batching
  10. Paged KV Cache & FP8 KV Cache
  11. Long Context (32K-128K)
  12. LoRA Adapters at Runtime
  13. Speculative Decoding (Medusa, EAGLE, Lookahead)
  14. Serving with trtllm-serve
  15. Triton Inference Server Backend
  16. Kubernetes Deployment
  17. Observability & Metrics
  18. Benchmarks vs vLLM and Ollama
  19. Tuning Recipes by GPU
  20. Common Errors & Fixes
  21. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What TensorRT-LLM Is {#what-it-is}

TensorRT-LLM is a Python library + CUDA runtime built on TensorRT. It does three things vLLM does not:

  1. Ahead-of-time compilation — your model is compiled into a binary engine specific to one GPU and one dtype combination. Kernels are autotuned at build time.
  2. Hand-tuned fused kernels — NVIDIA engineers maintain attention, MoE, and quant-dequant kernels written in CUDA C++ for each architecture.
  3. First-class FP8 + Transformer Engine — Hopper / Ada / Blackwell FP8 support is the most mature here.

Trade-off: less flexibility. You cannot swap models or change quant at runtime. Engines are not portable across GPUs.

The library exposes Python and C++ APIs, the trtllm-build CLI, the trtllm-serve server (since 0.9.0), and a Triton backend for production.


TensorRT-LLM vs vLLM vs SGLang vs Ollama {#comparison}

PropertyOllamavLLMTensorRT-LLMSGLang
Setup time60s5min30-90 min (engine build)5min
Single-stream latencyOKGoodBestExcellent
Aggregate throughputLowExcellentExcellentExcellent
FP8 (Ada/Hopper)✅ best
INT4-AWQ
Pipeline parallelbasic
Tensor parallel
MoE optimizationbasicgoodexcellentexcellent
Custom kernelsn/apythonC++ / CUDApython+triton
Engine portabilityweights onlyweights onlyGPU+dtype-specificweights only
Best forDesktopProduction serversLowest-latency prodAgent frameworks

Decision: if a 30-minute build per model is acceptable and latency is the KPI, TensorRT-LLM. If you iterate models often or need broad coverage, vLLM. For agent workflows with structured generation, SGLang.


Hardware & Software Requirements {#requirements}

ComponentMinimumRecommended
GPUCC 7.0+Ada (RTX 40), Hopper (H100), Blackwell
VRAM12 GB (8B BF16)24 GB+ for FP8 70B partial; 48 GB+ for 70B AWQ full
Driver535+555+ (FP8)
CUDA12.412.5+
Python3.103.10-3.12
OSLinux only (Ubuntu 22.04+)Ubuntu 22.04 LTS
RAM32 GB64 GB+ for 70B builds
Disk200 GBNVMe; engines + caches grow fast

Windows native is not supported. WSL2 works for inference; engine build on WSL2 is slow due to filesystem.

ROCm / AMD: not supported. Use vLLM-ROCm instead.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: NGC Container, pip, From Source {#installation}

docker pull nvcr.io/nvidia/tensorrt-llm/release:0.16.0
docker run --rm -it --gpus all --ipc=host \
    -v $(pwd):/workspace \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tensorrt-llm/release:0.16.0

The NGC image bundles CUDA, TensorRT, TRT-LLM, and dependencies pinned to known-good versions.

pip (advanced)

python3.10 -m venv ~/venvs/trtllm
source ~/venvs/trtllm/bin/activate
pip install --upgrade pip
pip install tensorrt-llm==0.16.0 --extra-index-url https://pypi.nvidia.com

This works on Ubuntu 22.04 with CUDA 12.5. Mismatched CUDA versions are the #1 source of installation pain — prefer the container.

From source (custom kernels, latest features)

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
docker build -t trtllm:custom -f docker/Dockerfile.dev .

Source builds take 30-60 minutes and require ~50 GB disk for build artifacts.


The Build Pipeline (Convert → Build → Serve) {#build-pipeline}

Hugging Face checkpoint
        │
        ▼
[convert_checkpoint.py]    # model-specific
        │
        ▼
TRT-LLM checkpoint format (separated weights + config)
        │
        ▼
[trtllm-build]             # the compiler
        │
        ▼
Engine files (.engine + config.json)
        │
        ▼
[trtllm-serve | Triton]
        │
        ▼
HTTP/gRPC API

Every model family has its own convert_checkpoint.py under examples/<family>/ in the TRT-LLM repo (llama, qwen, gemma, deepseek, mixtral, etc.).


Your First Engine: Llama 3.1 8B FP8 {#first-engine}

# Inside the NGC container
cd /workspace
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama

# Convert HF checkpoint to TRT-LLM format
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-8B-Instruct \
    --output_dir ./tllm_checkpoint_8b_fp8 \
    --dtype bfloat16 \
    --use_fp8 \
    --calib_dataset cnn_dailymail \
    --calib_size 512

# Build engine
trtllm-build \
    --checkpoint_dir ./tllm_checkpoint_8b_fp8 \
    --output_dir ./engines/llama-3.1-8b-fp8 \
    --gpt_attention_plugin auto \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 16384 \
    --max_seq_len 16384 \
    --max_batch_size 64 \
    --max_num_tokens 16384

# Serve (OpenAI-compatible)
trtllm-serve ./engines/llama-3.1-8b-fp8 \
    --tokenizer meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 --port 8000

Build time on RTX 4090: ~10 minutes. Test with the same OpenAI-compatible client used for vLLM.


Quantization Recipes: FP8, INT4-AWQ, INT8 {#quantization}

FP8 (E4M3) — Ada / Hopper / Blackwell

python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./tllm_70b_fp8 \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 2 \
    --calib_dataset cnn_dailymail --calib_size 512

trtllm-build \
    --checkpoint_dir ./tllm_70b_fp8 \
    --output_dir ./engines/llama-3.1-70b-fp8 \
    --gpt_attention_plugin auto \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 32768 --max_seq_len 32768 \
    --max_batch_size 16 \
    --tp_size 2

Calibration uses 512 samples from CNN/DailyMail to compute scaling factors. Larger calibration sets (1024-2048) marginally improve quality on long-context tasks.

INT4-AWQ

python ../quantization/quantize.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./tllm_70b_awq \
    --dtype bfloat16 \
    --qformat int4_awq \
    --awq_block_size 128 \
    --calib_size 512 \
    --tp_size 2

trtllm-build \
    --checkpoint_dir ./tllm_70b_awq \
    --output_dir ./engines/llama-3.1-70b-awq \
    --gpt_attention_plugin auto \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --max_input_len 16384 --max_seq_len 16384 \
    --tp_size 2

INT4-AWQ + FP8 KV cache is the highest-throughput recipe on Ada / Hopper for 70B class models.

INT8 (W8A8 SmoothQuant)

python ../quantization/quantize.py \
    --model_dir <model> \
    --output_dir ./tllm_w8a8 \
    --qformat int8_sq \
    --calib_size 512

INT8 is rarely the right choice anymore — INT4-AWQ is smaller and FP8 is faster. Keep for legacy compatibility on Ampere where FP8 is unavailable.


Tensor Parallel & Pipeline Parallel {#parallelism}

# 8x H100 — Llama 3.1 405B FP8 with TP=8
trtllm-build \
    --checkpoint_dir ./tllm_405b_fp8 \
    --output_dir ./engines/llama-405b-fp8-tp8 \
    --tp_size 8 \
    --max_seq_len 32768 \
    --max_batch_size 8

# 16x H100 across 2 nodes — TP=8 within node, PP=2 across nodes
trtllm-build \
    --checkpoint_dir ./tllm_405b_fp8 \
    --output_dir ./engines/llama-405b-fp8-tp8-pp2 \
    --tp_size 8 --pp_size 2 \
    --max_seq_len 32768

NCCL must see all GPUs. Within one node, NVLink (H100 SXM, A100 SXM) is critical. Across nodes, InfiniBand is recommended for TP > 8.


In-Flight Batching {#in-flight-batching}

In-Flight Batching (IFB) is TensorRT-LLM's term for continuous batching — request scheduling at iteration granularity, not request granularity. Same goal as vLLM's continuous batching, different implementation.

Enabled by default for any engine built with paged KV cache. Configure via runtime:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.executor import ExecutorConfig

executor_cfg = ExecutorConfig(
    max_batch_size=32,
    max_num_tokens=8192,
    enable_chunked_context=True,
    kv_cache_config={"free_gpu_memory_fraction": 0.9},
)

llm = LLM(model="./engines/llama-3.1-8b-fp8", executor_config=executor_cfg)

enable_chunked_context interleaves long-prompt prefill with decode steps from other requests — same idea as vLLM's chunked prefill.


Paged KV Cache & FP8 KV Cache {#kv-cache}

trtllm-build ... \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable

FP8 KV cache halves memory vs BF16 with negligible quality impact. Required for long context on consumer GPUs.

Runtime tuning:

kv_cache_config = {
    "free_gpu_memory_fraction": 0.92,   # fraction of free VRAM for KV
    "enable_block_reuse": True,          # prefix caching
    "host_cache_size": 8 * 1024**3,      # 8 GB CPU offload
}

enable_block_reuse is TensorRT-LLM's prefix caching — same benefit as vLLM's, automatically applied to repeated prompt prefixes.


Long Context (32K-128K) {#long-context}

For 128K context on Llama 3.1:

trtllm-build ... \
    --max_input_len 131072 \
    --max_seq_len 131080 \
    --max_batch_size 4 \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable

Long-context KV cache memory is the bottleneck — Llama 3.1 8B at 128K with FP8 KV uses ~24 GB. Lower max_batch_size to fit. RoPE scaling parameters are read from the checkpoint config; for non-standard scaling pass --rope_scaling_factor and --rope_theta.

For background on long-context techniques (RoPE scaling, YaRN, NTK-aware), see our Sampling Parameters guide and forthcoming long-context deep dive.


LoRA Adapters at Runtime {#lora}

Build the base engine with LoRA support:

trtllm-build ... \
    --lora_plugin auto \
    --lora_target_modules attn_q attn_k attn_v attn_dense \
    --max_lora_rank 64

Convert and use a LoRA adapter:

python convert_checkpoint.py --lora_path ./my-lora --output_dir ./tllm_lora
from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="./engines/base", lora_dir="./tllm_lora")
out = llm.generate(["Once upon a time"], sampling_params=SamplingParams(max_tokens=128), lora_request=LoRARequest("my-lora", 1, "./tllm_lora"))

Per-request LoRA swapping is microseconds — adapter weights stream into a fixed-size LoRA cache. Practical for serving 10-100 fine-tunes from a single base engine.


Speculative Decoding (Medusa, EAGLE, Lookahead) {#speculative}

# Medusa heads
trtllm-build ... \
    --speculative_decoding_mode medusa \
    --max_draft_len 5 \
    --num_medusa_heads 4

# EAGLE
trtllm-build ... \
    --speculative_decoding_mode eagle \
    --max_draft_len 7

# Lookahead (n-gram, no draft model)
trtllm-build ... \
    --speculative_decoding_mode lookahead \
    --max_draft_len 5

Expected speedups: Medusa 2.0-2.5x, EAGLE 2.5-3.0x, Lookahead 1.4-1.8x at single-stream batch size 1. Background on the methods: see CUDA Optimization guide.


Serving with trtllm-serve {#trtllm-serve}

trtllm-serve ./engines/llama-3.1-8b-fp8 \
    --tokenizer meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --backend pytorch \
    --max_batch_size 32 \
    --max_num_tokens 8192 \
    --kv_cache_free_gpu_memory_fraction 0.9

OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /health, /metrics.

Test:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "llama-3.1-8b", "messages": [{"role":"user","content":"hi"}]}'

Triton Inference Server Backend {#triton}

For production: tritonserver + tensorrtllm_backend.

docker pull nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

# Triton model repo layout
model_repo/
├── ensemble/                    # ensemble pipeline (preprocess + tllm + postprocess)
├── preprocessing/
├── postprocessing/
└── tensorrt_llm/
    ├── 1/
    │   └── (engine + config)
    └── config.pbtxt

config.pbtxt highlights:

backend: "tensorrtllm"
max_batch_size: 32

parameters: {
  key: "gpt_model_path"
  value: { string_value: "/models/tensorrt_llm/1" }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: { string_value: "0.9" }
}
parameters: {
  key: "enable_chunked_context"
  value: { string_value: "true" }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: { string_value: "true" }
}

Launch:

docker run --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \
    -v $(pwd)/model_repo:/models \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 \
    tritonserver --model-repository=/models

Triton exposes HTTP (8000), gRPC (8001), and Prometheus metrics (8002).


Kubernetes Deployment {#kubernetes}

apiVersion: apps/v1
kind: Deployment
metadata: { name: trtllm-llama-8b }
spec:
  replicas: 2
  selector: { matchLabels: { app: trtllm-llama-8b } }
  template:
    metadata: { labels: { app: trtllm-llama-8b } }
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
          args: ["tritonserver", "--model-repository=/models"]
          ports:
            - { name: http, containerPort: 8000 }
            - { name: grpc, containerPort: 8001 }
            - { name: metrics, containerPort: 8002 }
          resources: { limits: { nvidia.com/gpu: "1" } }
          readinessProbe: { httpGet: { path: /v2/health/ready, port: http }, initialDelaySeconds: 60 }
          volumeMounts:
            - { name: models, mountPath: /models }
            - { name: shm, mountPath: /dev/shm }
      volumes:
        - { name: models, persistentVolumeClaim: { claimName: trtllm-models-pvc } }
        - { name: shm, emptyDir: { medium: Memory, sizeLimit: 16Gi } }

Use the Prometheus metrics on :8002 for HPA scaling. Engines are GPU-specific — pin pods to matching nodes via nodeSelector: { nvidia.com/gpu.product: NVIDIA-H100-PCIe }.


Observability & Metrics {#observability}

Triton exposes Prometheus metrics on :8002/metrics:

MetricMeaning
nv_inference_request_successSuccessful requests
nv_inference_request_failureFailures
nv_inference_queue_duration_usTime queued
nv_inference_compute_input_duration_usPrefill
nv_inference_compute_output_duration_usDecode
nv_gpu_utilizationPer-GPU utilization
nv_gpu_memory_used_bytesVRAM used
nv_trt_llm_kv_cache_block_usedKV cache blocks in use
nv_trt_llm_active_request_countActive in-flight requests

Pair with Grafana dashboard 19656 (community Triton + TRT-LLM dashboard).


Benchmarks vs vLLM and Ollama {#benchmarks}

RTX 4090, Llama 3.1 8B FP8, 8K context, 128 output tokens:

ConcurrencyOllama tok/svLLM tok/sTensorRT-LLM tok/s
1132142178
4145480555
161481,1501,310
321481,7201,940
641482,2002,510

TensorRT-LLM wins at every concurrency level. The gap is largest at batch size 1 (single-stream latency) — typical 25-30% lower than vLLM.

p99 TTFT at 32-concurrency: vLLM 510 ms, TRT-LLM 380 ms. p99 ITL: vLLM 18 ms, TRT-LLM 14 ms.


Tuning Recipes by GPU {#tuning}

RTX 4090 (24 GB Ada)

trtllm-build \
    --checkpoint_dir ./tllm_8b_fp8 \
    --output_dir ./engines/4090-8b-fp8 \
    --gpt_attention_plugin auto --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 32768 --max_seq_len 32768 \
    --max_batch_size 64 --max_num_tokens 16384

RTX 5090 (32 GB Blackwell)

Same as 4090 with --max_seq_len 65536. FA3 + native FP8 deliver larger gains than RTX 4090.

2x RTX 4090 (48 GB total, PCIe)

trtllm-build \
    --checkpoint_dir ./tllm_70b_awq_tp2 \
    --output_dir ./engines/2x4090-70b-awq \
    --tp_size 2 \
    --max_input_len 16384 --max_seq_len 16384 \
    --max_batch_size 16

H100 SXM 80 GB

# Llama 3.1 70B FP8 single GPU
trtllm-build \
    --checkpoint_dir ./tllm_70b_fp8 \
    --output_dir ./engines/h100-70b-fp8 \
    --max_input_len 32768 --max_seq_len 32768 \
    --max_batch_size 64

8x H100 SXM (640 GB)

trtllm-build \
    --checkpoint_dir ./tllm_405b_fp8_tp8 \
    --output_dir ./engines/8xh100-405b-fp8 \
    --tp_size 8 \
    --max_input_len 65536 --max_seq_len 65536 \
    --max_batch_size 32

Common Errors & Fixes {#troubleshooting}

ErrorCauseFix
CUDA out of memory during buildBuild needs 2x model size in RAMReduce TP, build on a larger node
Engine OOMs at runtimeKV cache too largeLower --max_seq_len or --max_batch_size
unsupported plugin versionEngine built on different TRT-LLM versionRebuild with current version
FP8 build failsGPU lacks FP8 (Ampere)Use INT4-AWQ instead
Triton container fails to startshm too smallAdd --shm-size=16g
Slow first requestCUDA graph warmupSend 5-10 warmup requests after start
tritonserver: KV cache reuse OOMPrefix cache too largeLower kv_cache_free_gpu_mem_fraction
Engine works on dev box, fails in prodDifferent GPU SKUEngines are GPU-specific; rebuild on prod GPU

FAQ {#faq}

See answers to common TensorRT-LLM questions below.


Sources: TensorRT-LLM GitHub | TensorRT-LLM docs | Triton Inference Server | tensorrtllm_backend | Internal benchmarks RTX 4090, RTX 5090, H100.

Related guides on Local AI Master:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a Triton + TensorRT-LLM reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators