What is TensorRT-LLM and when should I use it instead of vLLM?

TensorRT-LLM is NVIDIA's open-source library for optimizing and serving LLMs on NVIDIA GPUs. It compiles a model into a hardware-specific engine using fused kernels, FP8 / INT4 quantization, CUDA graphs, and in-flight batching. Compared to vLLM it delivers 20-40% lower per-token latency at single-stream batch size 1 and similar or higher throughput at high concurrency, but at the cost of a 30-90 minute engine build per model + GPU + dtype combination. Use TensorRT-LLM when you serve a fixed model in production and latency is the primary KPI; stay on vLLM for rapid experimentation and broad model coverage.

What hardware does TensorRT-LLM require?

NVIDIA GPU with compute capability 7.0+ (Volta, Turing, Ampere, Ada, Hopper, Blackwell). FP8 features require Ada (RTX 40-series, L40S) or newer. INT4-AWQ works on all. Recommended in 2026: RTX 4090 / 5090 for desktop, A100 / L40S for cloud, H100 / H200 / B200 for production servers. AMD ROCm is not supported — use vLLM-ROCm instead. macOS not supported. Driver 550+ is required for FP8; 555+ for Hopper Transformer Engine v2.

How long does it take to build a TensorRT-LLM engine?

Engine build time depends on model size and target GPU: Llama 3.1 8B FP8 on RTX 4090 takes ~10 minutes; Llama 3.1 70B AWQ-INT4 with TP=2 takes 30-60 minutes; Llama 3.1 405B FP8 with TP=8 takes 90+ minutes. The engine is binary, GPU-specific, and dtype-specific — you cannot move an RTX 4090 engine to an H100 or change quant on the fly. For development, build engines in CI and store in object storage; for production, pre-build and ship as a container image with the engine baked in.

Should I use FP8 or INT4-AWQ in TensorRT-LLM?

On Ada / Hopper / Blackwell with a model published in FP8 (Meta and NeuralMagic publish many), FP8 is the highest-quality fast option — typically 0.3-0.7% MMLU loss vs BF16 at ~2x throughput. INT4-AWQ is the best 4-bit option — quality within 1-1.5% of BF16 at ~4x throughput and ~75% memory reduction. Use FP8 when you have a published FP8 model and the GPU supports it; use AWQ-INT4 when you need maximum throughput, when running on Ampere (no FP8 hardware), or when serving very large models on smaller GPUs.

Do I need Triton Inference Server to use TensorRT-LLM?

No, but you probably should in production. The simplest path is `trtllm-serve` which exposes an OpenAI-compatible API directly. For multi-model serving, model versioning, ensemble pipelines, gRPC, KV-cache reuse across requests, and Kubernetes autoscaling, Triton with the TensorRT-LLM backend is the production-grade choice. Triton adds about 5-10 ms per request of overhead vs raw trtllm-serve, which is acceptable for the operational benefits.

Can I serve multiple models on the same GPU with TensorRT-LLM?

Yes via Triton model ensembles or multiple model instances behind one Triton server, but each engine has its own GPU memory allocation. For a 24 GB RTX 4090 you might fit a Llama 3.1 8B FP8 engine (~10 GB) and a Qwen 2.5 1.5B INT4 engine (~1 GB) plus KV cache. Use Triton's `instance_group` config to colocate. For broader multi-model on a GPU, MIG (H100) or MPS provide hardware-level isolation — see our [CUDA Optimization guide](/blog/cuda-optimization-local-llms#mig-mps).

How do I update a model in TensorRT-LLM without rebuilding the engine from scratch?

You cannot — engines are baked from specific weights. If you fine-tune, you rebuild. The good news: TensorRT-LLM supports LoRA adapters loaded at runtime, so for small adaptations you build a base engine once and swap LoRAs per request via the `lora_request` field. This is much faster than rebuilding for every fine-tune. For full weight updates (post-training, distillation, new base model), you must rebuild the engine.

How does TensorRT-LLM handle long context (32K-128K tokens)?

TensorRT-LLM supports paged KV cache (similar to vLLM's PagedAttention), FP8 KV cache, and chunked context FMHA. For 128K context, build with `--max_input_len 131072 --max_seq_len 131080 --use_paged_context_fmha enable --use_fp8_context_fmha enable` (FP8 KV requires Hopper or Ada). Memory for KV cache scales with seq_len × num_heads × head_dim × layers — Llama 3.1 8B at 128K context uses ~24 GB of FP8 KV cache. Plan VRAM accordingly or limit max_seq_len to what you actually need.

TensorRT-LLM Setup Guide (2026): Engine Build, FP8, INT4-AWQ, Triton

TensorRT-LLM is NVIDIA's purpose-built LLM inference compiler. It takes a Hugging Face checkpoint, fuses kernels, applies FP8 / INT4 quantization with CUDA graphs and in-flight batching, and emits a hardware-specific engine that delivers the lowest single-stream latency you can get on an NVIDIA GPU. Used right, it produces 20-40% lower per-token latency than vLLM at batch size 1 — at the cost of a 10-90 minute engine build per (model, GPU, dtype) combination.

This guide is the complete practitioner reference: NGC container install, checkpoint conversion, engine build flags, FP8 vs INT4-AWQ recipes, in-flight batching configuration, Triton integration, LoRA adapters, long-context tuning, multi-GPU deploys, and benchmarks vs vLLM and Ollama on the same hardware.

What TensorRT-LLM Is
TensorRT-LLM vs vLLM vs SGLang vs Ollama
Hardware & Software Requirements
Installation: NGC Container, pip, From Source
The Build Pipeline (Convert → Build → Serve)
Your First Engine: Llama 3.1 8B FP8
Quantization Recipes: FP8, INT4-AWQ, INT8
Tensor Parallel & Pipeline Parallel
In-Flight Batching
Paged KV Cache & FP8 KV Cache
Long Context (32K-128K)
LoRA Adapters at Runtime
Speculative Decoding (Medusa, EAGLE, Lookahead)
Serving with trtllm-serve
Triton Inference Server Backend
Kubernetes Deployment
Observability & Metrics
Benchmarks vs vLLM and Ollama
Tuning Recipes by GPU
Common Errors & Fixes
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What TensorRT-LLM Is {#what-it-is}

TensorRT-LLM is a Python library + CUDA runtime built on TensorRT. It does three things vLLM does not:

Ahead-of-time compilation — your model is compiled into a binary engine specific to one GPU and one dtype combination. Kernels are autotuned at build time.
Hand-tuned fused kernels — NVIDIA engineers maintain attention, MoE, and quant-dequant kernels written in CUDA C++ for each architecture.
First-class FP8 + Transformer Engine — Hopper / Ada / Blackwell FP8 support is the most mature here.

Trade-off: less flexibility. You cannot swap models or change quant at runtime. Engines are not portable across GPUs.

The library exposes Python and C++ APIs, the trtllm-build CLI, the trtllm-serve server (since 0.9.0), and a Triton backend for production.

TensorRT-LLM vs vLLM vs SGLang vs Ollama {#comparison}

Property	Ollama	vLLM	TensorRT-LLM	SGLang
Setup time	60s	5min	30-90 min (engine build)	5min
Single-stream latency	OK	Good	Best	Excellent
Aggregate throughput	Low	Excellent	Excellent	Excellent
FP8 (Ada/Hopper)	❌	✅	✅ best	✅
INT4-AWQ	❌	✅	✅	✅
Pipeline parallel	basic	✅	✅	✅
Tensor parallel	❌	✅	✅	✅
MoE optimization	basic	good	excellent	excellent
Custom kernels	n/a	python	C++ / CUDA	python+triton
Engine portability	weights only	weights only	GPU+dtype-specific	weights only
Best for	Desktop	Production servers	Lowest-latency prod	Agent frameworks

Decision: if a 30-minute build per model is acceptable and latency is the KPI, TensorRT-LLM. If you iterate models often or need broad coverage, vLLM. For agent workflows with structured generation, SGLang.

Hardware & Software Requirements {#requirements}

Component	Minimum	Recommended
GPU	CC 7.0+	Ada (RTX 40), Hopper (H100), Blackwell
VRAM	12 GB (8B BF16)	24 GB+ for FP8 70B partial; 48 GB+ for 70B AWQ full
Driver	535+	555+ (FP8)
CUDA	12.4	12.5+
Python	3.10	3.10-3.12
OS	Linux only (Ubuntu 22.04+)	Ubuntu 22.04 LTS
RAM	32 GB	64 GB+ for 70B builds
Disk	200 GB	NVMe; engines + caches grow fast

Windows native is not supported. WSL2 works for inference; engine build on WSL2 is slow due to filesystem.

ROCm / AMD: not supported. Use vLLM-ROCm instead.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Installation: NGC Container, pip, From Source {#installation}

NGC Container (recommended)

docker pull nvcr.io/nvidia/tensorrt-llm/release:0.16.0
docker run --rm -it --gpus all --ipc=host \
    -v $(pwd):/workspace \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tensorrt-llm/release:0.16.0

The NGC image bundles CUDA, TensorRT, TRT-LLM, and dependencies pinned to known-good versions.

pip (advanced)

python3.10 -m venv ~/venvs/trtllm
source ~/venvs/trtllm/bin/activate
pip install --upgrade pip
pip install tensorrt-llm==0.16.0 --extra-index-url https://pypi.nvidia.com

This works on Ubuntu 22.04 with CUDA 12.5. Mismatched CUDA versions are the #1 source of installation pain — prefer the container.

From source (custom kernels, latest features)

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
docker build -t trtllm:custom -f docker/Dockerfile.dev .

Source builds take 30-60 minutes and require ~50 GB disk for build artifacts.

The Build Pipeline (Convert → Build → Serve) {#build-pipeline}

Hugging Face checkpoint
        │
        ▼
[convert_checkpoint.py]    # model-specific
        │
        ▼
TRT-LLM checkpoint format (separated weights + config)
        │
        ▼
[trtllm-build]             # the compiler
        │
        ▼
Engine files (.engine + config.json)
        │
        ▼
[trtllm-serve | Triton]
        │
        ▼
HTTP/gRPC API

Every model family has its own convert_checkpoint.py under examples/<family>/ in the TRT-LLM repo (llama, qwen, gemma, deepseek, mixtral, etc.).

Your First Engine: Llama 3.1 8B FP8 {#first-engine}

# Inside the NGC container
cd /workspace
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama

# Convert HF checkpoint to TRT-LLM format
python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-8B-Instruct \
    --output_dir ./tllm_checkpoint_8b_fp8 \
    --dtype bfloat16 \
    --use_fp8 \
    --calib_dataset cnn_dailymail \
    --calib_size 512

# Build engine
trtllm-build \
    --checkpoint_dir ./tllm_checkpoint_8b_fp8 \
    --output_dir ./engines/llama-3.1-8b-fp8 \
    --gpt_attention_plugin auto \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 16384 \
    --max_seq_len 16384 \
    --max_batch_size 64 \
    --max_num_tokens 16384

# Serve (OpenAI-compatible)
trtllm-serve ./engines/llama-3.1-8b-fp8 \
    --tokenizer meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 --port 8000

Build time on RTX 4090: ~10 minutes. Test with the same OpenAI-compatible client used for vLLM.

Quantization Recipes: FP8, INT4-AWQ, INT8 {#quantization}

FP8 (E4M3) — Ada / Hopper / Blackwell

python convert_checkpoint.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./tllm_70b_fp8 \
    --dtype bfloat16 \
    --use_fp8 \
    --tp_size 2 \
    --calib_dataset cnn_dailymail --calib_size 512

trtllm-build \
    --checkpoint_dir ./tllm_70b_fp8 \
    --output_dir ./engines/llama-3.1-70b-fp8 \
    --gpt_attention_plugin auto \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 32768 --max_seq_len 32768 \
    --max_batch_size 16 \
    --tp_size 2

Calibration uses 512 samples from CNN/DailyMail to compute scaling factors. Larger calibration sets (1024-2048) marginally improve quality on long-context tasks.

INT4-AWQ

python ../quantization/quantize.py \
    --model_dir meta-llama/Llama-3.1-70B-Instruct \
    --output_dir ./tllm_70b_awq \
    --dtype bfloat16 \
    --qformat int4_awq \
    --awq_block_size 128 \
    --calib_size 512 \
    --tp_size 2

trtllm-build \
    --checkpoint_dir ./tllm_70b_awq \
    --output_dir ./engines/llama-3.1-70b-awq \
    --gpt_attention_plugin auto \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --max_input_len 16384 --max_seq_len 16384 \
    --tp_size 2

INT4-AWQ + FP8 KV cache is the highest-throughput recipe on Ada / Hopper for 70B class models.

INT8 (W8A8 SmoothQuant)

python ../quantization/quantize.py \
    --model_dir <model> \
    --output_dir ./tllm_w8a8 \
    --qformat int8_sq \
    --calib_size 512

INT8 is rarely the right choice anymore — INT4-AWQ is smaller and FP8 is faster. Keep for legacy compatibility on Ampere where FP8 is unavailable.

Tensor Parallel & Pipeline Parallel {#parallelism}

# 8x H100 — Llama 3.1 405B FP8 with TP=8
trtllm-build \
    --checkpoint_dir ./tllm_405b_fp8 \
    --output_dir ./engines/llama-405b-fp8-tp8 \
    --tp_size 8 \
    --max_seq_len 32768 \
    --max_batch_size 8

# 16x H100 across 2 nodes — TP=8 within node, PP=2 across nodes
trtllm-build \
    --checkpoint_dir ./tllm_405b_fp8 \
    --output_dir ./engines/llama-405b-fp8-tp8-pp2 \
    --tp_size 8 --pp_size 2 \
    --max_seq_len 32768

NCCL must see all GPUs. Within one node, NVLink (H100 SXM, A100 SXM) is critical. Across nodes, InfiniBand is recommended for TP > 8.

In-Flight Batching {#in-flight-batching}

In-Flight Batching (IFB) is TensorRT-LLM's term for continuous batching — request scheduling at iteration granularity, not request granularity. Same goal as vLLM's continuous batching, different implementation.

Enabled by default for any engine built with paged KV cache. Configure via runtime:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.executor import ExecutorConfig

executor_cfg = ExecutorConfig(
    max_batch_size=32,
    max_num_tokens=8192,
    enable_chunked_context=True,
    kv_cache_config={"free_gpu_memory_fraction": 0.9},
)

llm = LLM(model="./engines/llama-3.1-8b-fp8", executor_config=executor_cfg)

enable_chunked_context interleaves long-prompt prefill with decode steps from other requests — same idea as vLLM's chunked prefill.

Paged KV Cache & FP8 KV Cache {#kv-cache}

trtllm-build ... \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable

FP8 KV cache halves memory vs BF16 with negligible quality impact. Required for long context on consumer GPUs.

Runtime tuning:

kv_cache_config = {
    "free_gpu_memory_fraction": 0.92,   # fraction of free VRAM for KV
    "enable_block_reuse": True,          # prefix caching
    "host_cache_size": 8 * 1024**3,      # 8 GB CPU offload
}

enable_block_reuse is TensorRT-LLM's prefix caching — same benefit as vLLM's, automatically applied to repeated prompt prefixes.

Long Context (32K-128K) {#long-context}

For 128K context on Llama 3.1:

trtllm-build ... \
    --max_input_len 131072 \
    --max_seq_len 131080 \
    --max_batch_size 4 \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable

Long-context KV cache memory is the bottleneck — Llama 3.1 8B at 128K with FP8 KV uses ~24 GB. Lower max_batch_size to fit. RoPE scaling parameters are read from the checkpoint config; for non-standard scaling pass --rope_scaling_factor and --rope_theta.

For background on long-context techniques (RoPE scaling, YaRN, NTK-aware), see our Sampling Parameters guide and forthcoming long-context deep dive.

LoRA Adapters at Runtime {#lora}

Build the base engine with LoRA support:

trtllm-build ... \
    --lora_plugin auto \
    --lora_target_modules attn_q attn_k attn_v attn_dense \
    --max_lora_rank 64

Convert and use a LoRA adapter:

python convert_checkpoint.py --lora_path ./my-lora --output_dir ./tllm_lora

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="./engines/base", lora_dir="./tllm_lora")
out = llm.generate(["Once upon a time"], sampling_params=SamplingParams(max_tokens=128), lora_request=LoRARequest("my-lora", 1, "./tllm_lora"))

Per-request LoRA swapping is microseconds — adapter weights stream into a fixed-size LoRA cache. Practical for serving 10-100 fine-tunes from a single base engine.

Speculative Decoding (Medusa, EAGLE, Lookahead) {#speculative}

# Medusa heads
trtllm-build ... \
    --speculative_decoding_mode medusa \
    --max_draft_len 5 \
    --num_medusa_heads 4

# EAGLE
trtllm-build ... \
    --speculative_decoding_mode eagle \
    --max_draft_len 7

# Lookahead (n-gram, no draft model)
trtllm-build ... \
    --speculative_decoding_mode lookahead \
    --max_draft_len 5

Expected speedups: Medusa 2.0-2.5x, EAGLE 2.5-3.0x, Lookahead 1.4-1.8x at single-stream batch size 1. Background on the methods: see CUDA Optimization guide.

Serving with trtllm-serve {#trtllm-serve}

trtllm-serve ./engines/llama-3.1-8b-fp8 \
    --tokenizer meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --backend pytorch \
    --max_batch_size 32 \
    --max_num_tokens 8192 \
    --kv_cache_free_gpu_memory_fraction 0.9

OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /health, /metrics.

Test:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "llama-3.1-8b", "messages": [{"role":"user","content":"hi"}]}'

Triton Inference Server Backend {#triton}

For production: tritonserver + tensorrtllm_backend.

docker pull nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3

# Triton model repo layout
model_repo/
├── ensemble/                    # ensemble pipeline (preprocess + tllm + postprocess)
├── preprocessing/
├── postprocessing/
└── tensorrt_llm/
    ├── 1/
    │   └── (engine + config)
    └── config.pbtxt

config.pbtxt highlights:

backend: "tensorrtllm"
max_batch_size: 32

parameters: {
  key: "gpt_model_path"
  value: { string_value: "/models/tensorrt_llm/1" }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: { string_value: "0.9" }
}
parameters: {
  key: "enable_chunked_context"
  value: { string_value: "true" }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: { string_value: "true" }
}

Launch:

docker run --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 \
    -v $(pwd)/model_repo:/models \
    nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3 \
    tritonserver --model-repository=/models

Triton exposes HTTP (8000), gRPC (8001), and Prometheus metrics (8002).

Kubernetes Deployment {#kubernetes}

apiVersion: apps/v1
kind: Deployment
metadata: { name: trtllm-llama-8b }
spec:
  replicas: 2
  selector: { matchLabels: { app: trtllm-llama-8b } }
  template:
    metadata: { labels: { app: trtllm-llama-8b } }
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
          args: ["tritonserver", "--model-repository=/models"]
          ports:
            - { name: http, containerPort: 8000 }
            - { name: grpc, containerPort: 8001 }
            - { name: metrics, containerPort: 8002 }
          resources: { limits: { nvidia.com/gpu: "1" } }
          readinessProbe: { httpGet: { path: /v2/health/ready, port: http }, initialDelaySeconds: 60 }
          volumeMounts:
            - { name: models, mountPath: /models }
            - { name: shm, mountPath: /dev/shm }
      volumes:
        - { name: models, persistentVolumeClaim: { claimName: trtllm-models-pvc } }
        - { name: shm, emptyDir: { medium: Memory, sizeLimit: 16Gi } }

Use the Prometheus metrics on :8002 for HPA scaling. Engines are GPU-specific — pin pods to matching nodes via nodeSelector: { nvidia.com/gpu.product: NVIDIA-H100-PCIe }.

Observability & Metrics {#observability}

Triton exposes Prometheus metrics on :8002/metrics:

Metric	Meaning
`nv_inference_request_success`	Successful requests
`nv_inference_request_failure`	Failures
`nv_inference_queue_duration_us`	Time queued
`nv_inference_compute_input_duration_us`	Prefill
`nv_inference_compute_output_duration_us`	Decode
`nv_gpu_utilization`	Per-GPU utilization
`nv_gpu_memory_used_bytes`	VRAM used
`nv_trt_llm_kv_cache_block_used`	KV cache blocks in use
`nv_trt_llm_active_request_count`	Active in-flight requests

Pair with Grafana dashboard 19656 (community Triton + TRT-LLM dashboard).

Benchmarks vs vLLM and Ollama {#benchmarks}

RTX 4090, Llama 3.1 8B FP8, 8K context, 128 output tokens:

Concurrency	Ollama tok/s	vLLM tok/s	TensorRT-LLM tok/s
1	132	142	178
4	145	480	555
16	148	1,150	1,310
32	148	1,720	1,940
64	148	2,200	2,510

TensorRT-LLM wins at every concurrency level. The gap is largest at batch size 1 (single-stream latency) — typical 25-30% lower than vLLM.

p99 TTFT at 32-concurrency: vLLM 510 ms, TRT-LLM 380 ms. p99 ITL: vLLM 18 ms, TRT-LLM 14 ms.

Tuning Recipes by GPU {#tuning}

RTX 4090 (24 GB Ada)

trtllm-build \
    --checkpoint_dir ./tllm_8b_fp8 \
    --output_dir ./engines/4090-8b-fp8 \
    --gpt_attention_plugin auto --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --use_fp8_context_fmha enable \
    --max_input_len 32768 --max_seq_len 32768 \
    --max_batch_size 64 --max_num_tokens 16384

RTX 5090 (32 GB Blackwell)

Same as 4090 with --max_seq_len 65536. FA3 + native FP8 deliver larger gains than RTX 4090.

2x RTX 4090 (48 GB total, PCIe)

trtllm-build \
    --checkpoint_dir ./tllm_70b_awq_tp2 \
    --output_dir ./engines/2x4090-70b-awq \
    --tp_size 2 \
    --max_input_len 16384 --max_seq_len 16384 \
    --max_batch_size 16

H100 SXM 80 GB

# Llama 3.1 70B FP8 single GPU
trtllm-build \
    --checkpoint_dir ./tllm_70b_fp8 \
    --output_dir ./engines/h100-70b-fp8 \
    --max_input_len 32768 --max_seq_len 32768 \
    --max_batch_size 64

8x H100 SXM (640 GB)

trtllm-build \
    --checkpoint_dir ./tllm_405b_fp8_tp8 \
    --output_dir ./engines/8xh100-405b-fp8 \
    --tp_size 8 \
    --max_input_len 65536 --max_seq_len 65536 \
    --max_batch_size 32

Common Errors & Fixes {#troubleshooting}

Error	Cause	Fix
`CUDA out of memory` during build	Build needs 2x model size in RAM	Reduce TP, build on a larger node
Engine OOMs at runtime	KV cache too large	Lower `--max_seq_len` or `--max_batch_size`
`unsupported plugin version`	Engine built on different TRT-LLM version	Rebuild with current version
FP8 build fails	GPU lacks FP8 (Ampere)	Use INT4-AWQ instead
Triton container fails to start	shm too small	Add `--shm-size=16g`
Slow first request	CUDA graph warmup	Send 5-10 warmup requests after start
`tritonserver: KV cache reuse OOM`	Prefix cache too large	Lower `kv_cache_free_gpu_mem_fraction`
Engine works on dev box, fails in prod	Different GPU SKU	Engines are GPU-specific; rebuild on prod GPU

FAQ {#faq}

See answers to common TensorRT-LLM questions below.

Sources: TensorRT-LLM GitHub | TensorRT-LLM docs | Triton Inference Server | tensorrtllm_backend | Internal benchmarks RTX 4090, RTX 5090, H100.

Related guides on Local AI Master:

TensorRT-LLM Setup Guide (2026): Engine Build, FP8, INT4-AWQ, Triton

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What TensorRT-LLM Is {#what-it-is}

TensorRT-LLM vs vLLM vs SGLang vs Ollama {#comparison}

Hardware & Software Requirements {#requirements}

Reading articles is good. Building is better.

Installation: NGC Container, pip, From Source {#installation}

NGC Container (recommended)

pip (advanced)

From source (custom kernels, latest features)

The Build Pipeline (Convert → Build → Serve) {#build-pipeline}

Your First Engine: Llama 3.1 8B FP8 {#first-engine}

Quantization Recipes: FP8, INT4-AWQ, INT8 {#quantization}

FP8 (E4M3) — Ada / Hopper / Blackwell

INT4-AWQ

INT8 (W8A8 SmoothQuant)

Tensor Parallel & Pipeline Parallel {#parallelism}

In-Flight Batching {#in-flight-batching}

Paged KV Cache & FP8 KV Cache {#kv-cache}

Long Context (32K-128K) {#long-context}

LoRA Adapters at Runtime {#lora}

Speculative Decoding (Medusa, EAGLE, Lookahead) {#speculative}

Serving with trtllm-serve {#trtllm-serve}

Triton Inference Server Backend {#triton}

Kubernetes Deployment {#kubernetes}

Observability & Metrics {#observability}

Benchmarks vs vLLM and Ollama {#benchmarks}

Tuning Recipes by GPU {#tuning}

RTX 4090 (24 GB Ada)

RTX 5090 (32 GB Blackwell)

2x RTX 4090 (48 GB total, PCIe)

H100 SXM 80 GB

8x H100 SXM (640 GB)

Common Errors & Fixes {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

vLLM Complete Setup Guide

CUDA Optimization for Local LLMs

SGLang vs vLLM Comparison

AWQ vs GPTQ vs GGUF

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI