What is ExLlamaV2 and why is EXL2 quantization different from GGUF / AWQ?

ExLlamaV2 (also called ExLlama2 or EXL2) is turboderp's LLM inference library specifically optimized for single-GPU INT4 inference on Ampere and newer NVIDIA cards. EXL2 is its quantization format — a measurement-based, mixed-bit-width quantization where each layer gets a different bit budget (2, 3, 4, 5, 6, or 8 bits) based on how much it contributes to perplexity. The result: at the same average bits-per-weight, EXL2 typically achieves 2-3% lower perplexity than GPTQ-128g or GGUF Q4_K_M. ExLlamaV2 also has the fastest inference loop for batch size 1 on consumer NVIDIA GPUs, beating llama.cpp by 30-60% and vLLM by 5-15% on a single RTX 3090 / 4090.

What is TabbyAPI and why use it instead of bare ExLlamaV2?

TabbyAPI is an OpenAI-compatible HTTP server wrapping ExLlamaV2. Bare ExLlamaV2 ships only as a Python library — TabbyAPI gives you `/v1/chat/completions`, `/v1/completions`, streaming, tool calling, sampling presets, and YAML configuration. It is the production-friendly path. Use bare ExLlamaV2 only for embedded research code; for any deployment, use TabbyAPI.

What hardware does ExLlamaV2 / TabbyAPI need?

NVIDIA GPU with compute capability 8.0+ — Ampere (RTX 30-series, A6000, A100), Ada (RTX 40-series, L40S), Blackwell (RTX 50-series). It will not run on Turing (RTX 20-series) or older. Practical sweet spots: RTX 3090 (24 GB) running 32B EXL2 5.0bpw, RTX 4090 running 70B EXL2 4.0bpw with offload-free single-GPU inference, RTX 5090 (32 GB) running 70B EXL2 4.5bpw. AMD ROCm is not supported (use llama.cpp or vLLM-ROCm instead).

Does ExLlamaV2 support multiple GPUs?

Yes — naive split (pipeline-parallel layer splits) across up to 8 GPUs. It does NOT have true tensor parallelism, so multi-GPU scaling is sub-linear: 2x RTX 3090 for a 70B EXL2 4.0bpw delivers about 1.4-1.6x of single-GPU throughput, not 2x. If your priority is multi-GPU tensor-parallel throughput, vLLM or TensorRT-LLM are better. ExLlamaV2 shines specifically on single-GPU 24GB-class workloads.

How do I quantize a model to EXL2?

Use the `convert.py` script in the ExLlamaV2 repo. It needs the original FP16 / BF16 weights, a calibration dataset (e.g., WikiText 2 or your own domain-specific data), and a target bits-per-weight. Pass `-b 4.0` for 4.0bpw average, or `-b 4.65` for a custom target. Quantization runs the model on calibration data, measures per-layer sensitivity, and assigns variable bit widths to minimize perplexity at the target average. For Llama 3.1 70B, expect ~2-4 hours on a single RTX 4090. Many community quants are pre-published on Hugging Face by users like turboderp, lonestriker, and bartowski.

How does ExLlamaV2 compare to vLLM and llama.cpp on the same RTX 4090?

For Llama 3.1 8B Q4-class on RTX 4090 at batch size 1: ExLlamaV2 ~165 tok/s, vLLM with AWQ ~155 tok/s, llama.cpp Q4_K_M with FlashAttention ~127 tok/s. For Llama 3.1 70B 4.0bpw EXL2 fitting fully in 24 GB: ExLlamaV2 ~22 tok/s. llama.cpp partial-offload Q4 hits ~8 tok/s. vLLM cannot fit 70B in 24 GB. ExLlamaV2 at 22 tok/s on 70B from a single 4090 is currently the fastest option in that category. Above batch size 4 vLLM pulls ahead because of PagedAttention; below, ExLlamaV2 wins.

Does ExLlamaV2 support FlashAttention and long context?

Yes — FlashAttention 2 is built in and on by default. EXL2 supports up to the model's native context length (e.g., 131K for Llama 3.1) and YaRN / NTK / linear RoPE scaling for extended context. Q4 / Q6 / Q8 cache quantization is supported and recommended for long-context workloads — Q4 cache halves KV memory at ~1% perplexity cost. Set in TabbyAPI config: `cache_mode: Q4` or `cache_mode: Q6` or `cache_mode: FP16`.

When should I NOT use ExLlamaV2?

Three scenarios. (1) Multi-user serving — vLLM continuous batching is 5-20x faster aggregate. (2) Multi-GPU tensor-parallel — vLLM or TensorRT-LLM scale better. (3) AMD or Apple hardware — ExLlamaV2 is CUDA only. Otherwise, for single-user / single-GPU NVIDIA workloads, ExLlamaV2 is hard to beat. Pair with [Open WebUI](/blog/open-webui-setup-guide) or SillyTavern for a chat UI on top.

ExLlamaV2 + TabbyAPI Guide (2026): Best INT4 Inference on a Single GPU

ExLlamaV2 is the unsung hero of consumer-GPU local LLM inference. On a single RTX 3090 / 4090 / 5090, no other framework matches its speed for INT4 quantized models — not vLLM, not llama.cpp, not Ollama. The catch: it is single-GPU, single-user, NVIDIA-only, and barely documented for newcomers.

TabbyAPI puts an OpenAI-compatible HTTP server on top, making ExLlamaV2 production-usable. Together they are the right answer when you have one good NVIDIA GPU, run one user at a time, and want maximum tokens per second.

This guide covers everything: how EXL2 quantization works and why it beats AWQ / GGUF on quality-per-bit, installing TabbyAPI on Linux / Windows / Docker, picking the right bits-per-weight for your VRAM, long-context tuning with cache quantization, sampling presets, multi-GPU layer splits, and benchmarks against vLLM, llama.cpp, and TensorRT-LLM on the same hardware.

What ExLlamaV2 and TabbyAPI Are
Why EXL2 Beats Other 4-Bit Formats on Quality
Hardware & Software Requirements
Installation: TabbyAPI
Picking the Right EXL2 bpw for Your VRAM
Downloading Pre-Quantized Models
Quantizing Your Own Model to EXL2
TabbyAPI Configuration
Cache Quantization (Q4 / Q6 / Q8 / FP16)
Long Context (32K-131K)
Multi-GPU Layer Split
Sampling & Prompt Templates
LoRA Adapters
Speculative Decoding
Tool Calling / Function Calling
Performance Benchmarks
Tuning Recipes by GPU
Common Mistakes & Fixes
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What ExLlamaV2 and TabbyAPI Are {#what-it-is}

ExLlamaV2 (turboderp's repo) is a CUDA-first inference library for INT4-quantized transformer models. It pairs hand-tuned dequant-fused matmul kernels with an aggressive batch-size-1 scheduler. The format it loads, EXL2, is its own quantization scheme — measurement-based, mixed-bit-width, optimized for inference on Ampere+ NVIDIA cards.

TabbyAPI (theroyallab/tabbyAPI) is the OpenAI-compatible server wrapping ExLlamaV2. It exposes /v1/chat/completions, /v1/completions, streaming, sampling presets, prompt templates, basic auth, and a YAML config.

Together, they are the right local inference stack when:

You have one NVIDIA GPU (Ampere or newer).
You serve one or two concurrent users.
You want maximum tok/s.

For multi-user concurrent serving, see vLLM; for max single-stream latency on H100-class hardware, TensorRT-LLM; for AMD / Apple, llama.cpp / Ollama.

Why EXL2 Beats Other 4-Bit Formats on Quality {#exl2-quality}

Most 4-bit formats (GPTQ, AWQ, GGUF Q4_K_M) use a uniform bit width — every weight matrix gets the same number of bits. EXL2 instead allocates bits per layer based on measurement: it runs a calibration pass, measures how much each layer's weights affect output perplexity, and gives more bits to sensitive layers (attention QKV, output projection) and fewer to less-sensitive ones (later FFN layers).

Result: at the same average bits-per-weight, EXL2 produces lower perplexity than uniform-width quants.

Llama 3.1 70B perplexity on WikiText (lower is better):

Format	Avg bits	Size	PPL
FP16	16.00	140 GB	4.96
EXL2 6.0bpw	6.00	53 GB	4.97
EXL2 5.0bpw	5.00	44 GB	4.99
AWQ-INT4 g128	4.25	36 GB	5.01
GGUF Q5_K_M	5.66	50 GB	5.00
EXL2 4.5bpw	4.50	40 GB	5.00
EXL2 4.0bpw	4.00	35 GB	5.05
GGUF Q4_K_M	4.83	42 GB	5.06
GGUF IQ4_XS	4.25	38 GB	5.04
GPTQ-128g	4.25	36 GB	5.10

EXL2 4.0bpw is smaller than AWQ-INT4 g128 with comparable perplexity — both fit a 70B model in a 24 GB GPU with 4-6 GB left for KV cache. That 4-6 GB of KV makes a 70B model with 8K-16K context fully on-GPU practical on a single RTX 4090.

Hardware & Software Requirements {#requirements}

Component	Minimum	Recommended
GPU	Compute capability 8.0+ (Ampere)	RTX 3090 / 4090 / 5090 / A6000 / L40S
VRAM (8B class)	8 GB	12 GB+
VRAM (32B class)	16 GB	20 GB+
VRAM (70B class)	24 GB	24 GB exactly is fine
Driver	535+	555+
CUDA	12.1+ (build only; runtime via PyTorch)	12.4+
Python	3.10	3.11-3.12
OS	Linux, Windows	Ubuntu 22.04 LTS
RAM	16 GB	32 GB+
Disk	100 GB	NVMe

Turing (RTX 20-series, GTX 16) is not supported — kernels require Ampere SM 8.0 features.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Installation: TabbyAPI {#installation}

Linux / WSL2

git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip

# Install with CUDA 12.4 wheels
./start.sh    # auto-detects CUDA, picks right wheels

The start.sh script handles installing exllamav2 + flash-attn + dependencies with the right CUDA pins.

Windows

git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
.\start.bat

The start.bat script does the equivalent venv + install.

Docker

docker run -d --name tabbyapi \
    --gpus all \
    -p 5000:5000 \
    -v $(pwd)/models:/app/models \
    -v $(pwd)/config.yml:/app/config.yml \
    ghcr.io/theroyallab/tabbyapi:latest

From source (latest)

pip install exllamav2
pip install flash-attn --no-build-isolation
pip install -r requirements.txt
python main.py --config config.yml

Default port 5000. Browse http://localhost:5000/docs for the FastAPI Swagger UI.

Picking the Right EXL2 bpw for Your VRAM {#picking-bpw}

Approximate memory budget = model_size × bpw / 8 + KV cache + overhead (~1-2 GB).

GPU VRAM	Recommended Model + bpw
8 GB	Llama 3.1 8B EXL2 4.0bpw, 4K ctx, Q4 cache
12 GB	Llama 3.1 8B EXL2 6.0bpw, 16K ctx
16 GB	Qwen 2.5 14B EXL2 5.0bpw, 16K ctx
24 GB	Qwen 2.5 32B EXL2 5.0bpw / Llama 3.1 70B EXL2 2.4bpw / Llama 3.1 70B EXL2 4.0bpw with Q4 cache
32 GB (5090)	Llama 3.1 70B EXL2 4.5bpw, 16K ctx, FP16 cache
48 GB (A6000 / 2x 24GB)	Llama 3.1 70B EXL2 6.0bpw, 32K ctx
80 GB (H100 / A100)	Llama 3.1 70B EXL2 8.0bpw or 405B EXL2 2.4bpw

For 70B on exactly 24 GB: 4.0bpw + Q4 cache works at 8K context. 4.5bpw works at ~4K context. 5.0bpw and above will OOM.

Downloading Pre-Quantized Models {#downloading}

The community publishes EXL2 quants on Hugging Face. Top providers:

turboderp — the author; smallest selection but highest quality.
lonestriker — extensive coverage of popular models.
bartowski — broad coverage, multiple bpw per model.
LoneStriker (case-sensitive) — overlap with above.

# Use huggingface-cli
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-exl2 \
    --revision 4_0 \
    --local-dir ./models/Llama-3.1-70B-4.0bpw

EXL2 repos use branch revisions for different bpw — --revision 4_0 selects the 4.0bpw branch. Always check the README for available branches.

Quantizing Your Own Model to EXL2 {#quantizing}

git clone https://github.com/turboderp-org/exllamav2
cd exllamav2

# Quantize Llama 3.1 8B to 4.65bpw average
python convert.py \
    -i ./Llama-3.1-8B-Instruct \
    -o ./Llama-3.1-8B-4.65bpw \
    -nr \
    -b 4.65 \
    -hb 6 \
    -c ./calibration.parquet

Flags:

-b target average bpw (e.g., 4.65, 5.0, 6.5)
-hb head bits — output projection layer (keep at 6+ for quality)
-c calibration dataset (wikitext-2.parquet is the default; provide domain-specific data if your use case is narrow)
-nr skip resuming existing measurement file

Time: Llama 3.1 8B = 30-60 min on RTX 4090; 70B = 2-4 hours; 405B = 12-24 hours.

For a domain-specific calibration set, format your data as one column text in a Parquet file, ~1024 chunks of ~2K tokens each. Custom calibration improves perplexity on in-domain data 1-3%.

TabbyAPI Configuration {#config}

config.yml:

network:
  host: 0.0.0.0
  port: 5000
  disable_auth: false      # set true for local dev only

logging:
  log_prompt: false
  log_generation_params: false
  log_requests: false

model:
  model_dir: models/Meta-Llama-3.1-70B-Instruct-exl2-4.0bpw
  use_dummy_models: false
  inline_model_loading: true
  use_as_default: ['max_seq_len', 'cache_mode']

  max_seq_len: 16384
  override_base_seq_len:
  cache_mode: Q4               # Q4, Q6, Q8, FP16
  cache_size: 16384
  chunk_size: 2048
  prompt_template:             # auto-detected from tokenizer_config; override here

  # Multi-GPU
  gpu_split_auto: true
  autosplit_reserve: [96]      # MB reserved on first GPU
  gpu_split:                   # explicit override e.g. [22, 22]

  # Optimizations
  fasttensors: true
  use_dummy_models: false

  # Generation defaults
  rope_scale: 1.0
  rope_alpha: 1.0

draft_model:
  draft_model_dir:             # speculative decoding draft
  draft_rope_scale: 1.0

embeddings:
  embedding_model_dir:
  embeddings_device: cpu

developer:
  unsafe_launch: false
  disable_request_streaming: false
  cuda_malloc_backend: true
  uvloop: true
  realtime_process_priority: false

Run:

python main.py --config config.yml

Endpoints (OpenAI-compatible): /v1/chat/completions, /v1/completions. Plus /v1/model/load, /v1/model/unload for hot-swapping models.

Cache Quantization (Q4 / Q6 / Q8 / FP16) {#cache-quantization}

`cache_mode`	bytes/token (per layer)	Perplexity loss	Use Case
FP16	4	0%	Maximum quality, plenty of VRAM
Q8	2	~0.1%	Default for medium contexts
Q6	1.5	~0.3%	Good balance
Q4	1	~1.0%	Long context on tight VRAM

For Llama 3.1 70B at 16K context: FP16 cache = ~5 GB, Q8 = ~2.5 GB, Q4 = ~1.25 GB. The savings from Q4 cache often allow you to fit a higher-bpw model on the same GPU.

Real example on RTX 4090 with 70B at 16K:

4.0bpw + Q4 cache → fits with 1 GB free
4.5bpw + Q4 cache → just barely fits
4.0bpw + Q6 cache → fits with ~500 MB free
4.0bpw + FP16 cache → OOM

Long Context (32K-131K) {#long-context}

model:
  max_seq_len: 131072
  cache_mode: Q4
  cache_size: 131072
  chunk_size: 4096          # prompt processing chunk

For RoPE scaling on models that need it (e.g., scaling Llama 3 8K to 32K):

model:
  rope_scale: 4.0           # linear interpolation
  rope_alpha: 1.0           # NTK-aware

Llama 3.1 (8K → 131K) and Qwen 2.5 (32K → 131K with YaRN) handle scaling natively in the model — no additional flags. For Llama 3 (the older one) extending past 8K, set rope_scale and rope_alpha based on YaRN or NTK-aware formulas.

Multi-GPU Layer Split {#multi-gpu}

ExLlamaV2 splits layers across GPUs (pipeline-parallel-like, not tensor-parallel).

model:
  gpu_split_auto: true
  autosplit_reserve: [96, 96]   # reserve 96 MB on each GPU

Or explicit:

model:
  gpu_split_auto: false
  gpu_split: [22, 22]            # 22 GB on GPU0, 22 GB on GPU1

For 2x RTX 3090 with NVLink, expect 1.4-1.6x of single-3090 throughput on 70B 4.5bpw. Without NVLink (RTX 4090 / 5090), ~1.3-1.5x. Inferior to vLLM tensor parallel, but simpler.

Sampling & Prompt Templates {#sampling}

TabbyAPI exposes the full sampler stack via the OpenAI-compatible API plus extension fields:

{
  "model": "Llama-3.1-70B",
  "messages": [{"role": "user", "content": "Hello"}],
  "temperature": 0.7,
  "top_p": 0.9,
  "min_p": 0.05,
  "top_k": 0,
  "repetition_penalty": 1.05,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "smoothing_factor": 0.0,
  "skew": 0.0,
  "min_tokens": 0,
  "max_tokens": 1024,
  "stop": ["<|eot_id|>"],
  "stream": true
}

DRY and XTC are not yet in TabbyAPI as of mid-2026 — for those, use oobabooga's text-generation-webui or KoboldCpp. See LLM Sampling Parameters for what each parameter does.

Prompt templates

TabbyAPI auto-detects prompt templates from tokenizer_config.json (Jinja). Override with prompt_template in the model section if needed. Supplied templates: alpaca, chatml, llama3, mistral, vicuna, phi3, etc., in templates/.

LoRA Adapters {#lora}

loras:
  - name: my-style
    scaling: 1.0

Place the LoRA folder under loras/my-style/. ExLlamaV2 supports peft-format LoRAs converted with the exllamav2.lora utilities.

Per-request LoRA via the API:

{
  "model": "...",
  "messages": [...],
  "lora_request": {"name": "my-style", "scaling": 1.0}
}

Multiple LoRAs can be loaded simultaneously and selected per request.

Speculative Decoding {#speculative}

draft_model:
  draft_model_dir: models/Llama-3.2-1B-Instruct-exl2-4.5bpw
  draft_rope_scale: 1.0
  draft_cache_mode: Q4

Pair Llama 3.1 70B target with Llama 3.2 1B draft (same vocab). Expected speedup: 1.5-2.0x at batch size 1 with high acceptance rate. See CUDA Optimization for theory.

Tool Calling / Function Calling {#tool-calling}

TabbyAPI supports OpenAI-style tools for any chat-tuned model with a tool template (Llama 3.1+, Qwen 2.5+, Mistral Large 2). Pass tools and tool_choice as in the OpenAI spec; TabbyAPI handles parsing the model output and returning structured tool_calls in the response.

For consistent JSON output without grammars, use temperature ≤ 0.3 and pair with a guidance/outlines wrapper at the application layer.

Performance Benchmarks {#benchmarks}

Single user, batch size 1, 128 output tokens, 4K context, RTX 4090.

Llama 3.1 8B Q4-class

Framework	tok/s
Ollama (Q4_K_M)	127
llama.cpp (Q4_K_M, FA)	130
vLLM (AWQ-INT4)	155
TabbyAPI / ExLlamaV2 (EXL2 4.0bpw)	165
TensorRT-LLM (AWQ)	178

Llama 3.1 70B Q4-class

Framework	tok/s	VRAM
Ollama (Q4_K_M)	8 (offload)	24+15
llama.cpp (Q4_K_M, FA)	9 (offload)	24+15
vLLM (AWQ-INT4)	OOM	24
TabbyAPI / ExLlamaV2 (EXL2 4.0bpw + Q4 cache)	22	24
2x RTX 4090 vLLM (AWQ TP=2)	38	48
2x RTX 4090 TabbyAPI (EXL2 5.0bpw)	31	48

For single-GPU 70B inference, TabbyAPI is the only practical option at usable speeds.

Tuning Recipes by GPU {#tuning}

RTX 3090 (24 GB)

model:
  model_dir: models/Llama-3.1-70B-EXL2-4.0bpw
  max_seq_len: 8192
  cache_mode: Q4
  cache_size: 8192
  chunk_size: 2048
  fasttensors: true

RTX 4090 (24 GB)

Same as 3090 but max_seq_len: 16384 works comfortably with Q4 cache.

RTX 5090 (32 GB)

model:
  model_dir: models/Llama-3.1-70B-EXL2-4.5bpw
  max_seq_len: 16384
  cache_mode: Q6
  cache_size: 16384

A6000 / RTX 6000 Ada (48 GB)

model:
  model_dir: models/Llama-3.1-70B-EXL2-6.0bpw
  max_seq_len: 32768
  cache_mode: Q8

2x RTX 3090 NVLink

model:
  gpu_split: [22, 22]
  model_dir: models/Llama-3.1-70B-EXL2-5.0bpw
  max_seq_len: 16384
  cache_mode: Q6

Common Mistakes & Fixes {#troubleshooting}

Symptom	Cause	Fix
OOM at load	bpw + cache too large	Lower bpw or cache_mode
OOM mid-generation	Long context grew KV	Lower max_seq_len or use Q4 cache
Slow first token	Prompt prefill	Set chunk_size higher (4096)
Garbled output	Wrong prompt template	Set explicit `prompt_template: llama3`
Multi-GPU slower than single	NVLink not used	Verify NVLink with nvidia-smi nvlink -s
flash-attn install fails	Wrong CUDA / wheel	Use `./start.sh` which pins versions
Model loads but errors on first request	Tokenizer mismatch	Re-download repo, ensure tokenizer files present
Repetition / loops	Sampling too narrow	Increase min_p to 0.05, add rep_penalty 1.05

FAQ {#faq}

See answers to common ExLlamaV2 / TabbyAPI questions below.

Sources: ExLlamaV2 GitHub | TabbyAPI GitHub | bartowski's EXL2 quants on HF | turboderp's quants | Internal benchmarks RTX 3090, 4090, 5090.

Related guides:

ExLlamaV2 + TabbyAPI Guide (2026): Best INT4 Inference on a Single GPU

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What ExLlamaV2 and TabbyAPI Are {#what-it-is}

Why EXL2 Beats Other 4-Bit Formats on Quality {#exl2-quality}

Hardware & Software Requirements {#requirements}

Reading articles is good. Building is better.

Installation: TabbyAPI {#installation}

Linux / WSL2

Windows

Docker

From source (latest)

Picking the Right EXL2 bpw for Your VRAM {#picking-bpw}

Downloading Pre-Quantized Models {#downloading}

Quantizing Your Own Model to EXL2 {#quantizing}

TabbyAPI Configuration {#config}

Cache Quantization (Q4 / Q6 / Q8 / FP16) {#cache-quantization}

Long Context (32K-131K) {#long-context}

Multi-GPU Layer Split {#multi-gpu}

Sampling & Prompt Templates {#sampling}

Prompt templates

LoRA Adapters {#lora}

Speculative Decoding {#speculative}

Tool Calling / Function Calling {#tool-calling}

Performance Benchmarks {#benchmarks}

Llama 3.1 8B Q4-class

Llama 3.1 70B Q4-class

Tuning Recipes by GPU {#tuning}

RTX 3090 (24 GB)

RTX 4090 (24 GB)

RTX 5090 (32 GB)

A6000 / RTX 6000 Ada (48 GB)

2x RTX 3090 NVLink

Common Mistakes & Fixes {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

AWQ vs GPTQ vs GGUF

vLLM Complete Setup Guide

TensorRT-LLM Setup Guide

CUDA Optimization for Local LLMs

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI