AWQ vs GPTQ vs GGUF: Which Quantization Is Best?
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
GGUF, GPTQ, and AWQ are the three main quantization formats for local AI models. GGUF is the universal format used by Ollama and llama.cpp that works on CPU, GPU, and Apple Silicon. GPTQ and AWQ are GPU-only formats requiring NVIDIA CUDA. AWQ preserves the highest quality (96-98% of FP16), while GGUF Q4_K_M offers the best compatibility at 95-98% quality retention. Use GGUF for Ollama, AWQ for vLLM production servers, and GPTQ for text-generation-webui.
Quick Answer: Which Quantization Format?
| Your Setup | Use This | Why |
|---|---|---|
| Ollama / llama.cpp | GGUF (Q4_K_M) | Only format supported; works on CPU+GPU |
| vLLM / TGI server | AWQ | Best throughput + quality on GPU |
| text-generation-webui | GPTQ or AWQ | Both supported; AWQ slightly better |
| No GPU (CPU only) | GGUF | Only option for CPU inference |
What Is Model Quantization?
Quantization reduces model size and memory usage by representing weights with fewer bits, trading a small amount of quality for dramatically lower resource requirements.
A full-precision (FP16) Llama 3.1 8B model needs ~16 GB of memory. With 4-bit quantization, the same model needs ~5 GB — a 3x reduction that makes it runnable on a single consumer GPU or even CPU.
Three quantization methods dominate the local AI ecosystem: GGUF, GPTQ, and AWQ. Each has different strengths, hardware requirements, and ecosystem support. This guide helps you pick the right one.
The Three Formats Compared
Head-to-Head Comparison
| Feature | GGUF | GPTQ | AWQ |
|---|---|---|---|
| Full Name | GPT-Generated Unified Format | GPT Quantization | Activation-Aware Weight Quantization |
| Created By | Georgi Gerganov (llama.cpp) | IST-DASLab (2023 paper) | MIT Han Lab (2024 paper) |
| Bit Options | 2, 3, 4, 5, 6, 8-bit | 2, 3, 4, 8-bit | 4-bit (primarily) |
| CPU Support | Yes (primary use case) | No (CUDA required) | No (CUDA required) |
| GPU Support | Yes (full or partial offload) | Yes (full GPU only) | Yes (full GPU only) |
| Apple Silicon | Yes (Metal acceleration) | No | No |
| AMD GPU | Yes (ROCm via llama.cpp) | Limited | Limited |
| Mixed Precision | Yes (K-quants: important layers get more bits) | No (uniform bit-width) | Yes (protects salient weights) |
| Ecosystem | Ollama, llama.cpp, LM Studio, Jan | vLLM, TGI, text-gen-webui, Transformers | vLLM, TGI, text-gen-webui, Transformers |
| File Extension | .gguf | safetensors/bin (with config) | safetensors (with config) |
| Quantization Speed | Minutes (on CPU) | Hours (requires calibration data + GPU) | Hours (requires calibration data + GPU) |
GGUF: The Universal Format
What Is GGUF?
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the C/C++ inference engine that powers Ollama, LM Studio, and Jan. It was designed for maximum compatibility — running on CPUs, GPUs, Apple Silicon, and even edge devices.
GGUF Quantization Levels
| Quant | Bits | Size (7B model) | VRAM | Quality vs FP16 | Use Case |
|---|---|---|---|---|---|
| Q2_K | 2-3 | ~2.8 GB | ~3.5 GB | 85-90% | Extreme compression, noticeable quality loss |
| Q3_K_M | 3-4 | ~3.3 GB | ~4 GB | 90-93% | Low-memory systems |
| Q4_K_S | 4 | ~3.9 GB | ~4.5 GB | 93-95% | Slightly smaller than K_M |
| Q4_K_M | 4 | ~4.1 GB | ~5 GB | 95-98% | Standard choice (Ollama default) |
| Q5_K_M | 5 | ~4.8 GB | ~5.5 GB | 97-99% | When you want a bit more quality |
| Q6_K | 6 | ~5.5 GB | ~6.5 GB | 99%+ | Near-lossless |
| Q8_0 | 8 | ~7.2 GB | ~8 GB | 99.5%+ | Essentially lossless |
| FP16 | 16 | ~14 GB | ~16 GB | 100% | Full precision (reference) |
Sizes are approximate for a 7B parameter model.
When to Use GGUF
- You use Ollama, LM Studio, or Jan — these only support GGUF
- You have no NVIDIA GPU — GGUF works on CPU, Apple Silicon (Metal), AMD (ROCm)
- You want CPU+GPU hybrid — offload some layers to GPU, rest on CPU
- You want the simplest experience — one file, download and run
GGUF in Practice
# Ollama uses GGUF by default
ollama pull llama3.1:8b # Downloads Q4_K_M GGUF
# Pull a specific quantization
ollama pull llama3.1:8b-instruct-q5_K_M
# With llama.cpp directly
./llama-cli -m llama-3.1-8b-Q4_K_M.gguf -p "Hello" -n 128
GPTQ: The Original GPU Quantizer
What Is GPTQ?
GPTQ (GPT Quantization) was one of the first practical post-training quantization methods for large language models, published by IST-DASLab in 2023. It uses a calibration dataset to minimize quantization error layer by layer, producing well-optimized GPU-only models.
GPTQ Characteristics
| Aspect | Detail |
|---|---|
| Precision | Typically 4-bit (also supports 2, 3, 8-bit) |
| GPU Requirement | NVIDIA CUDA (mandatory) |
| Calibration | Requires 128-256 samples of representative data |
| Quantization Time | 1-4 hours for a 7B model on A100 |
| Quality (4-bit) | ~94-96% of FP16 |
| Inference Speed | Fast on GPU with optimized kernels (ExLlama, Marlin) |
When to Use GPTQ
- You have an NVIDIA GPU and want fast inference
- You use text-generation-webui — GPTQ was the first well-supported format
- You need wider bit-width options — GPTQ has mature support for 2, 3, 4, and 8-bit
- Pre-quantized models available — extensive library on Hugging Face (TheBloke, turboderp)
GPTQ in Practice
# Using Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GPTQ")
# Using vLLM
# vllm serve TheBloke/Llama-2-7B-GPTQ --quantization gptq
AWQ: The Accuracy Champion
What Is AWQ?
AWQ (Activation-Aware Weight Quantization) was developed by MIT's Han Lab in 2024. Its key insight: not all weights are equally important. AWQ identifies the weights that matter most (by analyzing activation patterns) and protects them during quantization, preserving more model quality per bit than GPTQ.
AWQ Characteristics
| Aspect | Detail |
|---|---|
| Precision | Primarily 4-bit (W4A16) |
| GPU Requirement | NVIDIA CUDA (mandatory) |
| Calibration | Requires calibration data (similar to GPTQ) |
| Key Innovation | Protects salient weights based on activation magnitude |
| Quality (4-bit) | ~96-98% of FP16 (1-3% better than GPTQ) |
| Inference Speed | Fast, especially with vLLM's Marlin kernels |
| Format | Safetensors with quantization config |
AWQ vs GPTQ Quality Comparison
The AWQ paper (arXiv:2306.00978) demonstrates that AWQ consistently preserves more model quality than GPTQ at the same bit-width. The general pattern across published evaluations:
- AWQ-4bit retains ~96-98% of FP16 quality (measured by perplexity)
- GPTQ-4bit retains ~94-96% of FP16 quality
- The gap is consistent across model sizes (7B through 70B)
- On downstream tasks like MMLU, AWQ typically scores 0.5-1.5% higher than GPTQ
The exact numbers vary by model, calibration data, and evaluation setup. Check the AWQ GitHub repo for the latest benchmarks.
When to Use AWQ
- You want the best quality at 4-bit quantization on GPU
- You run vLLM or TGI in production — AWQ kernels are highly optimized
- You serve models at scale — AWQ's throughput advantages compound
- Quality matters more than flexibility — AWQ is GPU-only but highest quality
AWQ in Practice
# Using vLLM (recommended for production)
# vllm serve casperhansen/llama-3-8b-instruct-awq --quantization awq
# Using Hugging Face Transformers
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"casperhansen/llama-3-8b-instruct-awq",
fuse_layers=True
)
Performance Benchmarks
Inference Speed (Llama 3.1 8B, RTX 4090 24GB)
| Format | Quantization | Tokens/sec | VRAM Used |
|---|---|---|---|
| FP16 (baseline) | None | 95 tok/s | 16.2 GB |
| GGUF | Q4_K_M | 85 tok/s | 5.1 GB |
| GGUF | Q5_K_M | 78 tok/s | 5.8 GB |
| GPTQ | 4-bit 128g | 110 tok/s | 5.3 GB |
| AWQ | 4-bit | 115 tok/s | 5.0 GB |
Note: Speed figures are approximate estimates based on community benchmarks. GPTQ/AWQ are generally faster on GPU due to optimized CUDA kernels (Marlin). GGUF uses llama.cpp's backend which prioritizes broader hardware compatibility. Actual results vary by hardware, driver version, and batch size.
Quality Retention (General Pattern)
The typical quality retention at 4-bit quantization across published evaluations:
| Format | Quantization | Quality Retained (vs FP16) | Key Advantage |
|---|---|---|---|
| GGUF Q4_K_M | 4-bit mixed precision | ~95-98% | K-quants give important layers more bits |
| GGUF Q5_K_M | 5-bit mixed precision | ~97-99% | Higher bit-width, near-lossless |
| GPTQ 4-bit | 4-bit uniform | ~94-96% | Optimized CUDA kernels |
| AWQ 4-bit | 4-bit adaptive | ~96-98% | Protects salient weights |
Key insight: GGUF Q4_K_M and AWQ 4-bit are very close in quality. GGUF achieves this through mixed-precision K-quants (giving important layers more bits), while AWQ does it through activation-aware weight protection. Both outperform uniform GPTQ. The exact retention percentage varies by model and task.
When to Choose Each Format
Choose GGUF When:
- You use Ollama — it only supports GGUF
- You have Apple Silicon — Metal acceleration, no CUDA needed
- You have no GPU — CPU inference works well
- You want partial GPU offload — some layers on GPU, rest on CPU
- You want the simplest setup — one file, no dependencies
- You use LM Studio or Jan — both use GGUF
Choose AWQ When:
- You run a production GPU server — vLLM + AWQ is the gold standard
- Quality is your top priority — best quality-per-bit at 4-bit
- You serve many concurrent users — AWQ kernels optimize batched inference
- You have NVIDIA GPUs — A100, H100, RTX 4090+
Choose GPTQ When:
- You use text-generation-webui — GPTQ has the longest support history
- You need 2-bit or 3-bit quantization — GPTQ supports more bit widths
- You want the widest model selection — most Hugging Face quants are GPTQ
- You need ExLlamaV2 — EXL2 format (GPTQ-based) offers fine-grained bit control
Quantization Decision Matrix
| Criteria | GGUF | GPTQ | AWQ |
|---|---|---|---|
| CPU inference | Yes | No | No |
| Apple Silicon | Yes (Metal) | No | No |
| AMD GPU | Yes (ROCm) | Limited | Limited |
| NVIDIA GPU | Yes | Yes | Yes |
| Ollama support | Yes | No | No |
| vLLM support | No | Yes | Yes |
| Quality at 4-bit | Excellent (K-quants) | Good | Best |
| Inference speed (GPU) | Good | Fast | Fastest |
| Ecosystem breadth | Widest | Wide | Growing |
| Ease of use | Easiest | Moderate | Moderate |
| Mixed precision | Yes (K-quants) | No | Yes (per-channel) |
How to Convert Between Formats
FP16 → GGUF
# Using llama.cpp's convert script
python convert_hf_to_gguf.py ./model-directory --outtype q4_k_m
# Output: model-Q4_K_M.gguf
FP16 → GPTQ
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)
model = AutoGPTQForCausalLM.from_pretrained("./model-directory", config)
model.quantize(calibration_data)
model.save_quantized("./model-gptq")
FP16 → AWQ
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("./model-directory")
model.quantize(tokenizer, quant_config={"w_bit": 4, "q_group_size": 128})
model.save_quantized("./model-awq")
Most users never need to convert models themselves — pre-quantized versions of popular models are available on Hugging Face and the Ollama library.
Real-World Recommendations
For Home Users (Ollama + Open WebUI)
Stick with GGUF via Ollama. It's the simplest path:
ollama pull llama3.1:8bgives you Q4_K_M GGUF automatically- Works on any hardware (Mac, Windows, Linux, with or without GPU)
- No configuration needed
- See our Open WebUI setup guide for the best interface
- See our best Ollama models guide for model recommendations
For Developers (AI Coding Assistants)
Use GGUF with Ollama as the backend for Continue.dev or Cursor:
- Fast models for autocomplete:
qwen2.5-coder:1.5b(GGUF, ~2GB) - Quality models for chat:
qwen2.5-coder:7b(GGUF, ~5GB)
For Production API Servers
Use AWQ with vLLM:
vllm serve model-name-AWQ \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
This gives you the best throughput and quality for serving users at scale.
For Research and Experimentation
Use GPTQ with text-generation-webui — it supports the widest range of quantization options and lets you compare formats easily.
Key Takeaways
- GGUF is the default for local use — Ollama, LM Studio, and Jan all use it exclusively
- AWQ produces the best quality at 4-bit on GPU, ideal for production servers
- GPTQ has the widest model library on Hugging Face and supports more bit widths
- Q4_K_M is the sweet spot for GGUF — 95-98% of FP16 quality at ~60% memory
- Only GGUF works without a GPU — AWQ and GPTQ require NVIDIA CUDA
- Quality differences are small — all three formats at 4-bit retain 95%+ of original quality
- Don't overthink it — if you use Ollama, you're already using the right format (GGUF)
Next Steps
- Pull the best Ollama models — all use GGUF automatically
- Set up Open WebUI for a ChatGPT-like interface
- Check VRAM requirements to size your hardware
- Compare local AI tools — all use GGUF
- Run models on 8GB RAM with optimized quantization
Quantization technology continues to improve rapidly. New methods like FP8 (for datacenter GPUs) and BitNet (1-bit models trained from scratch) are emerging. This guide covers the three formats relevant to most local AI users as of March 2026.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!