What is the difference between AWQ, GPTQ, and GGUF?

GGUF is a file format used by llama.cpp and Ollama — it supports CPU inference, CPU+GPU hybrid, and runs on any hardware. GPTQ is a GPU-only quantization method that requires CUDA and works with frameworks like vLLM and text-generation-inference. AWQ (Activation-Aware Weight Quantization) is also GPU-only but preserves accuracy better than GPTQ by protecting important weights. Choose GGUF for flexibility, GPTQ for GPU server deployment, AWQ for best accuracy-per-bit on GPU.

Which quantization format should I use with Ollama?

Ollama uses GGUF exclusively. When you run ollama pull llama3.1:8b, you get a GGUF file with Q4_K_M quantization by default. You cannot use AWQ or GPTQ files with Ollama. If you want to use AWQ or GPTQ models, use vLLM, text-generation-webui, or Hugging Face Transformers instead.

What does Q4_K_M mean in GGUF quantization?

Q4_K_M is a GGUF quantization scheme. Q4 means 4-bit quantization. K means it uses the k-quants method (mixed precision — important layers get more bits). M means medium quality (balanced). The hierarchy from lowest to highest quality: Q2_K < Q3_K_S < Q3_K_M < Q4_K_S < Q4_K_M < Q5_K_S < Q5_K_M < Q6_K < Q8_0 < FP16. Q4_K_M is the standard choice for most users — 95-98% of FP16 quality at ~60% of the memory.

Is AWQ better than GPTQ for quality?

Yes, AWQ generally preserves more model quality than GPTQ at the same bit-width. AWQ identifies which weights are most important (based on activation patterns) and protects them during quantization. In benchmarks, AWQ-4bit typically scores 1-3% higher than GPTQ-4bit on perplexity and downstream tasks. However, the difference is small and GPTQ has wider ecosystem support.

Can I run quantized models on CPU only?

Only GGUF supports pure CPU inference (via llama.cpp/Ollama). GPTQ and AWQ require an NVIDIA GPU with CUDA support — they cannot run on CPU at all. If you have no GPU, GGUF is your only option. GGUF also supports partial GPU offloading: load some layers on GPU and the rest on CPU for a speed boost even with limited VRAM.

How much quality do you lose with 4-bit quantization?

With modern quantization methods (Q4_K_M GGUF, AWQ-4bit, GPTQ-4bit), quality loss is minimal — typically 2-5% on standard benchmarks compared to full FP16 precision. For most practical tasks (chat, coding, summarization), users cannot reliably distinguish 4-bit from full precision in blind tests. The exception is very small models (under 3B parameters), where quantization causes more noticeable degradation.

What quantization should I use for a production API server?

For GPU-based production servers, AWQ-4bit with vLLM is the recommended stack. AWQ offers the best throughput (tokens/second per GPU dollar) with near-FP16 quality. vLLM supports AWQ natively with optimized CUDA kernels. For maximum throughput at scale, FP8 quantization on H100/A100 GPUs is even faster but requires Ampere+ architecture.

Where do I find quantized models to download?

For GGUF: Ollama library (ollama.com/library) has pre-quantized models ready to pull. Hugging Face has GGUF files from community quantizers like TheBloke and bartowski. For GPTQ/AWQ: Hugging Face hosts pre-quantized versions from TheBloke, turboderp, and model creators. Search for model-name-AWQ or model-name-GPTQ on Hugging Face.

AWQ vs GPTQ vs GGUF: Which Quantization Is Best? (2026)

Q: Is AWQ better than GPTQ for quality?

Yes, AWQ generally preserves more model quality than GPTQ at the same bit-width. AWQ identifies which weights are most important (based on activation patterns) and protects them during quantization. In benchmarks, AWQ-4bit typically scores 1-3% higher than GPTQ-4bit on perplexity and downstream tasks. However, the difference is small and GPTQ has wider ecosystem support.

Q: Can I run quantized models on CPU only?

Only GGUF supports pure CPU inference (via llama.cpp/Ollama). GPTQ and AWQ require an NVIDIA GPU with CUDA support — they cannot run on CPU at all. If you have no GPU, GGUF is your only option. GGUF also supports partial GPU offloading: load some layers on GPU and the rest on CPU for a speed boost even with limited VRAM.

Q: How much quality do you lose with 4-bit quantization?

With modern quantization methods (Q4_K_M GGUF, AWQ-4bit, GPTQ-4bit), quality loss is minimal — typically 2-5% on standard benchmarks compared to full FP16 precision. For most practical tasks (chat, coding, summarization), users cannot reliably distinguish 4-bit from full precision in blind tests. The exception is very small models (under 3B parameters), where quantization causes more noticeable degradation.

Q: What quantization should I use for a production API server?

For GPU-based production servers, AWQ-4bit with vLLM is the recommended stack. AWQ offers the best throughput (tokens/second per GPU dollar) with near-FP16 quality. vLLM supports AWQ natively with optimized CUDA kernels. For maximum throughput at scale, FP8 quantization on H100/A100 GPUs is even faster but requires Ampere+ architecture.

Q: Where do I find quantized models to download?

For GGUF: Ollama library (ollama.com/library) has pre-quantized models ready to pull. Hugging Face has GGUF files from community quantizers like TheBloke and bartowski. For GPTQ/AWQ: Hugging Face hosts pre-quantized versions from TheBloke, turboderp, and model creators. Search for model-name-AWQ or model-name-GPTQ on Hugging Face.

GGUF, GPTQ, and AWQ are the three main quantization formats for local AI models. GGUF is the universal format used by Ollama and llama.cpp that works on CPU, GPU, and Apple Silicon. GPTQ and AWQ are GPU-only formats requiring NVIDIA CUDA. AWQ preserves the highest quality (96-98% of FP16), while GGUF Q4_K_M offers the best compatibility at 95-98% quality retention. Use GGUF for Ollama, AWQ for vLLM production servers, and GPTQ for text-generation-webui.

Quick Answer: Which Quantization Format?

Your Setup	Use This	Why
Ollama / llama.cpp	GGUF (Q4_K_M)	Only format supported; works on CPU+GPU
vLLM / TGI server	AWQ	Best throughput + quality on GPU
text-generation-webui	GPTQ or AWQ	Both supported; AWQ slightly better
No GPU (CPU only)	GGUF	Only option for CPU inference

What Is Model Quantization?

Quantization reduces model size and memory usage by representing weights with fewer bits, trading a small amount of quality for dramatically lower resource requirements.

A full-precision (FP16) Llama 3.1 8B model needs ~16 GB of memory. With 4-bit quantization, the same model needs ~5 GB — a 3x reduction that makes it runnable on a single consumer GPU or even CPU.

Three quantization methods dominate the local AI ecosystem: GGUF, GPTQ, and AWQ. Each has different strengths, hardware requirements, and ecosystem support. This guide helps you pick the right one.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The Three Formats Compared

Head-to-Head Comparison

Feature	GGUF	GPTQ	AWQ
Full Name	GPT-Generated Unified Format	GPT Quantization	Activation-Aware Weight Quantization
Created By	Georgi Gerganov (llama.cpp)	IST-DASLab (2023 paper)	MIT Han Lab (2024 paper)
Bit Options	2, 3, 4, 5, 6, 8-bit	2, 3, 4, 8-bit	4-bit (primarily)
CPU Support	Yes (primary use case)	No (CUDA required)	No (CUDA required)
GPU Support	Yes (full or partial offload)	Yes (full GPU only)	Yes (full GPU only)
Apple Silicon	Yes (Metal acceleration)	No	No
AMD GPU	Yes (ROCm via llama.cpp)	Limited	Limited
Mixed Precision	Yes (K-quants: important layers get more bits)	No (uniform bit-width)	Yes (protects salient weights)
Ecosystem	Ollama, llama.cpp, LM Studio, Jan	vLLM, TGI, text-gen-webui, Transformers	vLLM, TGI, text-gen-webui, Transformers
File Extension	.gguf	safetensors/bin (with config)	safetensors (with config)
Quantization Speed	Minutes (on CPU)	Hours (requires calibration data + GPU)	Hours (requires calibration data + GPU)

GGUF: The Universal Format

What Is GGUF?

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the C/C++ inference engine that powers Ollama, LM Studio, and Jan. It was designed for maximum compatibility — running on CPUs, GPUs, Apple Silicon, and even edge devices.

GGUF Quantization Levels

Quant	Bits	Size (7B model)	VRAM	Quality vs FP16	Use Case
Q2_K	2-3	~2.8 GB	~3.5 GB	85-90%	Extreme compression, noticeable quality loss
Q3_K_M	3-4	~3.3 GB	~4 GB	90-93%	Low-memory systems
Q4_K_S	4	~3.9 GB	~4.5 GB	93-95%	Slightly smaller than K_M
Q4_K_M	4	~4.1 GB	~5 GB	95-98%	Standard choice (Ollama default)
Q5_K_M	5	~4.8 GB	~5.5 GB	97-99%	When you want a bit more quality
Q6_K	6	~5.5 GB	~6.5 GB	99%+	Near-lossless
Q8_0	8	~7.2 GB	~8 GB	99.5%+	Essentially lossless
FP16	16	~14 GB	~16 GB	100%	Full precision (reference)

Sizes are approximate for a 7B parameter model.

When to Use GGUF

You use Ollama, LM Studio, or Jan — these only support GGUF
You have no NVIDIA GPU — GGUF works on CPU, Apple Silicon (Metal), AMD (ROCm)
You want CPU+GPU hybrid — offload some layers to GPU, rest on CPU
You want the simplest experience — one file, download and run

GGUF in Practice

# Ollama uses GGUF by default
ollama pull llama3.1:8b          # Downloads Q4_K_M GGUF

# Pull a specific quantization
ollama pull llama3.1:8b-instruct-q5_K_M

# With llama.cpp directly
./llama-cli -m llama-3.1-8b-Q4_K_M.gguf -p "Hello" -n 128

GPTQ: The Original GPU Quantizer

What Is GPTQ?

GPTQ (GPT Quantization) was one of the first practical post-training quantization methods for large language models, published by IST-DASLab in 2023. It uses a calibration dataset to minimize quantization error layer by layer, producing well-optimized GPU-only models.

GPTQ Characteristics

Aspect	Detail
Precision	Typically 4-bit (also supports 2, 3, 8-bit)
GPU Requirement	NVIDIA CUDA (mandatory)
Calibration	Requires 128-256 samples of representative data
Quantization Time	1-4 hours for a 7B model on A100
Quality (4-bit)	~94-96% of FP16
Inference Speed	Fast on GPU with optimized kernels (ExLlama, Marlin)

When to Use GPTQ

You have an NVIDIA GPU and want fast inference
You use text-generation-webui — GPTQ was the first well-supported format
You need wider bit-width options — GPTQ has mature support for 2, 3, 4, and 8-bit
Pre-quantized models available — extensive library on Hugging Face (TheBloke, turboderp)

GPTQ in Practice

# Using Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

# Using vLLM
# vllm serve TheBloke/Llama-2-7B-GPTQ --quantization gptq

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

AWQ: The Accuracy Champion

What Is AWQ?

AWQ (Activation-Aware Weight Quantization) was developed by MIT's Han Lab in 2024. Its key insight: not all weights are equally important. AWQ identifies the weights that matter most (by analyzing activation patterns) and protects them during quantization, preserving more model quality per bit than GPTQ.

AWQ Characteristics

Aspect	Detail
Precision	Primarily 4-bit (W4A16)
GPU Requirement	NVIDIA CUDA (mandatory)
Calibration	Requires calibration data (similar to GPTQ)
Key Innovation	Protects salient weights based on activation magnitude
Quality (4-bit)	~96-98% of FP16 (1-3% better than GPTQ)
Inference Speed	Fast, especially with vLLM's Marlin kernels
Format	Safetensors with quantization config

AWQ vs GPTQ Quality Comparison

The AWQ paper (arXiv:2306.00978) demonstrates that AWQ consistently preserves more model quality than GPTQ at the same bit-width. The general pattern across published evaluations:

AWQ-4bit retains ~96-98% of FP16 quality (measured by perplexity)
GPTQ-4bit retains ~94-96% of FP16 quality
The gap is consistent across model sizes (7B through 70B)
On downstream tasks like MMLU, AWQ typically scores 0.5-1.5% higher than GPTQ

The exact numbers vary by model, calibration data, and evaluation setup. Check the AWQ GitHub repo for the latest benchmarks.

When to Use AWQ

You want the best quality at 4-bit quantization on GPU
You run vLLM or TGI in production — AWQ kernels are highly optimized
You serve models at scale — AWQ's throughput advantages compound
Quality matters more than flexibility — AWQ is GPU-only but highest quality

AWQ in Practice

# Using vLLM (recommended for production)
# vllm serve casperhansen/llama-3-8b-instruct-awq --quantization awq

# Using Hugging Face Transformers
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized(
    "casperhansen/llama-3-8b-instruct-awq",
    fuse_layers=True
)

Performance Benchmarks

Inference Speed (Llama 3.1 8B, RTX 4090 24GB)

Format	Quantization	Tokens/sec	VRAM Used
FP16 (baseline)	None	95 tok/s	16.2 GB
GGUF	Q4_K_M	85 tok/s	5.1 GB
GGUF	Q5_K_M	78 tok/s	5.8 GB
GPTQ	4-bit 128g	110 tok/s	5.3 GB
AWQ	4-bit	115 tok/s	5.0 GB

Note: Speed figures are approximate estimates based on community benchmarks. GPTQ/AWQ are generally faster on GPU due to optimized CUDA kernels (Marlin). GGUF uses llama.cpp's backend which prioritizes broader hardware compatibility. Actual results vary by hardware, driver version, and batch size.

Quality Retention (General Pattern)

The typical quality retention at 4-bit quantization across published evaluations:

Format	Quantization	Quality Retained (vs FP16)	Key Advantage
GGUF Q4_K_M	4-bit mixed precision	~95-98%	K-quants give important layers more bits
GGUF Q5_K_M	5-bit mixed precision	~97-99%	Higher bit-width, near-lossless
GPTQ 4-bit	4-bit uniform	~94-96%	Optimized CUDA kernels
AWQ 4-bit	4-bit adaptive	~96-98%	Protects salient weights

Key insight: GGUF Q4_K_M and AWQ 4-bit are very close in quality. GGUF achieves this through mixed-precision K-quants (giving important layers more bits), while AWQ does it through activation-aware weight protection. Both outperform uniform GPTQ. The exact retention percentage varies by model and task.

When to Choose Each Format

Choose GGUF When:

You use Ollama — it only supports GGUF
You have Apple Silicon — Metal acceleration, no CUDA needed
You have no GPU — CPU inference works well
You want partial GPU offload — some layers on GPU, rest on CPU
You want the simplest setup — one file, no dependencies
You use LM Studio or Jan — both use GGUF

Choose AWQ When:

You run a production GPU server — vLLM + AWQ is the gold standard
Quality is your top priority — best quality-per-bit at 4-bit
You serve many concurrent users — AWQ kernels optimize batched inference
You have NVIDIA GPUs — A100, H100, RTX 4090+

Choose GPTQ When:

You use text-generation-webui — GPTQ has the longest support history
You need 2-bit or 3-bit quantization — GPTQ supports more bit widths
You want the widest model selection — most Hugging Face quants are GPTQ
You need ExLlamaV2 — EXL2 format (GPTQ-based) offers fine-grained bit control

Quantization Decision Matrix

Criteria	GGUF	GPTQ	AWQ
CPU inference	Yes	No	No
Apple Silicon	Yes (Metal)	No	No
AMD GPU	Yes (ROCm)	Limited	Limited
NVIDIA GPU	Yes	Yes	Yes
Ollama support	Yes	No	No
vLLM support	No	Yes	Yes
Quality at 4-bit	Excellent (K-quants)	Good	Best
Inference speed (GPU)	Good	Fast	Fastest
Ecosystem breadth	Widest	Wide	Growing
Ease of use	Easiest	Moderate	Moderate
Mixed precision	Yes (K-quants)	No	Yes (per-channel)

How to Convert Between Formats

FP16 → GGUF

# Using llama.cpp's convert script
python convert_hf_to_gguf.py ./model-directory --outtype q4_k_m
# Output: model-Q4_K_M.gguf

FP16 → GPTQ

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)
model = AutoGPTQForCausalLM.from_pretrained("./model-directory", config)
model.quantize(calibration_data)
model.save_quantized("./model-gptq")

FP16 → AWQ

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("./model-directory")
model.quantize(tokenizer, quant_config={"w_bit": 4, "q_group_size": 128})
model.save_quantized("./model-awq")

Most users never need to convert models themselves — pre-quantized versions of popular models are available on Hugging Face and the Ollama library.

Real-World Recommendations

For Home Users (Ollama + Open WebUI)

Stick with GGUF via Ollama. It's the simplest path:

ollama pull llama3.1:8b gives you Q4_K_M GGUF automatically
Works on any hardware (Mac, Windows, Linux, with or without GPU)
No configuration needed
See our Open WebUI setup guide for the best interface
See our best Ollama models guide for model recommendations

For Developers (AI Coding Assistants)

Use GGUF with Ollama as the backend for Continue.dev or Cursor:

Fast models for autocomplete: qwen2.5-coder:1.5b (GGUF, ~2GB)
Quality models for chat: qwen2.5-coder:7b (GGUF, ~5GB)

For Production API Servers

Use AWQ with vLLM:

vllm serve model-name-AWQ \
  --quantization awq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

This gives you the best throughput and quality for serving users at scale.

For Research and Experimentation

Use GPTQ with text-generation-webui — it supports the widest range of quantization options and lets you compare formats easily.

Key Takeaways

GGUF is the default for local use — Ollama, LM Studio, and Jan all use it exclusively
AWQ produces the best quality at 4-bit on GPU, ideal for production servers
GPTQ has the widest model library on Hugging Face and supports more bit widths
Q4_K_M is the sweet spot for GGUF — 95-98% of FP16 quality at ~60% memory
Only GGUF works without a GPU — AWQ and GPTQ require NVIDIA CUDA
Quality differences are small — all three formats at 4-bit retain 95%+ of original quality
Don't overthink it — if you use Ollama, you're already using the right format (GGUF)

Next Steps

Pull the best Ollama models — all use GGUF automatically
Set up Open WebUI for a ChatGPT-like interface
Check VRAM requirements to size your hardware
Compare local AI tools — all use GGUF
Run models on 8GB RAM with optimized quantization

Quantization technology continues to improve rapidly. New methods like FP8 (for datacenter GPUs) and BitNet (1-bit models trained from scratch) are emerging. This guide covers the three formats relevant to most local AI users as of March 2026.

AWQ vs GPTQ vs GGUF: Which Quantization Is Best?

Want to go deeper than this article?

Quick Answer: Which Quantization Format?

What Is Model Quantization?

Reading articles is good. Building is better.

The Three Formats Compared

Head-to-Head Comparison

GGUF: The Universal Format

What Is GGUF?

GGUF Quantization Levels

When to Use GGUF

GGUF in Practice

GPTQ: The Original GPU Quantizer

What Is GPTQ?

GPTQ Characteristics

When to Use GPTQ

GPTQ in Practice

Reading articles is good. Building is better.

AWQ: The Accuracy Champion

What Is AWQ?

AWQ Characteristics

AWQ vs GPTQ Quality Comparison

When to Use AWQ

AWQ in Practice

Performance Benchmarks

Inference Speed (Llama 3.1 8B, RTX 4090 24GB)

Quality Retention (General Pattern)

When to Choose Each Format

Choose GGUF When:

Choose AWQ When:

Choose GPTQ When:

Quantization Decision Matrix

How to Convert Between Formats

FP16 → GGUF

FP16 → GPTQ

FP16 → AWQ

Real-World Recommendations

For Home Users (Ollama + Open WebUI)

For Developers (AI Coding Assistants)

For Production API Servers

For Research and Experimentation

Key Takeaways

Next Steps

Go from reading about AI to building with AI

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Build Real AI on Your Machine

Related Guides

Best Ollama Models 2026

VRAM Requirements 2026

Best Local AI Models for 8GB RAM

Open WebUI Setup Guide

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI