Llama 3.3 70B
Meta's most capable open-weight model. 86.0% MMLU, 81.7% HumanEval, 128K context. Runs locally with Ollama on 48GB+ hardware.
One-Command Install
$ ollama pull llama3.3:70b
# Downloads ~40GB, takes 10-30 min
$ ollama run llama3.3:70b
# Start chatting immediately
What's New in Llama 3.3 vs 3.1
Llama 3.3 70B, released December 2024, is Meta's latest instruction-tuned model. It replaces Llama 3.1 70B as the default "big" open-weight model for local and production use.
| Metric | Llama 3.1 70B | Llama 3.3 70B | Improvement |
|---|---|---|---|
| MMLU | 79.3% | 86.0% | +6.7 pts |
| HumanEval (coding) | 72.6% | 81.7% | +9.1 pts |
| GSM8K (math) | 83.7% | 91.1% | +7.4 pts |
| MATH | 51.9% | 68.0% | +16.1 pts |
| Context Window | 128K | 128K | Same |
| VRAM (Q4_K_M) | ~40GB | ~40GB | Same |
| Instruction Following | Good | Significantly better | Improved |
Key improvements: stronger reasoning, better coding, reduced hallucination, more consistent instruction following. Same hardware requirements as Llama 3.1 70B — a direct upgrade. Scores from official Meta model cards; verify latest figures on Hugging Face.
Technical Specifications
Architecture
| Parameters | 70.6 billion |
| Architecture | Dense Transformer (Decoder-only) |
| Layers | 80 |
| Hidden Size | 8,192 |
| Attention Heads | 64 (8 KV heads, GQA) |
| Vocabulary | 128,256 tokens (tiktoken) |
| Context Window | 128,000 tokens |
Training
| Developer | Meta AI |
| Release | December 2024 |
| Training Data | 15+ trillion tokens |
| Data Cutoff | December 2023 |
| Alignment | RLHF + DPO |
| License | Llama Community License |
| Commercial Use | Yes (up to 700M monthly users) |
Benchmark Performance
Benchmark scores from Meta's official model card. MMLU = general knowledge, HumanEval = Python coding, GSM8K = grade-school math, MATH = competition math, ARC = science reasoning. Speed estimates are community-reported and vary by system configuration.
Quick Start: Run Llama 3.3 70B with Ollama
Step-by-Step
Install Ollama
macOS: brew install ollama | Linux: curl -fsSL https://ollama.com/install.sh | sh | Windows: download from ollama.com
Pull the model (~40GB download)
ollama pull llama3.3:70b
Start chatting
ollama run llama3.3:70b
Optional: Add a web interface
See our Open WebUI setup guide for a ChatGPT-like interface
Hardware Requirements
VRAM by Quantization
| Hardware | Quantization | Speed | Experience |
|---|---|---|---|
| RTX 5090 (32GB) | Q3_K_M | ~22 tok/s | Good (slight quality loss) |
| 2x RTX 4090 (48GB) | Q4_K_M | ~18 tok/s | Excellent |
| Mac M4 Max 64GB | Q4_K_M | ~18 tok/s | Excellent |
| Mac M2 Ultra 128GB | Q5_K_M | ~15 tok/s | Excellent (higher quality) |
| A100 80GB | Q4_K_M | ~35 tok/s | Production-ready |
| RTX 4090 (24GB) + CPU offload | Q4_K_M (split) | ~8 tok/s | Usable but slow |
Llama 3.3 70B vs Competitors
vs Cloud APIs (GPT-4o, Claude)
- + Free — no per-token costs
- + Full privacy — data stays local
- + No rate limits or downtime
- − Requires expensive hardware
- − Slower than cloud for complex tasks
- − Weaker on creative writing, nuance
vs Smaller Local Models (8B-32B)
- + Dramatically better reasoning
- + Fewer hallucinations
- + Better at following complex instructions
- − 5-8x more VRAM required
- − 3-5x slower inference
- − Cannot run on most consumer hardware
Best Use Cases for Llama 3.3 70B
Excels At
- Code generation — 81.7% HumanEval, strong across Python/JS/Rust
- Math & reasoning — 91.1% GSM8K, competitive with o1-mini
- Document analysis — 128K context handles long documents
- Summarization — Reliable, factual summaries
- Enterprise Q&A — Pair with RAG for internal knowledge bases
- Multi-turn conversation — Good memory across long chats
Consider Alternatives For
- Real-time chat — Use 8B models for faster responses
- Multilingual — Qwen 2.5 72B is better for CJK languages
- Coding autocomplete — Qwen 2.5 Coder 1.5B is faster for tab completion
- Chain-of-thought reasoning — DeepSeek R1 32B shows thinking process
- 8GB hardware — Use 8B models instead
Quantization Guide
Ollama defaults to Q4_K_M, which is the best balance of quality and memory. Here are your options:
| Quantization | Download Size | VRAM | Quality | Ollama Command |
|---|---|---|---|---|
| Q2_K | ~25 GB | ~27 GB | Poor (not recommended) | ollama pull llama3.3:70b-instruct-q2_K |
| Q3_K_M | ~31 GB | ~33 GB | Acceptable | ollama pull llama3.3:70b-instruct-q3_K_M |
| Q4_K_M (default) | ~40 GB | ~42 GB | Excellent (95-98%) | ollama pull llama3.3:70b |
| Q5_K_M | ~46 GB | ~48 GB | Near-lossless | ollama pull llama3.3:70b-instruct-q5_K_M |
| Q8_0 | ~70 GB | ~72 GB | Essentially lossless | ollama pull llama3.3:70b-instruct-q8_0 |
For a detailed explanation of quantization formats, see our AWQ vs GPTQ vs GGUF comparison.
Advanced Setup
Custom Modelfile
Create a tuned version with custom parameters:
# Save as Modelfile
FROM llama3.3:70b
PARAMETER temperature 0.7
PARAMETER num_ctx 16384
PARAMETER top_p 0.9
SYSTEM "You are a helpful technical assistant. Be concise and accurate."
# Build: ollama create my-llama -f Modelfile
# Run: ollama run my-llama
API Access
Use Ollama's OpenAI-compatible API:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.3:70b", "messages": [{"role": "user", "content": "Hello"}]}'
Pair with a Web Interface
For the best experience, pair Llama 3.3 70B with Open WebUI for a full ChatGPT-like interface, or use it as the backend for Continue.dev for AI-assisted coding.
Sources
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Related Guides
Continue your local AI journey with these comprehensive guides
Was this helpful?
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.