Falcon 40B – Technical Guide
Updated: March 13, 2026 | Originally released: May 2023
Honest technical guide to Falcon 40B from TII (Technology Innovation Institute, Abu Dhabi) -- the model that pioneered open-source LLM licensing and quality-focused web data curation with the RefinedWeb dataset.
40B parameter decoder-only transformer. Apache 2.0 license. 2,048 token context. Real MMLU: ~56% (not the 83% sometimes claimed).
Historical Context
Falcon 40B was released in May 2023 and was briefly the #1 model on the HuggingFace Open LLM Leaderboard. It was groundbreaking for its time: one of the first high-quality open-source LLMs with a permissive license. However, it has since been significantly surpassed by newer models like Llama 3, Qwen 2.5, and Mistral. This guide provides real benchmarks and honest context for anyone still evaluating Falcon 40B.
Model Specifications
40B Parameters
Decoder-only transformer with multi-query attention
2K Context
2,048 tokens -- a major limitation vs modern models
MMLU ~56%
Real benchmark from HF Open LLM Leaderboard
Apache 2.0
Truly open license -- commercial use allowed
Technical Architecture
Decoder-Only Transformer with Multi-Query Attention: Falcon 40B uses a decoder-only transformer architecture with 60 layers and 64 attention heads. A key architectural innovation is multi-query attention (MQA), where all attention heads share a single set of key-value projections. This significantly reduces memory usage during inference and improves throughput compared to standard multi-head attention.
Developed by the Technology Innovation Institute (TII) in Abu Dhabi, Falcon 40B was trained on 1 trillion tokens from the RefinedWeb dataset plus a small curated corpus (~150B tokens of books, conversations, technical content). The model uses rotary positional embeddings (RoPE) and was trained using 384 A100 80GB GPUs on AWS.
Key Architectural Details:
- • Layers: 60 transformer blocks
- • Attention heads: 64 (with multi-query attention for efficiency)
- • Hidden dimension: 8,192
- • Vocabulary: 65,024 tokens (custom BPE tokenizer)
- • Context window: 2,048 tokens (no RoPE scaling variants released)
- • Positional encoding: Rotary Position Embeddings (RoPE)
- • Parallelized attention + MLP: Attention and FFN computed in parallel for faster training
RefinedWeb: The Key Innovation
Falcon 40B's most important contribution to the field was not the model itself, but the RefinedWeb dataset and the research showing that properly filtered web data alone could match or exceed curated datasets. This finding influenced every subsequent LLM training run.
The RefinedWeb Pipeline
- 1. CommonCrawl extraction: Starting from raw web crawl data
- 2. URL filtering: Remove adult, spam, and low-quality domains
- 3. Text extraction: trafilatura-based content extraction
- 4. Language identification: fastText classifier for English filtering
- 5. Quality filtering: Perplexity-based and heuristic filters
- 6. Deduplication: MinHash fuzzy deduplication at massive scale
Why It Mattered
- • 5 trillion tokens of filtered English web data released
- • Proved web-only data can match curated training mixes
- • Paper: "The RefinedWeb Dataset for Falcon LLM" (Penedo et al., 2023)
- • Influenced Llama 3, Qwen, and other subsequent training runs
- • Dataset released on HuggingFace under Apache 2.0
- • One of the first large-scale open training datasets
Source: Penedo et al., "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only", NeurIPS 2023 Datasets and Benchmarks Track. arXiv:2306.01116
Real Benchmarks
HuggingFace Open LLM Leaderboard Results
These are real scores from the HuggingFace Open LLM Leaderboard (v1), not marketing claims. Falcon 40B was tested with standard evaluation protocols.
| Benchmark | Falcon 40B | Llama 2 13B | Yi-34B | Mistral 7B |
|---|---|---|---|---|
| MMLU (5-shot) | ~55.4% | ~55.8% | ~76.1% | ~62.5% |
| HellaSwag (10-shot) | ~83.6% | ~80.7% | ~85.7% | ~83.3% |
| ARC-Challenge (25-shot) | ~57.7% | ~59.4% | ~65.4% | ~61.7% |
| TruthfulQA (0-shot) | ~43.5% | ~36.4% | ~56.2% | ~42.2% |
| Winogrande (5-shot) | ~77.7% | ~74.8% | ~81.4% | ~78.4% |
Source: HuggingFace Open LLM Leaderboard (v1). Falcon 40B (base, not instruct). Note: Falcon 40B-Instruct scored slightly differently on some benchmarks. Yi-34B and Mistral 7B were released months later with improved training techniques.
MMLU Score Comparison (Real Benchmarks)
VRAM by Quantization
Falcon 40B at full precision requires ~80GB VRAM. Quantization makes it runnable on consumer hardware. GGUF quantized versions are available for llama.cpp and Ollama.
| Quantization | VRAM (approx) | File Size | Quality Loss | Best For |
|---|---|---|---|---|
| FP16 | ~80GB | ~80GB | None | A100 80GB / multi-GPU |
| Q8_0 | ~42GB | ~42GB | Minimal | A6000 48GB / dual GPU |
| Q4_K_M | ~24GB | ~23GB | Moderate | RTX 4090 / 3090 (recommended) |
| Q4_K_S | ~22GB | ~22GB | Moderate | RTX 4090 / 3090 |
| Q3_K_M | ~19GB | ~18GB | Noticeable | RTX 3090 (tight fit) |
| Q2_K | ~16GB | ~15GB | Significant | RTX 4080 16GB (degraded) |
VRAM estimates include KV cache overhead for short contexts. Actual usage varies by prompt length and batch size. Q4_K_M is the recommended balance of quality and VRAM.
Hardware Requirements
Minimum (Q4_K_M Quantized)
Full Precision (FP16)
CPU-Only / Apple Silicon Note
Falcon 40B can run on CPU-only or Apple Silicon (M1 Pro/Max/Ultra, M2/M3/M4 with 32GB+ unified memory) via llama.cpp/Ollama, but expect very slow inference (1-5 tok/s). For Apple Silicon, Q4_K_M quantization fits in 32GB unified memory with adequate performance for non-interactive use. For interactive use on Apple Silicon, consider smaller models like Mistral 7B or Llama 3.2 3B instead.
Installation Guide
Ollama (Recommended)
Step 1: Install Ollama
Linux/macOS. For Windows, download from ollama.com. For macOS, also available via: brew install ollama
Step 2: Pull Falcon 40B
Download size: ~22GB (Q4_K_M). Verify available tags at ollama.com/library/falcon. Note: Ollama may not carry all Falcon variants -- check availability first.
Step 3: Run the Model
This starts an interactive chat session. First run will be slower as the model loads into VRAM.
Step 4: API Access (Optional)
Alternative: llama.cpp (GGUF)
For more control over quantization and inference parameters, use llama.cpp directly with GGUF files from HuggingFace (e.g., TheBloke/Falcon-40B-GGUF).
Local Alternatives (2026)
Falcon 40B was groundbreaking in 2023, but newer models deliver significantly better performance at equal or lower VRAM requirements. If you are choosing a model today, consider these alternatives.
| Model | Parameters | MMLU | Context | VRAM (Q4) | License |
|---|---|---|---|---|---|
| Falcon 40B | 40B | ~56% | 2K | ~24GB | Apache 2.0 |
| Qwen 2.5 32B | 32B | ~83% | 128K | ~20GB | Apache 2.0 |
| Llama 3.1 70B | 70B | ~79% | 128K | ~40GB | Llama 3.1 Community |
| Mistral 7B | 7B | ~63% | 32K | ~5GB | Apache 2.0 |
| Yi-34B | 34B | ~76% | 200K | ~20GB | Yi License |
| Mixtral 8x7B | 46.7B (MoE) | ~70% | 32K | ~26GB | Apache 2.0 |
Recommendation: For the same 24GB VRAM budget, Qwen 2.5 32B delivers ~83% MMLU with 128K context vs Falcon 40B's ~56% MMLU with 2K context. Mistral 7B at just 5GB VRAM already exceeds Falcon 40B on MMLU.
Honest Assessment: Falcon 40B in 2026
Why Falcon 40B Mattered
- • First truly open LLM: Apache 2.0 license before Meta, Mistral, or others followed
- • RefinedWeb dataset: Proved web data curation could match curated corpora
- • #1 on HF Leaderboard: Briefly topped the Open LLM Leaderboard in June 2023
- • Multi-query attention: Efficient inference architecture adopted by later models
- • Non-US origin: Showed world-class AI could come from Abu Dhabi/UAE
- • Open dataset: RefinedWeb was one of the first large open training datasets
Limitations in 2026
- • 2,048 token context: Severely limiting vs 128K+ in modern models
- • MMLU ~56%: Now matched or exceeded by 7B models (Mistral 7B: ~63%)
- • No instruction tuning ecosystem: Limited chat/instruct variants compared to Llama
- • 24GB VRAM for Q4: Same VRAM budget runs much better models now
- • No code specialization: Poor at code generation vs dedicated code models
- • Falcon 2 never gained traction: TII's follow-up (Falcon 2, 2024) saw limited adoption
Should You Use Falcon 40B Today?
For new deployments: No. Falcon 40B has been comprehensively surpassed. A 7B Mistral model uses 5x less VRAM and scores higher on MMLU. Qwen 2.5 32B uses similar VRAM but with 64x more context and ~27 points higher MMLU.
For existing deployments: If you have Falcon 40B in production and it works for your use case, migration is not urgent but recommended when convenient. Test Qwen 2.5 32B or Llama 3.1 8B as drop-in replacements.
For research/education: Falcon 40B remains historically interesting as the model that kickstarted the open-source LLM licensing revolution and demonstrated the power of web data curation. The RefinedWeb paper is still widely cited.
Frequently Asked Questions
What hardware do I need to run Falcon 40B?
At Q4_K_M quantization (recommended), you need 24GB VRAM -- an RTX 4090, RTX 3090, or A6000. System RAM should be 32GB minimum. For full FP16, you need 80GB+ VRAM (A100 or dual A6000). Apple Silicon Macs with 32GB+ unified memory can run Q4_K_M but at reduced speed (1-5 tok/s).
What is Falcon 40B's real MMLU score?
Falcon 40B scores approximately 55-57% on MMLU (5-shot) according to the HuggingFace Open LLM Leaderboard. Claims of 83% MMLU are incorrect and likely confused with HellaSwag (~83.6%) or inflated marketing numbers. For context, Mistral 7B (a much smaller model) scores ~63% on MMLU.
What makes the RefinedWeb dataset special?
RefinedWeb was one of the first demonstrations that carefully filtered web data alone could produce competitive LLMs. The pipeline uses URL filtering, quality scoring, and MinHash deduplication on CommonCrawl data. TII released 600B tokens publicly and used 1T+ tokens for training. The paper (Penedo et al., NeurIPS 2023) influenced subsequent training runs for Llama 3, Qwen, and others.
Is Falcon 40B still worth using in 2026?
For new deployments, no. Falcon 40B has been surpassed by newer models that deliver better performance at equal or lower VRAM requirements. Qwen 2.5 32B uses similar VRAM but scores ~83% MMLU with 128K context. Mistral 7B uses 5x less VRAM and still scores higher on MMLU. Falcon 40B remains historically important as a pioneering open-source LLM.
What is the license for Falcon 40B?
Falcon 40B uses the Apache 2.0 license. It was originally released under a custom"Falcon License" (May 2023) that required royalties for high-revenue commercial use. After community pushback, TII re-licensed it to Apache 2.0 in June 2023. This was one of the first major open-source LLMs with a fully permissive commercial license.
Resources & Further Reading
Official Sources
Research Papers
Deployment Tools
Stay Updated with Local AI Developments
Get the latest insights on local AI models, performance benchmarks, and deployment strategies.
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
Creator of Local AI Master
I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.