RecurrentGemma-9B: Griffin Architecture Review
Google DeepMind's hybrid recurrent + local attention model with O(1) inference memory
RecurrentGemma-9B at a Glance
Architecture: Griffin (linear recurrences + local attention)
Parameters: 9 billion
Context Window: 8,192 tokens
MMLU: ~56% (from Google's technical report)
Key Innovation: O(1) memory during inference (constant KV state)
License: Gemma Terms of Use (not Apache 2.0)
Released: April 2024 by Google DeepMind
Ollama: ollama run recurrentgemma2:9b
Honest take: RecurrentGemma-9B is architecturally interesting but scores lower than standard Gemma-7B on most benchmarks (~56% vs ~64% MMLU). Its value is in the Griffin architecture innovation (constant inference memory), not raw quality. If you need the best 7-9B model for general tasks, Llama 3 8B or Gemma 2 9B are stronger. RecurrentGemma is worth trying if you care about recurrent architectures or want to experiment with alternatives to full attention.
Contents
Griffin Architecture Deep Dive
RecurrentGemma-9B uses the Griffin architecture, introduced in the paper "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models" (De et al., 2024). Griffin is a hybrid that replaces most of the standard transformer's global self-attention layers with gated linear recurrence layers, while keeping a few local sliding-window attention layers.
How Griffin Works (Layer by Layer)
Griffin Block Pattern (repeating):
+------------------------------------+
| RG-LRU (Recurrent Block) | -- O(1) memory, processes sequentially
+------------------------------------+
| MLP (Feed-Forward) | -- Standard transformation
+------------------------------------+
| RG-LRU (Recurrent Block) | -- Another recurrent layer
+------------------------------------+
| Local Attention (sliding window) | -- 2048-token local window, NOT global
+------------------------------------+
Key: RG-LRU = Real-Gated Linear Recurrent Unit
Local Attention = Multi-Query Attention over 2048-token window onlyRG-LRU (Recurrent Layers)
- - Maintains a fixed-size hidden state
- - Processes tokens one at a time, updating state
- - O(1) memory per layer regardless of sequence length
- - Uses input and forget gates (similar to LSTM concept)
- - Can be parallelized during training via scan operations
Local Attention Layers
- - Sliding window of 2048 tokens
- - Uses Multi-Query Attention (MQA) for efficiency
- - Provides sharp, precise attention on recent tokens
- - Complements the recurrent layers' long-range state
- - Only a few attention layers (most are recurrent)
The key innovation is the combination: recurrent layers capture long-range dependencies in a compressed state, while local attention layers provide precise, fine-grained processing of the most recent tokens. This is fundamentally different from full transformers (which attend to all tokens globally) and from pure RNNs (which have no attention at all).
Why O(1) inference memory matters: In standard transformers, the KV cache grows linearly with sequence length. A 9B transformer processing 8K tokens might use 2-4GB just for the KV cache. Griffin's recurrent state is fixed-size -- it doesn't grow as you generate more tokens. This means inference memory stays constant no matter how many tokens you've generated. The practical benefit: more predictable memory usage and no KV cache OOM issues.
However, "constant memory" does not mean "infinite perfect recall." The fixed-size state compresses information lossily. Details from thousands of tokens ago may be forgotten or blurred, similar to how RNNs lose older information. The 8,192-token context window is the designed operating range, not a theoretical limit on memory.
Real Benchmarks and Performance
RecurrentGemma-9B scores lower than standard Gemma models on most benchmarks. This is the honest tradeoff: the Griffin architecture gains memory efficiency but loses some quality compared to full-attention transformers of similar size.
Benchmark Results (Source: Google DeepMind Technical Report)
| Benchmark | RecurrentGemma-9B | Gemma-7B | Mistral-7B |
|---|---|---|---|
| MMLU (5-shot) | ~56% | ~64% | ~60% |
| HellaSwag | ~73% | ~81% | ~81% |
| PIQA | ~79% | ~81% | ~82% |
| Winogrande | ~68% | ~74% | ~74% |
| ARC-Challenge | ~48% | ~53% | ~54% |
Source: Google DeepMind RecurrentGemma technical report (April 2024), Gemma technical report, Mistral AI. Scores are approximate.
Key takeaway: RecurrentGemma-9B consistently scores 5-10 percentage points below standard Gemma-7B across benchmarks. This is the quality cost of replacing global attention with linear recurrences. The model is better understood as an architecture research release than as a production-ready competitor to Llama 3 or Mistral.
MMLU Score Comparison (5-shot, approximate)
VRAM Requirements by Quantization
RecurrentGemma-9B is comparable to other 7-9B models in size. The recurrent architecture doesn't significantly change the model weight size -- the O(1) memory advantage applies to the inference KV cache, not the model weights themselves.
VRAM by Quantization Level
| Quantization | Model Size | VRAM (GPU) | RAM (CPU) | Quality Impact |
|---|---|---|---|---|
| FP16 (full) | ~18GB | ~20GB | ~20GB | None (baseline) |
| Q8_0 | ~9.5GB | ~11GB | ~12GB | Minimal |
| Q5_K_M | ~6.5GB | ~8GB | ~9GB | Small |
| Q4_K_M (recommended) | ~5.5GB | ~7GB | ~8GB | Moderate |
| Q2_K | ~3.8GB | ~5GB | ~6GB | Significant |
Estimated based on 9B parameter model weights. Actual VRAM includes model + activations + overhead.
The real memory advantage: While model weight VRAM is similar to other 9B models, RecurrentGemma's inference KV cache stays constant. For standard transformers, the KV cache for an 8K context 9B model adds ~1-2GB. This matters more at longer sequences and larger batch sizes, but for single-user local inference at 8K context, the practical difference is modest.
Performance Metrics
Griffin vs RWKV vs Mamba vs Transformers
RecurrentGemma's Griffin architecture belongs to a family of "efficient attention alternatives" that emerged in 2023-2024. Here's how they compare.
Architecture Comparison
| Feature | Griffin (RecurrentGemma) | RWKV-v5/v6 | Mamba (S4) | Standard Transformer |
|---|---|---|---|---|
| Attention | Local sliding window | None (pure recurrent) | None (SSM) | Full global |
| Recurrence | RG-LRU (gated) | WKV (custom) | Structured SSM | None |
| Inference Memory | O(1) + local window | O(1) | O(1) | O(n) KV cache |
| Training Parallelism | Good (scan + attention) | Good (parallel scan) | Good (parallel scan) | Excellent (full parallel) |
| Quality at 7-9B scale | ~56% MMLU | ~47% MMLU (v5-7B) | ~55% MMLU (est.) | ~60-66% MMLU |
| Community/Ecosystem | Limited (Google only) | Active community | Growing (research) | Massive (Llama, etc.) |
What Makes Griffin Different from RWKV?
Both Griffin and RWKV achieve O(1) inference memory through recurrence, but they differ in a key way: Griffin keeps a few local attention layers (sliding window over the most recent ~2048 tokens). This hybrid approach means Griffin can precisely reference recent tokens via attention while using its recurrent state for older context. RWKV is purely recurrent with no attention at all.
The result: Griffin tends to perform better on tasks requiring precise reference to recent context (like multi-turn conversation), while RWKV has a simpler, more elegant architecture. Neither matches the quality of full-attention transformers at equivalent scale -- that's the fundamental tradeoff all these models make.
Memory Usage Over Time
Ollama Setup Guide
RecurrentGemma is available on Ollama as recurrentgemma2. Setup is straightforward -- no special flags or environment variables needed.
Note: The Ollama tag is recurrentgemma2:9b (not recurrentgemma:9b). The "2" refers to the RecurrentGemma 2 release. Check ollama.com/library/recurrentgemma2 for the latest available tags.
Alternative: Using HuggingFace + Transformers
# pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "google/recurrentgemma-9b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto"
)
inputs = tokenizer("Explain the Griffin architecture:", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Requires ~20GB VRAM for FP16, or use bitsandbytes for 4-bit quantization.
System Requirements
Install Ollama
Download Ollama for your platform
Pull RecurrentGemma
Download the RecurrentGemma 2 9B model (~5.5GB Q4 quantized)
Run the model
Start an interactive chat session
Honest Strengths and Limitations
Strengths
- - O(1) inference memory: KV cache doesn't grow with sequence length. Predictable memory usage.
- - Architecture innovation: Demonstrates that hybrid recurrent+attention models are viable at 9B scale.
- - Efficient long generation: Generating long outputs doesn't increase memory pressure like transformers.
- - Research value: Important for understanding alternatives to full attention.
- - Runs on consumer hardware: Q4 fits in 8GB VRAM, similar to other 7-9B models.
Limitations
- - Lower quality: ~56% MMLU vs 64% for Gemma-7B. Measurably weaker on most benchmarks.
- - Small context window: 8K tokens -- smaller than Llama 3's 8K-128K or Mistral's 32K.
- - Lossy compression: The recurrent state compresses information. Old details may be lost or blurred.
- - Limited ecosystem: Few fine-tuned variants, community tools, or GGUF options compared to Llama/Mistral.
- - Not instruction-tuned: Base model only. The instruct version (recurrentgemma-9b-it) exists but is less tested.
- - Gemma license: Not Apache 2.0. Has usage restrictions for large-scale commercial deployment.
Common misconception: "RecurrentGemma can process infinite sequences with perfect memory." This is misleading. While the inference memory is technically constant (O(1)), the model has an 8K context window and the recurrent state compresses information lossily. It does NOT maintain "perfect recall" of everything it has processed. Think of it like a human summarizing a book in their head -- the gist is there but fine details fade.
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| RecurrentGemma-9B (Q4) | ~5.5GB | 8GB | ~20 tok/s | 56% | Free |
| Gemma-7B (Q4) | ~4.3GB | 8GB | ~25 tok/s | 64% | Free |
| Llama-3-8B (Q4) | ~4.7GB | 8GB | ~30 tok/s | 66% | Free |
| Mistral-7B (Q4) | ~4.1GB | 8GB | ~28 tok/s | 60% | Free |
Local AI Alternatives
Unless you specifically want to experiment with recurrent architectures, these models offer better quality per parameter for local use.
Recommended Alternatives (7-9B range)
| Model | MMLU | Context | Ollama Command | Best For |
|---|---|---|---|---|
| Llama 3.1 8B | ~66% | 128K | ollama run llama3.1:8b | General use, best quality |
| Gemma 2 9B | ~71% | 8K | ollama run gemma2:9b | Highest quality in class |
| Mistral 7B v0.3 | ~60% | 32K | ollama run mistral:7b | Good balance, long context |
| RWKV-v6 7B | ~50% | Unlimited* | Via rwkv.cpp | Pure recurrent, no attention |
| Qwen 2.5 7B | ~68% | 128K | ollama run qwen2.5:7b | Strong all-rounder |
*RWKV has theoretical unlimited context but quality degrades significantly past training window.
When to Choose RecurrentGemma
- - You're researching recurrent vs attention architectures
- - You need predictable/constant inference memory (e.g., embedded or edge deployment)
- - You want to compare Griffin, RWKV, and Mamba approaches firsthand
- - Quality is less important than architectural experimentation
Frequently Asked Questions
How does Griffin's O(1) inference memory work?
Standard transformers store key-value pairs for every previous token (the "KV cache"), which grows linearly with sequence length. Griffin's recurrent layers (RG-LRU) compress all previous context into a fixed-size hidden state. This state gets updated with each new token but never grows. The local attention layers do have a small, fixed-size window (2048 tokens), but this is bounded and doesn't grow with total sequence length.
Can RecurrentGemma process "infinite" sequences?
Technically, the memory stays constant regardless of input length, so there's no memory-based limit. However, the model was trained with 8,192-token context, so quality degrades on longer inputs. The recurrent state also compresses information lossily -- details from early in a long sequence will be lost or blurred. "Infinite context" is a theoretical property, not a practical one.
Why is RecurrentGemma's MMLU lower than Gemma-7B?
Replacing global self-attention with linear recurrences reduces the model's ability to attend to all tokens simultaneously. Global attention allows direct comparison between any two tokens in the context, which is especially useful for factual recall and complex reasoning. The recurrent state must compress this into a fixed vector, losing information. This is the fundamental quality-efficiency tradeoff of recurrent architectures.
What hardware do I need to run it locally?
For Q4 quantized: 8GB VRAM (GPU) or 8GB RAM (CPU-only). An RTX 3060, RTX 4060, or Apple M1 with 8GB+ works fine. For FP16: you'll need ~20GB VRAM (RTX 3090, RTX 4090, or M2 Max with 32GB). CPU-only inference is possible but slow (~2-5 tok/s depending on CPU).
Is RecurrentGemma better than RWKV?
At similar parameter counts, Griffin (RecurrentGemma) tends to score higher on benchmarks than RWKV, likely because the local attention layers help with tasks requiring precise recent-context reference. However, RWKV has a more active community, more model sizes, and ongoing active development (RWKV-v6, v7). For practical local use, RWKV may have more community support while RecurrentGemma has Google's engineering behind it.
What license is RecurrentGemma under?
RecurrentGemma uses the Gemma Terms of Use, which is NOT Apache 2.0 or MIT. It allows personal and commercial use but has restrictions: you cannot use it to train competing models, and large-scale commercial use (>$10M revenue) requires contacting Google. Read the full terms at ai.google.dev/gemma/terms.
Was this helpful?
Related Guides
Continue your local AI journey with these comprehensive guides
Continue Learning: Alternative Architectures
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.
Sources
Official Resources
Technical Papers
- RecurrentGemma: Moving Past Transformers for Efficient Open Language Models (arXiv 2402.19427)
- Griffin: Mixing Gated Linear Recurrences with Local Attention (same paper)
- RWKV: Reinventing RNNs for the Transformer Era (arXiv 2305.13048)
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces (arXiv 2312.00752)
RecurrentGemma-9B Griffin Architecture
RecurrentGemma-9B's Griffin architecture combining linear recurrence (RG-LRU) and local sliding-window attention for O(1) inference memory