RecurrentGemma-9B: Griffin Architecture Review

Google DeepMind's hybrid recurrent + local attention model with O(1) inference memory

RecurrentGemma-9B at a Glance

Architecture: Griffin (linear recurrences + local attention)

Parameters: 9 billion

Context Window: 8,192 tokens

MMLU: ~56% (from Google's technical report)

Key Innovation: O(1) memory during inference (constant KV state)

License: Gemma Terms of Use (not Apache 2.0)

Released: April 2024 by Google DeepMind

Ollama: ollama run recurrentgemma2:9b

Honest take: RecurrentGemma-9B is architecturally interesting but scores lower than standard Gemma-7B on most benchmarks (~56% vs ~64% MMLU). Its value is in the Griffin architecture innovation (constant inference memory), not raw quality. If you need the best 7-9B model for general tasks, Llama 3 8B or Gemma 2 9B are stronger. RecurrentGemma is worth trying if you care about recurrent architectures or want to experiment with alternatives to full attention.

Griffin Architecture Deep Dive

RecurrentGemma-9B uses the Griffin architecture, introduced in the paper "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models" (De et al., 2024). Griffin is a hybrid that replaces most of the standard transformer's global self-attention layers with gated linear recurrence layers, while keeping a few local sliding-window attention layers.

How Griffin Works (Layer by Layer)

Griffin Block Pattern (repeating):
+------------------------------------+
|  RG-LRU (Recurrent Block)          | -- O(1) memory, processes sequentially
+------------------------------------+
|  MLP (Feed-Forward)                | -- Standard transformation
+------------------------------------+
|  RG-LRU (Recurrent Block)          | -- Another recurrent layer
+------------------------------------+
|  Local Attention (sliding window)  | -- 2048-token local window, NOT global
+------------------------------------+

Key: RG-LRU = Real-Gated Linear Recurrent Unit
     Local Attention = Multi-Query Attention over 2048-token window only

RG-LRU (Recurrent Layers)

  • - Maintains a fixed-size hidden state
  • - Processes tokens one at a time, updating state
  • - O(1) memory per layer regardless of sequence length
  • - Uses input and forget gates (similar to LSTM concept)
  • - Can be parallelized during training via scan operations

Local Attention Layers

  • - Sliding window of 2048 tokens
  • - Uses Multi-Query Attention (MQA) for efficiency
  • - Provides sharp, precise attention on recent tokens
  • - Complements the recurrent layers' long-range state
  • - Only a few attention layers (most are recurrent)

The key innovation is the combination: recurrent layers capture long-range dependencies in a compressed state, while local attention layers provide precise, fine-grained processing of the most recent tokens. This is fundamentally different from full transformers (which attend to all tokens globally) and from pure RNNs (which have no attention at all).

Why O(1) inference memory matters: In standard transformers, the KV cache grows linearly with sequence length. A 9B transformer processing 8K tokens might use 2-4GB just for the KV cache. Griffin's recurrent state is fixed-size -- it doesn't grow as you generate more tokens. This means inference memory stays constant no matter how many tokens you've generated. The practical benefit: more predictable memory usage and no KV cache OOM issues.

However, "constant memory" does not mean "infinite perfect recall." The fixed-size state compresses information lossily. Details from thousands of tokens ago may be forgotten or blurred, similar to how RNNs lose older information. The 8,192-token context window is the designed operating range, not a theoretical limit on memory.

Real Benchmarks and Performance

RecurrentGemma-9B scores lower than standard Gemma models on most benchmarks. This is the honest tradeoff: the Griffin architecture gains memory efficiency but loses some quality compared to full-attention transformers of similar size.

Benchmark Results (Source: Google DeepMind Technical Report)

BenchmarkRecurrentGemma-9BGemma-7BMistral-7B
MMLU (5-shot)~56%~64%~60%
HellaSwag~73%~81%~81%
PIQA~79%~81%~82%
Winogrande~68%~74%~74%
ARC-Challenge~48%~53%~54%

Source: Google DeepMind RecurrentGemma technical report (April 2024), Gemma technical report, Mistral AI. Scores are approximate.

Key takeaway: RecurrentGemma-9B consistently scores 5-10 percentage points below standard Gemma-7B across benchmarks. This is the quality cost of replacing global attention with linear recurrences. The model is better understood as an architecture research release than as a production-ready competitor to Llama 3 or Mistral.

MMLU Score Comparison (5-shot, approximate)

Gemma-7B64 MMLU %
64
RecurrentGemma-9B56 MMLU %
56
Mistral-7B60 MMLU %
60
RWKV-v5-7B47 MMLU %
47

VRAM Requirements by Quantization

RecurrentGemma-9B is comparable to other 7-9B models in size. The recurrent architecture doesn't significantly change the model weight size -- the O(1) memory advantage applies to the inference KV cache, not the model weights themselves.

VRAM by Quantization Level

QuantizationModel SizeVRAM (GPU)RAM (CPU)Quality Impact
FP16 (full)~18GB~20GB~20GBNone (baseline)
Q8_0~9.5GB~11GB~12GBMinimal
Q5_K_M~6.5GB~8GB~9GBSmall
Q4_K_M (recommended)~5.5GB~7GB~8GBModerate
Q2_K~3.8GB~5GB~6GBSignificant

Estimated based on 9B parameter model weights. Actual VRAM includes model + activations + overhead.

The real memory advantage: While model weight VRAM is similar to other 9B models, RecurrentGemma's inference KV cache stays constant. For standard transformers, the KV cache for an 8K context 9B model adds ~1-2GB. This matters more at longer sequences and larger batch sizes, but for single-user local inference at 8K context, the practical difference is modest.

Performance Metrics

MMLU Score
56
Inference Memory
92
Long Sequence Speed
80
Short Text Quality
52
Architecture Novelty
85
Community Support
40

Griffin vs RWKV vs Mamba vs Transformers

RecurrentGemma's Griffin architecture belongs to a family of "efficient attention alternatives" that emerged in 2023-2024. Here's how they compare.

Architecture Comparison

FeatureGriffin (RecurrentGemma)RWKV-v5/v6Mamba (S4)Standard Transformer
AttentionLocal sliding windowNone (pure recurrent)None (SSM)Full global
RecurrenceRG-LRU (gated)WKV (custom)Structured SSMNone
Inference MemoryO(1) + local windowO(1)O(1)O(n) KV cache
Training ParallelismGood (scan + attention)Good (parallel scan)Good (parallel scan)Excellent (full parallel)
Quality at 7-9B scale~56% MMLU~47% MMLU (v5-7B)~55% MMLU (est.)~60-66% MMLU
Community/EcosystemLimited (Google only)Active communityGrowing (research)Massive (Llama, etc.)

What Makes Griffin Different from RWKV?

Both Griffin and RWKV achieve O(1) inference memory through recurrence, but they differ in a key way: Griffin keeps a few local attention layers (sliding window over the most recent ~2048 tokens). This hybrid approach means Griffin can precisely reference recent tokens via attention while using its recurrent state for older context. RWKV is purely recurrent with no attention at all.

The result: Griffin tends to perform better on tasks requiring precise reference to recent context (like multi-turn conversation), while RWKV has a simpler, more elegant architecture. Neither matches the quality of full-attention transformers at equivalent scale -- that's the fundamental tradeoff all these models make.

Memory Usage Over Time

5GB
4GB
3GB
1GB
0GB
1K tokens2K tokens4K tokens8K tokens

Ollama Setup Guide

RecurrentGemma is available on Ollama as recurrentgemma2. Setup is straightforward -- no special flags or environment variables needed.

Note: The Ollama tag is recurrentgemma2:9b (not recurrentgemma:9b). The "2" refers to the RecurrentGemma 2 release. Check ollama.com/library/recurrentgemma2 for the latest available tags.

Alternative: Using HuggingFace + Transformers

# pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "google/recurrentgemma-9b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

inputs = tokenizer("Explain the Griffin architecture:", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requires ~20GB VRAM for FP16, or use bitsandbytes for 4-bit quantization.

System Requirements

Operating System
Windows 10/11, macOS 12+, Ubuntu 20.04+
RAM
8GB minimum (Q4), 16GB recommended (Q8), 32GB for FP16
Storage
6GB for Q4 model, 18GB for FP16
GPU
Optional: any GPU with 6GB+ VRAM for Q4 acceleration
CPU
4+ cores recommended
1

Install Ollama

Download Ollama for your platform

$ curl -fsSL https://ollama.com/install.sh | sh
2

Pull RecurrentGemma

Download the RecurrentGemma 2 9B model (~5.5GB Q4 quantized)

$ ollama pull recurrentgemma2:9b
3

Run the model

Start an interactive chat session

$ ollama run recurrentgemma2:9b
Terminal
$ollama run recurrentgemma2:9b
pulling manifest pulling 9b model... 100% pulling tokenizer... 100% verifying sha256 digest writing manifest success >>> Hello, what architecture are you based on? I'm based on the Griffin architecture, which combines linear recurrences with local attention mechanisms. This is different from standard transformer models that use full self-attention. My architecture uses a Real-Gated Linear Recurrent Unit (RG-LRU) for processing sequences with constant memory, plus sliding window local attention for recent context.
$_

Honest Strengths and Limitations

Strengths

  • - O(1) inference memory: KV cache doesn't grow with sequence length. Predictable memory usage.
  • - Architecture innovation: Demonstrates that hybrid recurrent+attention models are viable at 9B scale.
  • - Efficient long generation: Generating long outputs doesn't increase memory pressure like transformers.
  • - Research value: Important for understanding alternatives to full attention.
  • - Runs on consumer hardware: Q4 fits in 8GB VRAM, similar to other 7-9B models.

Limitations

  • - Lower quality: ~56% MMLU vs 64% for Gemma-7B. Measurably weaker on most benchmarks.
  • - Small context window: 8K tokens -- smaller than Llama 3's 8K-128K or Mistral's 32K.
  • - Lossy compression: The recurrent state compresses information. Old details may be lost or blurred.
  • - Limited ecosystem: Few fine-tuned variants, community tools, or GGUF options compared to Llama/Mistral.
  • - Not instruction-tuned: Base model only. The instruct version (recurrentgemma-9b-it) exists but is less tested.
  • - Gemma license: Not Apache 2.0. Has usage restrictions for large-scale commercial deployment.

Common misconception: "RecurrentGemma can process infinite sequences with perfect memory." This is misleading. While the inference memory is technically constant (O(1)), the model has an 8K context window and the recurrent state compresses information lossily. It does NOT maintain "perfect recall" of everything it has processed. Think of it like a human summarizing a book in their head -- the gist is there but fine details fade.

ModelSizeRAM RequiredSpeedQualityCost/Month
RecurrentGemma-9B (Q4)~5.5GB8GB~20 tok/s
56%
Free
Gemma-7B (Q4)~4.3GB8GB~25 tok/s
64%
Free
Llama-3-8B (Q4)~4.7GB8GB~30 tok/s
66%
Free
Mistral-7B (Q4)~4.1GB8GB~28 tok/s
60%
Free

Local AI Alternatives

Unless you specifically want to experiment with recurrent architectures, these models offer better quality per parameter for local use.

Recommended Alternatives (7-9B range)

ModelMMLUContextOllama CommandBest For
Llama 3.1 8B~66%128Kollama run llama3.1:8bGeneral use, best quality
Gemma 2 9B~71%8Kollama run gemma2:9bHighest quality in class
Mistral 7B v0.3~60%32Kollama run mistral:7bGood balance, long context
RWKV-v6 7B~50%Unlimited*Via rwkv.cppPure recurrent, no attention
Qwen 2.5 7B~68%128Kollama run qwen2.5:7bStrong all-rounder

*RWKV has theoretical unlimited context but quality degrades significantly past training window.

When to Choose RecurrentGemma

  • - You're researching recurrent vs attention architectures
  • - You need predictable/constant inference memory (e.g., embedded or edge deployment)
  • - You want to compare Griffin, RWKV, and Mamba approaches firsthand
  • - Quality is less important than architectural experimentation

Frequently Asked Questions

How does Griffin's O(1) inference memory work?

Standard transformers store key-value pairs for every previous token (the "KV cache"), which grows linearly with sequence length. Griffin's recurrent layers (RG-LRU) compress all previous context into a fixed-size hidden state. This state gets updated with each new token but never grows. The local attention layers do have a small, fixed-size window (2048 tokens), but this is bounded and doesn't grow with total sequence length.

Can RecurrentGemma process "infinite" sequences?

Technically, the memory stays constant regardless of input length, so there's no memory-based limit. However, the model was trained with 8,192-token context, so quality degrades on longer inputs. The recurrent state also compresses information lossily -- details from early in a long sequence will be lost or blurred. "Infinite context" is a theoretical property, not a practical one.

Why is RecurrentGemma's MMLU lower than Gemma-7B?

Replacing global self-attention with linear recurrences reduces the model's ability to attend to all tokens simultaneously. Global attention allows direct comparison between any two tokens in the context, which is especially useful for factual recall and complex reasoning. The recurrent state must compress this into a fixed vector, losing information. This is the fundamental quality-efficiency tradeoff of recurrent architectures.

What hardware do I need to run it locally?

For Q4 quantized: 8GB VRAM (GPU) or 8GB RAM (CPU-only). An RTX 3060, RTX 4060, or Apple M1 with 8GB+ works fine. For FP16: you'll need ~20GB VRAM (RTX 3090, RTX 4090, or M2 Max with 32GB). CPU-only inference is possible but slow (~2-5 tok/s depending on CPU).

Is RecurrentGemma better than RWKV?

At similar parameter counts, Griffin (RecurrentGemma) tends to score higher on benchmarks than RWKV, likely because the local attention layers help with tasks requiring precise recent-context reference. However, RWKV has a more active community, more model sizes, and ongoing active development (RWKV-v6, v7). For practical local use, RWKV may have more community support while RecurrentGemma has Google's engineering behind it.

What license is RecurrentGemma under?

RecurrentGemma uses the Gemma Terms of Use, which is NOT Apache 2.0 or MIT. It allows personal and commercial use but has restrictions: you cannot use it to train competing models, and large-scale commercial use (>$10M revenue) requires contacting Google. Read the full terms at ai.google.dev/gemma/terms.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

Reading now
Join the discussion
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
📅 Published: 2024-04-01🔄 Last Updated: 2026-03-16✓ Manually Reviewed

Sources

RecurrentGemma-9B Griffin Architecture

RecurrentGemma-9B's Griffin architecture combining linear recurrence (RG-LRU) and local sliding-window attention for O(1) inference memory

👤
You
💻
Your ComputerAI Processing
👤
🌐
🏢
Cloud AI: You → Internet → Company Servers
Free Tools & Calculators