Yi-34B: 01.AI Base Model
34B parameter model trained from scratch by Kai-Fu Lee's 01.AI -- not a LLaMA derivative
What Is Yi-34B?
Yi-34B is a 34 billion parameter base language model developed by 01.AI, the AI company founded by Dr. Kai-Fu Lee (former president of Google China). Released in November 2023, Yi-34B was notable for being trained entirely from scratch on a custom dataset -- it is not a fine-tune or derivative of LLaMA, Mistral, or any other existing model.
At launch, Yi-34B achieved remarkably strong results for its size, scoring ~76% on MMLU and ~81% on C-Eval (a Chinese language benchmark). This made it one of the strongest open-weight models available in late 2023, particularly for bilingual Chinese-English tasks. As one of many open LLMs you can run locally, it remains a solid option for users who need strong Chinese language capabilities.
Important: This page covers the base (pretrained) model. The base model is designed for text completion, not instruction-following. For conversational use, see the Yi-34B-Chat page, which covers the instruction-tuned version.
Architecture: Trained from Scratch
Key Architecture Details
When Yi-34B first appeared, some community members speculated it was a LLaMA derivative due to similar transformer architecture choices. 01.AI clarified that Yi was trained from scratch on their own data pipeline. The architectural similarities (grouped-query attention, RMSNorm, SwiGLU) are standard transformer design choices used by many modern LLMs independently.
| Component | Yi-34B Specification |
|---|---|
| Parameters | 34 billion |
| Hidden Size | 7168 |
| Layers | 60 |
| Attention Heads | 56 (with 8 KV heads via GQA) |
| Attention Type | Grouped-Query Attention (GQA) |
| Vocabulary Size | 64,000 tokens (bilingual Chinese-English tokenizer) |
| Context Length | 4,096 tokens (base), 200K (extended via NTK-aware RoPE) |
| Normalization | RMSNorm |
| Activation | SwiGLU |
| Position Encoding | Rotary Position Embedding (RoPE) |
| Training Data | 3.1 trillion tokens (bilingual English + Chinese, cleaned) |
Note on data quality: 01.AI emphasized that Yi's training data went through extensive deduplication and quality filtering. The bilingual tokenizer with 64K vocabulary was designed specifically for efficient Chinese-English processing, unlike models that bolt on Chinese support as an afterthought.
Real Benchmark Results
MMLU Scores (5-shot)
MMLU Score (%)
Source: 01.AI technical report and Open LLM Leaderboard. Yi-34B outperformed Llama 2 70B on MMLU despite being half the parameter count.
Full Benchmark Comparison
| Benchmark | Yi-34B | Llama 2 70B | Falcon 40B | Mixtral 8x7B |
|---|---|---|---|---|
| MMLU (5-shot) | ~76% | ~69% | ~55% | ~70% |
| C-Eval (Chinese) | ~81% | ~50% | ~38% | ~55% |
| HellaSwag | ~85% | ~87% | ~83% | ~87% |
| ARC-Challenge | ~65% | ~64% | ~54% | ~66% |
| Parameters | 34B | 70B | 40B | 46.7B (MoE) |
Sources: 01.AI technical report, Open LLM Leaderboard, Hugging Face model cards. C-Eval scores for non-Chinese models are approximate as they were not optimized for Chinese evaluation.
VRAM Requirements by Quantization
| Quantization | File Size | VRAM Required | Quality Loss | Compatible GPUs |
|---|---|---|---|---|
| Q4_K_M | ~19GB | ~20-22GB | Minimal | RTX 4090 (24GB), RTX 3090 (24GB) |
| Q5_K_M | ~23GB | ~25-27GB | Very small | A5000 (24GB partial offload), 2x RTX 3090 |
| Q8_0 | ~36GB | ~38-40GB | Negligible | A6000 (48GB), 2x RTX 4090 |
| FP16 | ~68GB | ~70-72GB | None (full precision) | A100 80GB, 3x RTX 4090 |
VRAM estimates include model weights plus KV cache overhead at default context length. Longer context windows will require additional VRAM. CPU offloading is possible but significantly reduces speed.
Memory Usage Over Time
Memory usage during inference with Q4_K_M quantization on RTX 4090 (GB VRAM). Stabilizes around 20-21GB after initial loading.
200K Context Extension
Yi-34B's base context window is 4,096 tokens, but 01.AI developed a method to extend this to 200K tokens using NTK-aware interpolation of Rotary Position Embeddings (RoPE). This approach modifies the frequency base of the positional encoding to handle longer sequences without fine-tuning on long documents.
How NTK-aware RoPE Works
- Standard RoPE: Uses fixed frequency bases for position encoding, limited to training context length
- NTK-aware interpolation: Dynamically adjusts the frequency base to encode positions beyond the original training window
- Result: Extends effective context from 4K to 200K with minimal perplexity increase
Practical Considerations
- Quality at 200K: Works for retrieval/needle-in-haystack tasks, but generation quality degrades at very long contexts
- VRAM impact: 200K context requires significantly more VRAM for KV cache (potentially 40GB+ additional)
- Best range: Most reliable at 4K-32K tokens; 200K is a theoretical maximum
Base Model vs Chat Model
Yi-34B comes in two variants. This page covers the base model. Understanding the difference is important for choosing the right one.
| Feature | Yi-34B (Base) | Yi-34B-Chat |
|---|---|---|
| Purpose | Text completion, fine-tuning foundation | Conversational, instruction-following |
| Training | Pretraining only (3.1T tokens) | Pretraining + SFT + RLHF |
| Use Case | Custom fine-tuning, text generation, embeddings | Chatbots, Q&A, general assistant tasks |
| Ollama Tag | yi:34b | yi:34b-chat |
| Best For | Developers building custom applications | End users wanting a general assistant |
If you want to chat with the model directly, use Yi-34B-Chat instead. The base model outputs raw completions and may not follow instructions without proper prompting.
Local Deployment with Ollama
Install Ollama
Download and install the Ollama runtime
Pull Yi-34B
Download the Yi-34B base model (~19GB Q4_K_M)
Run Yi-34B
Start an interactive session with the base model
Test with a prompt
Verify the model is working correctly
Terminal Demo
System Requirements
Ollama Environment Variables
# Limit to 1 loaded model (saves VRAM)
export OLLAMA_MAX_LOADED_MODELS=1
# Set concurrent request limit
export OLLAMA_NUM_PARALLEL=2
# Flash attention (if supported by your GPU)
export OLLAMA_FLASH_ATTENTION=trueThese are real Ollama environment variables. Yi-34B at Q4_K_M fits in a single RTX 4090 (24GB), but runs tight on VRAM -- limit parallel requests to avoid OOM errors.
Alternative Runtimes
llama.cpp
Direct GGUF inference with fine-grained control over quantization and context length.
./llama-cli -m yi-34b-Q4_K_M.gguf -p "Your prompt" -n 512vLLM
High-throughput serving with PagedAttention for production deployments.
python -m vllm.entrypoints.openai.api_server --model 01-ai/Yi-34BLocal AI Alternatives
Yi-34B was released in November 2023. Since then, newer models have surpassed it on most benchmarks. Here is how it compares to current alternatives for local deployment:
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| Yi-34B (Q4_K_M) | 20GB | 24GB | ~15 tok/s | 76% | $0.00 |
| Llama 2 70B (Q4) | 40GB | 48GB | ~8 tok/s | 69% | $0.00 |
| Mixtral 8x7B (Q4) | 26GB | 32GB | ~20 tok/s | 70% | $0.00 |
| Qwen 2.5 32B (Q4) | 20GB | 24GB | ~18 tok/s | 83% | $0.00 |
| Model | MMLU | Chinese | VRAM (Q4) | License | Released |
|---|---|---|---|---|---|
| Yi-34B | ~76% | Excellent | ~20GB | Apache 2.0 | Nov 2023 |
| Qwen 2.5 32B | ~83% | Excellent | ~20GB | Apache 2.0 | Sep 2024 |
| Llama 3 70B | ~82% | Good | ~40GB | Meta License | Apr 2024 |
| Mixtral 8x7B | ~70% | Fair | ~26GB | Apache 2.0 | Dec 2023 |
| Gemma 2 27B | ~75% | Fair | ~17GB | Gemma ToU | Jun 2024 |
For bilingual Chinese-English workloads, Qwen 2.5 32B is the strongest current alternative with similar VRAM requirements. For English-only tasks, Llama 3 70B offers better benchmarks but needs 2x the VRAM.
Honest Assessment
Strengths
- Strong for its era: When released in Nov 2023, Yi-34B was arguably the best open-weight model at its size class, beating Llama 2 70B on MMLU with half the parameters
- Bilingual Chinese-English: Native bilingual training makes it genuinely strong at Chinese tasks, unlike models that treat Chinese as secondary
- Apache 2.0 license: Fully permissive for commercial use with no restrictions
- Clean architecture: Trained from scratch rather than being a derivative, which gives it a unique character
- Efficient for 34B: Fits comfortably on a single RTX 4090 with Q4 quantization
Limitations
- Superseded by newer models: Qwen 2.5 32B (Sep 2024) achieves ~83% MMLU at similar VRAM requirements. For most new projects, newer models are better choices
- Base model limitations: Without fine-tuning, the base model does raw text completion -- it will not follow instructions or chat naturally
- 4K base context: The 200K extension works via interpolation but quality degrades at very long contexts compared to models trained natively on long sequences
- No code specialization: Not optimized for coding tasks; dedicated code models like DeepSeek Coder or CodeLlama are better for programming
- Community size: Smaller community than Llama/Mistral ecosystem, meaning fewer fine-tunes and adapters available
When to Still Choose Yi-34B in 2026
- Bilingual Chinese-English work: If you need strong Chinese + English in a single model and want Apache 2.0 licensing
- Fine-tuning base: The base model is a solid foundation for domain-specific fine-tuning, especially for Chinese-language tasks
- Existing deployments: If you already have Yi-34B in production and it works for your use case, there is no urgent reason to migrate
- For new projects: Consider Qwen 2.5 32B as a direct upgrade with better benchmarks at similar VRAM
Frequently Asked Questions
Is Yi-34B based on LLaMA?
No. Despite early community speculation, 01.AI confirmed that Yi-34B was trained from scratch on their own data pipeline. The architectural similarities (GQA, RMSNorm, SwiGLU) are standard transformer design choices used independently by many models. The bilingual tokenizer and training data are entirely custom.
What GPU do I need to run Yi-34B locally?
With Q4_K_M quantization, Yi-34B fits on a single RTX 4090 or RTX 3090 (24GB VRAM). For higher quantizations (Q8), you will need 48GB+ VRAM (A6000 or dual GPUs). CPU-only inference is possible with 64GB+ system RAM but is very slow (~2 tokens/second).
Should I use the base model or the Chat model?
For most users, Yi-34B-Chat is the better choice -- it follows instructions and has a conversational format. Use the base model only if you are fine-tuning for a specific domain, building embeddings, or need raw text completion without instruction bias.
Does the 200K context window really work?
The 200K extension via NTK-aware RoPE interpolation works for retrieval-style tasks (finding specific information in long documents), but generation quality degrades at very long contexts. For reliable results, stay within 4K-32K tokens. The 200K figure is a theoretical maximum, not a practical everyday limit. It also requires significantly more VRAM.
Is Yi-34B still worth using in 2026?
For new projects, newer models like Qwen 2.5 32B generally offer better performance at similar VRAM requirements. However, Yi-34B remains a solid choice for bilingual Chinese-English fine-tuning and existing deployments. Its Apache 2.0 license and clean architecture make it a good foundation model for specialized applications.
Was this helpful?
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.
Related Guides
Continue your local AI journey with these comprehensive guides
Continue Learning
Explore related models and guides: