Yi-34B: 01.AI Base Model

34B parameter model trained from scratch by Kai-Fu Lee's 01.AI -- not a LLaMA derivative

MMLU ~76%C-Eval ~81%Apache 2.0200K ContextReleased Nov 2023
Parameters
34B
MMLU (5-shot)
~76%
Context Window
4K base / 200K extended
License
Apache 2.0

What Is Yi-34B?

Yi-34B is a 34 billion parameter base language model developed by 01.AI, the AI company founded by Dr. Kai-Fu Lee (former president of Google China). Released in November 2023, Yi-34B was notable for being trained entirely from scratch on a custom dataset -- it is not a fine-tune or derivative of LLaMA, Mistral, or any other existing model.

At launch, Yi-34B achieved remarkably strong results for its size, scoring ~76% on MMLU and ~81% on C-Eval (a Chinese language benchmark). This made it one of the strongest open-weight models available in late 2023, particularly for bilingual Chinese-English tasks. As one of many open LLMs you can run locally, it remains a solid option for users who need strong Chinese language capabilities.

Important: This page covers the base (pretrained) model. The base model is designed for text completion, not instruction-following. For conversational use, see the Yi-34B-Chat page, which covers the instruction-tuned version.

Architecture: Trained from Scratch

Key Architecture Details

When Yi-34B first appeared, some community members speculated it was a LLaMA derivative due to similar transformer architecture choices. 01.AI clarified that Yi was trained from scratch on their own data pipeline. The architectural similarities (grouped-query attention, RMSNorm, SwiGLU) are standard transformer design choices used by many modern LLMs independently.

ComponentYi-34B Specification
Parameters34 billion
Hidden Size7168
Layers60
Attention Heads56 (with 8 KV heads via GQA)
Attention TypeGrouped-Query Attention (GQA)
Vocabulary Size64,000 tokens (bilingual Chinese-English tokenizer)
Context Length4,096 tokens (base), 200K (extended via NTK-aware RoPE)
NormalizationRMSNorm
ActivationSwiGLU
Position EncodingRotary Position Embedding (RoPE)
Training Data3.1 trillion tokens (bilingual English + Chinese, cleaned)

Note on data quality: 01.AI emphasized that Yi's training data went through extensive deduplication and quality filtering. The bilingual tokenizer with 64K vocabulary was designed specifically for efficient Chinese-English processing, unlike models that bolt on Chinese support as an afterthought.

Real Benchmark Results

MMLU Scores (5-shot)

MMLU Score (%)

Yi-34B76 % accuracy
76
Llama 2 70B69 % accuracy
69
Mixtral 8x7B70 % accuracy
70
Falcon 40B55 % accuracy
55

Source: 01.AI technical report and Open LLM Leaderboard. Yi-34B outperformed Llama 2 70B on MMLU despite being half the parameter count.

Full Benchmark Comparison

BenchmarkYi-34BLlama 2 70BFalcon 40BMixtral 8x7B
MMLU (5-shot)~76%~69%~55%~70%
C-Eval (Chinese)~81%~50%~38%~55%
HellaSwag~85%~87%~83%~87%
ARC-Challenge~65%~64%~54%~66%
Parameters34B70B40B46.7B (MoE)

Sources: 01.AI technical report, Open LLM Leaderboard, Hugging Face model cards. C-Eval scores for non-Chinese models are approximate as they were not optimized for Chinese evaluation.

VRAM Requirements by Quantization

QuantizationFile SizeVRAM RequiredQuality LossCompatible GPUs
Q4_K_M~19GB~20-22GBMinimalRTX 4090 (24GB), RTX 3090 (24GB)
Q5_K_M~23GB~25-27GBVery smallA5000 (24GB partial offload), 2x RTX 3090
Q8_0~36GB~38-40GBNegligibleA6000 (48GB), 2x RTX 4090
FP16~68GB~70-72GBNone (full precision)A100 80GB, 3x RTX 4090

VRAM estimates include model weights plus KV cache overhead at default context length. Longer context windows will require additional VRAM. CPU offloading is possible but significantly reduces speed.

Memory Usage Over Time

21GB
16GB
11GB
5GB
0GB
0s60s120s

Memory usage during inference with Q4_K_M quantization on RTX 4090 (GB VRAM). Stabilizes around 20-21GB after initial loading.

200K Context Extension

Yi-34B's base context window is 4,096 tokens, but 01.AI developed a method to extend this to 200K tokens using NTK-aware interpolation of Rotary Position Embeddings (RoPE). This approach modifies the frequency base of the positional encoding to handle longer sequences without fine-tuning on long documents.

How NTK-aware RoPE Works

  • Standard RoPE: Uses fixed frequency bases for position encoding, limited to training context length
  • NTK-aware interpolation: Dynamically adjusts the frequency base to encode positions beyond the original training window
  • Result: Extends effective context from 4K to 200K with minimal perplexity increase

Practical Considerations

  • Quality at 200K: Works for retrieval/needle-in-haystack tasks, but generation quality degrades at very long contexts
  • VRAM impact: 200K context requires significantly more VRAM for KV cache (potentially 40GB+ additional)
  • Best range: Most reliable at 4K-32K tokens; 200K is a theoretical maximum

Base Model vs Chat Model

Yi-34B comes in two variants. This page covers the base model. Understanding the difference is important for choosing the right one.

FeatureYi-34B (Base)Yi-34B-Chat
PurposeText completion, fine-tuning foundationConversational, instruction-following
TrainingPretraining only (3.1T tokens)Pretraining + SFT + RLHF
Use CaseCustom fine-tuning, text generation, embeddingsChatbots, Q&A, general assistant tasks
Ollama Tagyi:34byi:34b-chat
Best ForDevelopers building custom applicationsEnd users wanting a general assistant

If you want to chat with the model directly, use Yi-34B-Chat instead. The base model outputs raw completions and may not follow instructions without proper prompting.

Local Deployment with Ollama

1

Install Ollama

Download and install the Ollama runtime

$ curl -fsSL https://ollama.com/install.sh | sh
2

Pull Yi-34B

Download the Yi-34B base model (~19GB Q4_K_M)

$ ollama pull yi:34b
3

Run Yi-34B

Start an interactive session with the base model

$ ollama run yi:34b
4

Test with a prompt

Verify the model is working correctly

$ ollama run yi:34b "What is the capital of France?"

Terminal Demo

Terminal
$ollama pull yi:34b
pulling manifest pulling abc123... 100% |████████████████████| 19 GB verifying sha256 digest writing manifest success
$ollama run yi:34b "Explain the difference between TCP and UDP"
TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are both transport layer protocols but differ in key ways: TCP is connection-oriented, establishing a reliable connection via a three-way handshake before data transfer. It guarantees ordered delivery and retransmits lost packets. This makes it ideal for web browsing, email, and file transfers where data integrity matters. UDP is connectionless and sends datagrams without establishing a connection first. It provides no delivery guarantees or ordering, but has much lower overhead. This makes it suitable for real-time applications like video streaming, gaming, and DNS lookups where speed matters more than perfect reliability.
$_

System Requirements

Operating System
Windows 10/11, macOS 13+, Ubuntu 20.04+
RAM
32GB minimum (Q4_K_M), 48GB+ for larger quantizations
Storage
25GB for Q4_K_M, 40GB for Q8, 70GB for FP16
GPU
24GB VRAM for Q4_K_M (RTX 4090, RTX 3090, A5000)
CPU
CPU-only possible with 64GB+ RAM but very slow (~2 tok/s)

Ollama Environment Variables

# Limit to 1 loaded model (saves VRAM)
export OLLAMA_MAX_LOADED_MODELS=1

# Set concurrent request limit
export OLLAMA_NUM_PARALLEL=2

# Flash attention (if supported by your GPU)
export OLLAMA_FLASH_ATTENTION=true

These are real Ollama environment variables. Yi-34B at Q4_K_M fits in a single RTX 4090 (24GB), but runs tight on VRAM -- limit parallel requests to avoid OOM errors.

Alternative Runtimes

llama.cpp

Direct GGUF inference with fine-grained control over quantization and context length.

./llama-cli -m yi-34b-Q4_K_M.gguf -p "Your prompt" -n 512

vLLM

High-throughput serving with PagedAttention for production deployments.

python -m vllm.entrypoints.openai.api_server --model 01-ai/Yi-34B

Local AI Alternatives

Yi-34B was released in November 2023. Since then, newer models have surpassed it on most benchmarks. Here is how it compares to current alternatives for local deployment:

ModelSizeRAM RequiredSpeedQualityCost/Month
Yi-34B (Q4_K_M)20GB24GB~15 tok/s
76%
$0.00
Llama 2 70B (Q4)40GB48GB~8 tok/s
69%
$0.00
Mixtral 8x7B (Q4)26GB32GB~20 tok/s
70%
$0.00
Qwen 2.5 32B (Q4)20GB24GB~18 tok/s
83%
$0.00
ModelMMLUChineseVRAM (Q4)LicenseReleased
Yi-34B~76%Excellent~20GBApache 2.0Nov 2023
Qwen 2.5 32B~83%Excellent~20GBApache 2.0Sep 2024
Llama 3 70B~82%Good~40GBMeta LicenseApr 2024
Mixtral 8x7B~70%Fair~26GBApache 2.0Dec 2023
Gemma 2 27B~75%Fair~17GBGemma ToUJun 2024

For bilingual Chinese-English workloads, Qwen 2.5 32B is the strongest current alternative with similar VRAM requirements. For English-only tasks, Llama 3 70B offers better benchmarks but needs 2x the VRAM.

Honest Assessment

Strengths

  • Strong for its era: When released in Nov 2023, Yi-34B was arguably the best open-weight model at its size class, beating Llama 2 70B on MMLU with half the parameters
  • Bilingual Chinese-English: Native bilingual training makes it genuinely strong at Chinese tasks, unlike models that treat Chinese as secondary
  • Apache 2.0 license: Fully permissive for commercial use with no restrictions
  • Clean architecture: Trained from scratch rather than being a derivative, which gives it a unique character
  • Efficient for 34B: Fits comfortably on a single RTX 4090 with Q4 quantization

Limitations

  • Superseded by newer models: Qwen 2.5 32B (Sep 2024) achieves ~83% MMLU at similar VRAM requirements. For most new projects, newer models are better choices
  • Base model limitations: Without fine-tuning, the base model does raw text completion -- it will not follow instructions or chat naturally
  • 4K base context: The 200K extension works via interpolation but quality degrades at very long contexts compared to models trained natively on long sequences
  • No code specialization: Not optimized for coding tasks; dedicated code models like DeepSeek Coder or CodeLlama are better for programming
  • Community size: Smaller community than Llama/Mistral ecosystem, meaning fewer fine-tunes and adapters available

When to Still Choose Yi-34B in 2026

  • Bilingual Chinese-English work: If you need strong Chinese + English in a single model and want Apache 2.0 licensing
  • Fine-tuning base: The base model is a solid foundation for domain-specific fine-tuning, especially for Chinese-language tasks
  • Existing deployments: If you already have Yi-34B in production and it works for your use case, there is no urgent reason to migrate
  • For new projects: Consider Qwen 2.5 32B as a direct upgrade with better benchmarks at similar VRAM

Frequently Asked Questions

Is Yi-34B based on LLaMA?

No. Despite early community speculation, 01.AI confirmed that Yi-34B was trained from scratch on their own data pipeline. The architectural similarities (GQA, RMSNorm, SwiGLU) are standard transformer design choices used independently by many models. The bilingual tokenizer and training data are entirely custom.

What GPU do I need to run Yi-34B locally?

With Q4_K_M quantization, Yi-34B fits on a single RTX 4090 or RTX 3090 (24GB VRAM). For higher quantizations (Q8), you will need 48GB+ VRAM (A6000 or dual GPUs). CPU-only inference is possible with 64GB+ system RAM but is very slow (~2 tokens/second).

Should I use the base model or the Chat model?

For most users, Yi-34B-Chat is the better choice -- it follows instructions and has a conversational format. Use the base model only if you are fine-tuning for a specific domain, building embeddings, or need raw text completion without instruction bias.

Does the 200K context window really work?

The 200K extension via NTK-aware RoPE interpolation works for retrieval-style tasks (finding specific information in long documents), but generation quality degrades at very long contexts. For reliable results, stay within 4K-32K tokens. The 200K figure is a theoretical maximum, not a practical everyday limit. It also requires significantly more VRAM.

Is Yi-34B still worth using in 2026?

For new projects, newer models like Qwen 2.5 32B generally offer better performance at similar VRAM requirements. However, Yi-34B remains a solid choice for bilingual Chinese-English fine-tuning and existing deployments. Its Apache 2.0 license and clean architecture make it a good foundation model for specialized applications.

Reading now
Join the discussion

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
📅 Published: November 1, 2023🔄 Last Updated: March 13, 2026✓ Manually Reviewed

Related Guides

Continue your local AI journey with these comprehensive guides

Sources

Free Tools & Calculators