DeepSeek R1 Local Setup: Complete Guide to Running 671B Locally
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
DeepSeek R1 Quick Start
Choose Your Version:
Quick Install (3 commands):
curl -fsSL https://ollama.com/install.sh | sh
ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b
What is DeepSeek R1?
DeepSeek R1 is a 671 billion parameter reasoning model that revolutionized open-source AI when released on January 20, 2025. Built by Chinese AI lab DeepSeek for approximately $5.6 million in training costs, it matches or beats OpenAI's o1 and GPT-4 on complex reasoning tasksāand it's completely open source under the MIT license.
What makes R1 groundbreaking isn't just performance. It's the first open model to demonstrate transparent chain-of-thought reasoning through visible <think> tokens. Unlike closed models that hide their reasoning, R1 shows you exactly how it solves problems step by step. You can watch it explore solutions, catch its own mistakes, and course-correct in real-time.
The model uses a Mixture-of-Experts (MoE) architecture with 671B total parameters but only 37B active per token. This makes it computationally efficient while maintaining massive capability. DeepSeek achieved this through pure reinforcement learningāthe model learned to reason without pre-programmed chains of thought.
Andrej Karpathy, founding member of OpenAI, commented: "DeepSeek making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M)."
DeepSeek Model Family: Complete Version History
Understanding the DeepSeek ecosystem helps you choose the right model:
DeepSeek V3 (December 26, 2024)
The foundation model. General-purpose 671B MoE optimized for broad capabilities across coding, writing, and conversation.
- Parameters: 671B total, 37B active
- Context: 128K tokens
- Best for: General tasks, coding, writing
DeepSeek R1 (January 20, 2025)
The reasoning specialist. Same architecture as V3 but trained specifically for complex reasoning using reinforcement learning.
- Parameters: 671B total, 37B active
- Context: 128K-160K tokens
- Best for: Math, logic, multi-step problems, debugging
DeepSeek R1-0528 (May 28, 2025)
Major upgrade to R1 with reduced hallucinations, JSON output support, and function calling.
- Improvements: Better JSON, tool calling, fewer hallucinations
- Key update: 8B distilled version significantly improved
- Best for: Production use, API integration
DeepSeek V3.1 (August 21, 2025)
Hybrid model combining V3's versatility with R1's reasoning. Can switch between thinking and non-thinking modes.
- Unique feature: Dynamic mode switching
- Best for: Users wanting both capabilities in one model
DeepSeek V3.2 (December 2025)
Latest general-purpose release with performance improvements across all benchmarks.
DeepSeek V4 (Expected February 2026)
Next-generation model expected mid-February 2026, focusing on enhanced code generation. Will likely use Apache 2.0 license.
DeepSeek R1 Benchmark Performance
R1's benchmark scores explain why it disrupted the AI industry:
Mathematics Benchmarks
| Benchmark | DeepSeek R1 | OpenAI o1-1217 | GPT-4o | Claude 3.5 |
|---|---|---|---|---|
| AIME 2024 (Math Olympiad) | 79.8% | 79.2% | 9.3% | 16.0% |
| MATH-500 | 97.3% | 96.4% | 74.6% | 78.3% |
| GSM8K | 95.8% | 94.8% | 92.0% | 91.6% |
Coding Benchmarks
| Benchmark | DeepSeek R1 | OpenAI o1 | GPT-4o | Claude 3.5 |
|---|---|---|---|---|
| Codeforces Elo | 2,029 | 1,891 | 1,891 | 1,886 |
| LiveCodeBench | 65.9% | 63.4% | 33.4% | 38.9% |
| SWE-Bench Verified | 49.2% | 48.9% | 33.2% | 40.6% |
Knowledge Benchmarks
| Benchmark | DeepSeek R1 | GPT-4o | Claude 3.5 |
|---|---|---|---|
| MMLU | 90.8% | 88.7% | 88.3% |
| MMLU-Pro | 84.0% | 80.3% | 78.0% |
| GPQA Diamond (PhD Science) | 71.5% | 49.9% | 59.4% |
The R1 scores are particularly impressive on hard benchmarksāAIME (math olympiad), Codeforces (competitive programming), and GPQA Diamond (PhD-level science). Beating GPT-4o and Claude 3.5 together on these benchmarks by significant margins was unprecedented for an open model.
Distilled Model Performance
The distilled versions retain most reasoning capability:
| Model | AIME 2024 | MATH-500 | Codeforces Elo |
|---|---|---|---|
| R1-Distill-Qwen-32B | 72.6% | 94.3% | 1,691 |
| R1-Distill-Llama-70B | 70.0% | 94.5% | 1,633 |
| R1-Distill-Qwen-14B | 69.7% | 93.9% | 1,481 |
| R1-Distill-Llama-8B | 50.4% | 89.1% | 1,205 |
The 32B distilled model achieves 72.6% on AIMEāthat's math olympiad performance from a model that runs on a single RTX 4090.
How DeepSeek R1's Thinking Mode Works
Understanding R1's reasoning mechanism helps you use it effectively.
Chain-of-Thought Architecture
When you send a prompt to R1, the model generates two distinct phases:
-
Thinking Phase (
<think>...</think>)- Internal reasoning tokens visible in raw output
- Problem breakdown and exploration
- Self-correction and verification
- Multiple solution paths evaluated
-
Response Phase
- Final answer based on thinking
- Clean, user-facing output
- Conclusions from reasoning process
The Four-Stage Reasoning Taxonomy
Research into R1's behavior reveals four distinct stages:
- Problem Definition: Initial understanding and constraint identification
- Blooming Cycle: Exploration of multiple solution approaches
- Reconstruction Cycles: Self-correction, rumination, "aha moments"
- Final Decision: Commitment to answer after verification
"Aha Moments" - Emergent Self-Correction
One of R1's most remarkable behaviors is spontaneous error correction. During reasoning, the model will:
- Recognize when an approach is failing
- Explicitly state "Wait, this doesn't seem right..."
- Backtrack and try alternative methods
- Verify final answers against original constraints
This emerged purely from reinforcement learningāDeepSeek did not pre-program these behaviors.
Example Thinking Output
<think>
Let me break down this math problem...
First, I need to identify the variables: x represents...
Wait, I should check my assumption about...
Actually, that approach won't work because...
Let me try a different method using...
Now I can verify: if x = 5, then...
Yes, this satisfies all constraints.
</think>
The answer is x = 5, which I verified by substituting back into the original equation.
Controlling Thinking Mode
- Temperature 0.6 (recommended): Balanced reasoning
- Temperature 0.3-0.5: More focused, less exploration
- Temperature 0.7-0.8: More creative, more exploration
- Prompt engineering: "Think step by step" encourages extended reasoning
- V3.1 hybrid: Automatically decides when to think based on complexity
Complete Model Specifications
Full 671B Model
| Specification | Value |
|---|---|
| Total Parameters | 671B (originally 685B) |
| Active Parameters | 37B per token |
| Architecture | Mixture of Experts (MoE) |
| Expert Count | 256 experts |
| Active Experts | 8 per token |
| Context Window | 128K-160K tokens |
| Training Tokens | 14.8 trillion |
| Training Cost | ~$5.6 million |
| License | MIT |
Distilled Model Specifications
| Model | Parameters | Download Size | Base Architecture | Context |
|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | 1.5B | 1.1GB | Qwen-2.5 | 128K |
| R1-Distill-Qwen-7B | 7B | 4.7GB | Qwen-2.5 | 128K |
| R1-Distill-Llama-8B | 8B | 5.2GB | Llama 3.1-8B | 128K |
| R1-Distill-Qwen-14B | 14B | 9.0GB | Qwen-2.5 | 128K |
| R1-Distill-Qwen-32B | 32B | 20GB | Qwen-2.5 | 128K |
| R1-Distill-Llama-70B | 70B | 43GB | Llama 3.3-70B | 128K |
Step-by-Step Local Setup with Ollama
Step 1: Install Ollama
macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com/download and run it.
Verify installation:
ollama --version
# Should show: ollama version 0.5.x or higher
Step 2: Choose and Pull Your Model
Select based on your VRAM:
# 4GB VRAM - Basic reasoning
ollama pull deepseek-r1:1.5b
# 8GB VRAM - Good balance (recommended starting point)
ollama pull deepseek-r1:8b
# 12-16GB VRAM - Better reasoning
ollama pull deepseek-r1:14b
# 24GB VRAM - Best local experience
ollama pull deepseek-r1:32b
# 48GB+ VRAM - Near-full capability
ollama pull deepseek-r1:70b
# 400GB+ - Full model (enterprise hardware)
ollama pull deepseek-r1:671b
Step 3: Run the Model
ollama run deepseek-r1:32b
Test with a reasoning problem:
A farmer has 17 sheep. All but 9 run away. How many sheep does the farmer have left?
Watch R1 reason through the problemāit will show its thinking process (internally) before answering correctly: 9 sheep.
Step 4: Advanced Configuration
Create a custom Modelfile for optimized settings:
cat > Modelfile << 'EOF'
FROM deepseek-r1:32b
# Optimal settings for reasoning
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
# System prompt for reasoning tasks
SYSTEM """You are DeepSeek R1, an advanced reasoning assistant.
For complex problems:
1. Break down the problem systematically
2. Show your reasoning process clearly
3. Verify your answer before finalizing
4. If you notice an error, correct it explicitly
Think carefully and explain your logic step by step."""
EOF
# Create optimized model
ollama create deepseek-r1-reasoning -f Modelfile
# Run optimized version
ollama run deepseek-r1-reasoning
Step 5: Pull Latest R1-0528 Version
For the improved May 2025 version with better JSON and function calling:
# Unsloth optimized versions
ollama pull hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q4_K_M
# Or the Qwen3 8B distilled with improvements
ollama pull hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL
VRAM Requirements: Complete Guide
Distilled Models by Quantization
| Model | FP16 | Q8_0 | Q5_K_M | Q4_K_M | Minimum GPU |
|---|---|---|---|---|---|
| R1 1.5B | 3GB | 2GB | 1.5GB | 1.2GB | GTX 1060 6GB |
| R1 7B | 14GB | 8GB | 6GB | 5GB | RTX 3060 8GB |
| R1 8B | 16GB | 9GB | 7GB | 6GB | RTX 3060 12GB |
| R1 14B | 28GB | 15GB | 11GB | 9GB | RTX 4060 Ti 16GB |
| R1 32B | 64GB | 34GB | 24GB | 20GB | RTX 4090 24GB |
| R1 70B | 140GB | 75GB | 52GB | 42GB | 2x RTX 4090 |
Full 671B Model Quantization Options
| Quantization | Size | VRAM Required | Setup | Tokens/sec |
|---|---|---|---|---|
| FP16/BF16 | ~1,400GB | 1,500-1,800GB | 20x H100 80GB | 200+ |
| FP8 | ~700GB | ~700GB | 9x H100 80GB | 180+ |
| Q4_K_M (4-bit) | ~400GB | ~400GB | 8x H100 80GB | 150+ |
| 2.51-bit Dynamic | ~212GB | ~212GB | 3x H100 80GB | 80+ |
| IQ1_M (1.78-bit) | ~183GB | 183GB + RAM | 2x H100 + offload | 40+ |
| TQ1_0 (1.66-bit) | ~162GB | 162GB | 192GB Mac Ultra | 2-3 |
| 1.58-bit Dynamic | ~131GB | 131GB | 2x RTX 4090 + 128GB RAM | 1-5 |
Consumer Hardware Configurations
| Budget | Hardware | Best R1 Version | Performance |
|---|---|---|---|
| $300 | RTX 3060 12GB | R1 8B Q4_K_M | 30 tok/s |
| $500 | RTX 4060 Ti 16GB | R1 14B Q4_K_M | 32 tok/s |
| $800 | RTX 4070 Ti Super 16GB | R1 14B Q5_K_M | 38 tok/s |
| $1,200 | RTX 4080 Super 16GB | R1 14B Q8_0 | 40 tok/s |
| $1,600 | RTX 4090 24GB | R1 32B Q4_K_M | 28 tok/s |
| $3,000 | 2x RTX 4090 + 128GB RAM | R1 671B 1.58-bit | 1-5 tok/s |
Apple Silicon Performance
| Mac | Memory | Best R1 Version | Performance |
|---|---|---|---|
| M1/M2 8GB | 8GB | R1 1.5B Q4 | 45 tok/s |
| M1/M2 16GB | 16GB | R1 8B Q4 | 18 tok/s |
| M2/M3 Pro 32GB | 32GB | R1 14B Q5 | 22 tok/s |
| M3 Max 64GB | 64GB | R1 32B Q4 | 20 tok/s |
| M3 Max 128GB | 128GB | R1 70B Q4 | 12 tok/s |
| M3 Ultra 192GB | 192GB | R1 671B TQ1_0 | 2-3 tok/s |
Integration Options
Open WebUI (ChatGPT-like Interface)
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Access at http://localhost:3000 and select deepseek-r1:32b from the model dropdown.
VS Code Integration (Continue Extension)
- Install the Continue extension
- Open Continue settings (Ctrl+Shift+P > "Continue: Open Config")
- Add configuration:
{
"models": [
{
"title": "DeepSeek R1 32B",
"provider": "ollama",
"model": "deepseek-r1:32b",
"contextLength": 8192
}
]
}
Python API Integration
import ollama
# Basic chat
response = ollama.chat(
model='deepseek-r1:32b',
messages=[{
'role': 'user',
'content': 'Solve step by step: What is the derivative of x^3 * sin(x)?'
}]
)
print(response['message']['content'])
# Streaming for long reasoning
for chunk in ollama.chat(
model='deepseek-r1:32b',
messages=[{'role': 'user', 'content': 'Your complex question here'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
OpenAI-Compatible API
Ollama exposes an OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any string works
)
response = client.chat.completions.create(
model="deepseek-r1:32b",
messages=[{"role": "user", "content": "Explain quantum entanglement"}],
temperature=0.6
)
print(response.choices[0].message.content)
llama.cpp Direct (Advanced)
For maximum control with the full 671B model:
./llama.cpp/llama-cli \
--model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 16 \
--temp 0.6 \
--ctx-size 8192 \
--n-gpu-layers 7 \
-no-cnv \
--prompt "<|User|>Your prompt here<|Assistant|>"
Real-World Use Cases
1. Mathematical Problem Solving
R1 excels at competition math. Feed it IMO problems, calculus derivations, or statistical analysis. Use prompts like: "Please reason step by step, and put your final answer within \boxed{}"
2. Code Architecture and Debugging
Ask R1 to design system architectures or debug complex logic. Its reasoning shows the "why" behind decisionsāinvaluable for learning and code review.
3. Legal and Contract Analysis
R1 can parse complex documents, identify issues, and explain implications step-by-step. The visible reasoning creates an audit trail.
4. Scientific Research Assistance
Use for hypothesis evaluation, experimental design, and paper analysis. R1's 71.5% on GPQA Diamond (PhD-level science) demonstrates deep technical understanding.
5. Educational Tutoring
R1's visible reasoning makes it ideal for teaching. Students see the problem-solving process, not just answers. Perfect for math, physics, and programming education.
6. Complex Decision Analysis
Multi-criteria decisions benefit from R1's systematic reasoning. It naturally explores trade-offs and edge cases.
DeepSeek R1 vs Other Models
Reasoning Model Comparison
| Feature | DeepSeek R1 | OpenAI o1 | Claude 3.5 | GPT-4 |
|---|---|---|---|---|
| Visible Reasoning | Yes ( | No | No | No |
| AIME 2024 | 79.8% | 79.2% | 16.0% | 9.3% |
| MATH-500 | 97.3% | 96.4% | 78.3% | 74.6% |
| Codeforces Elo | 2,029 | - | 1,886 | 1,891 |
| License | MIT (Open) | Proprietary | Proprietary | Proprietary |
| Local Running | Yes | No | No | No |
| API Cost (1M tokens) | $1.10-$2.19 | ~$15-60 | ~$15-18 | ~$30-60 |
Local Model Comparison
| Model | VRAM (Q4) | Reasoning | Coding | Speed (4090) |
|---|---|---|---|---|
| DeepSeek R1 32B | 20GB | Excellent | Excellent | 28 tok/s |
| Llama 3.1 70B | 42GB | Good | Excellent | 15 tok/s |
| Qwen 2.5 72B | 44GB | Good | Excellent | 14 tok/s |
| Mistral Large | 18GB | Good | Good | 32 tok/s |
| Phi-4 14B | 10GB | Good | Good | 45 tok/s |
Verdict: For reasoning-heavy tasks, R1 32B is unmatched in the "runs on single 4090" category.
Troubleshooting Common Issues
Model Loads Slowly
# Pre-load into memory and keep warm
ollama run deepseek-r1:32b "warmup" --keepalive 1h
Out of Memory Errors
# Use smaller quantization
ollama pull deepseek-r1:32b-q4_0
# Reduce context window
OLLAMA_NUM_CTX=4096 ollama run deepseek-r1:32b
Slow Generation Speed
# Verify GPU is being used
ollama ps
# Force GPU layers
OLLAMA_NUM_GPU=999 ollama run deepseek-r1:32b
# Check CUDA installation
nvidia-smi
Thinking Tokens Not Visible
Some interfaces hide <think> tags. Check settings for:
- "Show reasoning" or "Show thinking"
- "Raw output mode"
- Or use the CLI directly to see full output
Model Gives Truncated Responses
# Increase max tokens
ollama run deepseek-r1:32b --num-predict 4096
JSON Output Issues (Pre-R1-0528)
Upgrade to R1-0528 which has proper JSON support:
ollama pull hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q4_K_M
Key Takeaways
- DeepSeek R1 is the best open-source reasoning model, matching OpenAI o1 on complex math and coding benchmarks
- The 32B distilled version is ideal for single-GPU setups (RTX 4090 or 64GB Mac)
- Visible
<think>tokens make R1 uniquely transparentāyou can watch it reason - MIT license means completely free commercial use with no API costs
- R1-0528 update fixed JSON, function calling, and reduced hallucinations
- Full 671B is runnable on consumer hardware with extreme quantization (1.58-bit on 2x 4090 + RAM)
- V4 expected February 2026 with enhanced coding capabilities
Next Steps
- Set up RAG for document analysis with DeepSeek R1
- Build AI agents using R1's reasoning capabilities
- Compare with Llama 4 for your specific use case
- Understand MoE architecture behind DeepSeek's efficiency
- Check VRAM requirements for different model configurations
- Learn about MCP servers to extend R1's capabilities
DeepSeek R1 represents a paradigm shift in open-source AI. For the first time, anyone can run a model that rivals the best closed-source systems for complex reasoningācompletely free, completely private, completely yours. The visible reasoning process makes it uniquely suited for education, debugging, and trust-critical applications. Whether you're running the 8B distilled on an 8GB GPU or the full 671B on enterprise hardware, R1 delivers reasoning capabilities that were impossible to access locally just a year ago.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!