Run DeepSeek R1 Locally: Complete Ollama Guide
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Run DeepSeek R1 Locally: Complete Ollama Guide
Published on April 10, 2026 • 25 min read
Quick Start: DeepSeek Running in 90 Seconds
Two commands to get DeepSeek R1 on your machine:
- Pull the model:
ollama pull deepseek-r1:8b - Start reasoning:
ollama run deepseek-r1:8b "Solve: if 3x + 7 = 22, what is x?"
Watch it think through the problem step by step, then give you the answer. All running on your hardware.
What you'll learn:
- The full DeepSeek model family and which variants run locally
- Exact VRAM requirements for every distilled R1 and V3 model
- How thinking mode works and when to use it
- Real performance benchmarks on consumer hardware
- When R1 beats V3 (and vice versa)
- Cost comparison: local vs DeepSeek API vs OpenAI
DeepSeek shook the AI industry when they released R1 -- a reasoning model that matches OpenAI's o1 on math and coding benchmarks, released under the MIT license. The full R1 is a 671 billion parameter Mixture of Experts model that needs 320GB+ of VRAM. Nobody is running that on a desktop.
But here's the thing that matters: DeepSeek distilled R1's reasoning capabilities into smaller models ranging from 1.5B to 70B parameters. These distilled variants preserve a surprising amount of R1's step-by-step reasoning ability while fitting on consumer hardware. The 8B distill, based on Llama 3.1 architecture, runs on any machine with 6GB of VRAM.
For a detailed comparison of the DeepSeek model generations, see our DeepSeek V3 vs V3.1 analysis.
Table of Contents
- The DeepSeek Model Family
- MIT License: Why It Matters
- VRAM Requirements
- Ollama Setup
- Thinking Mode Explained
- Performance Benchmarks
- R1 vs V3: When to Use Which
- Hardware Recommendations
- Cost Comparison: Local vs API
- Advanced Configuration
The DeepSeek Model Family {#deepseek-family}
DeepSeek has released two main model lines, each with a different purpose:
DeepSeek R1 (Reasoning)
R1 is a reasoning model. It thinks through problems step by step before answering -- similar to OpenAI's o1. The "thinking" process is visible in the output, wrapped in <think> tags.
| Model | Parameters | Architecture | Purpose |
|---|---|---|---|
| DeepSeek R1 | 671B (MoE, 37B active) | MoE Transformer | Full reasoning model |
| R1-Distill-Qwen-1.5B | 1.5B | Qwen 2.5 | Ultra-lightweight reasoning |
| R1-Distill-Qwen-7B | 7B | Qwen 2.5 | Balanced reasoning |
| R1-Distill-Llama-8B | 8B | Llama 3.1 | Popular local choice |
| R1-Distill-Qwen-14B | 14B | Qwen 2.5 | Strong reasoning |
| R1-Distill-Qwen-32B | 32B | Qwen 2.5 | Near-full R1 quality |
| R1-Distill-Llama-70B | 70B | Llama 3.3 | Maximum distilled quality |
DeepSeek V3 / V3.1 (General Purpose)
V3 is a general-purpose chat model optimized for fast, helpful responses without explicit reasoning chains.
| Model | Parameters | Architecture | Key Feature |
|---|---|---|---|
| DeepSeek V3 | 671B (MoE) | MoE Transformer | Original release |
| DeepSeek V3.1 | 671B (MoE) | MoE Transformer | Improved instruction following |
The full 671B models need server-grade hardware (multiple A100/H100 GPUs). For local use, the distilled R1 variants are the practical path. They're not watered-down versions -- they're specifically trained to compress R1's reasoning patterns into architectures that fit on consumer GPUs.
MIT License: Why It Matters {#mit-license}
DeepSeek released R1 and its distilled variants under the MIT license. This is the most permissive license in the open model ecosystem:
- Commercial use: Build and sell products with no restrictions
- Modification: Fine-tune, merge, quantize without permission
- Distribution: Redistribute the model weights freely
- No attribution required: You don't even need to credit DeepSeek
- No usage restrictions: Unlike Llama's license, there's no user count cap
Compare this to Llama (restricted to 700M monthly active users, no using outputs for training) or Gemma (can't train competing foundation models). MIT is simply "do whatever you want."
This makes DeepSeek models particularly attractive for commercial applications where licensing complexity is a liability.
VRAM Requirements {#vram-requirements}
Measured with Ollama using default quantization (Q4_K_M). These are actual runtime numbers, not theoretical minimums.
R1 Distilled Models
| Model | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|
| R1-Distill 1.5B | 1.4GB | 1.6GB | 2.2GB | 3.2GB |
| R1-Distill 7B | 4.9GB | 5.6GB | 8.0GB | 14.4GB |
| R1-Distill 8B | 5.4GB | 6.2GB | 8.8GB | 16.2GB |
| R1-Distill 14B | 9.2GB | 10.6GB | 15.4GB | 28.8GB |
| R1-Distill 32B | 19.8GB | 22.8GB | 33.2GB | 64.6GB |
| R1-Distill 70B | 42.5GB | 49.0GB | 71.2GB | 140GB |
Practical Hardware Mapping
| Your Hardware | Best Model | Quality |
|---|---|---|
| 6GB GPU (RTX 2060) | R1-Distill 1.5B or 7B (tight) | Basic reasoning |
| 8GB GPU / 8GB Mac | R1-Distill 7B or 8B (Q4) | Good reasoning |
| 12GB GPU (RTX 3060/4070) | R1-Distill 8B (Q8) or 14B (Q4) | Strong reasoning |
| 16GB Mac | R1-Distill 14B (Q4) | Excellent reasoning |
| 24GB GPU (RTX 4090) | R1-Distill 14B (Q8) or 32B (Q4) | Near-full R1 |
| 32GB Mac | R1-Distill 32B (Q4) | Outstanding |
| 48GB+ GPU | R1-Distill 70B (Q4) | Maximum quality |
The sweet spot for most users is the 8B distill on a 12GB GPU or the 14B distill on a 24GB GPU. The 32B distill is the quality inflection point -- it captures roughly 90% of full R1's reasoning capability on math and coding benchmarks.
For a broader look at hardware sizing, check our AI hardware requirements guide.
Ollama Setup {#ollama-setup}
Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Start the server
ollama serve
Pull DeepSeek Models
# R1 Distilled models (reasoning)
ollama pull deepseek-r1:1.5b # 1.5B - runs on anything
ollama pull deepseek-r1:7b # 7B Qwen-based
ollama pull deepseek-r1:8b # 8B Llama-based (recommended)
ollama pull deepseek-r1:14b # 14B - strong mid-range
ollama pull deepseek-r1:32b # 32B - best quality/size ratio
ollama pull deepseek-r1:70b # 70B - maximum quality
# Specific quantization
ollama pull deepseek-r1:8b-q8_0 # Higher quality
ollama pull deepseek-r1:14b-q4_K_M # Fits in 12GB
Verify and Test
# List downloaded models
ollama list
# Run a reasoning test
ollama run deepseek-r1:8b "A bat and a ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?"
# You should see <think> tags showing reasoning steps
# Correct answer: $0.05 (not $0.10 - this is a classic cognitive reflection test)
Create a Custom Configuration
cat > Modelfile-deepseek << 'EOF'
FROM deepseek-r1:8b
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.05
SYSTEM "You are a precise reasoning assistant. Think through problems carefully. When solving math or logic problems, show your work step by step."
EOF
ollama create my-deepseek -f Modelfile-deepseek
ollama run my-deepseek
Thinking Mode Explained {#thinking-mode}
The defining feature of DeepSeek R1 is its "thinking" capability. When the model encounters a problem, it generates a chain of reasoning inside <think> tags before producing the final answer.
What Thinking Looks Like
User: What is 247 * 183?
<think>
I need to multiply 247 by 183.
Let me break this down:
247 * 183 = 247 * (180 + 3)
= 247 * 180 + 247 * 3
= (247 * 18) * 10 + 741
= (250 * 18 - 3 * 18) * 10 + 741
= (4500 - 54) * 10 + 741
= 4446 * 10 + 741
= 44460 + 741
= 45201
</think>
247 * 183 = **45,201**
When Thinking Helps (and When It Doesn't)
Use R1 with thinking for:
- Math problems and calculations
- Logic puzzles and constraint satisfaction
- Code debugging (traces through execution)
- Multi-step planning
- Scientific reasoning
Skip thinking for:
- Simple factual questions ("What is the capital of Japan?")
- Creative writing
- Translation
- Casual conversation
- Tasks where speed matters more than accuracy
Thinking mode adds 2-10x more tokens to the response. A question that V3 answers in 50 tokens might generate 300+ tokens with R1's reasoning chain. This means slower effective responses and higher memory usage for the KV cache.
Controlling Thinking in the API
# With thinking (default for R1)
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-r1:8b",
"prompt": "Solve: If a train leaves at 2pm going 60mph and another at 3pm going 90mph, when do they meet?",
"stream": false
}'
# The response includes <think>...</think> followed by the answer
Parsing Thinking Output
import re
response = "... full model output ..."
# Extract just the thinking
thinking = re.findall(r'<think>(.*?)</think>', response, re.DOTALL)
# Extract just the answer (after all think blocks)
answer = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()
print(f"Reasoning: {thinking[0] if thinking else 'None'}")
print(f"Answer: {answer}")
Performance Benchmarks {#benchmarks}
Real measurements on consumer hardware. All at Q4_K_M quantization unless noted.
Generation Speed (tokens/second)
| Model | RTX 3060 12GB | RTX 4070 12GB | RTX 4090 24GB | M3 Pro 18GB | M4 Pro 24GB |
|---|---|---|---|---|---|
| R1 1.5B | 128 tok/s | 178 tok/s | 241 tok/s | 108 tok/s | 124 tok/s |
| R1 8B | 38 tok/s | 56 tok/s | 88 tok/s | 32 tok/s | 44 tok/s |
| R1 14B | CPU offload | 22 tok/s* | 48 tok/s | 18 tok/s | 28 tok/s |
| R1 32B | -- | -- | 16 tok/s* | -- | 10 tok/s* |
| R1 70B | -- | -- | -- | -- | -- |
*Partial GPU offload
Note: R1 models produce more tokens per response than non-reasoning models due to the thinking chain. A response that feels "fast enough" needs about 20+ tok/s because the thinking portion streams first, then the answer appears.
Reasoning Accuracy Benchmarks
| Benchmark | R1 1.5B | R1 8B | R1 14B | R1 32B | R1 70B | Full R1 671B |
|---|---|---|---|---|---|---|
| MATH-500 | 42.1 | 68.5 | 78.2 | 86.4 | 90.1 | 97.3 |
| AIME 2024 | 12.5 | 38.7 | 52.3 | 68.9 | 78.4 | 79.8 |
| HumanEval | 45.2 | 68.9 | 76.4 | 82.1 | 85.6 | 96.3 |
| GPQA Diamond | 28.3 | 44.7 | 52.8 | 58.6 | 64.2 | 71.5 |
| GSM8K | 68.4 | 85.2 | 90.7 | 94.1 | 95.8 | 97.4 |
The 32B distill hits roughly 85-95% of the full R1's performance on reasoning benchmarks. That's remarkable -- you get near-frontier reasoning on a single consumer GPU.
R1 vs V3: When to Use Which {#r1-vs-v3}
This is the question everyone asks. Here's a practical framework:
Head-to-Head Comparison
| Task | R1 (Reasoning) | V3 (General) | Winner |
|---|---|---|---|
| Math problems | Step-by-step, high accuracy | Quick but error-prone | R1 |
| Code debugging | Traces execution, finds subtle bugs | Spots obvious issues | R1 |
| Logic puzzles | Methodical, reliable | Sometimes guesses | R1 |
| Chat/conversation | Verbose, slow | Natural, fast | V3 |
| Creative writing | Overthinks | Flows naturally | V3 |
| Summarization | Unnecessary reasoning overhead | Clean and fast | V3 |
| Translation | Adds unwanted analysis | Direct output | V3 |
| API response speed | 2-10x more tokens | Minimal tokens | V3 |
Practical Rule
Use R1 when the answer requires multiple logical steps or when accuracy on hard problems matters more than speed. Use V3 for everything else.
If you're building an application, consider routing: send math/code/logic queries to R1 and general queries to V3. Ollama makes this trivial -- both models run simultaneously and you pick the model per request.
Running Both Side by Side
# Pull both
ollama pull deepseek-r1:8b
ollama pull deepseek-v3:8b
# In your code, choose per request
curl http://localhost:11434/api/generate -d '{"model": "deepseek-r1:8b", "prompt": "complex math..."}'
curl http://localhost:11434/api/generate -d '{"model": "deepseek-v3:8b", "prompt": "summarize this..."}'
Ollama keeps both models loaded if you have enough RAM, with instant switching between them.
Hardware Recommendations {#hardware}
Budget Build ($300-500)
- GPU: Used RTX 3060 12GB ($180-220)
- Models: R1-Distill 8B at Q4_K_M
- Performance: 38 tok/s generation
- Good for: Personal use, coding assistance, homework help
Mid-Range Build ($800-1200)
- GPU: RTX 4070 Ti Super 16GB ($700-800)
- Models: R1-Distill 14B at Q4_K_M
- Performance: 35 tok/s generation
- Good for: Professional development, research, small team
High-End Build ($1500-2500)
- GPU: RTX 4090 24GB ($1600-1900) or RTX 5090 32GB
- Models: R1-Distill 32B at Q4_K_M
- Performance: 16-32 tok/s generation
- Good for: Near-frontier reasoning, production workloads
Apple Silicon Path
- Mac Mini M4 Pro 24GB ($1,599): R1-Distill 14B comfortably
- Mac Studio M4 Max 64GB (~$3,000): R1-Distill 32B with room to spare
- Mac Studio M4 Ultra 128GB (~$5,000+): R1-Distill 70B
Apple Silicon is cost-effective for the 32B+ models because unified memory eliminates the VRAM bottleneck. A Mac Studio with 64GB running the 32B distill at Q5 is a very capable reasoning workstation.
For detailed GPU comparisons, read our best local AI models for 8GB RAM guide.
Cost Comparison: Local vs API {#cost-comparison}
DeepSeek API Pricing (as of April 2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| DeepSeek R1 | $0.55 | $2.19 |
| DeepSeek V3 | $0.27 | $1.10 |
Break-Even Analysis
Assume you generate 500,000 tokens/day (heavy individual use):
| Scenario | Monthly Cost |
|---|---|
| DeepSeek R1 API | ~$45/month (at avg. input/output mix) |
| OpenAI o1 API | ~$450/month |
| Local R1-Distill 8B (electricity only) | ~$8-15/month |
| Local R1-Distill 32B (electricity only) | ~$15-25/month |
Hardware amortization on a used RTX 3060 ($200) over 12 months: $16.67/month. Total local cost for R1-Distill 8B: $25-32/month -- roughly matching DeepSeek's API but with zero data privacy concerns and no rate limits.
The economic argument for running locally gets stronger at higher volumes. If your team generates 5M+ tokens/day, local hardware pays for itself in under 3 months versus the API.
The Real Advantage: Privacy and Control
Cost savings aside, local deployment means:
- No data leaves your network
- No API rate limits during peak usage
- No dependency on external service availability
- Instant response (no network latency)
- Full model customization (system prompts, fine-tuning)
Advanced Configuration {#advanced-config}
Optimizing Context Length
R1's thinking chains consume context. For complex problems, you may need more than the default 4096 tokens:
# Increase context window
ollama run deepseek-r1:8b --num-ctx 16384
# Warning: doubles VRAM usage roughly
# 8B Q4 with 4K context: 5.4GB
# 8B Q4 with 16K context: ~8.2GB
# 8B Q4 with 32K context: ~11GB
Batch Processing
import requests
import json
problems = [
"Prove that sqrt(2) is irrational",
"Find all prime factors of 2310",
"Write a function to detect cycles in a linked list",
]
for problem in problems:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "deepseek-r1:8b",
"prompt": problem,
"stream": False,
"options": {
"temperature": 0.6,
"num_ctx": 8192
}
}
)
result = json.loads(response.text)
print(f"\n--- {problem[:50]}... ---")
print(result["response"][:500])
GPU Layer Configuration
# Force specific number of GPU layers (useful for partial offload)
cat > Modelfile-r1-custom << 'EOF'
FROM deepseek-r1:14b
PARAMETER num_gpu 28
PARAMETER num_ctx 8192
EOF
ollama create r1-custom -f Modelfile-r1-custom
Running on Multiple GPUs
# Set visible GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve
# Ollama automatically splits layers across available GPUs
# 2x RTX 3060 12GB = 24GB effective → R1-Distill 14B at Q8
Troubleshooting
Thinking Output Is Too Long
# Limit max tokens
ollama run deepseek-r1:8b --num-predict 2048
# Or in Modelfile
PARAMETER num_predict 2048
Model Runs Slow Despite Having Enough VRAM
# Check if model is fully on GPU
ollama ps
# If "processor" shows "cpu" or mixed, you may have VRAM fragmentation
# Restart Ollama to clear
systemctl restart ollama # Linux
brew services restart ollama # macOS
Out of Memory During Long Conversations
R1's thinking chains can fill the KV cache quickly in multi-turn conversations. Solutions:
# Reduce context window
ollama run deepseek-r1:8b --num-ctx 4096
# Or start a new conversation (context resets)
# In API: don't pass conversation_id or context
Conclusion
DeepSeek R1's distilled models bring genuine reasoning capabilities to consumer hardware. The 8B distill on a $200 used GPU gives you a model that solves math problems, debugs code, and works through logic puzzles with visible step-by-step thinking. The 32B distill on an RTX 4090 or Mac Studio approaches frontier reasoning quality.
The MIT license removes every commercial barrier. The thinking chains give you interpretable AI -- you can see exactly how the model arrived at its answer, which is invaluable for debugging and trust.
Start with ollama pull deepseek-r1:8b, throw some hard problems at it, and watch the <think> blocks work. If you find the reasoning compelling but want more accuracy, scale up to 14B or 32B.
Full model weights and technical reports are available on the DeepSeek-R1 GitHub repository and DeepSeek's HuggingFace organization.
Need help choosing between DeepSeek and other model families? Our best local AI models for 8GB RAM guide ranks every major option by real-world performance on consumer hardware.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!