Model Guide

Run DeepSeek R1 Locally: Complete Ollama Guide

April 10, 2026
25 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Run DeepSeek R1 Locally: Complete Ollama Guide

Published on April 10, 2026 • 25 min read

Quick Start: DeepSeek Running in 90 Seconds

Two commands to get DeepSeek R1 on your machine:

  1. Pull the model: ollama pull deepseek-r1:8b
  2. Start reasoning: ollama run deepseek-r1:8b "Solve: if 3x + 7 = 22, what is x?"

Watch it think through the problem step by step, then give you the answer. All running on your hardware.


What you'll learn:

  • The full DeepSeek model family and which variants run locally
  • Exact VRAM requirements for every distilled R1 and V3 model
  • How thinking mode works and when to use it
  • Real performance benchmarks on consumer hardware
  • When R1 beats V3 (and vice versa)
  • Cost comparison: local vs DeepSeek API vs OpenAI

DeepSeek shook the AI industry when they released R1 -- a reasoning model that matches OpenAI's o1 on math and coding benchmarks, released under the MIT license. The full R1 is a 671 billion parameter Mixture of Experts model that needs 320GB+ of VRAM. Nobody is running that on a desktop.

But here's the thing that matters: DeepSeek distilled R1's reasoning capabilities into smaller models ranging from 1.5B to 70B parameters. These distilled variants preserve a surprising amount of R1's step-by-step reasoning ability while fitting on consumer hardware. The 8B distill, based on Llama 3.1 architecture, runs on any machine with 6GB of VRAM.

For a detailed comparison of the DeepSeek model generations, see our DeepSeek V3 vs V3.1 analysis.

Table of Contents

  1. The DeepSeek Model Family
  2. MIT License: Why It Matters
  3. VRAM Requirements
  4. Ollama Setup
  5. Thinking Mode Explained
  6. Performance Benchmarks
  7. R1 vs V3: When to Use Which
  8. Hardware Recommendations
  9. Cost Comparison: Local vs API
  10. Advanced Configuration

The DeepSeek Model Family {#deepseek-family}

DeepSeek has released two main model lines, each with a different purpose:

DeepSeek R1 (Reasoning)

R1 is a reasoning model. It thinks through problems step by step before answering -- similar to OpenAI's o1. The "thinking" process is visible in the output, wrapped in <think> tags.

ModelParametersArchitecturePurpose
DeepSeek R1671B (MoE, 37B active)MoE TransformerFull reasoning model
R1-Distill-Qwen-1.5B1.5BQwen 2.5Ultra-lightweight reasoning
R1-Distill-Qwen-7B7BQwen 2.5Balanced reasoning
R1-Distill-Llama-8B8BLlama 3.1Popular local choice
R1-Distill-Qwen-14B14BQwen 2.5Strong reasoning
R1-Distill-Qwen-32B32BQwen 2.5Near-full R1 quality
R1-Distill-Llama-70B70BLlama 3.3Maximum distilled quality

DeepSeek V3 / V3.1 (General Purpose)

V3 is a general-purpose chat model optimized for fast, helpful responses without explicit reasoning chains.

ModelParametersArchitectureKey Feature
DeepSeek V3671B (MoE)MoE TransformerOriginal release
DeepSeek V3.1671B (MoE)MoE TransformerImproved instruction following

The full 671B models need server-grade hardware (multiple A100/H100 GPUs). For local use, the distilled R1 variants are the practical path. They're not watered-down versions -- they're specifically trained to compress R1's reasoning patterns into architectures that fit on consumer GPUs.


MIT License: Why It Matters {#mit-license}

DeepSeek released R1 and its distilled variants under the MIT license. This is the most permissive license in the open model ecosystem:

  • Commercial use: Build and sell products with no restrictions
  • Modification: Fine-tune, merge, quantize without permission
  • Distribution: Redistribute the model weights freely
  • No attribution required: You don't even need to credit DeepSeek
  • No usage restrictions: Unlike Llama's license, there's no user count cap

Compare this to Llama (restricted to 700M monthly active users, no using outputs for training) or Gemma (can't train competing foundation models). MIT is simply "do whatever you want."

This makes DeepSeek models particularly attractive for commercial applications where licensing complexity is a liability.


VRAM Requirements {#vram-requirements}

Measured with Ollama using default quantization (Q4_K_M). These are actual runtime numbers, not theoretical minimums.

R1 Distilled Models

ModelQ4_K_MQ5_K_MQ8_0FP16
R1-Distill 1.5B1.4GB1.6GB2.2GB3.2GB
R1-Distill 7B4.9GB5.6GB8.0GB14.4GB
R1-Distill 8B5.4GB6.2GB8.8GB16.2GB
R1-Distill 14B9.2GB10.6GB15.4GB28.8GB
R1-Distill 32B19.8GB22.8GB33.2GB64.6GB
R1-Distill 70B42.5GB49.0GB71.2GB140GB

Practical Hardware Mapping

Your HardwareBest ModelQuality
6GB GPU (RTX 2060)R1-Distill 1.5B or 7B (tight)Basic reasoning
8GB GPU / 8GB MacR1-Distill 7B or 8B (Q4)Good reasoning
12GB GPU (RTX 3060/4070)R1-Distill 8B (Q8) or 14B (Q4)Strong reasoning
16GB MacR1-Distill 14B (Q4)Excellent reasoning
24GB GPU (RTX 4090)R1-Distill 14B (Q8) or 32B (Q4)Near-full R1
32GB MacR1-Distill 32B (Q4)Outstanding
48GB+ GPUR1-Distill 70B (Q4)Maximum quality

The sweet spot for most users is the 8B distill on a 12GB GPU or the 14B distill on a 24GB GPU. The 32B distill is the quality inflection point -- it captures roughly 90% of full R1's reasoning capability on math and coding benchmarks.

For a broader look at hardware sizing, check our AI hardware requirements guide.


Ollama Setup {#ollama-setup}

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start the server
ollama serve

Pull DeepSeek Models

# R1 Distilled models (reasoning)
ollama pull deepseek-r1:1.5b    # 1.5B - runs on anything
ollama pull deepseek-r1:7b      # 7B Qwen-based
ollama pull deepseek-r1:8b      # 8B Llama-based (recommended)
ollama pull deepseek-r1:14b     # 14B - strong mid-range
ollama pull deepseek-r1:32b     # 32B - best quality/size ratio
ollama pull deepseek-r1:70b     # 70B - maximum quality

# Specific quantization
ollama pull deepseek-r1:8b-q8_0     # Higher quality
ollama pull deepseek-r1:14b-q4_K_M  # Fits in 12GB

Verify and Test

# List downloaded models
ollama list

# Run a reasoning test
ollama run deepseek-r1:8b "A bat and a ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?"

# You should see <think> tags showing reasoning steps
# Correct answer: $0.05 (not $0.10 - this is a classic cognitive reflection test)

Create a Custom Configuration

cat > Modelfile-deepseek << 'EOF'
FROM deepseek-r1:8b
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.05
SYSTEM "You are a precise reasoning assistant. Think through problems carefully. When solving math or logic problems, show your work step by step."
EOF

ollama create my-deepseek -f Modelfile-deepseek
ollama run my-deepseek

Thinking Mode Explained {#thinking-mode}

The defining feature of DeepSeek R1 is its "thinking" capability. When the model encounters a problem, it generates a chain of reasoning inside <think> tags before producing the final answer.

What Thinking Looks Like

User: What is 247 * 183?

<think>
I need to multiply 247 by 183.
Let me break this down:
247 * 183 = 247 * (180 + 3)
= 247 * 180 + 247 * 3
= (247 * 18) * 10 + 741
= (250 * 18 - 3 * 18) * 10 + 741
= (4500 - 54) * 10 + 741
= 4446 * 10 + 741
= 44460 + 741
= 45201
</think>

247 * 183 = **45,201**

When Thinking Helps (and When It Doesn't)

Use R1 with thinking for:

  • Math problems and calculations
  • Logic puzzles and constraint satisfaction
  • Code debugging (traces through execution)
  • Multi-step planning
  • Scientific reasoning

Skip thinking for:

  • Simple factual questions ("What is the capital of Japan?")
  • Creative writing
  • Translation
  • Casual conversation
  • Tasks where speed matters more than accuracy

Thinking mode adds 2-10x more tokens to the response. A question that V3 answers in 50 tokens might generate 300+ tokens with R1's reasoning chain. This means slower effective responses and higher memory usage for the KV cache.

Controlling Thinking in the API

# With thinking (default for R1)
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:8b",
  "prompt": "Solve: If a train leaves at 2pm going 60mph and another at 3pm going 90mph, when do they meet?",
  "stream": false
}'

# The response includes <think>...</think> followed by the answer

Parsing Thinking Output

import re

response = "... full model output ..."

# Extract just the thinking
thinking = re.findall(r'<think>(.*?)</think>', response, re.DOTALL)

# Extract just the answer (after all think blocks)
answer = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

print(f"Reasoning: {thinking[0] if thinking else 'None'}")
print(f"Answer: {answer}")

Performance Benchmarks {#benchmarks}

Real measurements on consumer hardware. All at Q4_K_M quantization unless noted.

Generation Speed (tokens/second)

ModelRTX 3060 12GBRTX 4070 12GBRTX 4090 24GBM3 Pro 18GBM4 Pro 24GB
R1 1.5B128 tok/s178 tok/s241 tok/s108 tok/s124 tok/s
R1 8B38 tok/s56 tok/s88 tok/s32 tok/s44 tok/s
R1 14BCPU offload22 tok/s*48 tok/s18 tok/s28 tok/s
R1 32B----16 tok/s*--10 tok/s*
R1 70B----------

*Partial GPU offload

Note: R1 models produce more tokens per response than non-reasoning models due to the thinking chain. A response that feels "fast enough" needs about 20+ tok/s because the thinking portion streams first, then the answer appears.

Reasoning Accuracy Benchmarks

BenchmarkR1 1.5BR1 8BR1 14BR1 32BR1 70BFull R1 671B
MATH-50042.168.578.286.490.197.3
AIME 202412.538.752.368.978.479.8
HumanEval45.268.976.482.185.696.3
GPQA Diamond28.344.752.858.664.271.5
GSM8K68.485.290.794.195.897.4

The 32B distill hits roughly 85-95% of the full R1's performance on reasoning benchmarks. That's remarkable -- you get near-frontier reasoning on a single consumer GPU.


R1 vs V3: When to Use Which {#r1-vs-v3}

This is the question everyone asks. Here's a practical framework:

Head-to-Head Comparison

TaskR1 (Reasoning)V3 (General)Winner
Math problemsStep-by-step, high accuracyQuick but error-proneR1
Code debuggingTraces execution, finds subtle bugsSpots obvious issuesR1
Logic puzzlesMethodical, reliableSometimes guessesR1
Chat/conversationVerbose, slowNatural, fastV3
Creative writingOverthinksFlows naturallyV3
SummarizationUnnecessary reasoning overheadClean and fastV3
TranslationAdds unwanted analysisDirect outputV3
API response speed2-10x more tokensMinimal tokensV3

Practical Rule

Use R1 when the answer requires multiple logical steps or when accuracy on hard problems matters more than speed. Use V3 for everything else.

If you're building an application, consider routing: send math/code/logic queries to R1 and general queries to V3. Ollama makes this trivial -- both models run simultaneously and you pick the model per request.

Running Both Side by Side

# Pull both
ollama pull deepseek-r1:8b
ollama pull deepseek-v3:8b

# In your code, choose per request
curl http://localhost:11434/api/generate -d '{"model": "deepseek-r1:8b", "prompt": "complex math..."}'
curl http://localhost:11434/api/generate -d '{"model": "deepseek-v3:8b", "prompt": "summarize this..."}'

Ollama keeps both models loaded if you have enough RAM, with instant switching between them.


Hardware Recommendations {#hardware}

Budget Build ($300-500)

  • GPU: Used RTX 3060 12GB ($180-220)
  • Models: R1-Distill 8B at Q4_K_M
  • Performance: 38 tok/s generation
  • Good for: Personal use, coding assistance, homework help

Mid-Range Build ($800-1200)

  • GPU: RTX 4070 Ti Super 16GB ($700-800)
  • Models: R1-Distill 14B at Q4_K_M
  • Performance: 35 tok/s generation
  • Good for: Professional development, research, small team

High-End Build ($1500-2500)

  • GPU: RTX 4090 24GB ($1600-1900) or RTX 5090 32GB
  • Models: R1-Distill 32B at Q4_K_M
  • Performance: 16-32 tok/s generation
  • Good for: Near-frontier reasoning, production workloads

Apple Silicon Path

  • Mac Mini M4 Pro 24GB ($1,599): R1-Distill 14B comfortably
  • Mac Studio M4 Max 64GB (~$3,000): R1-Distill 32B with room to spare
  • Mac Studio M4 Ultra 128GB (~$5,000+): R1-Distill 70B

Apple Silicon is cost-effective for the 32B+ models because unified memory eliminates the VRAM bottleneck. A Mac Studio with 64GB running the 32B distill at Q5 is a very capable reasoning workstation.

For detailed GPU comparisons, read our best local AI models for 8GB RAM guide.


Cost Comparison: Local vs API {#cost-comparison}

DeepSeek API Pricing (as of April 2026)

ModelInput (per 1M tokens)Output (per 1M tokens)
DeepSeek R1$0.55$2.19
DeepSeek V3$0.27$1.10

Break-Even Analysis

Assume you generate 500,000 tokens/day (heavy individual use):

ScenarioMonthly Cost
DeepSeek R1 API~$45/month (at avg. input/output mix)
OpenAI o1 API~$450/month
Local R1-Distill 8B (electricity only)~$8-15/month
Local R1-Distill 32B (electricity only)~$15-25/month

Hardware amortization on a used RTX 3060 ($200) over 12 months: $16.67/month. Total local cost for R1-Distill 8B: $25-32/month -- roughly matching DeepSeek's API but with zero data privacy concerns and no rate limits.

The economic argument for running locally gets stronger at higher volumes. If your team generates 5M+ tokens/day, local hardware pays for itself in under 3 months versus the API.

The Real Advantage: Privacy and Control

Cost savings aside, local deployment means:

  • No data leaves your network
  • No API rate limits during peak usage
  • No dependency on external service availability
  • Instant response (no network latency)
  • Full model customization (system prompts, fine-tuning)

Advanced Configuration {#advanced-config}

Optimizing Context Length

R1's thinking chains consume context. For complex problems, you may need more than the default 4096 tokens:

# Increase context window
ollama run deepseek-r1:8b --num-ctx 16384

# Warning: doubles VRAM usage roughly
# 8B Q4 with 4K context: 5.4GB
# 8B Q4 with 16K context: ~8.2GB
# 8B Q4 with 32K context: ~11GB

Batch Processing

import requests
import json

problems = [
    "Prove that sqrt(2) is irrational",
    "Find all prime factors of 2310",
    "Write a function to detect cycles in a linked list",
]

for problem in problems:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "deepseek-r1:8b",
            "prompt": problem,
            "stream": False,
            "options": {
                "temperature": 0.6,
                "num_ctx": 8192
            }
        }
    )
    result = json.loads(response.text)
    print(f"\n--- {problem[:50]}... ---")
    print(result["response"][:500])

GPU Layer Configuration

# Force specific number of GPU layers (useful for partial offload)
cat > Modelfile-r1-custom << 'EOF'
FROM deepseek-r1:14b
PARAMETER num_gpu 28
PARAMETER num_ctx 8192
EOF

ollama create r1-custom -f Modelfile-r1-custom

Running on Multiple GPUs

# Set visible GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve

# Ollama automatically splits layers across available GPUs
# 2x RTX 3060 12GB = 24GB effective → R1-Distill 14B at Q8

Troubleshooting

Thinking Output Is Too Long

# Limit max tokens
ollama run deepseek-r1:8b --num-predict 2048

# Or in Modelfile
PARAMETER num_predict 2048

Model Runs Slow Despite Having Enough VRAM

# Check if model is fully on GPU
ollama ps

# If "processor" shows "cpu" or mixed, you may have VRAM fragmentation
# Restart Ollama to clear
systemctl restart ollama    # Linux
brew services restart ollama # macOS

Out of Memory During Long Conversations

R1's thinking chains can fill the KV cache quickly in multi-turn conversations. Solutions:

# Reduce context window
ollama run deepseek-r1:8b --num-ctx 4096

# Or start a new conversation (context resets)
# In API: don't pass conversation_id or context

Conclusion

DeepSeek R1's distilled models bring genuine reasoning capabilities to consumer hardware. The 8B distill on a $200 used GPU gives you a model that solves math problems, debugs code, and works through logic puzzles with visible step-by-step thinking. The 32B distill on an RTX 4090 or Mac Studio approaches frontier reasoning quality.

The MIT license removes every commercial barrier. The thinking chains give you interpretable AI -- you can see exactly how the model arrived at its answer, which is invaluable for debugging and trust.

Start with ollama pull deepseek-r1:8b, throw some hard problems at it, and watch the <think> blocks work. If you find the reasoning compelling but want more accuracy, scale up to 14B or 32B.

Full model weights and technical reports are available on the DeepSeek-R1 GitHub repository and DeepSeek's HuggingFace organization.


Need help choosing between DeepSeek and other model families? Our best local AI models for 8GB RAM guide ranks every major option by real-world performance on consumer hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 10, 2026🔄 Last Updated: April 10, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor

Track the Reasoning Model Race

DeepSeek, OpenAI, and Google are shipping new reasoning models monthly. Get benchmarks and setup guides first.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators