Can I actually run DeepSeek R1 locally?

The full R1 model (671B parameters) requires 320GB+ VRAM and is impractical for consumer hardware. However, DeepSeek released distilled versions from 1.5B to 70B that preserve R1's reasoning capabilities in smaller packages. The R1-Distill 8B runs on any GPU with 6GB VRAM, and the 32B distill captures about 90% of full R1's reasoning quality on a single 24GB GPU.

How much VRAM does DeepSeek R1 8B need?

At Q4_K_M quantization (Ollama default), DeepSeek R1-Distill 8B uses 5.4GB VRAM. At Q8_0, it uses 8.8GB. At FP16, it needs 16.2GB. A 12GB GPU like the RTX 3060 can run it at Q8 quality with room for a 4K-8K context window.

Is DeepSeek R1 really MIT licensed?

Yes. DeepSeek R1 and all its distilled variants are released under the MIT license, the most permissive open-source license available. You can use the models commercially, modify them, redistribute them, and fine-tune them with no restrictions. No attribution is required. This is more permissive than Llama's or Gemma's licenses.

Which R1 distill size should I choose?

For 8GB hardware: R1-Distill 7B or 8B. For 12-16GB: R1-Distill 14B. For 24GB: R1-Distill 32B (best quality/size ratio). The 8B is the minimum for useful reasoning. The 32B is where reasoning quality approaches the full R1 -- it scores 86.4 on MATH-500 versus R1's 97.3. Going from 14B to 32B is a bigger quality jump than 32B to 70B.

How does DeepSeek R1 compare to OpenAI o1?

The full R1 671B matches o1 on most reasoning benchmarks (97.3 vs 96.4 on MATH-500, 79.8 vs 83.3 on AIME 2024). The distilled 32B version is roughly comparable to o1-mini. The key difference: R1 is MIT licensed and can run locally, while o1 requires an OpenAI API subscription at $15-60 per million tokens.

Why is DeepSeek R1 slower than regular models?

R1 generates a 'thinking' chain before the final answer, typically producing 2-10x more tokens than a non-reasoning model. A question that a regular model answers in 50 tokens might generate 300+ tokens with R1 (thinking + answer). The tokens-per-second speed is similar; the total response time is longer because there are more tokens to generate.

Can I fine-tune DeepSeek R1 distilled models?

Yes. The MIT license allows unrestricted fine-tuning. The distilled models based on Llama (8B, 70B) and Qwen (1.5B, 7B, 14B, 32B) architectures work with standard fine-tuning tools like Unsloth, LoRA, and QLoRA. Fine-tuning the 8B distill with QLoRA requires only 6GB VRAM.

Run DeepSeek R1 Locally: Complete Ollama Guide

Q: What's the difference between DeepSeek R1 and V3?

R1 is a reasoning model that 'thinks' step by step before answering, similar to OpenAI's o1. It excels at math, logic, and code debugging. V3 is a general-purpose chat model optimized for fast, helpful responses without explicit reasoning chains. Use R1 when accuracy on hard problems matters; use V3 for conversation, summarization, and creative tasks.

Published on April 10, 2026 • 25 min read

Quick Start: DeepSeek Running in 90 Seconds

Two commands to get DeepSeek R1 on your machine:

Pull the model: ollama pull deepseek-r1:8b
Start reasoning: ollama run deepseek-r1:8b "Solve: if 3x + 7 = 22, what is x?"

Watch it think through the problem step by step, then give you the answer. All running on your hardware.

What you'll learn:

The full DeepSeek model family and which variants run locally
Exact VRAM requirements for every distilled R1 and V3 model
How thinking mode works and when to use it
Real performance benchmarks on consumer hardware
When R1 beats V3 (and vice versa)
Cost comparison: local vs DeepSeek API vs OpenAI

DeepSeek shook the AI industry when they released R1 -- a reasoning model that matches OpenAI's o1 on math and coding benchmarks, released under the MIT license. The full R1 is a 671 billion parameter Mixture of Experts model that needs 320GB+ of VRAM. Nobody is running that on a desktop.

But here's the thing that matters: DeepSeek distilled R1's reasoning capabilities into smaller models ranging from 1.5B to 70B parameters. These distilled variants preserve a surprising amount of R1's step-by-step reasoning ability while fitting on consumer hardware. The 8B distill, based on Llama 3.1 architecture, runs on any machine with 6GB of VRAM.

For a detailed comparison of the DeepSeek model generations, see our DeepSeek V3 vs V3.1 analysis.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The DeepSeek Model Family
MIT License: Why It Matters
VRAM Requirements
Ollama Setup
Thinking Mode Explained
Performance Benchmarks
R1 vs V3: When to Use Which
Hardware Recommendations
Cost Comparison: Local vs API
Advanced Configuration

The DeepSeek Model Family {#deepseek-family}

DeepSeek has released two main model lines, each with a different purpose:

DeepSeek R1 (Reasoning)

R1 is a reasoning model. It thinks through problems step by step before answering -- similar to OpenAI's o1. The "thinking" process is visible in the output, wrapped in <think> tags.

Model	Parameters	Architecture	Purpose
DeepSeek R1	671B (MoE, 37B active)	MoE Transformer	Full reasoning model
R1-Distill-Qwen-1.5B	1.5B	Qwen 2.5	Ultra-lightweight reasoning
R1-Distill-Qwen-7B	7B	Qwen 2.5	Balanced reasoning
R1-Distill-Llama-8B	8B	Llama 3.1	Popular local choice
R1-Distill-Qwen-14B	14B	Qwen 2.5	Strong reasoning
R1-Distill-Qwen-32B	32B	Qwen 2.5	Near-full R1 quality
R1-Distill-Llama-70B	70B	Llama 3.3	Maximum distilled quality

DeepSeek V3 / V3.1 (General Purpose)

V3 is a general-purpose chat model optimized for fast, helpful responses without explicit reasoning chains.

Model	Parameters	Architecture	Key Feature
DeepSeek V3	671B (MoE)	MoE Transformer	Original release
DeepSeek V3.1	671B (MoE)	MoE Transformer	Improved instruction following

The full 671B models need server-grade hardware (multiple A100/H100 GPUs). For local use, the distilled R1 variants are the practical path. They're not watered-down versions -- they're specifically trained to compress R1's reasoning patterns into architectures that fit on consumer GPUs.

MIT License: Why It Matters {#mit-license}

DeepSeek released R1 and its distilled variants under the MIT license. This is the most permissive license in the open model ecosystem:

Commercial use: Build and sell products with no restrictions
Modification: Fine-tune, merge, quantize without permission
Distribution: Redistribute the model weights freely
No attribution required: You don't even need to credit DeepSeek
No usage restrictions: Unlike Llama's license, there's no user count cap

Compare this to Llama (restricted to 700M monthly active users, no using outputs for training) or Gemma (can't train competing foundation models). MIT is simply "do whatever you want."

This makes DeepSeek models particularly attractive for commercial applications where licensing complexity is a liability.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

VRAM Requirements {#vram-requirements}

Measured with Ollama using default quantization (Q4_K_M). These are actual runtime numbers, not theoretical minimums.

R1 Distilled Models

Model	Q4_K_M	Q5_K_M	Q8_0	FP16
R1-Distill 1.5B	1.4GB	1.6GB	2.2GB	3.2GB
R1-Distill 7B	4.9GB	5.6GB	8.0GB	14.4GB
R1-Distill 8B	5.4GB	6.2GB	8.8GB	16.2GB
R1-Distill 14B	9.2GB	10.6GB	15.4GB	28.8GB
R1-Distill 32B	19.8GB	22.8GB	33.2GB	64.6GB
R1-Distill 70B	42.5GB	49.0GB	71.2GB	140GB

Practical Hardware Mapping

Your Hardware	Best Model	Quality
6GB GPU (RTX 2060)	R1-Distill 1.5B or 7B (tight)	Basic reasoning
8GB GPU / 8GB Mac	R1-Distill 7B or 8B (Q4)	Good reasoning
12GB GPU (RTX 3060/4070)	R1-Distill 8B (Q8) or 14B (Q4)	Strong reasoning
16GB Mac	R1-Distill 14B (Q4)	Excellent reasoning
24GB GPU (RTX 4090)	R1-Distill 14B (Q8) or 32B (Q4)	Near-full R1
32GB Mac	R1-Distill 32B (Q4)	Outstanding
48GB+ GPU	R1-Distill 70B (Q4)	Maximum quality

The sweet spot for most users is the 8B distill on a 12GB GPU or the 14B distill on a 24GB GPU. The 32B distill is the quality inflection point -- it captures roughly 90% of full R1's reasoning capability on math and coding benchmarks.

For a broader look at hardware sizing, check our AI hardware requirements guide.

Ollama Setup {#ollama-setup}

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Start the server
ollama serve

Pull DeepSeek Models

# R1 Distilled models (reasoning)
ollama pull deepseek-r1:1.5b    # 1.5B - runs on anything
ollama pull deepseek-r1:7b      # 7B Qwen-based
ollama pull deepseek-r1:8b      # 8B Llama-based (recommended)
ollama pull deepseek-r1:14b     # 14B - strong mid-range
ollama pull deepseek-r1:32b     # 32B - best quality/size ratio
ollama pull deepseek-r1:70b     # 70B - maximum quality

# Specific quantization
ollama pull deepseek-r1:8b-q8_0     # Higher quality
ollama pull deepseek-r1:14b-q4_K_M  # Fits in 12GB

Verify and Test

# List downloaded models
ollama list

# Run a reasoning test
ollama run deepseek-r1:8b "A bat and a ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?"

# You should see <think> tags showing reasoning steps
# Correct answer: $0.05 (not $0.10 - this is a classic cognitive reflection test)

Create a Custom Configuration

cat > Modelfile-deepseek << 'EOF'
FROM deepseek-r1:8b
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.05
SYSTEM "You are a precise reasoning assistant. Think through problems carefully. When solving math or logic problems, show your work step by step."
EOF

ollama create my-deepseek -f Modelfile-deepseek
ollama run my-deepseek

Thinking Mode Explained {#thinking-mode}

The defining feature of DeepSeek R1 is its "thinking" capability. When the model encounters a problem, it generates a chain of reasoning inside <think> tags before producing the final answer.

What Thinking Looks Like

User: What is 247 * 183?

<think>
I need to multiply 247 by 183.
Let me break this down:
247 * 183 = 247 * (180 + 3)
= 247 * 180 + 247 * 3
= (247 * 18) * 10 + 741
= (250 * 18 - 3 * 18) * 10 + 741
= (4500 - 54) * 10 + 741
= 4446 * 10 + 741
= 44460 + 741
= 45201
</think>

247 * 183 = **45,201**

When Thinking Helps (and When It Doesn't)

Use R1 with thinking for:

Math problems and calculations
Logic puzzles and constraint satisfaction
Code debugging (traces through execution)
Multi-step planning
Scientific reasoning

Skip thinking for:

Simple factual questions ("What is the capital of Japan?")
Creative writing
Translation
Casual conversation
Tasks where speed matters more than accuracy

Thinking mode adds 2-10x more tokens to the response. A question that V3 answers in 50 tokens might generate 300+ tokens with R1's reasoning chain. This means slower effective responses and higher memory usage for the KV cache.

Controlling Thinking in the API

# With thinking (default for R1)
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:8b",
  "prompt": "Solve: If a train leaves at 2pm going 60mph and another at 3pm going 90mph, when do they meet?",
  "stream": false
}'

# The response includes <think>...</think> followed by the answer

Parsing Thinking Output

import re

response = "... full model output ..."

# Extract just the thinking
thinking = re.findall(r'<think>(.*?)</think>', response, re.DOTALL)

# Extract just the answer (after all think blocks)
answer = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

print(f"Reasoning: {thinking[0] if thinking else 'None'}")
print(f"Answer: {answer}")

Performance Benchmarks {#benchmarks}

Real measurements on consumer hardware. All at Q4_K_M quantization unless noted.

Generation Speed (tokens/second)

Model	RTX 3060 12GB	RTX 4070 12GB	RTX 4090 24GB	M3 Pro 18GB	M4 Pro 24GB
R1 1.5B	128 tok/s	178 tok/s	241 tok/s	108 tok/s	124 tok/s
R1 8B	38 tok/s	56 tok/s	88 tok/s	32 tok/s	44 tok/s
R1 14B	CPU offload	22 tok/s*	48 tok/s	18 tok/s	28 tok/s
R1 32B	--	--	16 tok/s*	--	10 tok/s*
R1 70B	--	--	--	--	--

*Partial GPU offload

Note: R1 models produce more tokens per response than non-reasoning models due to the thinking chain. A response that feels "fast enough" needs about 20+ tok/s because the thinking portion streams first, then the answer appears.

Reasoning Accuracy Benchmarks

Benchmark	R1 1.5B	R1 8B	R1 14B	R1 32B	R1 70B	Full R1 671B
MATH-500	42.1	68.5	78.2	86.4	90.1	97.3
AIME 2024	12.5	38.7	52.3	68.9	78.4	79.8
HumanEval	45.2	68.9	76.4	82.1	85.6	96.3
GPQA Diamond	28.3	44.7	52.8	58.6	64.2	71.5
GSM8K	68.4	85.2	90.7	94.1	95.8	97.4

The 32B distill hits roughly 85-95% of the full R1's performance on reasoning benchmarks. That's remarkable -- you get near-frontier reasoning on a single consumer GPU.

R1 vs V3: When to Use Which {#r1-vs-v3}

This is the question everyone asks. Here's a practical framework:

Head-to-Head Comparison

Task	R1 (Reasoning)	V3 (General)	Winner
Math problems	Step-by-step, high accuracy	Quick but error-prone	R1
Code debugging	Traces execution, finds subtle bugs	Spots obvious issues	R1
Logic puzzles	Methodical, reliable	Sometimes guesses	R1
Chat/conversation	Verbose, slow	Natural, fast	V3
Creative writing	Overthinks	Flows naturally	V3
Summarization	Unnecessary reasoning overhead	Clean and fast	V3
Translation	Adds unwanted analysis	Direct output	V3
API response speed	2-10x more tokens	Minimal tokens	V3

Practical Rule

Use R1 when the answer requires multiple logical steps or when accuracy on hard problems matters more than speed. Use V3 for everything else.

If you're building an application, consider routing: send math/code/logic queries to R1 and general queries to V3. Ollama makes this trivial -- both models run simultaneously and you pick the model per request.

Running Both Side by Side

# Pull both
ollama pull deepseek-r1:8b
ollama pull deepseek-v3:8b

# In your code, choose per request
curl http://localhost:11434/api/generate -d '{"model": "deepseek-r1:8b", "prompt": "complex math..."}'
curl http://localhost:11434/api/generate -d '{"model": "deepseek-v3:8b", "prompt": "summarize this..."}'

Ollama keeps both models loaded if you have enough RAM, with instant switching between them.

Hardware Recommendations {#hardware}

Budget Build ($300-500)

GPU: Used RTX 3060 12GB ($180-220)
Models: R1-Distill 8B at Q4_K_M
Performance: 38 tok/s generation
Good for: Personal use, coding assistance, homework help

Mid-Range Build ($800-1200)

GPU: RTX 4070 Ti Super 16GB ($700-800)
Models: R1-Distill 14B at Q4_K_M
Performance: 35 tok/s generation
Good for: Professional development, research, small team

High-End Build ($1500-2500)

GPU: RTX 4090 24GB ($1600-1900) or RTX 5090 32GB
Models: R1-Distill 32B at Q4_K_M
Performance: 16-32 tok/s generation
Good for: Near-frontier reasoning, production workloads

Apple Silicon Path

Mac Mini M4 Pro 24GB ($1,599): R1-Distill 14B comfortably
Mac Studio M4 Max 64GB (~$3,000): R1-Distill 32B with room to spare
Mac Studio M4 Ultra 128GB (~$5,000+): R1-Distill 70B

Apple Silicon is cost-effective for the 32B+ models because unified memory eliminates the VRAM bottleneck. A Mac Studio with 64GB running the 32B distill at Q5 is a very capable reasoning workstation.

For detailed GPU comparisons, read our best local AI models for 8GB RAM guide.

Cost Comparison: Local vs API {#cost-comparison}

DeepSeek API Pricing (as of April 2026)

Model	Input (per 1M tokens)	Output (per 1M tokens)
DeepSeek R1	$0.55	$2.19
DeepSeek V3	$0.27	$1.10

Break-Even Analysis

Assume you generate 500,000 tokens/day (heavy individual use):

Scenario	Monthly Cost
DeepSeek R1 API	~$45/month (at avg. input/output mix)
OpenAI o1 API	~$450/month
Local R1-Distill 8B (electricity only)	~$8-15/month
Local R1-Distill 32B (electricity only)	~$15-25/month

Hardware amortization on a used RTX 3060 ($200) over 12 months: $16.67/month. Total local cost for R1-Distill 8B: $25-32/month -- roughly matching DeepSeek's API but with zero data privacy concerns and no rate limits.

The economic argument for running locally gets stronger at higher volumes. If your team generates 5M+ tokens/day, local hardware pays for itself in under 3 months versus the API.

The Real Advantage: Privacy and Control

Cost savings aside, local deployment means:

No data leaves your network
No API rate limits during peak usage
No dependency on external service availability
Instant response (no network latency)
Full model customization (system prompts, fine-tuning)

Advanced Configuration {#advanced-config}

Optimizing Context Length

R1's thinking chains consume context. For complex problems, you may need more than the default 4096 tokens:

# Increase context window
ollama run deepseek-r1:8b --num-ctx 16384

# Warning: doubles VRAM usage roughly
# 8B Q4 with 4K context: 5.4GB
# 8B Q4 with 16K context: ~8.2GB
# 8B Q4 with 32K context: ~11GB

Batch Processing

import requests
import json

problems = [
    "Prove that sqrt(2) is irrational",
    "Find all prime factors of 2310",
    "Write a function to detect cycles in a linked list",
]

for problem in problems:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "deepseek-r1:8b",
            "prompt": problem,
            "stream": False,
            "options": {
                "temperature": 0.6,
                "num_ctx": 8192
            }
        }
    )
    result = json.loads(response.text)
    print(f"\n--- {problem[:50]}... ---")
    print(result["response"][:500])

GPU Layer Configuration

# Force specific number of GPU layers (useful for partial offload)
cat > Modelfile-r1-custom << 'EOF'
FROM deepseek-r1:14b
PARAMETER num_gpu 28
PARAMETER num_ctx 8192
EOF

ollama create r1-custom -f Modelfile-r1-custom

Running on Multiple GPUs

# Set visible GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve

# Ollama automatically splits layers across available GPUs
# 2x RTX 3060 12GB = 24GB effective → R1-Distill 14B at Q8

Troubleshooting

Thinking Output Is Too Long

# Limit max tokens
ollama run deepseek-r1:8b --num-predict 2048

# Or in Modelfile
PARAMETER num_predict 2048

Model Runs Slow Despite Having Enough VRAM

# Check if model is fully on GPU
ollama ps

# If "processor" shows "cpu" or mixed, you may have VRAM fragmentation
# Restart Ollama to clear
systemctl restart ollama    # Linux
brew services restart ollama # macOS

Out of Memory During Long Conversations

R1's thinking chains can fill the KV cache quickly in multi-turn conversations. Solutions:

# Reduce context window
ollama run deepseek-r1:8b --num-ctx 4096

# Or start a new conversation (context resets)
# In API: don't pass conversation_id or context

Conclusion

DeepSeek R1's distilled models bring genuine reasoning capabilities to consumer hardware. The 8B distill on a $200 used GPU gives you a model that solves math problems, debugs code, and works through logic puzzles with visible step-by-step thinking. The 32B distill on an RTX 4090 or Mac Studio approaches frontier reasoning quality.

The MIT license removes every commercial barrier. The thinking chains give you interpretable AI -- you can see exactly how the model arrived at its answer, which is invaluable for debugging and trust.

Start with ollama pull deepseek-r1:8b, throw some hard problems at it, and watch the <think> blocks work. If you find the reasoning compelling but want more accuracy, scale up to 14B or 32B.

Full model weights and technical reports are available on the DeepSeek-R1 GitHub repository and DeepSeek's HuggingFace organization.

Need help choosing between DeepSeek and other model families? Our best local AI models for 8GB RAM guide ranks every major option by real-world performance on consumer hardware.

Run DeepSeek R1 Locally: Complete Ollama Guide

Want to go deeper than this article?

Quick Start: DeepSeek Running in 90 Seconds

Reading articles is good. Building is better.

Table of Contents

The DeepSeek Model Family {#deepseek-family}

DeepSeek R1 (Reasoning)

DeepSeek V3 / V3.1 (General Purpose)

MIT License: Why It Matters {#mit-license}

Reading articles is good. Building is better.

VRAM Requirements {#vram-requirements}

R1 Distilled Models

Practical Hardware Mapping

Ollama Setup {#ollama-setup}

Install Ollama

Pull DeepSeek Models

Verify and Test

Create a Custom Configuration

Thinking Mode Explained {#thinking-mode}

What Thinking Looks Like

When Thinking Helps (and When It Doesn't)

Controlling Thinking in the API

Parsing Thinking Output

Performance Benchmarks {#benchmarks}

Generation Speed (tokens/second)

Reasoning Accuracy Benchmarks

R1 vs V3: When to Use Which {#r1-vs-v3}

Head-to-Head Comparison

Practical Rule

Running Both Side by Side

Hardware Recommendations {#hardware}

Budget Build ($300-500)

Mid-Range Build ($800-1200)

High-End Build ($1500-2500)

Apple Silicon Path

Cost Comparison: Local vs API {#cost-comparison}

DeepSeek API Pricing (as of April 2026)

Break-Even Analysis

The Real Advantage: Privacy and Control

Advanced Configuration {#advanced-config}

Optimizing Context Length

Batch Processing

GPU Layer Configuration

Running on Multiple GPUs

Troubleshooting

Thinking Output Is Too Long

Model Runs Slow Despite Having Enough VRAM

Out of Memory During Long Conversations

Conclusion

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Track the Reasoning Model Race

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Continue Learning

DeepSeek V3 vs V3.1

Best Models for 8GB RAM

AI Hardware Requirements

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI