AI Models

DeepSeek R1 Local Setup: Complete Guide to Running 671B Locally

February 4, 2026
18 min read
Local AI Master Research Team
šŸŽ 4 PDFs included
Newsletter

Before we dive deeper...

Get your free AI Starter Kit

Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.

No spam, everUnsubscribe anytime
12,000+ downloads

DeepSeek R1 Quick Start

Choose Your Version:

R1 1.5B
4GB VRAM
Basic reasoning, fast
R1 8B
8GB VRAM
Good for most tasks
R1 32B
24GB VRAM
Best local option
R1 671B
131GB+ (1.58-bit)
Full capability

Quick Install (3 commands):
curl -fsSL https://ollama.com/install.sh | sh
ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b

What is DeepSeek R1?

DeepSeek R1 is a 671 billion parameter reasoning model that revolutionized open-source AI when released on January 20, 2025. Built by Chinese AI lab DeepSeek for approximately $5.6 million in training costs, it matches or beats OpenAI's o1 and GPT-4 on complex reasoning tasks—and it's completely open source under the MIT license.

What makes R1 groundbreaking isn't just performance. It's the first open model to demonstrate transparent chain-of-thought reasoning through visible <think> tokens. Unlike closed models that hide their reasoning, R1 shows you exactly how it solves problems step by step. You can watch it explore solutions, catch its own mistakes, and course-correct in real-time.

The model uses a Mixture-of-Experts (MoE) architecture with 671B total parameters but only 37B active per token. This makes it computationally efficient while maintaining massive capability. DeepSeek achieved this through pure reinforcement learning—the model learned to reason without pre-programmed chains of thought.

Andrej Karpathy, founding member of OpenAI, commented: "DeepSeek making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M)."


DeepSeek Model Family: Complete Version History

Understanding the DeepSeek ecosystem helps you choose the right model:

DeepSeek V3 (December 26, 2024)

The foundation model. General-purpose 671B MoE optimized for broad capabilities across coding, writing, and conversation.

  • Parameters: 671B total, 37B active
  • Context: 128K tokens
  • Best for: General tasks, coding, writing

DeepSeek R1 (January 20, 2025)

The reasoning specialist. Same architecture as V3 but trained specifically for complex reasoning using reinforcement learning.

  • Parameters: 671B total, 37B active
  • Context: 128K-160K tokens
  • Best for: Math, logic, multi-step problems, debugging

DeepSeek R1-0528 (May 28, 2025)

Major upgrade to R1 with reduced hallucinations, JSON output support, and function calling.

  • Improvements: Better JSON, tool calling, fewer hallucinations
  • Key update: 8B distilled version significantly improved
  • Best for: Production use, API integration

DeepSeek V3.1 (August 21, 2025)

Hybrid model combining V3's versatility with R1's reasoning. Can switch between thinking and non-thinking modes.

  • Unique feature: Dynamic mode switching
  • Best for: Users wanting both capabilities in one model

DeepSeek V3.2 (December 2025)

Latest general-purpose release with performance improvements across all benchmarks.

DeepSeek V4 (Expected February 2026)

Next-generation model expected mid-February 2026, focusing on enhanced code generation. Will likely use Apache 2.0 license.


DeepSeek R1 Benchmark Performance

R1's benchmark scores explain why it disrupted the AI industry:

Mathematics Benchmarks

BenchmarkDeepSeek R1OpenAI o1-1217GPT-4oClaude 3.5
AIME 2024 (Math Olympiad)79.8%79.2%9.3%16.0%
MATH-50097.3%96.4%74.6%78.3%
GSM8K95.8%94.8%92.0%91.6%

Coding Benchmarks

BenchmarkDeepSeek R1OpenAI o1GPT-4oClaude 3.5
Codeforces Elo2,0291,8911,8911,886
LiveCodeBench65.9%63.4%33.4%38.9%
SWE-Bench Verified49.2%48.9%33.2%40.6%

Knowledge Benchmarks

BenchmarkDeepSeek R1GPT-4oClaude 3.5
MMLU90.8%88.7%88.3%
MMLU-Pro84.0%80.3%78.0%
GPQA Diamond (PhD Science)71.5%49.9%59.4%

The R1 scores are particularly impressive on hard benchmarks—AIME (math olympiad), Codeforces (competitive programming), and GPQA Diamond (PhD-level science). Beating GPT-4o and Claude 3.5 together on these benchmarks by significant margins was unprecedented for an open model.

Distilled Model Performance

The distilled versions retain most reasoning capability:

ModelAIME 2024MATH-500Codeforces Elo
R1-Distill-Qwen-32B72.6%94.3%1,691
R1-Distill-Llama-70B70.0%94.5%1,633
R1-Distill-Qwen-14B69.7%93.9%1,481
R1-Distill-Llama-8B50.4%89.1%1,205

The 32B distilled model achieves 72.6% on AIME—that's math olympiad performance from a model that runs on a single RTX 4090.


How DeepSeek R1's Thinking Mode Works

Understanding R1's reasoning mechanism helps you use it effectively.

Chain-of-Thought Architecture

When you send a prompt to R1, the model generates two distinct phases:

  1. Thinking Phase (<think>...</think>)

    • Internal reasoning tokens visible in raw output
    • Problem breakdown and exploration
    • Self-correction and verification
    • Multiple solution paths evaluated
  2. Response Phase

    • Final answer based on thinking
    • Clean, user-facing output
    • Conclusions from reasoning process

The Four-Stage Reasoning Taxonomy

Research into R1's behavior reveals four distinct stages:

  1. Problem Definition: Initial understanding and constraint identification
  2. Blooming Cycle: Exploration of multiple solution approaches
  3. Reconstruction Cycles: Self-correction, rumination, "aha moments"
  4. Final Decision: Commitment to answer after verification

"Aha Moments" - Emergent Self-Correction

One of R1's most remarkable behaviors is spontaneous error correction. During reasoning, the model will:

  • Recognize when an approach is failing
  • Explicitly state "Wait, this doesn't seem right..."
  • Backtrack and try alternative methods
  • Verify final answers against original constraints

This emerged purely from reinforcement learning—DeepSeek did not pre-program these behaviors.

Example Thinking Output

<think>
Let me break down this math problem...
First, I need to identify the variables: x represents...
Wait, I should check my assumption about...
Actually, that approach won't work because...
Let me try a different method using...
Now I can verify: if x = 5, then...
Yes, this satisfies all constraints.
</think>

The answer is x = 5, which I verified by substituting back into the original equation.

Controlling Thinking Mode

  • Temperature 0.6 (recommended): Balanced reasoning
  • Temperature 0.3-0.5: More focused, less exploration
  • Temperature 0.7-0.8: More creative, more exploration
  • Prompt engineering: "Think step by step" encourages extended reasoning
  • V3.1 hybrid: Automatically decides when to think based on complexity

Complete Model Specifications

Full 671B Model

SpecificationValue
Total Parameters671B (originally 685B)
Active Parameters37B per token
ArchitectureMixture of Experts (MoE)
Expert Count256 experts
Active Experts8 per token
Context Window128K-160K tokens
Training Tokens14.8 trillion
Training Cost~$5.6 million
LicenseMIT

Distilled Model Specifications

ModelParametersDownload SizeBase ArchitectureContext
R1-Distill-Qwen-1.5B1.5B1.1GBQwen-2.5128K
R1-Distill-Qwen-7B7B4.7GBQwen-2.5128K
R1-Distill-Llama-8B8B5.2GBLlama 3.1-8B128K
R1-Distill-Qwen-14B14B9.0GBQwen-2.5128K
R1-Distill-Qwen-32B32B20GBQwen-2.5128K
R1-Distill-Llama-70B70B43GBLlama 3.3-70B128K

Step-by-Step Local Setup with Ollama

Step 1: Install Ollama

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download and run it.

Verify installation:

ollama --version
# Should show: ollama version 0.5.x or higher

Step 2: Choose and Pull Your Model

Select based on your VRAM:

# 4GB VRAM - Basic reasoning
ollama pull deepseek-r1:1.5b

# 8GB VRAM - Good balance (recommended starting point)
ollama pull deepseek-r1:8b

# 12-16GB VRAM - Better reasoning
ollama pull deepseek-r1:14b

# 24GB VRAM - Best local experience
ollama pull deepseek-r1:32b

# 48GB+ VRAM - Near-full capability
ollama pull deepseek-r1:70b

# 400GB+ - Full model (enterprise hardware)
ollama pull deepseek-r1:671b

Step 3: Run the Model

ollama run deepseek-r1:32b

Test with a reasoning problem:

A farmer has 17 sheep. All but 9 run away. How many sheep does the farmer have left?

Watch R1 reason through the problem—it will show its thinking process (internally) before answering correctly: 9 sheep.

Step 4: Advanced Configuration

Create a custom Modelfile for optimized settings:

cat > Modelfile << 'EOF'
FROM deepseek-r1:32b

# Optimal settings for reasoning
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192

# System prompt for reasoning tasks
SYSTEM """You are DeepSeek R1, an advanced reasoning assistant.
For complex problems:
1. Break down the problem systematically
2. Show your reasoning process clearly
3. Verify your answer before finalizing
4. If you notice an error, correct it explicitly
Think carefully and explain your logic step by step."""
EOF

# Create optimized model
ollama create deepseek-r1-reasoning -f Modelfile

# Run optimized version
ollama run deepseek-r1-reasoning

Step 5: Pull Latest R1-0528 Version

For the improved May 2025 version with better JSON and function calling:

# Unsloth optimized versions
ollama pull hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q4_K_M

# Or the Qwen3 8B distilled with improvements
ollama pull hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL

VRAM Requirements: Complete Guide

Distilled Models by Quantization

ModelFP16Q8_0Q5_K_MQ4_K_MMinimum GPU
R1 1.5B3GB2GB1.5GB1.2GBGTX 1060 6GB
R1 7B14GB8GB6GB5GBRTX 3060 8GB
R1 8B16GB9GB7GB6GBRTX 3060 12GB
R1 14B28GB15GB11GB9GBRTX 4060 Ti 16GB
R1 32B64GB34GB24GB20GBRTX 4090 24GB
R1 70B140GB75GB52GB42GB2x RTX 4090

Full 671B Model Quantization Options

QuantizationSizeVRAM RequiredSetupTokens/sec
FP16/BF16~1,400GB1,500-1,800GB20x H100 80GB200+
FP8~700GB~700GB9x H100 80GB180+
Q4_K_M (4-bit)~400GB~400GB8x H100 80GB150+
2.51-bit Dynamic~212GB~212GB3x H100 80GB80+
IQ1_M (1.78-bit)~183GB183GB + RAM2x H100 + offload40+
TQ1_0 (1.66-bit)~162GB162GB192GB Mac Ultra2-3
1.58-bit Dynamic~131GB131GB2x RTX 4090 + 128GB RAM1-5

Consumer Hardware Configurations

BudgetHardwareBest R1 VersionPerformance
$300RTX 3060 12GBR1 8B Q4_K_M30 tok/s
$500RTX 4060 Ti 16GBR1 14B Q4_K_M32 tok/s
$800RTX 4070 Ti Super 16GBR1 14B Q5_K_M38 tok/s
$1,200RTX 4080 Super 16GBR1 14B Q8_040 tok/s
$1,600RTX 4090 24GBR1 32B Q4_K_M28 tok/s
$3,0002x RTX 4090 + 128GB RAMR1 671B 1.58-bit1-5 tok/s

Apple Silicon Performance

MacMemoryBest R1 VersionPerformance
M1/M2 8GB8GBR1 1.5B Q445 tok/s
M1/M2 16GB16GBR1 8B Q418 tok/s
M2/M3 Pro 32GB32GBR1 14B Q522 tok/s
M3 Max 64GB64GBR1 32B Q420 tok/s
M3 Max 128GB128GBR1 70B Q412 tok/s
M3 Ultra 192GB192GBR1 671B TQ1_02-3 tok/s

Integration Options

Open WebUI (ChatGPT-like Interface)

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Access at http://localhost:3000 and select deepseek-r1:32b from the model dropdown.

VS Code Integration (Continue Extension)

  1. Install the Continue extension
  2. Open Continue settings (Ctrl+Shift+P > "Continue: Open Config")
  3. Add configuration:
{
  "models": [
    {
      "title": "DeepSeek R1 32B",
      "provider": "ollama",
      "model": "deepseek-r1:32b",
      "contextLength": 8192
    }
  ]
}

Python API Integration

import ollama

# Basic chat
response = ollama.chat(
    model='deepseek-r1:32b',
    messages=[{
        'role': 'user',
        'content': 'Solve step by step: What is the derivative of x^3 * sin(x)?'
    }]
)
print(response['message']['content'])

# Streaming for long reasoning
for chunk in ollama.chat(
    model='deepseek-r1:32b',
    messages=[{'role': 'user', 'content': 'Your complex question here'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

OpenAI-Compatible API

Ollama exposes an OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
)

response = client.chat.completions.create(
    model="deepseek-r1:32b",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}],
    temperature=0.6
)
print(response.choices[0].message.content)

llama.cpp Direct (Advanced)

For maximum control with the full 671B model:

./llama.cpp/llama-cli \
  --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
  --cache-type-k q4_0 \
  --threads 16 \
  --temp 0.6 \
  --ctx-size 8192 \
  --n-gpu-layers 7 \
  -no-cnv \
  --prompt "<|User|>Your prompt here<|Assistant|>"

Real-World Use Cases

1. Mathematical Problem Solving

R1 excels at competition math. Feed it IMO problems, calculus derivations, or statistical analysis. Use prompts like: "Please reason step by step, and put your final answer within \boxed{}"

2. Code Architecture and Debugging

Ask R1 to design system architectures or debug complex logic. Its reasoning shows the "why" behind decisions—invaluable for learning and code review.

R1 can parse complex documents, identify issues, and explain implications step-by-step. The visible reasoning creates an audit trail.

4. Scientific Research Assistance

Use for hypothesis evaluation, experimental design, and paper analysis. R1's 71.5% on GPQA Diamond (PhD-level science) demonstrates deep technical understanding.

5. Educational Tutoring

R1's visible reasoning makes it ideal for teaching. Students see the problem-solving process, not just answers. Perfect for math, physics, and programming education.

6. Complex Decision Analysis

Multi-criteria decisions benefit from R1's systematic reasoning. It naturally explores trade-offs and edge cases.


DeepSeek R1 vs Other Models

Reasoning Model Comparison

FeatureDeepSeek R1OpenAI o1Claude 3.5GPT-4
Visible ReasoningYes ()NoNoNo
AIME 202479.8%79.2%16.0%9.3%
MATH-50097.3%96.4%78.3%74.6%
Codeforces Elo2,029-1,8861,891
LicenseMIT (Open)ProprietaryProprietaryProprietary
Local RunningYesNoNoNo
API Cost (1M tokens)$1.10-$2.19~$15-60~$15-18~$30-60

Local Model Comparison

ModelVRAM (Q4)ReasoningCodingSpeed (4090)
DeepSeek R1 32B20GBExcellentExcellent28 tok/s
Llama 3.1 70B42GBGoodExcellent15 tok/s
Qwen 2.5 72B44GBGoodExcellent14 tok/s
Mistral Large18GBGoodGood32 tok/s
Phi-4 14B10GBGoodGood45 tok/s

Verdict: For reasoning-heavy tasks, R1 32B is unmatched in the "runs on single 4090" category.


Troubleshooting Common Issues

Model Loads Slowly

# Pre-load into memory and keep warm
ollama run deepseek-r1:32b "warmup" --keepalive 1h

Out of Memory Errors

# Use smaller quantization
ollama pull deepseek-r1:32b-q4_0

# Reduce context window
OLLAMA_NUM_CTX=4096 ollama run deepseek-r1:32b

Slow Generation Speed

# Verify GPU is being used
ollama ps

# Force GPU layers
OLLAMA_NUM_GPU=999 ollama run deepseek-r1:32b

# Check CUDA installation
nvidia-smi

Thinking Tokens Not Visible

Some interfaces hide <think> tags. Check settings for:

  • "Show reasoning" or "Show thinking"
  • "Raw output mode"
  • Or use the CLI directly to see full output

Model Gives Truncated Responses

# Increase max tokens
ollama run deepseek-r1:32b --num-predict 4096

JSON Output Issues (Pre-R1-0528)

Upgrade to R1-0528 which has proper JSON support:

ollama pull hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q4_K_M

Key Takeaways

  1. DeepSeek R1 is the best open-source reasoning model, matching OpenAI o1 on complex math and coding benchmarks
  2. The 32B distilled version is ideal for single-GPU setups (RTX 4090 or 64GB Mac)
  3. Visible <think> tokens make R1 uniquely transparent—you can watch it reason
  4. MIT license means completely free commercial use with no API costs
  5. R1-0528 update fixed JSON, function calling, and reduced hallucinations
  6. Full 671B is runnable on consumer hardware with extreme quantization (1.58-bit on 2x 4090 + RAM)
  7. V4 expected February 2026 with enhanced coding capabilities

Next Steps

  1. Set up RAG for document analysis with DeepSeek R1
  2. Build AI agents using R1's reasoning capabilities
  3. Compare with Llama 4 for your specific use case
  4. Understand MoE architecture behind DeepSeek's efficiency
  5. Check VRAM requirements for different model configurations
  6. Learn about MCP servers to extend R1's capabilities

DeepSeek R1 represents a paradigm shift in open-source AI. For the first time, anyone can run a model that rivals the best closed-source systems for complex reasoning—completely free, completely private, completely yours. The visible reasoning process makes it uniquely suited for education, debugging, and trust-critical applications. Whether you're running the 8B distilled on an 8GB GPU or the full 671B on enterprise hardware, R1 delivers reasoning capabilities that were impossible to access locally just a year ago.

šŸš€ Join 12K+ developers
Newsletter

Ready to start your AI career?

Get the complete roadmap

Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.

No spam, everUnsubscribe anytime
12,000+ downloads
Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

šŸ“… Published: February 4, 2026šŸ”„ Last Updated: February 4, 2026āœ“ Manually Reviewed

Get Local AI Model Updates

Join 40,000+ builders getting weekly updates on new models, optimization tips, and setup guides.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

āœ“ 10+ Years in ML/AIāœ“ 77K Dataset Creatorāœ“ Open Source Contributor
Free Tools & Calculators