★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Models

Cohere Command R+ Local Setup Guide (2026): RAG and Tool-Use Specialist

May 1, 2026
18 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Cohere's Command R+ is the open-weights model purpose-tuned for RAG and multi-step tool use. 104B parameters, native citation generation, structured tool-call output, and best-in-class multilingual coverage. For research and personal RAG workloads where grounded-answer quality and citation accuracy matter, Command R+ is the highest-quality option in 2026. The catch: CC-BY-NC license restricts commercial use without a Cohere license, and 104B parameters need serious hardware. Command R7B (December 2024 release) brings the same training to a 7B model that fits on consumer GPUs.

This guide covers the full Command R family, setup across runtimes, the native RAG citation format, structured tool calling, multilingual usage, and the licensing considerations for commercial deployment.

Table of Contents

  1. What Command R+ Is
  2. Command R+ vs Command R 35B vs Command R7B
  3. License (CC-BY-NC) Implications
  4. Hardware Requirements
  5. Command R+ vs Llama 3.1 70B for RAG
  6. Ollama Setup
  7. llama.cpp Setup
  8. vLLM Setup
  9. Native RAG Prompt Format
  10. Structured Tool Calling
  11. Multilingual Usage
  12. Real Benchmarks
  13. Tuning Recipes
  14. Commercial Alternatives
  15. Troubleshooting
  16. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Command R+ Is {#what-it-is}

Command R+ (CohereForAI/c4ai-command-r-plus on HuggingFace) is a 104B-parameter open-weights model from Cohere. Original release April 2024; revised "08-2024" version August 2024. License: CC-BY-NC-4.0.

Architecture: standard decoder-only with grouped-query attention. Context: 128K. Trained specifically for retrieval-augmented generation and multi-step tool use, not pure chat.


Command R+ vs Command R 35B vs Command R7B {#family}

VariantParamsVRAM (BF16/Q4)Use
Command R+104B210 GB / 63 GBBest RAG/tools, needs serious HW
Command R35B70 GB / 21 GBMid-tier; Q4 fits 24GB single card
Command R7B7B14 GB / 4.5 GBPractical local choice

For most local users: Command R7B is the right starting point. Command R 35B for 24GB cards. Command R+ for multi-GPU / H100 rigs.


License (CC-BY-NC) Implications {#license}

CC-BY-NC-4.0 = non-commercial only without separate license.

You can:

  • ✅ Use in research / academic work
  • ✅ Use in personal projects
  • ✅ Use in internal company tools that don't directly generate revenue
  • ✅ Modify and redistribute (with attribution and same license)

You cannot (without Cohere commercial license):

  • ❌ Sell as a paid service / API
  • ❌ Embed in commercial products
  • ❌ Use to train competing models

For commercial RAG with permissive licensing, see Granite 3 (Apache 2.0) or Mistral Small 3 (Apache 2.0).


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements {#requirements}

Command R+ (104B)

  • BF16: 210 GB → 4x A100 80GB or 4x H100
  • AWQ-INT4: 60 GB → single H100 80GB or 2x RTX 4090 + offload
  • Q4_K_M GGUF: 63 GB → 2x 48GB cards or 1x H100
  • Q2_K GGUF: 38 GB → tight on Pro W7900 48GB

Command R 35B

  • BF16: 70 GB → A100 80GB
  • Q4_K_M: 21 GB → fits 24 GB card
  • Q5_K_M: 25 GB → tight on 24 GB

Command R7B

  • BF16: 14 GB → fits 16 GB card
  • Q5_K_M: 5 GB → fits any modern GPU
  • Q4_K_M: 4 GB → mobile / edge

Command R+ vs Llama 3.1 70B for RAG {#vs-llama-rag}

MetricCommand R+Llama 3.1 70B
Native citation generation❌ (prompt-based)
Hallucination rate (RAG)6%14%
Multi-step tool use accuracy91%78%
Multilingual RAGExcellentGood
MMLU (general)75.786.0
Throughput (Q4, 48GB GPU)~12 tok/s~22 tok/s

For pure RAG quality with citations, Command R+ wins. For general capability + chat + reasoning, Llama 3.1 70B wins. For commercial production, switch to Granite 3 which has Apache 2.0 + comparable RAG quality.


Ollama Setup {#ollama}

# Command R7B (most practical)
ollama run command-r7b

# Command R 35B (24GB+ GPU)
ollama run command-r:35b

# Command R+ 104B (multi-GPU)
ollama run command-r-plus

For RAG-specific Modelfile:

FROM command-r7b
PARAMETER num_ctx 32768
PARAMETER temperature 0.3
PARAMETER min_p 0.05
SYSTEM """You are a helpful research assistant. Always cite sources using [1] [2] [3] format."""

llama.cpp Setup {#llamacpp}

huggingface-cli download bartowski/c4ai-command-r-plus-GGUF \
    c4ai-command-r-plus-Q4_K_M.gguf \
    --local-dir ./models

./llama-cli -m models/c4ai-command-r-plus-Q4_K_M.gguf \
    -ngl 999 -c 32768 -fa --tensor-split 24,24

For two 24GB GPUs (e.g., 2x RTX 4090): tensor split.


vLLM Setup {#vllm}

# AWQ for single-H100
vllm serve casperhansen/command-r-plus-104b-awq \
    --quantization awq --tensor-parallel-size 2 \
    --max-model-len 32768

# Command R7B (single card)
vllm serve CohereForAI/c4ai-command-r7b-12-2024 \
    --max-model-len 32768

Native RAG Prompt Format {#rag-format}

Command R+ uses a custom RAG template:

<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{system}<|END_OF_TURN_TOKEN|>
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{question}<|END_OF_TURN_TOKEN|>
<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
{retrieved documents in <results> tags}
{model produces grounded answer with [N] citations}
<|END_OF_TURN_TOKEN|>

For Hugging Face Transformers:

documents = [
    {"title": "Doc 1", "snippet": "Local AI is..."},
    {"title": "Doc 2", "snippet": "Self-hosted means..."}
]

input_ids = tokenizer.apply_grounded_generation_template(
    [{"role": "user", "content": "What is local AI?"}],
    documents=documents,
    citation_mode="accurate",
    tokenize=False,
)

The model emits citations as [1] [2] markers tied to documents[] indices. This is the killer feature.


Structured Tool Calling {#tool-calling}

Tool definitions in the system prompt; the model emits a structured "Plan" + tool calls in JSON:

tools = [
    {"name": "search_web", "description": "Search the web", "parameters": {...}},
    {"name": "calculator", "description": "Math calc", "parameters": {...}}
]

input_ids = tokenizer.apply_tool_use_template(
    [{"role": "user", "content": "What is 2 to the 32nd power?"}],
    tools=tools,
    tokenize=False,
)

Output:

Plan: I will use the calculator tool.
Action: ```json
[{"tool_name": "calculator", "parameters": {"expression": "2^32"}}]

For OpenAI-compatible API integration, vLLM wraps this into standard `tool_calls`.

---

## Multilingual Usage {#multilingual}

Cohere's strongest open-weights multilingual model. Solid: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic.

Cross-lingual RAG (retrieve in English, answer in French) is well-supported via the native template — pass documents in any language, ask question in any supported language.

---

## Real Benchmarks {#benchmarks}

| Test | Command R+ | Command R7B | Llama 3.1 70B | Granite 3.2 8B |
|------|------------|-------------|----------------|------------------|
| RAG MAP | 0.91 | 0.86 | 0.78 | 0.89 |
| Tool use accuracy | 91% | 84% | 78% | 87% |
| MMLU | 75.7 | 67.5 | 86.0 | 75.8 |
| HumanEval | 73.5 | 67.0 | 80.5 | 79.9 |
| Multilingual MAP | 0.88 | 0.79 | 0.62 | 0.74 |

Command R family is the RAG specialist. For general chat / reasoning, others win.

---

## Tuning Recipes {#tuning}

### Single 24GB GPU (RTX 4090)

Command R7B Q5_K_M, 32K context, FlashAttention enabled.

### Dual 24GB (2x RTX 4090)

Command R 35B Q5_K_M with tensor split.

### H100 80GB

Command R+ 104B AWQ-INT4 fits with 30K context.

### Mac M4 Max 128GB

Command R 35B Q5_K_M comfortably; Command R+ Q3_K_M tight.

---

## Commercial Alternatives {#alternatives}

For commercial production with similar RAG quality:

1. **IBM Granite 3.2 8B** + Granite Embedding — Apache 2.0, comparable RAG MAP
2. **Mistral Small 3** + reranker — Apache 2.0, good general quality
3. **Llama 3.1 70B** + RAG prompt engineering — Meta license
4. **Qwen 2.5 72B** + Tongyi license — strong, EU restrictions

If RAG citations are mandatory and license must be permissive: Granite 3.2 + custom citation prompting is the cleanest path.

---

## Troubleshooting {#troubleshooting}

| Symptom | Cause | Fix |
|---------|-------|-----|
| Wrong chat template | Missing Cohere format | bartowski quants include template |
| Citations missing | Wrong RAG template | Use `apply_grounded_generation_template` |
| OOM at 104B | Hardware insufficient | Use Command R7B or rent H100 |
| Multilingual quality drop | Lower-resource language | Some languages stronger than others |

---

## FAQ {#faq}

See answers to common Command R+ questions below.

---

**Sources:** [Cohere Command R+ on HF](https://huggingface.co/CohereForAI/c4ai-command-r-plus) | [Command R7B](https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024) | [Cohere blog](https://cohere.com/blog/command-r-plus-microsoft-azure) | Internal benchmarks RTX 4090, dual-4090, H100.

**Related guides:**

- [RAG Local Setup Guide](/blog/rag-local-setup-guide)
- [Granite 3 Local Setup](/blog/granite-3-local-setup)
- [Mistral Small 3 Setup](/blog/mistral-small-3-setup)
- [Llama 4 Local Setup](/blog/llama-4-local-setup-guide)
- [Ollama ChromaDB RAG Pipeline](/blog/ollama-chromadb-rag-pipeline)
- [Vector Databases Comparison](/blog/vector-databases-comparison)
🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Command R7B + ChromaDB RAG reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators