Cohere Command R+ Local Setup Guide (2026): RAG and Tool-Use Specialist
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Cohere's Command R+ is the open-weights model purpose-tuned for RAG and multi-step tool use. 104B parameters, native citation generation, structured tool-call output, and best-in-class multilingual coverage. For research and personal RAG workloads where grounded-answer quality and citation accuracy matter, Command R+ is the highest-quality option in 2026. The catch: CC-BY-NC license restricts commercial use without a Cohere license, and 104B parameters need serious hardware. Command R7B (December 2024 release) brings the same training to a 7B model that fits on consumer GPUs.
This guide covers the full Command R family, setup across runtimes, the native RAG citation format, structured tool calling, multilingual usage, and the licensing considerations for commercial deployment.
Table of Contents
- What Command R+ Is
- Command R+ vs Command R 35B vs Command R7B
- License (CC-BY-NC) Implications
- Hardware Requirements
- Command R+ vs Llama 3.1 70B for RAG
- Ollama Setup
- llama.cpp Setup
- vLLM Setup
- Native RAG Prompt Format
- Structured Tool Calling
- Multilingual Usage
- Real Benchmarks
- Tuning Recipes
- Commercial Alternatives
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Command R+ Is {#what-it-is}
Command R+ (CohereForAI/c4ai-command-r-plus on HuggingFace) is a 104B-parameter open-weights model from Cohere. Original release April 2024; revised "08-2024" version August 2024. License: CC-BY-NC-4.0.
Architecture: standard decoder-only with grouped-query attention. Context: 128K. Trained specifically for retrieval-augmented generation and multi-step tool use, not pure chat.
Command R+ vs Command R 35B vs Command R7B {#family}
| Variant | Params | VRAM (BF16/Q4) | Use |
|---|---|---|---|
| Command R+ | 104B | 210 GB / 63 GB | Best RAG/tools, needs serious HW |
| Command R | 35B | 70 GB / 21 GB | Mid-tier; Q4 fits 24GB single card |
| Command R7B | 7B | 14 GB / 4.5 GB | Practical local choice |
For most local users: Command R7B is the right starting point. Command R 35B for 24GB cards. Command R+ for multi-GPU / H100 rigs.
License (CC-BY-NC) Implications {#license}
CC-BY-NC-4.0 = non-commercial only without separate license.
You can:
- ✅ Use in research / academic work
- ✅ Use in personal projects
- ✅ Use in internal company tools that don't directly generate revenue
- ✅ Modify and redistribute (with attribution and same license)
You cannot (without Cohere commercial license):
- ❌ Sell as a paid service / API
- ❌ Embed in commercial products
- ❌ Use to train competing models
For commercial RAG with permissive licensing, see Granite 3 (Apache 2.0) or Mistral Small 3 (Apache 2.0).
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements {#requirements}
Command R+ (104B)
- BF16: 210 GB → 4x A100 80GB or 4x H100
- AWQ-INT4: 60 GB → single H100 80GB or 2x RTX 4090 + offload
- Q4_K_M GGUF: 63 GB → 2x 48GB cards or 1x H100
- Q2_K GGUF: 38 GB → tight on Pro W7900 48GB
Command R 35B
- BF16: 70 GB → A100 80GB
- Q4_K_M: 21 GB → fits 24 GB card
- Q5_K_M: 25 GB → tight on 24 GB
Command R7B
- BF16: 14 GB → fits 16 GB card
- Q5_K_M: 5 GB → fits any modern GPU
- Q4_K_M: 4 GB → mobile / edge
Command R+ vs Llama 3.1 70B for RAG {#vs-llama-rag}
| Metric | Command R+ | Llama 3.1 70B |
|---|---|---|
| Native citation generation | ✅ | ❌ (prompt-based) |
| Hallucination rate (RAG) | 6% | 14% |
| Multi-step tool use accuracy | 91% | 78% |
| Multilingual RAG | Excellent | Good |
| MMLU (general) | 75.7 | 86.0 |
| Throughput (Q4, 48GB GPU) | ~12 tok/s | ~22 tok/s |
For pure RAG quality with citations, Command R+ wins. For general capability + chat + reasoning, Llama 3.1 70B wins. For commercial production, switch to Granite 3 which has Apache 2.0 + comparable RAG quality.
Ollama Setup {#ollama}
# Command R7B (most practical)
ollama run command-r7b
# Command R 35B (24GB+ GPU)
ollama run command-r:35b
# Command R+ 104B (multi-GPU)
ollama run command-r-plus
For RAG-specific Modelfile:
FROM command-r7b
PARAMETER num_ctx 32768
PARAMETER temperature 0.3
PARAMETER min_p 0.05
SYSTEM """You are a helpful research assistant. Always cite sources using [1] [2] [3] format."""
llama.cpp Setup {#llamacpp}
huggingface-cli download bartowski/c4ai-command-r-plus-GGUF \
c4ai-command-r-plus-Q4_K_M.gguf \
--local-dir ./models
./llama-cli -m models/c4ai-command-r-plus-Q4_K_M.gguf \
-ngl 999 -c 32768 -fa --tensor-split 24,24
For two 24GB GPUs (e.g., 2x RTX 4090): tensor split.
vLLM Setup {#vllm}
# AWQ for single-H100
vllm serve casperhansen/command-r-plus-104b-awq \
--quantization awq --tensor-parallel-size 2 \
--max-model-len 32768
# Command R7B (single card)
vllm serve CohereForAI/c4ai-command-r7b-12-2024 \
--max-model-len 32768
Native RAG Prompt Format {#rag-format}
Command R+ uses a custom RAG template:
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{system}<|END_OF_TURN_TOKEN|>
<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{question}<|END_OF_TURN_TOKEN|>
<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>
{retrieved documents in <results> tags}
{model produces grounded answer with [N] citations}
<|END_OF_TURN_TOKEN|>
For Hugging Face Transformers:
documents = [
{"title": "Doc 1", "snippet": "Local AI is..."},
{"title": "Doc 2", "snippet": "Self-hosted means..."}
]
input_ids = tokenizer.apply_grounded_generation_template(
[{"role": "user", "content": "What is local AI?"}],
documents=documents,
citation_mode="accurate",
tokenize=False,
)
The model emits citations as [1] [2] markers tied to documents[] indices. This is the killer feature.
Structured Tool Calling {#tool-calling}
Tool definitions in the system prompt; the model emits a structured "Plan" + tool calls in JSON:
tools = [
{"name": "search_web", "description": "Search the web", "parameters": {...}},
{"name": "calculator", "description": "Math calc", "parameters": {...}}
]
input_ids = tokenizer.apply_tool_use_template(
[{"role": "user", "content": "What is 2 to the 32nd power?"}],
tools=tools,
tokenize=False,
)
Output:
Plan: I will use the calculator tool.
Action: ```json
[{"tool_name": "calculator", "parameters": {"expression": "2^32"}}]
For OpenAI-compatible API integration, vLLM wraps this into standard `tool_calls`.
---
## Multilingual Usage {#multilingual}
Cohere's strongest open-weights multilingual model. Solid: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic.
Cross-lingual RAG (retrieve in English, answer in French) is well-supported via the native template — pass documents in any language, ask question in any supported language.
---
## Real Benchmarks {#benchmarks}
| Test | Command R+ | Command R7B | Llama 3.1 70B | Granite 3.2 8B |
|------|------------|-------------|----------------|------------------|
| RAG MAP | 0.91 | 0.86 | 0.78 | 0.89 |
| Tool use accuracy | 91% | 84% | 78% | 87% |
| MMLU | 75.7 | 67.5 | 86.0 | 75.8 |
| HumanEval | 73.5 | 67.0 | 80.5 | 79.9 |
| Multilingual MAP | 0.88 | 0.79 | 0.62 | 0.74 |
Command R family is the RAG specialist. For general chat / reasoning, others win.
---
## Tuning Recipes {#tuning}
### Single 24GB GPU (RTX 4090)
Command R7B Q5_K_M, 32K context, FlashAttention enabled.
### Dual 24GB (2x RTX 4090)
Command R 35B Q5_K_M with tensor split.
### H100 80GB
Command R+ 104B AWQ-INT4 fits with 30K context.
### Mac M4 Max 128GB
Command R 35B Q5_K_M comfortably; Command R+ Q3_K_M tight.
---
## Commercial Alternatives {#alternatives}
For commercial production with similar RAG quality:
1. **IBM Granite 3.2 8B** + Granite Embedding — Apache 2.0, comparable RAG MAP
2. **Mistral Small 3** + reranker — Apache 2.0, good general quality
3. **Llama 3.1 70B** + RAG prompt engineering — Meta license
4. **Qwen 2.5 72B** + Tongyi license — strong, EU restrictions
If RAG citations are mandatory and license must be permissive: Granite 3.2 + custom citation prompting is the cleanest path.
---
## Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---------|-------|-----|
| Wrong chat template | Missing Cohere format | bartowski quants include template |
| Citations missing | Wrong RAG template | Use `apply_grounded_generation_template` |
| OOM at 104B | Hardware insufficient | Use Command R7B or rent H100 |
| Multilingual quality drop | Lower-resource language | Some languages stronger than others |
---
## FAQ {#faq}
See answers to common Command R+ questions below.
---
**Sources:** [Cohere Command R+ on HF](https://huggingface.co/CohereForAI/c4ai-command-r-plus) | [Command R7B](https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024) | [Cohere blog](https://cohere.com/blog/command-r-plus-microsoft-azure) | Internal benchmarks RTX 4090, dual-4090, H100.
**Related guides:**
- [RAG Local Setup Guide](/blog/rag-local-setup-guide)
- [Granite 3 Local Setup](/blog/granite-3-local-setup)
- [Mistral Small 3 Setup](/blog/mistral-small-3-setup)
- [Llama 4 Local Setup](/blog/llama-4-local-setup-guide)
- [Ollama ChromaDB RAG Pipeline](/blog/ollama-chromadb-rag-pipeline)
- [Vector Databases Comparison](/blog/vector-databases-comparison)
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!