What is Dragon 7B and who made it?

Dragon 7B is a family of RAG-optimized language models by LLMWare. It is fine-tuned from base models like Mistral 7B and Llama 2 7B specifically for document Q&A, fact extraction, and retrieval-augmented generation tasks. It is not a general-purpose chatbot.

Can Dragon 7B run on CPU without a GPU?

Yes. Dragon works well on CPU-only systems with about 6GB of RAM (at Q4 quantization). Because RAG responses are typically short (1-3 sentences), CPU inference at 3-8 tokens/second generates answers in 2-5 seconds, which is fine for document processing pipelines.

Which Dragon 7B variant is best for RAG?

dragon-mistral-7b-v0 is the recommended variant. It inherits Mistral 7B's stronger base capabilities and 8K context window. All Dragon variants use the same RAG fine-tuning dataset; the base model determines overall quality.

★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds

🤖AI MODEL GUIDE

Dragon 7B by LLMWare

Q: How do I install Dragon 7B locally?

The recommended method is using the LLMWare Python library: pip install llmware. Then load the model with ModelCatalog().load_model('llmware/dragon-mistral-7b-v0'). Alternatively, use HuggingFace Transformers directly or import a GGUF quantization into Ollama.

Updated: March 16, 2026

Correction Notice (March 2026)

This page previously contained fabricated general benchmarks (MMLU 78.4%, HumanEval 71.2%) and mischaracterized Dragon as a general-purpose chat model. Dragon is a RAG-specialized model family by LLMWare designed for document Q&A and fact extraction — not general reasoning or coding. This rewrite reflects the model's actual purpose and capabilities.

LLMWare's Dragon model family: small language models fine-tuned specifically for retrieval-augmented generation (RAG), document Q&A, and fact extraction from context.

Not a general-purpose chatbot — a specialized tool for building enterprise document pipelines.

What Is Dragon 7B?

📄

RAG-Optimized

Fine-tuned for document Q&A and context-based answers

🏢

By LLMWare

Enterprise AI company focused on RAG and document intelligence

🔧

Multiple Bases

Available on Mistral 7B, Llama 2 7B, and Yi 6B bases

🔓

Apache 2.0

Open source, commercial-friendly license

Dragon is not a general-purpose chatbot. It's a family of small language models (SLMs) by LLMWare that are fine-tuned on proprietary instruction datasets focused on reading documents, extracting facts, and answering questions based on provided context. Think of it as a local alternative to using GPT-4 or Claude for RAG pipelines — much smaller, much cheaper, but specifically trained for the retrieval-augmented generation workflow.

Dragon Model Family

LLMWare released Dragon as fine-tunes on multiple base models. The “7B” variants are most popular for local deployment. All are on HuggingFace (llmware).

Model	HuggingFace ID	Base Model	Context	Best For
Dragon Mistral 7B	llmware/dragon-mistral-7b-v0	Mistral 7B	8K tokens	Best overall RAG quality
Dragon Llama 7B	llmware/dragon-llama-7b-v0	Llama 2 7B	4K tokens	Broadest compatibility
Dragon Yi 6B	llmware/dragon-yi-6b-v0	Yi 6B	4K tokens	Smallest variant
Dragon Deci 7B	llmware/dragon-deci-7b-v0	Deci 7B	8K tokens	Fast inference

Source: LLMWare HuggingFace Collection. All Dragon models use the same RAG fine-tuning dataset; the base model determines general capability.

Also from LLMWare: SLIM Models

LLMWare also released SLIM (Structured Language Instruction Models) — even smaller models (1-3B parameters) designed for specific tasks like NER, classification, sentiment analysis, and SQL generation. If you need structured output extraction rather than free-form Q&A, SLIM models may be more appropriate than Dragon. See llmware on HuggingFace.

How Dragon Works: RAG Architecture

RAG Fine-Tuning: Unlike general chat models trained on broad instruction-following, Dragon was fine-tuned specifically on a pattern: given a context passage and a question, produce a concise, factual answer grounded in the context. This makes it resist hallucination better than general models when used in RAG pipelines.

Input Format: Dragon expects a specific prompt structure with context and question fields. The LLMWare library handles this formatting automatically, but if using raw inference you need to follow the template.

Dragon Prompt Template:

<human>: Based on the following context:

{retrieved_document_text}

Please answer the question: {user_question}

<bot>:

The key advantage: Dragon is trained to say “not enough information” when the context doesn't contain the answer, rather than hallucinating. This is critical for enterprise RAG where accuracy matters more than creativity.

What Dragon Is NOT

Dragon is not designed for and should not be evaluated on:

• General chat/conversation — use Mistral 7B, Llama 3, or Qwen 2.5 instead
• Code generation — use CodeLlama, DeepSeek Coder, or Qwen 2.5 Coder
• Math/reasoning — use Mathstral or Qwen 2.5 Math
• Creative writing — use any general chat model
• MMLU/HumanEval benchmarks — these measure general knowledge, not RAG ability

Evaluating Dragon on MMLU is like evaluating a screwdriver by how well it hammers nails. Its value is in context-grounded Q&A accuracy, not general knowledge.

VRAM & Hardware Requirements

Dragon 7B variants have the same VRAM requirements as their base models. The Mistral-based variant is recommended for best quality.

Quantization	File Size	VRAM (GPU)	RAM (CPU-only)	Quality Impact
Q4_K_M (recommended)	~4.1 GB	~5 GB	~6 GB	Minimal loss — best balance
Q5_K_M	~4.8 GB	~6 GB	~7 GB	Near-lossless
Q8_0	~7.2 GB	~8 GB	~9 GB	Lossless for most tasks
FP16 (full)	~14.5 GB	~15 GB	~16 GB	Full precision

VRAM estimates for dragon-mistral-7b variant. CPU-only inference is viable for RAG tasks since responses are typically short (1-3 sentences). Speed: ~15-25 tok/s GPU, ~3-8 tok/s CPU.

System Requirements

▸

Operating System

Ubuntu 20.04+, macOS Monterey+, Windows 11

▸

RAM

8GB minimum (16GB recommended)

▸

Storage

8GB SSD

▸

GPU

Optional — runs well on CPU for RAG tasks

▸

CPU

4+ cores recommended

CPU-Only Is Fine for RAG

Unlike chat models where you need fast token generation for interactive conversations, RAG pipelines typically need short, factual answers (1-3 sentences). At 3-8 tokens/second on CPU, a typical Dragon response completes in 2-5 seconds — perfectly acceptable for document processing pipelines. You do not need a GPU to use Dragon effectively.

Installation & Setup

Method 1: LLMWare Python Library (Recommended)

The official LLMWare library handles model downloading, prompt formatting, and RAG pipeline setup automatically.

# Install LLMWare

pip install llmware

# Basic Dragon RAG usage

from llmware.models import ModelCatalog

# Load Dragon model (auto-downloads on first use)
model = ModelCatalog().load_model("llmware/dragon-mistral-7b-v0")

# RAG-style query with context
context = """Q3 2024 revenue was $847M, up 12% YoY.
Operating margin improved to 23.1% from 21.4%.
Free cash flow was $198M."""

response = model.inference(
    "What was the operating margin?",
    add_context=context
)
print(response["llm_response"])
# Output: "The operating margin was 23.1%, improved from 21.4%."

Method 2: HuggingFace Transformers

Direct usage with the Transformers library for more control.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "llmware/dragon-mistral-7b-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = """<human>: Based on the following context:

The contract expires on December 31, 2025.
Early termination requires 90 days written notice.
Renewal is automatic unless cancelled.

Please answer: What is the early termination requirement?

<bot>:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Method 3: Ollama (GGUF Import)

Dragon models are not in Ollama's default library. You can import a GGUF quantization manually.

Terminal

$curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama... >>> ollama installed successfully

$# Download GGUF from HuggingFace (community quantization) # Then create a Modelfile: echo 'FROM ./dragon-mistral-7b-v0.Q4_K_M.gguf' > Modelfile ollama create dragon-mistral -f Modelfile

transferring model data creating model layer writing manifest success

$ollama run dragon-mistral

>>> Send a message (/? for help)

Note: The LLMWare Python library is the recommended approach as it handles prompt formatting correctly for Dragon's RAG-specific template.

Real-World RAG Pipeline Example

Document Q&A Pipeline with LLMWare

A minimal but complete pipeline: load PDFs, chunk text, embed, retrieve, and answer with Dragon.

from llmware.library import Library
from llmware.retrieval import Query
from llmware.models import ModelCatalog

# 1. Create a document library
lib = Library().create_new_library("contracts_q3")

# 2. Ingest documents (PDF, DOCX, TXT supported)
lib.add_files("/path/to/contracts/")

# 3. Run a query with Dragon as the RAG model
query = Query(lib)
results = query.text_query("What are the payment terms?")

# 4. Use Dragon to answer based on retrieved chunks
model = ModelCatalog().load_model("llmware/dragon-mistral-7b-v0")

for result in results[:3]:  # Top 3 matches
    response = model.inference(
        "What are the payment terms?",
        add_context=result["text"]
    )
    print(f"Source: {result['file_source']}")
    print(f"Answer: {response['llm_response']}")
    print("---")

Legal Documents

Contract analysis, clause extraction, compliance checking

Financial Reports

Earnings Q&A, metric extraction, filing analysis

Technical Docs

API documentation Q&A, troubleshooting, knowledge bases

Strengths & Limitations

Strengths

Context grounding: Trained to answer from provided context, reducing hallucination compared to general chat models in RAG setups
Small footprint: 7B parameters means it runs on consumer hardware, even CPU-only for batch processing
LLMWare ecosystem: First-class support in the llmware Python library with built-in document parsing, chunking, and retrieval
Enterprise licensing: Apache 2.0 allows commercial use without restrictions
Multiple base options: Choose Mistral-based for quality or Yi-based for smaller footprint

Limitations

Not a general chatbot: Poor at open-ended conversation, creative writing, coding, math — it's a specialist
Short context (4K-8K): Cannot process full documents at once; requires chunking and retrieval pipeline
2023 base models: Built on Llama 2 / Mistral 7B v0.1 era architectures — newer bases have surpassed these
Limited community GGUF: Not in Ollama's default library; requires manual GGUF import or using LLMWare's library
No standard benchmarks: LLMWare doesn't publish MMLU/HellaSwag scores because the model isn't designed for those tasks

RAG Model Alternatives in 2026

Honest Assessment (March 2026)

Dragon was released in late 2023. Since then, the RAG landscape has evolved significantly. Newer models with longer context windows (32K-128K) and better instruction following have reduced the need for specialized RAG fine-tunes. Dragon remains a good choice if you're already in the LLMWare ecosystem, but for new projects, consider the alternatives below.

Model	Size	RAM Required	Speed	Quality	Cost/Month
Dragon Mistral 7B	7B	5-8GB	15-25 tok/s	65%	Free
Qwen 2.5 7B	7B	5-8GB	20-35 tok/s	74%	Free
Llama 3.2 3B	3B	3-4GB	30-50 tok/s	63%	Free
Mistral 7B v0.3	7B	5-8GB	20-35 tok/s	72%	Free

Quality scores are editorial estimates for RAG/document Q&A use cases specifically, not general benchmarks. Qwen 2.5 and Mistral v0.3 have better base capabilities and longer context, which often matters more than RAG-specific fine-tuning in practice.

When to Still Choose Dragon

Use Dragon if:

• Already using LLMWare's pipeline tools
• Need proven RAG-specific fine-tuning
• Processing short document chunks (under 2K tokens)
• Want minimal hallucination in factual extraction

Consider alternatives if:

• Starting a new project in 2026
• Need longer context (32K+ tokens)
• Want one model for RAG + other tasks
• Need Ollama-native model support

Frequently Asked Questions

Is Dragon 7B a general-purpose AI model?

No. Dragon is specifically fine-tuned for RAG (retrieval-augmented generation) tasks — answering questions based on provided document context. It is not designed for general conversation, code generation, creative writing, or math. For general chat, use models like Mistral 7B, Llama 3, or Qwen 2.5.

Can I run Dragon without a GPU?

Yes. Dragon works well on CPU-only systems because RAG responses are typically short (1-3 sentences). At Q4 quantization, you need about 6GB of system RAM. Inference speed on CPU is around 3-8 tokens/second, which means a typical answer generates in 2-5 seconds — perfectly fine for document processing pipelines.

Which Dragon variant should I choose?

dragon-mistral-7b-v0 is the best overall choice — it inherits Mistral 7B's stronger base capabilities and 8K context window. If you need the smallest possible model, try dragon-yi-6b-v0. All variants use the same RAG fine-tuning dataset.

Is Dragon available on Ollama?

Dragon is not in Ollama's default model library. You can use it via the LLMWare Python library (recommended), HuggingFace Transformers, or by importing a community-made GGUF quantization into Ollama manually using a Modelfile. See the installation section above for all three methods.

How does Dragon compare to just using Mistral 7B for RAG?

Dragon's advantage is that it's specifically trained to answer from context and resist hallucination. A general Mistral 7B will sometimes “fill in” answers from its training data even when the context doesn't support it. However, Mistral v0.3 (2024) has improved instruction following significantly, narrowing this gap. For new projects in 2026, the base model improvements may matter more than Dragon's RAG fine-tuning.

Resources

Official Sources

Related Guides

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Start Learning Free See pricing

Reading now

Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

View All Local AI Guides

🎯

AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter

📅 Published: October 28, 2025🔄 Last Updated: March 16, 2026✓ Manually Reviewed

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯

AI Learning Path

Found your model? Now build something with it.

20 hands-on courses — RAG, agents, fine-tuning — all running locally. First chapter free, no card.

Start free Browse courses first