Dragon 7B by LLMWare
Updated: March 16, 2026
Correction Notice (March 2026)
This page previously contained fabricated general benchmarks (MMLU 78.4%, HumanEval 71.2%) and mischaracterized Dragon as a general-purpose chat model. Dragon is a RAG-specialized model family by LLMWare designed for document Q&A and fact extraction — not general reasoning or coding. This rewrite reflects the model's actual purpose and capabilities.
LLMWare's Dragon model family: small language models fine-tuned specifically for retrieval-augmented generation (RAG), document Q&A, and fact extraction from context.
Not a general-purpose chatbot — a specialized tool for building enterprise document pipelines.
What Is Dragon 7B?
RAG-Optimized
Fine-tuned for document Q&A and context-based answers
By LLMWare
Enterprise AI company focused on RAG and document intelligence
Multiple Bases
Available on Mistral 7B, Llama 2 7B, and Yi 6B bases
Apache 2.0
Open source, commercial-friendly license
Dragon is not a general-purpose chatbot. It's a family of small language models (SLMs) by LLMWare that are fine-tuned on proprietary instruction datasets focused on reading documents, extracting facts, and answering questions based on provided context. Think of it as a local alternative to using GPT-4 or Claude for RAG pipelines — much smaller, much cheaper, but specifically trained for the retrieval-augmented generation workflow.
Dragon Model Family
LLMWare released Dragon as fine-tunes on multiple base models. The “7B” variants are most popular for local deployment. All are on HuggingFace (llmware).
| Model | HuggingFace ID | Base Model | Context | Best For |
|---|---|---|---|---|
| Dragon Mistral 7B | llmware/dragon-mistral-7b-v0 | Mistral 7B | 8K tokens | Best overall RAG quality |
| Dragon Llama 7B | llmware/dragon-llama-7b-v0 | Llama 2 7B | 4K tokens | Broadest compatibility |
| Dragon Yi 6B | llmware/dragon-yi-6b-v0 | Yi 6B | 4K tokens | Smallest variant |
| Dragon Deci 7B | llmware/dragon-deci-7b-v0 | Deci 7B | 8K tokens | Fast inference |
Source: LLMWare HuggingFace Collection. All Dragon models use the same RAG fine-tuning dataset; the base model determines general capability.
Also from LLMWare: SLIM Models
LLMWare also released SLIM (Structured Language Instruction Models) — even smaller models (1-3B parameters) designed for specific tasks like NER, classification, sentiment analysis, and SQL generation. If you need structured output extraction rather than free-form Q&A, SLIM models may be more appropriate than Dragon. See llmware on HuggingFace.
How Dragon Works: RAG Architecture
RAG Fine-Tuning: Unlike general chat models trained on broad instruction-following, Dragon was fine-tuned specifically on a pattern: given a context passage and a question, produce a concise, factual answer grounded in the context. This makes it resist hallucination better than general models when used in RAG pipelines.
Input Format: Dragon expects a specific prompt structure with context and question fields. The LLMWare library handles this formatting automatically, but if using raw inference you need to follow the template.
Dragon Prompt Template:
<human>: Based on the following context:
{retrieved_document_text}
Please answer the question: {user_question}
<bot>:The key advantage: Dragon is trained to say “not enough information” when the context doesn't contain the answer, rather than hallucinating. This is critical for enterprise RAG where accuracy matters more than creativity.
What Dragon Is NOT
Dragon is not designed for and should not be evaluated on:
- • General chat/conversation — use Mistral 7B, Llama 3, or Qwen 2.5 instead
- • Code generation — use CodeLlama, DeepSeek Coder, or Qwen 2.5 Coder
- • Math/reasoning — use Mathstral or Qwen 2.5 Math
- • Creative writing — use any general chat model
- • MMLU/HumanEval benchmarks — these measure general knowledge, not RAG ability
Evaluating Dragon on MMLU is like evaluating a screwdriver by how well it hammers nails. Its value is in context-grounded Q&A accuracy, not general knowledge.
VRAM & Hardware Requirements
Dragon 7B variants have the same VRAM requirements as their base models. The Mistral-based variant is recommended for best quality.
| Quantization | File Size | VRAM (GPU) | RAM (CPU-only) | Quality Impact |
|---|---|---|---|---|
| Q4_K_M (recommended) | ~4.1 GB | ~5 GB | ~6 GB | Minimal loss — best balance |
| Q5_K_M | ~4.8 GB | ~6 GB | ~7 GB | Near-lossless |
| Q8_0 | ~7.2 GB | ~8 GB | ~9 GB | Lossless for most tasks |
| FP16 (full) | ~14.5 GB | ~15 GB | ~16 GB | Full precision |
VRAM estimates for dragon-mistral-7b variant. CPU-only inference is viable for RAG tasks since responses are typically short (1-3 sentences). Speed: ~15-25 tok/s GPU, ~3-8 tok/s CPU.
System Requirements
CPU-Only Is Fine for RAG
Unlike chat models where you need fast token generation for interactive conversations, RAG pipelines typically need short, factual answers (1-3 sentences). At 3-8 tokens/second on CPU, a typical Dragon response completes in 2-5 seconds — perfectly acceptable for document processing pipelines. You do not need a GPU to use Dragon effectively.
Installation & Setup
Method 1: LLMWare Python Library (Recommended)
The official LLMWare library handles model downloading, prompt formatting, and RAG pipeline setup automatically.
# Install LLMWare
pip install llmware
# Basic Dragon RAG usage
from llmware.models import ModelCatalog
# Load Dragon model (auto-downloads on first use)
model = ModelCatalog().load_model("llmware/dragon-mistral-7b-v0")
# RAG-style query with context
context = """Q3 2024 revenue was $847M, up 12% YoY.
Operating margin improved to 23.1% from 21.4%.
Free cash flow was $198M."""
response = model.inference(
"What was the operating margin?",
add_context=context
)
print(response["llm_response"])
# Output: "The operating margin was 23.1%, improved from 21.4%."Method 2: HuggingFace Transformers
Direct usage with the Transformers library for more control.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "llmware/dragon-mistral-7b-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
prompt = """<human>: Based on the following context:
The contract expires on December 31, 2025.
Early termination requires 90 days written notice.
Renewal is automatic unless cancelled.
Please answer: What is the early termination requirement?
<bot>:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Method 3: Ollama (GGUF Import)
Dragon models are not in Ollama's default library. You can import a GGUF quantization manually.
Note: The LLMWare Python library is the recommended approach as it handles prompt formatting correctly for Dragon's RAG-specific template.
Real-World RAG Pipeline Example
Document Q&A Pipeline with LLMWare
A minimal but complete pipeline: load PDFs, chunk text, embed, retrieve, and answer with Dragon.
from llmware.library import Library
from llmware.retrieval import Query
from llmware.models import ModelCatalog
# 1. Create a document library
lib = Library().create_new_library("contracts_q3")
# 2. Ingest documents (PDF, DOCX, TXT supported)
lib.add_files("/path/to/contracts/")
# 3. Run a query with Dragon as the RAG model
query = Query(lib)
results = query.text_query("What are the payment terms?")
# 4. Use Dragon to answer based on retrieved chunks
model = ModelCatalog().load_model("llmware/dragon-mistral-7b-v0")
for result in results[:3]: # Top 3 matches
response = model.inference(
"What are the payment terms?",
add_context=result["text"]
)
print(f"Source: {result['file_source']}")
print(f"Answer: {response['llm_response']}")
print("---")Legal Documents
Contract analysis, clause extraction, compliance checking
Financial Reports
Earnings Q&A, metric extraction, filing analysis
Technical Docs
API documentation Q&A, troubleshooting, knowledge bases
Strengths & Limitations
Strengths
- Context grounding: Trained to answer from provided context, reducing hallucination compared to general chat models in RAG setups
- Small footprint: 7B parameters means it runs on consumer hardware, even CPU-only for batch processing
- LLMWare ecosystem: First-class support in the llmware Python library with built-in document parsing, chunking, and retrieval
- Enterprise licensing: Apache 2.0 allows commercial use without restrictions
- Multiple base options: Choose Mistral-based for quality or Yi-based for smaller footprint
Limitations
- Not a general chatbot: Poor at open-ended conversation, creative writing, coding, math — it's a specialist
- Short context (4K-8K): Cannot process full documents at once; requires chunking and retrieval pipeline
- 2023 base models: Built on Llama 2 / Mistral 7B v0.1 era architectures — newer bases have surpassed these
- Limited community GGUF: Not in Ollama's default library; requires manual GGUF import or using LLMWare's library
- No standard benchmarks: LLMWare doesn't publish MMLU/HellaSwag scores because the model isn't designed for those tasks
RAG Model Alternatives in 2026
Honest Assessment (March 2026)
Dragon was released in late 2023. Since then, the RAG landscape has evolved significantly. Newer models with longer context windows (32K-128K) and better instruction following have reduced the need for specialized RAG fine-tunes. Dragon remains a good choice if you're already in the LLMWare ecosystem, but for new projects, consider the alternatives below.
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| Dragon Mistral 7B | 7B | 5-8GB | 15-25 tok/s | 65% | Free |
| Qwen 2.5 7B | 7B | 5-8GB | 20-35 tok/s | 74% | Free |
| Llama 3.2 3B | 3B | 3-4GB | 30-50 tok/s | 63% | Free |
| Mistral 7B v0.3 | 7B | 5-8GB | 20-35 tok/s | 72% | Free |
Quality scores are editorial estimates for RAG/document Q&A use cases specifically, not general benchmarks. Qwen 2.5 and Mistral v0.3 have better base capabilities and longer context, which often matters more than RAG-specific fine-tuning in practice.
When to Still Choose Dragon
Use Dragon if:
- • Already using LLMWare's pipeline tools
- • Need proven RAG-specific fine-tuning
- • Processing short document chunks (under 2K tokens)
- • Want minimal hallucination in factual extraction
Consider alternatives if:
- • Starting a new project in 2026
- • Need longer context (32K+ tokens)
- • Want one model for RAG + other tasks
- • Need Ollama-native model support
Frequently Asked Questions
Is Dragon 7B a general-purpose AI model?
No. Dragon is specifically fine-tuned for RAG (retrieval-augmented generation) tasks — answering questions based on provided document context. It is not designed for general conversation, code generation, creative writing, or math. For general chat, use models like Mistral 7B, Llama 3, or Qwen 2.5.
Can I run Dragon without a GPU?
Yes. Dragon works well on CPU-only systems because RAG responses are typically short (1-3 sentences). At Q4 quantization, you need about 6GB of system RAM. Inference speed on CPU is around 3-8 tokens/second, which means a typical answer generates in 2-5 seconds — perfectly fine for document processing pipelines.
Which Dragon variant should I choose?
dragon-mistral-7b-v0 is the best overall choice — it inherits Mistral 7B's stronger base capabilities and 8K context window. If you need the smallest possible model, try dragon-yi-6b-v0. All variants use the same RAG fine-tuning dataset.
Is Dragon available on Ollama?
Dragon is not in Ollama's default model library. You can use it via the LLMWare Python library (recommended), HuggingFace Transformers, or by importing a community-made GGUF quantization into Ollama manually using a Modelfile. See the installation section above for all three methods.
How does Dragon compare to just using Mistral 7B for RAG?
Dragon's advantage is that it's specifically trained to answer from context and resist hallucination. A general Mistral 7B will sometimes “fill in” answers from its training data even when the context doesn't support it. However, Mistral v0.3 (2024) has improved instruction following significantly, narrowing this gap. For new projects in 2026, the base model improvements may matter more than Dragon's RAG fine-tuning.
Resources
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.