🤖AI MODEL GUIDE

Dragon 7B by LLMWare

Updated: March 16, 2026

Correction Notice (March 2026)

This page previously contained fabricated general benchmarks (MMLU 78.4%, HumanEval 71.2%) and mischaracterized Dragon as a general-purpose chat model. Dragon is a RAG-specialized model family by LLMWare designed for document Q&A and fact extraction — not general reasoning or coding. This rewrite reflects the model's actual purpose and capabilities.

LLMWare's Dragon model family: small language models fine-tuned specifically for retrieval-augmented generation (RAG), document Q&A, and fact extraction from context.

Not a general-purpose chatbot — a specialized tool for building enterprise document pipelines.

What Is Dragon 7B?

📄

RAG-Optimized

Fine-tuned for document Q&A and context-based answers

🏢

By LLMWare

Enterprise AI company focused on RAG and document intelligence

🔧

Multiple Bases

Available on Mistral 7B, Llama 2 7B, and Yi 6B bases

🔓

Apache 2.0

Open source, commercial-friendly license

Dragon is not a general-purpose chatbot. It's a family of small language models (SLMs) by LLMWare that are fine-tuned on proprietary instruction datasets focused on reading documents, extracting facts, and answering questions based on provided context. Think of it as a local alternative to using GPT-4 or Claude for RAG pipelines — much smaller, much cheaper, but specifically trained for the retrieval-augmented generation workflow.

Dragon Model Family

LLMWare released Dragon as fine-tunes on multiple base models. The “7B” variants are most popular for local deployment. All are on HuggingFace (llmware).

ModelHuggingFace IDBase ModelContextBest For
Dragon Mistral 7Bllmware/dragon-mistral-7b-v0Mistral 7B8K tokensBest overall RAG quality
Dragon Llama 7Bllmware/dragon-llama-7b-v0Llama 2 7B4K tokensBroadest compatibility
Dragon Yi 6Bllmware/dragon-yi-6b-v0Yi 6B4K tokensSmallest variant
Dragon Deci 7Bllmware/dragon-deci-7b-v0Deci 7B8K tokensFast inference

Source: LLMWare HuggingFace Collection. All Dragon models use the same RAG fine-tuning dataset; the base model determines general capability.

Also from LLMWare: SLIM Models

LLMWare also released SLIM (Structured Language Instruction Models) — even smaller models (1-3B parameters) designed for specific tasks like NER, classification, sentiment analysis, and SQL generation. If you need structured output extraction rather than free-form Q&A, SLIM models may be more appropriate than Dragon. See llmware on HuggingFace.

How Dragon Works: RAG Architecture

RAG Fine-Tuning: Unlike general chat models trained on broad instruction-following, Dragon was fine-tuned specifically on a pattern: given a context passage and a question, produce a concise, factual answer grounded in the context. This makes it resist hallucination better than general models when used in RAG pipelines.

Input Format: Dragon expects a specific prompt structure with context and question fields. The LLMWare library handles this formatting automatically, but if using raw inference you need to follow the template.

Dragon Prompt Template:

<human>: Based on the following context:

{retrieved_document_text}

Please answer the question: {user_question}

<bot>:

The key advantage: Dragon is trained to say “not enough information” when the context doesn't contain the answer, rather than hallucinating. This is critical for enterprise RAG where accuracy matters more than creativity.

What Dragon Is NOT

Dragon is not designed for and should not be evaluated on:

  • General chat/conversation — use Mistral 7B, Llama 3, or Qwen 2.5 instead
  • Code generation — use CodeLlama, DeepSeek Coder, or Qwen 2.5 Coder
  • Math/reasoning — use Mathstral or Qwen 2.5 Math
  • Creative writing — use any general chat model
  • MMLU/HumanEval benchmarks — these measure general knowledge, not RAG ability

Evaluating Dragon on MMLU is like evaluating a screwdriver by how well it hammers nails. Its value is in context-grounded Q&A accuracy, not general knowledge.

VRAM & Hardware Requirements

Dragon 7B variants have the same VRAM requirements as their base models. The Mistral-based variant is recommended for best quality.

QuantizationFile SizeVRAM (GPU)RAM (CPU-only)Quality Impact
Q4_K_M (recommended)~4.1 GB~5 GB~6 GBMinimal loss — best balance
Q5_K_M~4.8 GB~6 GB~7 GBNear-lossless
Q8_0~7.2 GB~8 GB~9 GBLossless for most tasks
FP16 (full)~14.5 GB~15 GB~16 GBFull precision

VRAM estimates for dragon-mistral-7b variant. CPU-only inference is viable for RAG tasks since responses are typically short (1-3 sentences). Speed: ~15-25 tok/s GPU, ~3-8 tok/s CPU.

System Requirements

Operating System
Ubuntu 20.04+, macOS Monterey+, Windows 11
RAM
8GB minimum (16GB recommended)
Storage
8GB SSD
GPU
Optional — runs well on CPU for RAG tasks
CPU
4+ cores recommended

CPU-Only Is Fine for RAG

Unlike chat models where you need fast token generation for interactive conversations, RAG pipelines typically need short, factual answers (1-3 sentences). At 3-8 tokens/second on CPU, a typical Dragon response completes in 2-5 seconds — perfectly acceptable for document processing pipelines. You do not need a GPU to use Dragon effectively.

Installation & Setup

Method 1: LLMWare Python Library (Recommended)

The official LLMWare library handles model downloading, prompt formatting, and RAG pipeline setup automatically.

# Install LLMWare

pip install llmware

# Basic Dragon RAG usage

from llmware.models import ModelCatalog

# Load Dragon model (auto-downloads on first use)
model = ModelCatalog().load_model("llmware/dragon-mistral-7b-v0")

# RAG-style query with context
context = """Q3 2024 revenue was $847M, up 12% YoY.
Operating margin improved to 23.1% from 21.4%.
Free cash flow was $198M."""

response = model.inference(
    "What was the operating margin?",
    add_context=context
)
print(response["llm_response"])
# Output: "The operating margin was 23.1%, improved from 21.4%."

Method 2: HuggingFace Transformers

Direct usage with the Transformers library for more control.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "llmware/dragon-mistral-7b-v0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

prompt = """<human>: Based on the following context:

The contract expires on December 31, 2025.
Early termination requires 90 days written notice.
Renewal is automatic unless cancelled.

Please answer: What is the early termination requirement?

<bot>:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Method 3: Ollama (GGUF Import)

Dragon models are not in Ollama's default library. You can import a GGUF quantization manually.

Terminal
$curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama... >>> ollama installed successfully
$# Download GGUF from HuggingFace (community quantization) # Then create a Modelfile: echo 'FROM ./dragon-mistral-7b-v0.Q4_K_M.gguf' > Modelfile ollama create dragon-mistral -f Modelfile
transferring model data creating model layer writing manifest success
$ollama run dragon-mistral
>>> Send a message (/? for help)
$_

Note: The LLMWare Python library is the recommended approach as it handles prompt formatting correctly for Dragon's RAG-specific template.

Real-World RAG Pipeline Example

Document Q&A Pipeline with LLMWare

A minimal but complete pipeline: load PDFs, chunk text, embed, retrieve, and answer with Dragon.

from llmware.library import Library
from llmware.retrieval import Query
from llmware.models import ModelCatalog

# 1. Create a document library
lib = Library().create_new_library("contracts_q3")

# 2. Ingest documents (PDF, DOCX, TXT supported)
lib.add_files("/path/to/contracts/")

# 3. Run a query with Dragon as the RAG model
query = Query(lib)
results = query.text_query("What are the payment terms?")

# 4. Use Dragon to answer based on retrieved chunks
model = ModelCatalog().load_model("llmware/dragon-mistral-7b-v0")

for result in results[:3]:  # Top 3 matches
    response = model.inference(
        "What are the payment terms?",
        add_context=result["text"]
    )
    print(f"Source: {result['file_source']}")
    print(f"Answer: {response['llm_response']}")
    print("---")

Legal Documents

Contract analysis, clause extraction, compliance checking

Financial Reports

Earnings Q&A, metric extraction, filing analysis

Technical Docs

API documentation Q&A, troubleshooting, knowledge bases

Strengths & Limitations

Strengths

  • Context grounding: Trained to answer from provided context, reducing hallucination compared to general chat models in RAG setups
  • Small footprint: 7B parameters means it runs on consumer hardware, even CPU-only for batch processing
  • LLMWare ecosystem: First-class support in the llmware Python library with built-in document parsing, chunking, and retrieval
  • Enterprise licensing: Apache 2.0 allows commercial use without restrictions
  • Multiple base options: Choose Mistral-based for quality or Yi-based for smaller footprint

Limitations

  • Not a general chatbot: Poor at open-ended conversation, creative writing, coding, math — it's a specialist
  • Short context (4K-8K): Cannot process full documents at once; requires chunking and retrieval pipeline
  • 2023 base models: Built on Llama 2 / Mistral 7B v0.1 era architectures — newer bases have surpassed these
  • Limited community GGUF: Not in Ollama's default library; requires manual GGUF import or using LLMWare's library
  • No standard benchmarks: LLMWare doesn't publish MMLU/HellaSwag scores because the model isn't designed for those tasks

RAG Model Alternatives in 2026

Honest Assessment (March 2026)

Dragon was released in late 2023. Since then, the RAG landscape has evolved significantly. Newer models with longer context windows (32K-128K) and better instruction following have reduced the need for specialized RAG fine-tunes. Dragon remains a good choice if you're already in the LLMWare ecosystem, but for new projects, consider the alternatives below.

ModelSizeRAM RequiredSpeedQualityCost/Month
Dragon Mistral 7B7B5-8GB15-25 tok/s
65%
Free
Qwen 2.5 7B7B5-8GB20-35 tok/s
74%
Free
Llama 3.2 3B3B3-4GB30-50 tok/s
63%
Free
Mistral 7B v0.37B5-8GB20-35 tok/s
72%
Free

Quality scores are editorial estimates for RAG/document Q&A use cases specifically, not general benchmarks. Qwen 2.5 and Mistral v0.3 have better base capabilities and longer context, which often matters more than RAG-specific fine-tuning in practice.

When to Still Choose Dragon

Use Dragon if:

  • • Already using LLMWare's pipeline tools
  • • Need proven RAG-specific fine-tuning
  • • Processing short document chunks (under 2K tokens)
  • • Want minimal hallucination in factual extraction

Consider alternatives if:

  • • Starting a new project in 2026
  • • Need longer context (32K+ tokens)
  • • Want one model for RAG + other tasks
  • • Need Ollama-native model support

Frequently Asked Questions

Is Dragon 7B a general-purpose AI model?

No. Dragon is specifically fine-tuned for RAG (retrieval-augmented generation) tasks — answering questions based on provided document context. It is not designed for general conversation, code generation, creative writing, or math. For general chat, use models like Mistral 7B, Llama 3, or Qwen 2.5.

Can I run Dragon without a GPU?

Yes. Dragon works well on CPU-only systems because RAG responses are typically short (1-3 sentences). At Q4 quantization, you need about 6GB of system RAM. Inference speed on CPU is around 3-8 tokens/second, which means a typical answer generates in 2-5 seconds — perfectly fine for document processing pipelines.

Which Dragon variant should I choose?

dragon-mistral-7b-v0 is the best overall choice — it inherits Mistral 7B's stronger base capabilities and 8K context window. If you need the smallest possible model, try dragon-yi-6b-v0. All variants use the same RAG fine-tuning dataset.

Is Dragon available on Ollama?

Dragon is not in Ollama's default model library. You can use it via the LLMWare Python library (recommended), HuggingFace Transformers, or by importing a community-made GGUF quantization into Ollama manually using a Modelfile. See the installation section above for all three methods.

How does Dragon compare to just using Mistral 7B for RAG?

Dragon's advantage is that it's specifically trained to answer from context and resist hallucination. A general Mistral 7B will sometimes “fill in” answers from its training data even when the context doesn't support it. However, Mistral v0.3 (2024) has improved instruction following significantly, narrowing this gap. For new projects in 2026, the base model improvements may matter more than Dragon's RAG fine-tuning.

Resources

Get AI Breakthroughs Before Everyone Else

Join 10,000+ developers mastering local AI with weekly exclusive insights.

Reading now
Join the discussion

Related Guides

Continue your local AI journey with these comprehensive guides

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
📅 Published: October 28, 2025🔄 Last Updated: March 16, 2026✓ Manually Reviewed
Free Tools & Calculators