What is an Ollama Modelfile?

A Modelfile is a plain-text recipe — like a Dockerfile — that defines a custom Ollama model variant. It specifies a base model with FROM, optional parameter overrides (PARAMETER), a system prompt (SYSTEM), a chat template (TEMPLATE), and optional fine-tuning adapters (ADAPTER). You build it with "ollama create my-model -f Modelfile" and run it like any other model. The file is small, version-controllable, and shareable.

How do I add a system prompt to a model with a Modelfile?

Use the SYSTEM directive: "SYSTEM \"\"\"You are a helpful Python tutor. Always include working code examples.\"\"\"". Then "ollama create py-tutor -f Modelfile" creates the variant. The system prompt is baked in — every chat starts with it without needing to set it from the API. Triple-quoted strings let you write multi-line prompts cleanly.

What PARAMETER values can I override in a Modelfile?

The most common ones: num_ctx (context window), temperature (0-2, randomness), top_k (40-100), top_p (0.7-1.0), repeat_penalty (1.0-1.3), num_predict (max output tokens), stop (sequences that end generation), num_gpu (layers offloaded to GPU), num_thread (CPU threads), seed (deterministic output). Full list in the official Ollama Modelfile reference.

How do I import a GGUF file into Ollama with a Modelfile?

Use FROM with a path to the GGUF: "FROM ./qwen2.5-7b-instruct-q4_k_m.gguf". Then add a TEMPLATE matching the model's chat format and any required PARAMETER overrides. Run "ollama create qwen-custom -f Modelfile". The GGUF gets imported into Ollama's blob store and is usable via "ollama run qwen-custom" like any other model.

What is the TEMPLATE directive for and when do I need it?

TEMPLATE defines the prompt format the model expects — the special tokens that wrap system, user, and assistant messages. You need it when (1) importing a raw GGUF without a built-in template, or (2) overriding a base model's default. Get the template from the model's tokenizer_config.json or HuggingFace card. Wrong template = garbled output, even with correct weights.

Can I add a LoRA adapter to a base model with a Modelfile?

Yes. Use ADAPTER pointing to a GGUF-format adapter file: "ADAPTER ./my-finetune.gguf". The adapter is layered on top of the FROM base at inference time. You can stack multiple ADAPTER lines, though only the most-recent applies cleanly. This is the path to running fine-tunes you produced with axolotl, unsloth, or LLaMA-Factory through Ollama.

How do I share a custom Ollama model I built with a Modelfile?

Two ways. (1) Push to ollama.com: create an account, "ollama push username/my-model". Anyone can pull it. (2) Distribute the Modelfile + base model reference: collaborators run "ollama create" themselves. For private sharing, use Ollama's push to a self-hosted registry or just commit the Modelfile to a git repo and have teammates rebuild.

What is the difference between SYSTEM in Modelfile and setting a system prompt at runtime?

Functionally similar — both prepend a system message — but baked-in SYSTEM has three advantages: (1) consistency, every caller gets the same prompt, (2) no API contract needed for clients to set it, (3) version-controllable along with parameters. Runtime system prompts are better when the prompt depends on per-user context. Many teams ship a Modelfile with default SYSTEM and let advanced callers override it.

Ollama Modelfile Mastery: Custom Prompts, Parameters, and Templates

Published April 23, 2026 • 18 min read

The Modelfile is the most under-documented part of Ollama. Most users learn ollama run llama3.1 and never look back — which is fine, until you want to ship a coding assistant tuned to your house style, or import a fine-tuned GGUF, or pin a 16k context window for RAG without setting it on every API call. The Modelfile is the answer to all of that. Twenty lines of plain text and you have a custom model variant that behaves exactly the way your team needs.

This is the reference I wish existed when I first wrote one. Every directive, every parameter that actually matters, real recipes for real workflows, and the gotchas that ate hours of my time. Tested on Ollama 0.5.7, April 2026.

Quick Start: Your First Modelfile in 90 Seconds

Save as Modelfile:

FROM llama3.1:8b
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
SYSTEM """
You are a senior Python engineer. Always include working code examples.
Prefer standard library over dependencies. Be concise.
"""

Build and run:

ollama create py-coach -f Modelfile
ollama run py-coach "How do I parse CSV files with type-checked rows?"

You now have a custom model named py-coach that ships with the system prompt baked in, low temperature for focused output, and an 8k context. Anyone on your team can ollama pull it (if you push it) and get the exact same behavior. That is the entire value proposition.

What a Modelfile Actually Does
Modelfile Syntax Reference
Every PARAMETER Explained
SYSTEM Prompt Patterns
TEMPLATE: Chat Format Mastery
Importing a GGUF File
Adding LoRA Adapters
Real Recipes for Real Use Cases
Sharing and Publishing Modelfiles
Common Pitfalls
FAQs

What a Modelfile Actually Does {#what-it-does}

A Modelfile is to Ollama what a Dockerfile is to Docker. It defines a layered, reproducible model artifact:

Base model (FROM) — what to start with.
Parameter overrides (PARAMETER) — context size, temperature, GPU offload, etc.
System prompt (SYSTEM) — instructions baked into every chat.
Chat template (TEMPLATE) — how messages are formatted before tokenization.
Adapters (ADAPTER) — fine-tuning weights layered on top of the base.
License and metadata (LICENSE, MESSAGE) — provenance and example interactions.

When you run ollama create, Ollama builds a new model layer that combines all of this. The base weights are not duplicated — only the diff (your overrides) is stored — so a custom variant of a 7B model adds maybe 5 KB of disk.

The reason this matters: every team has a "house version" of common models. The marketing team wants a writing assistant with brand voice in the system prompt. Engineering wants a coding model with their style guide. Support wants a chatbot with hard guardrails. Without Modelfiles, every team builds its own prompt-injection layer in code. With Modelfiles, you ship one artifact that works the same way from cURL, the Ollama CLI, LangChain, Continue.dev, and any other client.

For the broader ecosystem context, our complete Ollama guide is the right starting point if you are new, and best Ollama models helps pick the right FROM.

Modelfile Syntax Reference {#syntax}

Modelfile is a simple line-based DSL. Each instruction is a single line (or a triple-quoted block for multi-line strings). Comments start with #.

# This is a comment

FROM llama3.1:8b                      # base model (required)

PARAMETER num_ctx 8192                # parameter override
PARAMETER temperature 0.7
PARAMETER stop "<|eot_id|>"

SYSTEM """                            # multi-line system prompt
You are a helpful coding assistant.
Be concise. Always include working examples.
"""

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

ADAPTER ./my-lora.gguf                # optional LoRA adapter
LICENSE """MIT License..."""          # optional license text
MESSAGE user "Show me a hello world"  # optional example messages
MESSAGE assistant "print(\"hello\")"

The instructions can appear in any order, but convention is FROM → PARAMETER → SYSTEM → TEMPLATE → ADAPTER → LICENSE → MESSAGE.

Build, inspect, push

# Build
ollama create my-model -f Modelfile

# See what was generated
ollama show my-model --modelfile

# Inspect parameters
ollama show my-model --parameters

# Inspect template
ollama show my-model --template

# Push to ollama.com (requires login)
ollama push username/my-model

# Pull from registry
ollama pull username/my-model

ollama show --modelfile is the secret weapon. It dumps the effective Modelfile of any installed model — even base models — so you can see exactly what TEMPLATE and PARAMETER values it ships with. Steal them as starting points for your own variants.

Every PARAMETER Explained {#parameters}

The full list of PARAMETER directives, what they actually do, and where they matter.

Context and output size

Parameter	Default	Range	What it controls
`num_ctx`	2048	512–131072	Context window in tokens. Critical for RAG and long docs.
`num_predict`	-1 (∞)	1–∞	Max tokens generated per response. -1 means until model emits stop or hits num_ctx.
`num_keep`	4	0–num_ctx	Tokens kept from beginning when context overflows.

Practical note: most base models in Ollama ship with num_ctx=2048. That is far below what the model actually supports. Llama 3.1 supports 128k. Qwen 2.5 supports 32k natively. If you are doing RAG or long-doc summarization, override with PARAMETER num_ctx 8192 (or higher) — otherwise your retrieved chunks get silently truncated.

Sampling and creativity

Parameter	Default	Range	What it controls
`temperature`	0.8	0.0–2.0	Randomness. 0 = deterministic, 1 = balanced, 2 = chaos.
`top_k`	40	0–100	Sample from top-k tokens. 0 disables. Lower = more focused.
`top_p`	0.9	0.0–1.0	Nucleus sampling. Sample from tokens with cumulative prob >= top_p.
`min_p`	0.05	0.0–1.0	Newer alternative to top_p. Drop tokens with prob < min_p × max_prob.
`tfs_z`	1.0	1.0–∞	Tail-free sampling. Higher = more aggressive tail filtering.
`typical_p`	1.0	0.0–1.0	Locally typical sampling.
`repeat_penalty`	1.1	0.0–2.0	Penalize repeating tokens. 1.0 = off, >1.0 reduces repetition.
`repeat_last_n`	64	0–num_ctx	Window over which repeat_penalty applies.
`presence_penalty`	0.0	-2.0–2.0	Penalty for any token already present.
`frequency_penalty`	0.0	-2.0–2.0	Penalty proportional to token frequency.
`mirostat`	0	0/1/2	Mirostat sampling. 0 = off, 1 = v1, 2 = v2. Trades diversity for stability.
`mirostat_eta`	0.1	0.0–1.0	Mirostat learning rate.
`mirostat_tau`	5.0	0.0–10.0	Mirostat target entropy.
`seed`	0	int	Random seed. Set non-zero for reproducible output.

My defaults by use case:

Coding / structured output: temperature 0.1, top_p 0.9, repeat_penalty 1.05
Conversational chat: temperature 0.7, top_p 0.9, repeat_penalty 1.1
Creative writing: temperature 0.95, top_p 0.95, repeat_penalty 1.15
RAG (factual): temperature 0.2, top_p 0.85, repeat_penalty 1.05

Stopping

Parameter	Default	What it does
`stop`	model-dependent	Stop generation when this string is emitted. Repeat for multiple.

PARAMETER stop "<|eot_id|>"
PARAMETER stop "User:"
PARAMETER stop "</s>"

Hardware and runtime

Parameter	Default	Range	What it controls
`num_gpu`	-1 (auto)	0–999	Layers offloaded to GPU. 999 = max. 0 = CPU only.
`num_thread`	auto	1–CPU cores	CPU threads for prompt processing.
`num_batch`	512	1–num_ctx	Batch size for prompt processing.
`f16_kv`	true	bool	Use FP16 for KV cache. Halves VRAM vs FP32.
`use_mmap`	true	bool	Memory-map the model file. Disable for read-once workloads.
`use_mlock`	false	bool	Lock model in RAM (no swap). Useful for low-RAM systems.
`numa`	false	bool	NUMA-aware memory allocation. For multi-socket servers.
`vocab_only`	false	bool	Load only the tokenizer. Diagnostic mostly.

The big knob most people miss: num_gpu. On a Mac with unified memory or a Linux box where Ollama auto-detects correctly, leave it. On systems where Ollama is offloading too few layers (you see "X/Y layers offloaded to GPU" in logs and X is suspiciously low), force-set num_gpu 999 — Ollama silently respects available VRAM and offloads as many as fit.

SYSTEM Prompt Patterns {#system-prompts}

The SYSTEM directive bakes a system prompt into the model. Patterns that work, ranked by usefulness:

Pattern 1: Persona + format constraints

SYSTEM """
You are a senior backend engineer specializing in Python and PostgreSQL.
- Always include working code examples.
- Prefer standard library; only suggest dependencies when justified.
- Use type hints in all function signatures.
- Be concise. No unnecessary preamble.
"""

This is the workhorse pattern. Persona narrows the model's response style, the bullets enforce concrete output rules.

Pattern 2: Hard guardrails for support bots

SYSTEM """
You are a customer support agent for ACME Corp.
NEVER discuss competitor products by name.
NEVER promise refunds — direct refund requests to support@acme.com.
NEVER answer questions outside ACME products and account help.
If asked something outside scope, respond: "I can only help with ACME product and account questions. For [topic], please contact [appropriate channel]."
"""

The "NEVER" lines work surprisingly well. Pair with a stop parameter that catches escape attempts.

Pattern 3: RAG-aware system prompt

SYSTEM """
You are a documentation assistant. Answer ONLY using the provided context.
If the context does not contain the answer, say: "I don't have that information in the available documentation."
Cite sources by filename when possible.
Do not invent facts. Do not generalize beyond the context.
"""
PARAMETER temperature 0.1
PARAMETER num_ctx 16384

The combination of low temperature + restrictive system prompt + long context is the standard recipe for a RAG-grounded model variant. Pair with our local RAG setup guide for the retrieval side.

Pattern 4: Style-mimicry for content

SYSTEM """
You write in the voice of Local AI Master:
- Direct, no fluff, no "let's dive in" or "in today's fast-paced world".
- Specific numbers, real benchmarks, concrete examples.
- First-person where appropriate ("I tested...", "we measured...").
- Short paragraphs. Hard returns instead of long sentences.
- Honest about tradeoffs. Acknowledge what doesn't work.
"""

Bake the brand voice into the variant. Every chat starts with the right tone without per-call prompt engineering.

Pattern 5: Multi-shot examples

Sometimes a system prompt is not enough. Use MESSAGE pairs:

SYSTEM "You are a SQL expert. Always show the full query and an explanation."

MESSAGE user "Find users who signed up in the last 7 days."
MESSAGE assistant """
\`\`\`sql
SELECT id, email, created_at
FROM users
WHERE created_at >= NOW() - INTERVAL '7 days'
ORDER BY created_at DESC;
\`\`\`

This selects all columns we typically care about (id, email, signup time), filters to the last 7 days, and sorts newest-first.
"""

MESSAGE user "Same but only paying customers."
MESSAGE assistant """
\`\`\`sql
SELECT u.id, u.email, u.created_at
FROM users u
JOIN subscriptions s ON s.user_id = u.id
WHERE u.created_at >= NOW() - INTERVAL '7 days'
  AND s.status = 'active'
ORDER BY u.created_at DESC;
\`\`\`

Joined to subscriptions and filtered to active subs to capture only paying customers.
"""

The messages are seeded into the conversation as if they had happened. Few-shot examples baked into the model.

TEMPLATE: Chat Format Mastery {#template}

TEMPLATE is the trickiest directive. It defines the exact tokens that wrap system, user, and assistant messages before the model sees them. Get it wrong and the model produces garbage even with correct weights.

Inheriting the base template

If you are extending an existing model, inherit:

FROM llama3.1:8b
SYSTEM "You are a Python tutor."

The TEMPLATE from llama3.1:8b is preserved automatically. You only need to write your own TEMPLATE when you import a raw GGUF.

Writing a TEMPLATE for an imported GGUF

The TEMPLATE uses Go templating with these variables:

Variable	Meaning
`.System`	The system prompt
`.Prompt`	The user's current message
`.Response`	The assistant's response (used during streaming)
`.Messages`	Array of {Role, Content} for multi-turn
`.Tools`	Array of tools (for tool-calling models)

The Llama 3.1 TEMPLATE looks like this:

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>

{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

For Qwen 2.5:

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ range .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""

For Mistral / Mixtral:

TEMPLATE """[INST] {{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }} [/INST]"""

Where to find the right template: look up the model on HuggingFace, find tokenizer_config.json, copy the chat_template field, and translate Jinja2 to Go templates (the syntax is similar but not identical).

Tool-call template

For a tool-calling model:

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}{{ if .Tools }}

You have access to the following tools:
{{ range .Tools }}{{ .Function.Name }}: {{ .Function.Description }}
{{ end }}{{ end }}<|eot_id|>{{ end }}{{ range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>

{{ .Content }}{{ if .ToolCalls }}{{ range .ToolCalls }}
{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}{{ end }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

For deeper tool-calling work, our Ollama tool calling guide covers the runtime side.

Importing a GGUF File {#gguf-import}

This is one of the most useful Modelfile features. Got a GGUF you downloaded from HuggingFace, exported from llama.cpp, or quantized yourself? Import it.

# Modelfile
FROM ./qwen2.5-7b-instruct-q4_k_m.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ range .Messages }}<|im_start|>{{ .Role }}
{{ .Content }}<|im_end|>
{{ end }}<|im_start|>assistant
"""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"
PARAMETER num_ctx 32768
PARAMETER temperature 0.7

SYSTEM "You are a helpful assistant."

ollama create my-qwen -f Modelfile
ollama run my-qwen

What just happened: Ollama copied the GGUF into its blob store under ~/.ollama/models/blobs, registered the new model, applied the template and parameters. From now on my-qwen works like any other Ollama model — including pulling it from a remote registry if you push.

Critical: the TEMPLATE has to match what the GGUF was trained for. If you skip TEMPLATE on a non-llama-format model, output is garbled. Always check the HuggingFace model card for the chat template.

For multimodal GGUFs (LLaVA, MiniCPM-V, BakLLaVA), Ollama 0.5+ supports them natively. The Modelfile is the same shape, just FROM the multimodal GGUF.

Adding LoRA Adapters {#adapters}

If you fine-tuned a model with axolotl, unsloth, or LLaMA-Factory, you can layer the adapter onto the base in a Modelfile:

FROM llama3.1:8b
ADAPTER ./company-style-lora.gguf

PARAMETER temperature 0.4
SYSTEM "You write in the company style."

The adapter must be in GGUF format. Most fine-tuning frameworks export to safetensors or HuggingFace format — convert with llama.cpp/convert_lora_to_gguf.py:

python llama.cpp/convert_lora_to_gguf.py \
  --base meta-llama/Meta-Llama-3.1-8B-Instruct \
  --outfile company-style-lora.gguf \
  ./my-lora-output/

After ollama create, the adapter is permanently merged into the model variant — runtime cost is identical to the base model.

For the full fine-tuning path that ends with this Modelfile step, see our fine-tune local AI for business guide.

Real Recipes for Real Use Cases {#recipes}

Working Modelfiles for common workflows. Copy, modify, ship.

Recipe 1: Coding assistant with house style

FROM qwen2.5-coder:14b

PARAMETER temperature 0.15
PARAMETER top_p 0.9
PARAMETER num_ctx 16384
PARAMETER repeat_penalty 1.05

SYSTEM """
You are a senior engineer at ACME, writing TypeScript and Python for backend services.
- Use type hints (Python) and strict types (TypeScript).
- Follow our style guide: snake_case in Python, camelCase in TS.
- Always include error handling, never bare except.
- Prefer composition over inheritance.
- Write tests when implementing new functionality (pytest / vitest).
- Be concise. No unnecessary preamble or explanations after the code.
"""

Recipe 2: RAG-grounded answerer

FROM llama3.1:8b

PARAMETER temperature 0.1
PARAMETER top_p 0.85
PARAMETER num_ctx 16384
PARAMETER repeat_penalty 1.05

SYSTEM """
You answer questions using ONLY the provided context.

Rules:
1. If the answer is in the context, give it directly with the source filename.
2. If the answer is not in the context, respond exactly: "I don't have that information in the available documentation."
3. Never invent details. Never extrapolate beyond the context.
4. Quote exact phrases from the context when accuracy matters.
"""

Recipe 3: SQL co-pilot for read-only analytics

FROM qwen2.5-coder:7b

PARAMETER temperature 0.05
PARAMETER num_ctx 8192

SYSTEM """
You are a PostgreSQL expert helping analysts query a read-only analytics database.

Schema:
- users(id, email, created_at, plan)
- subscriptions(id, user_id, status, started_at, canceled_at)
- events(id, user_id, name, properties JSONB, created_at)

Rules:
- Output only the SQL query in a fenced code block, then a one-sentence explanation.
- Always use explicit JOINs. Never SELECT *.
- Use CTEs for queries with more than 2 joins.
- Add comments for non-obvious logic.
- NEVER write INSERT, UPDATE, DELETE, or DDL — this is read-only.
"""

Recipe 4: Email triage assistant

FROM llama3.1:8b

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

SYSTEM """
Classify each email into exactly one category: URGENT, ACTION_REQUIRED, FYI, NEWSLETTER, SPAM.
Return JSON: {"category": "...", "summary": "...", "suggested_action": "..."}.
- URGENT: direct request from a customer or boss requiring response within 4 hours.
- ACTION_REQUIRED: needs a response but not urgent.
- FYI: informational, no action needed.
- NEWSLETTER: bulk content from a list.
- SPAM: cold outreach, suspicious, irrelevant.
Never include explanations outside the JSON.
"""

Pair this with our local AI email triage guide once it ships for the orchestration side.

Recipe 5: Long-context document summarizer

FROM llama3.1:8b

PARAMETER temperature 0.2
PARAMETER num_ctx 32768
PARAMETER num_predict 1024

SYSTEM """
Summarize long documents with this structure:
1. TL;DR (2 sentences max).
2. Key Points (5-7 bullets, each <= 15 words).
3. Action Items (only if any are present in the source; otherwise omit).
4. Open Questions (only if applicable).

Be specific. Use numbers and proper nouns from the source. Avoid filler phrases.
"""

Recipe 6: Quantized model with explicit GPU offload

For a system where Ollama is misdetecting VRAM:

FROM ./mistral-nemo-instruct-q5_k_m.gguf

TEMPLATE """[INST] {{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }} [/INST]"""

PARAMETER stop "[INST]"
PARAMETER stop "</s>"
PARAMETER num_ctx 8192
PARAMETER num_gpu 999      # offload all layers to GPU
PARAMETER num_thread 8     # CPU threads for prompt processing
PARAMETER f16_kv true      # halve KV cache VRAM

SYSTEM "You are a helpful assistant."

Push to the public registry

ollama login
ollama create yourname/py-coach -f Modelfile
ollama push yourname/py-coach

Anyone can now ollama pull yourname/py-coach. The base model is reused (not re-uploaded) — only your overrides ship.

Three options:

Commit Modelfile to a git repo. Teammates clone, run ollama create. Simplest, no infra.
Self-hosted registry. Ollama supports OCI-compatible registries. Push to a private Harbor or AWS ECR with proper auth.
Internal HTTP server with the GGUF + Modelfile. Teammates run a script that downloads + creates.

For 90% of teams, option 1 is the answer. The Modelfile is small enough to PR-review like any other code change.

Versioning

Tag your variants:

ollama create acme/coding-assistant:v1.2 -f Modelfile.v1.2
ollama push acme/coding-assistant:v1.2

Treat Modelfiles like Dockerfiles. Pin versions in app code. Never use :latest in production — a base model update can change behavior subtly.

Common Pitfalls {#pitfalls}

1. Forgetting num_ctx. Default is 2048. Most modern models support 8k-128k. Override to match your use case.

2. Wrong TEMPLATE for an imported GGUF. Output looks like noise. Check HuggingFace tokenizer_config.json for the right format.

3. SYSTEM with single quotes. Use triple double-quotes """...""" for multi-line. Single quotes truncate at first newline.

4. ADAPTER format mismatch. Adapter must be GGUF, not safetensors. Convert with llama.cpp's converter.

5. Stacking conflicting PARAMETER stop sequences. Each stop is OR'd. Too many fragments cause early termination on incidental matches.

6. Triple-quoted string escaping. Inside SYSTEM """...""" you do not need to escape quotes, but you do need to escape backslashes if you want them literal. Backticks are fine.

7. num_gpu 999 on CPU-only systems. Causes a warning and falls back to CPU, but logs are noisy. Set explicitly to 0 if no GPU.

8. Pushing without licensing. If you fine-tuned on data with restrictions, add a LICENSE block. Be honest about base model licenses (Llama community license, etc.).

9. Modelfile checked in without the GGUF. If your FROM is a local file, teammates cannot ollama create without the GGUF too. Either push the result to a registry, or document where to download the GGUF.

10. Re-creating instead of updating. ollama create my-model overwrites. Use a different name during testing (my-model-test) and only overwrite the production name when validated.

For deeper context on parameters and runtime tuning, the official Ollama Modelfile reference is the authoritative source — bookmark it.

Conclusion

Modelfiles are how Ollama becomes more than ollama run llama3.1. Twenty lines of plain text and you have a model variant tuned for your workflow, sharable across your team, version-controllable like any other artifact. The pattern composes: a base model + a system prompt + a few parameter overrides + maybe an adapter, and you have shipped a custom AI behavior.

The pieces I would internalize first: num_ctx (most users undertune it), SYSTEM with concrete persona + format rules, and temperature per use case. Those three knobs cover 80% of the value. After that, importing GGUFs and layering adapters opens up the long tail — fine-tuned coding assistants, brand-voice writers, RAG-grounded answerers — all running locally, all on hardware you control.

Once you have a Modelfile you trust, push it. Make it the team's default. Wire it into your Ollama production deployment so every API caller gets the same baked-in behavior. That is the moment Ollama stops being a tool and starts being part of your platform.

Subscribe to the Local AI Master newsletter for more Modelfile recipes, parameter tuning experiments, and shareable templates from real production stacks.

Ollama Modelfile Mastery: Custom Prompts, Parameters, and Templates

Want to go deeper than this article?

Ollama Modelfile Mastery: Custom Prompts, Parameters, and Templates

Quick Start: Your First Modelfile in 90 Seconds

Table of Contents

What a Modelfile Actually Does {#what-it-does}

Modelfile Syntax Reference {#syntax}

Build, inspect, push

Every PARAMETER Explained {#parameters}

Context and output size

Sampling and creativity

Stopping

Hardware and runtime

SYSTEM Prompt Patterns {#system-prompts}

Pattern 1: Persona + format constraints

Pattern 2: Hard guardrails for support bots

Pattern 3: RAG-aware system prompt

Pattern 4: Style-mimicry for content

Pattern 5: Multi-shot examples

TEMPLATE: Chat Format Mastery {#template}

Inheriting the base template

Writing a TEMPLATE for an imported GGUF

Tool-call template

Importing a GGUF File {#gguf-import}

Adding LoRA Adapters {#adapters}

Real Recipes for Real Use Cases {#recipes}

Recipe 1: Coding assistant with house style

Recipe 2: RAG-grounded answerer

Recipe 3: SQL co-pilot for read-only analytics

Recipe 4: Email triage assistant

Recipe 5: Long-context document summarizer

Recipe 6: Quantized model with explicit GPU offload

Sharing and Publishing Modelfiles {#publishing}

Push to the public registry

Private sharing within a team

Versioning

Common Pitfalls {#pitfalls}

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Related Guides

Get the Local AI Builder Newsletter

Build Real AI on Your Machine

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI