Docker Model Runner Guide: Run LLMs with Docker
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
Docker Model Runner Quick Start
3 Commands to Start:
docker model pull ai/llama3.2
docker model run ai/llama3.2 "Hello"
curl localhost:12434/engines/v1/chat/completions
Key Features:
• OpenAI-compatible API
• Native Docker Compose support
• GPU: Metal, CUDA, Vulkan
• Models as Docker primitives
What is Docker Model Runner?
Docker Model Runner (DMR) is Docker's official solution for running AI/LLM models locally, launched April 2025. It treats AI models as first-class citizens within the Docker ecosystem—similar to how Docker treats containers, images, and volumes.
Key Characteristics
- Built on llama.cpp: High-performance inference engine (with vLLM and Diffusers support)
- OpenAI-compatible API: Drop-in replacement for OpenAI SDK on localhost:12434
- OCI Artifacts: Models stored on Docker Hub or any OCI-compliant registry
- On-demand Loading: Models load into memory at runtime, unload when idle
- Native Execution: Inference server runs directly on host (not containerized) for GPU access
Why Docker Model Runner?
Docker Model Runner fills a gap in the AI development workflow. Instead of managing separate tools for local LLM inference, you can now use Docker's familiar commands and workflows:
- Run models with
docker model runjust likedocker run - Pull models from Docker Hub with
docker model pull - Integrate AI into Docker Compose with native support
- Use existing Docker infrastructure and CI/CD pipelines
Installation and Setup
Docker Desktop (macOS and Windows)
Docker Model Runner comes included with Docker Desktop 4.42 or later.
Step 1: Install/Update Docker Desktop
Download the latest version from docker.com.
Step 2: Enable Docker Model Runner
- Open Docker Desktop
- Go to Settings (gear icon)
- Navigate to AI
- Enable Docker Model Runner
- (Optional) Enable GPU-backed inference if you have NVIDIA GPU
- (Optional) Enable Host-side TCP support for localhost:12434 access
Step 3: Verify Installation
# Check version
docker model version
# Check status
docker model status
# View help
docker model --help
Linux (Docker Engine)
On Linux, Docker Model Runner is available as a plugin for Docker Engine.
Ubuntu/Debian:
# Update package list
sudo apt-get update
# Install plugin
sudo apt-get install docker-model-plugin
# Verify installation
docker model version
Fedora/RHEL/CentOS:
# Update packages
sudo dnf update
# Install plugin
sudo dnf install docker-model-plugin
# Verify
docker model version
Runner Management (Linux):
# Install runner component
docker model install-runner
# Start the runner
docker model start-runner
# Check status
docker model status
# Stop/restart
docker model stop-runner
docker model restart-runner
# Uninstall
docker model uninstall-runner
Enable TCP Support
To access the API from host applications (outside Docker containers):
- Docker Desktop: Settings > AI > Enable "host-side TCP support"
- This enables connections on port 12434
- API endpoint:
http://localhost:12434/engines/v1
Available Models
Docker Model Runner supports GGUF models from Docker Hub and Hugging Face.
Official Docker Hub Models
| Model | Tag | Size | Use Case |
|---|---|---|---|
| SmolLM2 | ai/smollm2:360M-Q4_K_M | 360M | Prototyping, testing |
| Llama 3.2 | ai/llama3.2:3B-Q4_K_M | 3B | General text generation |
| Llama 3.3 | ai/llama3.3 | 70B | Advanced text generation |
| Gemma 3 | ai/gemma3 | Various | Reasoning tasks |
| Phi 4 | ai/phi4 | 14B | Reasoning tasks |
| Qwen 2.5 | ai/qwen2.5 | Various | Code generation |
| Qwen 3 | ai/qwen3 | Various | Code, multilingual |
| Mistral | ai/mistral | 7B | General, code |
| Mistral Nemo | ai/mistral-nemo | 12B | Advanced tasks |
| DeepSeek R1 | ai/deepseek-r1-distill-llama | Various | Advanced reasoning |
| All-MiniLM | ai/all-minilm | 33M | Embeddings |
Model Selection by Use Case
By Hardware:
- Low-end (8GB RAM): ai/smollm2, ai/llama3.2:1B
- Mid-range (16GB RAM): ai/llama3.2:3B, ai/gemma3:4b, ai/phi4
- High-end (32GB+ RAM): ai/llama3.3, ai/deepseek-r1-distill-llama
By Application:
- General chat: ai/llama3.2 or ai/llama3.3
- Code generation: ai/qwen2.5 or ai/mistral
- Reasoning tasks: ai/gemma3 or ai/phi4
- Embeddings/RAG: ai/all-minilm
- Quick prototyping: ai/smollm2
Pulling Models
# From Docker Hub
docker model pull ai/llama3.2
docker model pull ai/llama3.2:3B-Q4_K_M # Specific quantization
# From Hugging Face
docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
# From custom registry
docker model pull myregistry.com/models/mistral:latest
# Check downloaded models
docker model list
Model Format Requirements
| Engine | Format | Platform |
|---|---|---|
| llama.cpp (default) | GGUF | All |
| vLLM | Safetensors | Linux + NVIDIA GPU |
| Diffusers | Various | Linux + NVIDIA GPU |
CLI Commands Reference
Model Management
# Pull/download a model
docker model pull ai/llama3.2:3B-Q4_K_M
docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
# List downloaded models
docker model list
# Remove a model
docker model rm ai/llama3.2
# Inspect model details
docker model inspect ai/llama3.2
# View model metadata
docker model inspect ai/llama3.2 --format '{{.Config.Size}}'
Running Models
# Interactive query (load, run, display)
docker model run ai/llama3.2 "Explain Docker in one sentence"
# Interactive chat mode
docker model run ai/llama3.2
# Background/detached mode (pre-load model)
docker model run -d ai/llama3.2
# With options
docker model run --debug ai/llama3.2 "Test prompt"
docker model run --color yes ai/llama3.2 "Hello"
docker model run --ignore-runtime-memory-check ai/llama3.2 "Test"
Run Command Options
| Option | Description |
|---|---|
| `--color auto | yes |
--debug | Enable debug logging |
-d, --detach | Load model in background |
--ignore-runtime-memory-check | Skip memory validation |
System Commands
# Check runner status
docker model status
# Show version
docker model version
# System information
docker model system info
# Disk usage
docker model system df
# Clean unused models
docker model system prune
# Remove ALL models
docker model system prune -a
Using the OpenAI-Compatible API
Docker Model Runner exposes an OpenAI-compatible API, making it a drop-in replacement for OpenAI's API.
API Endpoint
| Deployment | Endpoint |
|---|---|
| Host (TCP enabled) | http://localhost:12434/engines/v1 |
| Docker containers (Desktop) | http://model-runner.docker.internal/engines/v1 |
| Docker containers (Linux) | http://localhost:12434/engines/v1 |
Using cURL
# Chat completion
curl http://localhost:12434/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Docker?"}
],
"temperature": 0.7,
"max_tokens": 500
}'
# List available models
curl http://localhost:12434/engines/v1/models
# Create embeddings
curl http://localhost:12434/engines/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "ai/all-minilm",
"input": "Docker makes it easy to run applications"
}'
Python with OpenAI SDK
from openai import OpenAI
# Connect to Docker Model Runner
client = OpenAI(
base_url="http://localhost:12434/engines/v1",
api_key="not-needed" # No API key required
)
# Chat completion
response = client.chat.completions.create(
model="ai/llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Docker Model Runner?"}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
# Streaming response
stream = client.chat.completions.create(
model="ai/llama3.2",
messages=[{"role": "user", "content": "Write a poem about Docker"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Node.js/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:12434/engines/v1',
apiKey: 'not-needed'
});
async function chat() {
const response = await client.chat.completions.create({
model: 'ai/llama3.2',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
}
chat();
LangChain Integration
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:12434/engines/v1",
api_key="not-needed",
model="ai/llama3.2"
)
response = llm.invoke("What is Docker?")
print(response.content)
Docker Compose Integration
Docker Compose natively supports AI models as a top-level primitive.
Short Syntax (Simple)
services:
app:
image: my-app
models:
- llm
- embedding-model
models:
llm:
model: ai/llama3.2
embedding-model:
model: ai/all-minilm
Long Syntax (With Configuration)
services:
app:
image: my-app
environment:
- LLM_URL=${LLM_URL}
models:
llm:
endpoint_var: LLM_URL
models:
llm:
model: ai/llama3.2:3B-Q4_K_M
context_size: 4096
runtime_flags:
- "--no-prefill-assistant"
Provider Syntax
services:
chat:
image: my-chat-app
depends_on:
- ai_runner
ai_runner:
provider:
type: model
options:
model: ai/smollm2
context-size: 1024
Auto-Injected Environment Variables
Docker Compose automatically injects environment variables:
| Variable | Description |
|---|---|
{MODEL}_MODEL | The model name (e.g., ai/llama3.2) |
{MODEL}_URL | The endpoint URL |
Connecting from Containers
# Docker Desktop
ENDPOINT="http://model-runner.docker.internal/engines/v1"
# Docker Engine (Linux)
ENDPOINT="http://localhost:12434/engines/v1"
Complete Example: AI Chat Application
version: '3.8'
services:
backend:
image: python:3.11
command: python app.py
environment:
- MODEL_ENDPOINT=http://model-runner.docker.internal/engines/v1
- MODEL_NAME=ai/llama3.2
ports:
- "8000:8000"
models:
- chat-model
- embeddings
frontend:
image: node:20
command: npm start
ports:
- "3000:3000"
depends_on:
- backend
models:
chat-model:
model: ai/llama3.2:3B-Q4_K_M
context_size: 4096
embeddings:
model: ai/all-minilm
GPU Configuration
GPU acceleration dramatically improves inference performance—from 10+ seconds to under 3 seconds for typical responses.
Apple Silicon (Automatic)
GPU acceleration via Metal API is automatically configured on M1, M2, M3, and M4 Macs. No additional setup required.
Docker Model Runner runs natively on the host (not in a VM) for direct GPU access, providing excellent performance on Apple Silicon.
NVIDIA GPU Setup
Docker Desktop (Windows with WSL2):
- Ensure NVIDIA drivers are installed
- Open Docker Desktop > Settings > AI
- Enable GPU-backed inference
- Save and restart
Linux:
# 1. Install NVIDIA Container Runtime
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# 2. Configure Docker daemon
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 3. Verify configuration
docker info | grep -i runtime
# 4. Run with GPU
docker model run --gpu cuda ai/llama3.2 "Hello"
Best Practices for NVIDIA:
- Always use
--gpu cudaexplicitly (not--gpu auto) - Monitor GPU usage with
nvidia-smi - Use quantized models (Q4, Q5, Q8) for better GPU memory efficiency
- Ensure sufficient GPU VRAM for your model size
Vulkan Support (AMD/Intel/Integrated)
Vulkan support enables GPU acceleration on a wider range of hardware:
- AMD GPUs
- Intel GPUs (including integrated)
- Other Vulkan-compatible GPUs
Vulkan detection is automatic—no configuration needed.
GPU Support Matrix
| Platform | GPU | API | Configuration |
|---|---|---|---|
| macOS | Apple Silicon (M1-M4) | Metal | Automatic |
| Windows | NVIDIA | CUDA | Enable in Settings |
| Windows | AMD/Intel | Vulkan | Automatic |
| Linux | NVIDIA | CUDA | --gpu cuda flag |
| Linux | AMD/Intel | Vulkan | Automatic |
Inference Engines
Docker Model Runner supports multiple inference engines for different use cases.
llama.cpp (Default)
- Platform: All (macOS, Windows, Linux)
- Format: GGUF models only
- GPU: Metal (Apple), CUDA (NVIDIA), Vulkan (AMD/Intel)
- Best for: General use, Apple Silicon, development
vLLM
- Platform: Linux x86_64, Windows WSL2 (with NVIDIA GPU)
- Format: Safetensors
- GPU: NVIDIA CUDA required
- Best for: High-throughput production serving
Diffusers
- Platform: Linux x86_64, Linux ARM64 (with NVIDIA GPU)
- Format: Various (Stable Diffusion models)
- GPU: NVIDIA CUDA required
- Best for: Image generation
Performance: Docker Model Runner vs Ollama
Benchmark Comparison
| Metric | Docker Model Runner | Ollama |
|---|---|---|
| Throughput | Baseline to +12% | Baseline |
| Startup Time | 3-6 seconds | 2-5 seconds |
| Memory (7B Q4) | 4-6GB | 4-6GB |
| Apple Silicon | Excellent (native Metal) | Excellent |
| NVIDIA | Good (CUDA) | Good |
Feature Comparison
| Feature | Docker Model Runner | Ollama |
|---|---|---|
| Docker Compose | Native integration | Requires setup |
| API Compatibility | OpenAI-compatible | OpenAI-compatible |
| Model Format | GGUF (+Safetensors vLLM) | GGUF |
| Model Library | Growing | Extensive |
| Ecosystem | New (Docker-focused) | Mature (many integrations) |
| Multimodal | Coming | Supported |
| Custom Models | OCI registries, HF | Modelfile |
When to Choose Each
Choose Docker Model Runner when:
- Your workflow is Docker-centric
- You use Docker Compose for orchestration
- You want models as Docker primitives
- You're on Apple Silicon and want native performance
- You need CI/CD integration with Docker
Choose Ollama when:
- You need the largest model library
- You want maximum ecosystem integrations
- You need multimodal models now
- You prefer Modelfile customization
- You want the most mature solution
Resource Management
System Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| RAM | 8GB | 16GB+ |
| Storage | Varies | 50GB+ (for multiple models) |
| GPU | Optional | NVIDIA/Apple Silicon |
Model Sizes (Approximate)
| Model | Q4_K_M | Q8_0 | FP16 |
|---|---|---|---|
| SmolLM 360M | ~250MB | ~400MB | ~720MB |
| Llama 3.2 1B | ~800MB | ~1.2GB | ~2GB |
| Llama 3.2 3B | ~2GB | ~3.5GB | ~6GB |
| Llama 3.3 70B | ~40GB | ~70GB | ~140GB |
Memory Management
Models in Docker Model Runner:
- Load on-demand: First request loads the model
- Unload when idle: Memory released after inactivity
- Pre-loading: Use
docker model run -dto keep model loaded
Monitoring and Cleanup
# Check disk usage by models
docker model system df
# View detailed model info
docker model inspect ai/llama3.2
# Remove unused models
docker model system prune
# Remove ALL models (use with caution)
docker model system prune -a
# Monitor GPU (NVIDIA)
nvidia-smi
# Check runner status
docker model status
Troubleshooting
Model Not Responding
# Check runner status
docker model status
# Restart runner (Linux)
docker model restart-runner
# Check if model is loaded
docker model list
# Try with debug output
docker model run --debug ai/llama3.2 "Test"
GPU Not Detected
NVIDIA:
# Verify NVIDIA runtime
docker info | grep -i runtime
# Check GPU visibility
nvidia-smi
# Ensure --gpu cuda flag is used
docker model run --gpu cuda ai/llama3.2 "Test"
Apple Silicon:
- GPU via Metal should work automatically
- Ensure Docker Desktop is up to date
Port 12434 Not Accessible
- Enable Host-side TCP support in Docker Desktop settings
- Check for firewall blocking the port
- Verify with:
curl http://localhost:12434/engines/v1/models
Out of Memory
# Use smaller quantization
docker model pull ai/llama3.2:3B-Q4_K_M
# Clean up unused models
docker model system prune
# Reduce context size in Compose
models:
llm:
model: ai/llama3.2
context_size: 2048 # Smaller context
Best Practices
1. Use Appropriate Quantization
# Q4_K_M: Best balance of quality and memory
docker model pull ai/llama3.2:3B-Q4_K_M
# Q5_K_M: Slightly better quality
docker model pull ai/llama3.2:3B-Q5_K_M
# Q8_0: Near-full quality, more memory
docker model pull ai/llama3.2:3B-Q8_0
2. Pre-load Models for Production
# Load model in background for faster first request
docker model run -d ai/llama3.2
3. Set Appropriate Context Size
Larger context = more memory. Use smallest context that works:
models:
llm:
model: ai/llama3.2
context_size: 2048 # Start small, increase if needed
4. Monitor Resources
# Regular cleanup
docker model system prune
# Check disk usage
docker model system df
Key Takeaways
- Docker Model Runner is Docker's native AI inference solution
- OpenAI-compatible API enables drop-in replacement for existing code
- Docker Compose integration makes AI orchestration familiar and simple
- GPU acceleration works across Apple Silicon, NVIDIA, and Vulkan
- Performance matches Ollama, with Docker-native advantages
- Models load on-demand and unload when idle for efficient memory use
- Best for Docker-centric workflows, Compose deployments, and CI/CD
Next Steps
- Compare with Ollama and LM Studio
- Set up RAG with Docker Model Runner
- Build AI agents using the OpenAI API
- Choose the best models for your use case
Docker Model Runner brings AI inference into the Docker ecosystem, making it easy to run LLMs alongside your containerized applications with familiar tools, commands, and workflows. Whether you're developing locally or deploying to production, Docker Model Runner provides a seamless AI experience for Docker users.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!