Tools

Docker Model Runner Guide: Run LLMs with Docker

February 4, 2026
18 min read
Local AI Master Research Team
🎁 4 PDFs included
Newsletter

Before we dive deeper...

Get your free AI Starter Kit

Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.

No spam, everUnsubscribe anytime
12,000+ downloads

Docker Model Runner Quick Start

3 Commands to Start:
docker model pull ai/llama3.2
docker model run ai/llama3.2 "Hello"
curl localhost:12434/engines/v1/chat/completions

Key Features:
• OpenAI-compatible API
• Native Docker Compose support
• GPU: Metal, CUDA, Vulkan
• Models as Docker primitives

What is Docker Model Runner?

Docker Model Runner (DMR) is Docker's official solution for running AI/LLM models locally, launched April 2025. It treats AI models as first-class citizens within the Docker ecosystem—similar to how Docker treats containers, images, and volumes.

Key Characteristics

  • Built on llama.cpp: High-performance inference engine (with vLLM and Diffusers support)
  • OpenAI-compatible API: Drop-in replacement for OpenAI SDK on localhost:12434
  • OCI Artifacts: Models stored on Docker Hub or any OCI-compliant registry
  • On-demand Loading: Models load into memory at runtime, unload when idle
  • Native Execution: Inference server runs directly on host (not containerized) for GPU access

Why Docker Model Runner?

Docker Model Runner fills a gap in the AI development workflow. Instead of managing separate tools for local LLM inference, you can now use Docker's familiar commands and workflows:

  • Run models with docker model run just like docker run
  • Pull models from Docker Hub with docker model pull
  • Integrate AI into Docker Compose with native support
  • Use existing Docker infrastructure and CI/CD pipelines

Installation and Setup

Docker Desktop (macOS and Windows)

Docker Model Runner comes included with Docker Desktop 4.42 or later.

Step 1: Install/Update Docker Desktop

Download the latest version from docker.com.

Step 2: Enable Docker Model Runner

  1. Open Docker Desktop
  2. Go to Settings (gear icon)
  3. Navigate to AI
  4. Enable Docker Model Runner
  5. (Optional) Enable GPU-backed inference if you have NVIDIA GPU
  6. (Optional) Enable Host-side TCP support for localhost:12434 access

Step 3: Verify Installation

# Check version
docker model version

# Check status
docker model status

# View help
docker model --help

Linux (Docker Engine)

On Linux, Docker Model Runner is available as a plugin for Docker Engine.

Ubuntu/Debian:

# Update package list
sudo apt-get update

# Install plugin
sudo apt-get install docker-model-plugin

# Verify installation
docker model version

Fedora/RHEL/CentOS:

# Update packages
sudo dnf update

# Install plugin
sudo dnf install docker-model-plugin

# Verify
docker model version

Runner Management (Linux):

# Install runner component
docker model install-runner

# Start the runner
docker model start-runner

# Check status
docker model status

# Stop/restart
docker model stop-runner
docker model restart-runner

# Uninstall
docker model uninstall-runner

Enable TCP Support

To access the API from host applications (outside Docker containers):

  1. Docker Desktop: Settings > AI > Enable "host-side TCP support"
  2. This enables connections on port 12434
  3. API endpoint: http://localhost:12434/engines/v1

Available Models

Docker Model Runner supports GGUF models from Docker Hub and Hugging Face.

Official Docker Hub Models

ModelTagSizeUse Case
SmolLM2ai/smollm2:360M-Q4_K_M360MPrototyping, testing
Llama 3.2ai/llama3.2:3B-Q4_K_M3BGeneral text generation
Llama 3.3ai/llama3.370BAdvanced text generation
Gemma 3ai/gemma3VariousReasoning tasks
Phi 4ai/phi414BReasoning tasks
Qwen 2.5ai/qwen2.5VariousCode generation
Qwen 3ai/qwen3VariousCode, multilingual
Mistralai/mistral7BGeneral, code
Mistral Nemoai/mistral-nemo12BAdvanced tasks
DeepSeek R1ai/deepseek-r1-distill-llamaVariousAdvanced reasoning
All-MiniLMai/all-minilm33MEmbeddings

Model Selection by Use Case

By Hardware:

  • Low-end (8GB RAM): ai/smollm2, ai/llama3.2:1B
  • Mid-range (16GB RAM): ai/llama3.2:3B, ai/gemma3:4b, ai/phi4
  • High-end (32GB+ RAM): ai/llama3.3, ai/deepseek-r1-distill-llama

By Application:

  • General chat: ai/llama3.2 or ai/llama3.3
  • Code generation: ai/qwen2.5 or ai/mistral
  • Reasoning tasks: ai/gemma3 or ai/phi4
  • Embeddings/RAG: ai/all-minilm
  • Quick prototyping: ai/smollm2

Pulling Models

# From Docker Hub
docker model pull ai/llama3.2
docker model pull ai/llama3.2:3B-Q4_K_M  # Specific quantization

# From Hugging Face
docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

# From custom registry
docker model pull myregistry.com/models/mistral:latest

# Check downloaded models
docker model list

Model Format Requirements

EngineFormatPlatform
llama.cpp (default)GGUFAll
vLLMSafetensorsLinux + NVIDIA GPU
DiffusersVariousLinux + NVIDIA GPU

CLI Commands Reference

Model Management

# Pull/download a model
docker model pull ai/llama3.2:3B-Q4_K_M
docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

# List downloaded models
docker model list

# Remove a model
docker model rm ai/llama3.2

# Inspect model details
docker model inspect ai/llama3.2

# View model metadata
docker model inspect ai/llama3.2 --format '{{.Config.Size}}'

Running Models

# Interactive query (load, run, display)
docker model run ai/llama3.2 "Explain Docker in one sentence"

# Interactive chat mode
docker model run ai/llama3.2

# Background/detached mode (pre-load model)
docker model run -d ai/llama3.2

# With options
docker model run --debug ai/llama3.2 "Test prompt"
docker model run --color yes ai/llama3.2 "Hello"
docker model run --ignore-runtime-memory-check ai/llama3.2 "Test"

Run Command Options

OptionDescription
`--color autoyes
--debugEnable debug logging
-d, --detachLoad model in background
--ignore-runtime-memory-checkSkip memory validation

System Commands

# Check runner status
docker model status

# Show version
docker model version

# System information
docker model system info

# Disk usage
docker model system df

# Clean unused models
docker model system prune

# Remove ALL models
docker model system prune -a

Using the OpenAI-Compatible API

Docker Model Runner exposes an OpenAI-compatible API, making it a drop-in replacement for OpenAI's API.

API Endpoint

DeploymentEndpoint
Host (TCP enabled)http://localhost:12434/engines/v1
Docker containers (Desktop)http://model-runner.docker.internal/engines/v1
Docker containers (Linux)http://localhost:12434/engines/v1

Using cURL

# Chat completion
curl http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/llama3.2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Docker?"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

# List available models
curl http://localhost:12434/engines/v1/models

# Create embeddings
curl http://localhost:12434/engines/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/all-minilm",
    "input": "Docker makes it easy to run applications"
  }'

Python with OpenAI SDK

from openai import OpenAI

# Connect to Docker Model Runner
client = OpenAI(
    base_url="http://localhost:12434/engines/v1",
    api_key="not-needed"  # No API key required
)

# Chat completion
response = client.chat.completions.create(
    model="ai/llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Docker Model Runner?"}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Streaming response
stream = client.chat.completions.create(
    model="ai/llama3.2",
    messages=[{"role": "user", "content": "Write a poem about Docker"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Node.js/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:12434/engines/v1',
  apiKey: 'not-needed'
});

async function chat() {
  const response = await client.chat.completions.create({
    model: 'ai/llama3.2',
    messages: [{ role: 'user', content: 'Hello!' }]
  });

  console.log(response.choices[0].message.content);
}

chat();

LangChain Integration

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:12434/engines/v1",
    api_key="not-needed",
    model="ai/llama3.2"
)

response = llm.invoke("What is Docker?")
print(response.content)

Docker Compose Integration

Docker Compose natively supports AI models as a top-level primitive.

Short Syntax (Simple)

services:
  app:
    image: my-app
    models:
      - llm
      - embedding-model

models:
  llm:
    model: ai/llama3.2
  embedding-model:
    model: ai/all-minilm

Long Syntax (With Configuration)

services:
  app:
    image: my-app
    environment:
      - LLM_URL=${LLM_URL}
    models:
      llm:
        endpoint_var: LLM_URL

models:
  llm:
    model: ai/llama3.2:3B-Q4_K_M
    context_size: 4096
    runtime_flags:
      - "--no-prefill-assistant"

Provider Syntax

services:
  chat:
    image: my-chat-app
    depends_on:
      - ai_runner

  ai_runner:
    provider:
      type: model
      options:
        model: ai/smollm2
        context-size: 1024

Auto-Injected Environment Variables

Docker Compose automatically injects environment variables:

VariableDescription
{MODEL}_MODELThe model name (e.g., ai/llama3.2)
{MODEL}_URLThe endpoint URL

Connecting from Containers

# Docker Desktop
ENDPOINT="http://model-runner.docker.internal/engines/v1"

# Docker Engine (Linux)
ENDPOINT="http://localhost:12434/engines/v1"

Complete Example: AI Chat Application

version: '3.8'

services:
  backend:
    image: python:3.11
    command: python app.py
    environment:
      - MODEL_ENDPOINT=http://model-runner.docker.internal/engines/v1
      - MODEL_NAME=ai/llama3.2
    ports:
      - "8000:8000"
    models:
      - chat-model
      - embeddings

  frontend:
    image: node:20
    command: npm start
    ports:
      - "3000:3000"
    depends_on:
      - backend

models:
  chat-model:
    model: ai/llama3.2:3B-Q4_K_M
    context_size: 4096

  embeddings:
    model: ai/all-minilm

GPU Configuration

GPU acceleration dramatically improves inference performance—from 10+ seconds to under 3 seconds for typical responses.

Apple Silicon (Automatic)

GPU acceleration via Metal API is automatically configured on M1, M2, M3, and M4 Macs. No additional setup required.

Docker Model Runner runs natively on the host (not in a VM) for direct GPU access, providing excellent performance on Apple Silicon.

NVIDIA GPU Setup

Docker Desktop (Windows with WSL2):

  1. Ensure NVIDIA drivers are installed
  2. Open Docker Desktop > Settings > AI
  3. Enable GPU-backed inference
  4. Save and restart

Linux:

# 1. Install NVIDIA Container Runtime
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# 2. Configure Docker daemon
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# 3. Verify configuration
docker info | grep -i runtime

# 4. Run with GPU
docker model run --gpu cuda ai/llama3.2 "Hello"

Best Practices for NVIDIA:

  • Always use --gpu cuda explicitly (not --gpu auto)
  • Monitor GPU usage with nvidia-smi
  • Use quantized models (Q4, Q5, Q8) for better GPU memory efficiency
  • Ensure sufficient GPU VRAM for your model size

Vulkan Support (AMD/Intel/Integrated)

Vulkan support enables GPU acceleration on a wider range of hardware:

  • AMD GPUs
  • Intel GPUs (including integrated)
  • Other Vulkan-compatible GPUs

Vulkan detection is automatic—no configuration needed.

GPU Support Matrix

PlatformGPUAPIConfiguration
macOSApple Silicon (M1-M4)MetalAutomatic
WindowsNVIDIACUDAEnable in Settings
WindowsAMD/IntelVulkanAutomatic
LinuxNVIDIACUDA--gpu cuda flag
LinuxAMD/IntelVulkanAutomatic

Inference Engines

Docker Model Runner supports multiple inference engines for different use cases.

llama.cpp (Default)

  • Platform: All (macOS, Windows, Linux)
  • Format: GGUF models only
  • GPU: Metal (Apple), CUDA (NVIDIA), Vulkan (AMD/Intel)
  • Best for: General use, Apple Silicon, development

vLLM

  • Platform: Linux x86_64, Windows WSL2 (with NVIDIA GPU)
  • Format: Safetensors
  • GPU: NVIDIA CUDA required
  • Best for: High-throughput production serving

Diffusers

  • Platform: Linux x86_64, Linux ARM64 (with NVIDIA GPU)
  • Format: Various (Stable Diffusion models)
  • GPU: NVIDIA CUDA required
  • Best for: Image generation

Performance: Docker Model Runner vs Ollama

Benchmark Comparison

MetricDocker Model RunnerOllama
ThroughputBaseline to +12%Baseline
Startup Time3-6 seconds2-5 seconds
Memory (7B Q4)4-6GB4-6GB
Apple SiliconExcellent (native Metal)Excellent
NVIDIAGood (CUDA)Good

Feature Comparison

FeatureDocker Model RunnerOllama
Docker ComposeNative integrationRequires setup
API CompatibilityOpenAI-compatibleOpenAI-compatible
Model FormatGGUF (+Safetensors vLLM)GGUF
Model LibraryGrowingExtensive
EcosystemNew (Docker-focused)Mature (many integrations)
MultimodalComingSupported
Custom ModelsOCI registries, HFModelfile

When to Choose Each

Choose Docker Model Runner when:

  • Your workflow is Docker-centric
  • You use Docker Compose for orchestration
  • You want models as Docker primitives
  • You're on Apple Silicon and want native performance
  • You need CI/CD integration with Docker

Choose Ollama when:

  • You need the largest model library
  • You want maximum ecosystem integrations
  • You need multimodal models now
  • You prefer Modelfile customization
  • You want the most mature solution

Resource Management

System Requirements

ResourceMinimumRecommended
RAM8GB16GB+
StorageVaries50GB+ (for multiple models)
GPUOptionalNVIDIA/Apple Silicon

Model Sizes (Approximate)

ModelQ4_K_MQ8_0FP16
SmolLM 360M~250MB~400MB~720MB
Llama 3.2 1B~800MB~1.2GB~2GB
Llama 3.2 3B~2GB~3.5GB~6GB
Llama 3.3 70B~40GB~70GB~140GB

Memory Management

Models in Docker Model Runner:

  • Load on-demand: First request loads the model
  • Unload when idle: Memory released after inactivity
  • Pre-loading: Use docker model run -d to keep model loaded

Monitoring and Cleanup

# Check disk usage by models
docker model system df

# View detailed model info
docker model inspect ai/llama3.2

# Remove unused models
docker model system prune

# Remove ALL models (use with caution)
docker model system prune -a

# Monitor GPU (NVIDIA)
nvidia-smi

# Check runner status
docker model status

Troubleshooting

Model Not Responding

# Check runner status
docker model status

# Restart runner (Linux)
docker model restart-runner

# Check if model is loaded
docker model list

# Try with debug output
docker model run --debug ai/llama3.2 "Test"

GPU Not Detected

NVIDIA:

# Verify NVIDIA runtime
docker info | grep -i runtime

# Check GPU visibility
nvidia-smi

# Ensure --gpu cuda flag is used
docker model run --gpu cuda ai/llama3.2 "Test"

Apple Silicon:

  • GPU via Metal should work automatically
  • Ensure Docker Desktop is up to date

Port 12434 Not Accessible

  1. Enable Host-side TCP support in Docker Desktop settings
  2. Check for firewall blocking the port
  3. Verify with: curl http://localhost:12434/engines/v1/models

Out of Memory

# Use smaller quantization
docker model pull ai/llama3.2:3B-Q4_K_M

# Clean up unused models
docker model system prune

# Reduce context size in Compose
models:
  llm:
    model: ai/llama3.2
    context_size: 2048  # Smaller context

Best Practices

1. Use Appropriate Quantization

# Q4_K_M: Best balance of quality and memory
docker model pull ai/llama3.2:3B-Q4_K_M

# Q5_K_M: Slightly better quality
docker model pull ai/llama3.2:3B-Q5_K_M

# Q8_0: Near-full quality, more memory
docker model pull ai/llama3.2:3B-Q8_0

2. Pre-load Models for Production

# Load model in background for faster first request
docker model run -d ai/llama3.2

3. Set Appropriate Context Size

Larger context = more memory. Use smallest context that works:

models:
  llm:
    model: ai/llama3.2
    context_size: 2048  # Start small, increase if needed

4. Monitor Resources

# Regular cleanup
docker model system prune

# Check disk usage
docker model system df

Key Takeaways

  1. Docker Model Runner is Docker's native AI inference solution
  2. OpenAI-compatible API enables drop-in replacement for existing code
  3. Docker Compose integration makes AI orchestration familiar and simple
  4. GPU acceleration works across Apple Silicon, NVIDIA, and Vulkan
  5. Performance matches Ollama, with Docker-native advantages
  6. Models load on-demand and unload when idle for efficient memory use
  7. Best for Docker-centric workflows, Compose deployments, and CI/CD

Next Steps

  1. Compare with Ollama and LM Studio
  2. Set up RAG with Docker Model Runner
  3. Build AI agents using the OpenAI API
  4. Choose the best models for your use case

Docker Model Runner brings AI inference into the Docker ecosystem, making it easy to run LLMs alongside your containerized applications with familiar tools, commands, and workflows. Whether you're developing locally or deploying to production, Docker Model Runner provides a seamless AI experience for Docker users.

🚀 Join 12K+ developers
Newsletter

Ready to start your AI career?

Get the complete roadmap

Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.

No spam, everUnsubscribe anytime
12,000+ downloads
Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: February 4, 2026🔄 Last Updated: February 4, 2026✓ Manually Reviewed

My 77K Dataset Insights Delivered Weekly

Get exclusive access to real dataset optimization strategies and AI model performance tips.

Was this helpful?

PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor
Free Tools & Calculators