★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

Ramalama Setup Guide (2026): Container-Native Local LLMs from Red Hat

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Ramalama is Red Hat's answer to "what does Ollama look like through a container-engine lens?" The CLI is familiar — ramalama run llama3.1 — but every model and runtime lives inside an OCI container with proper image signing, SBOM, and Kubernetes-native deployment paths. For organizations that already standardize on Podman / OpenShift / Sigstore, Ramalama is the local-LLM tool that fits the existing operational model rather than introducing a new daemon.

This guide covers everything: installation across Linux / macOS / Windows, OCI model artifacts, GPU auto-detection, multi-modality (LLM / image / audio / embeddings), Podman quadlets and systemd integration, Kubernetes deployment, signing and supply chain, and how Ramalama compares to Ollama and LocalAI for enterprise use.

Table of Contents

  1. What Ramalama Is
  2. Why Containers for Local LLMs?
  3. Hardware & Software Requirements
  4. Installation: Linux, macOS, Windows
  5. Your First Model
  6. Model Sources: HF, Ollama, OCI
  7. GPU Auto-Detection
  8. Serving with OpenAI-Compatible API
  9. Multimodal: Whisper, SD, Embeddings
  10. Podman Quadlets and systemd
  11. Kubernetes Deployment
  12. Image Signing and Supply Chain
  13. Air-Gapped Registries
  14. Ramalama vs Ollama vs LocalAI
  15. Troubleshooting
  16. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Ramalama Is {#what-it-is}

Ramalama is a Python CLI + container orchestration layer that:

  • Runs LLMs inside OCI containers via Podman (default) or Docker
  • Auto-detects GPU runtimes and pulls matching base images (CUDA, ROCm, Vulkan, Metal)
  • Pulls models from Hugging Face, Ollama registry, or OCI registries (Quay, GHCR, Docker Hub)
  • Generates Podman quadlets / Kubernetes YAML for deployment
  • Signs and verifies images via Sigstore

Project: github.com/containers/ramalama. Maintained by the containers project (Podman / Buildah / Skopeo team) at Red Hat.


Why Containers for Local LLMs? {#why-containers}

Several reasons that matter at scale:

  1. Reproducibility — same image, same hash, same behavior across dev / staging / prod.
  2. Sandboxing — model + runtime confined; no Python venvs polluting the host.
  3. Supply chain — signed images with provenance, SBOMs, and CVE scanning.
  4. Kubernetes-native — same artifacts deploy locally and in production.
  5. Multi-runtime — different models can use different inference engines without conflicts.
  6. GPU isolation — NVIDIA / AMD / Intel runtimes coexist on the same host.
  7. Air-gapped friendly — pull once into a private registry, deploy anywhere.

Tradeoff: container cold start adds 1-3 seconds vs Ollama. For desktop chat use, that is unnoticeable; for hot-path serverless inference, it is a downside.


Hardware & Software Requirements {#requirements}

ComponentMinimumRecommended
OSLinux, macOS, Windows (WSL2)RHEL / Fedora / Ubuntu
Container enginePodman 4.x or Docker 24+Podman 5.x
RAM8 GB32 GB+
GPUNone (CPU works)12 GB+ VRAM
Disk20 GBNVMe
GPU runtimenvidia-container-toolkit / rocm container runtimeLatest

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: Linux, macOS, Windows {#installation}

Linux (Fedora / RHEL / CentOS Stream)

sudo dnf install ramalama

Linux (Debian / Ubuntu)

pip install ramalama

macOS

brew install ramalama

Windows (via WSL2)

# Inside WSL2 Ubuntu
pip install ramalama

Verify

ramalama --version
ramalama info        # shows detected GPU runtime

Your First Model {#first-model}

# Pulls and runs Llama 3.1 8B
ramalama run llama3.1:8b

# Behind the scenes:
# 1. Pulls quay.io/ramalama/cuda:latest (or rocm / vulkan / cpu)
# 2. Pulls model from Ollama registry
# 3. Starts container with model mounted
# 4. Drops you into chat REPL

For non-interactive single-shot:

ramalama run llama3.1:8b "What is local AI?"

Model Sources: HF, Ollama, OCI {#model-sources}

Hugging Face

ramalama run huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# or for llamafile
ramalama run hf://Mozilla/Llama-3.1-8B-Instruct-llamafile

Ollama registry

ramalama run ollama://llama3.1:8b

OCI registry (Ramalama-native artifacts)

ramalama run oci://quay.io/ramalama/llama-3.1-8b-instruct:latest

Push your own model as OCI

ramalama convert huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    oci://quay.io/your-org/llama3.1:8b-q4-k-m

ramalama push oci://quay.io/your-org/llama3.1:8b-q4-k-m

This is the killer feature for enterprise: your fine-tuned model becomes a signed, scannable, versioned artifact in your existing registry alongside your other container images.


GPU Auto-Detection {#gpu}

ramalama info
# {"GPU": "nvidia", "Driver": "555.42", "VRAM": "24576"}

Force a runtime:

ramalama --gpu nvidia run llama3.1:8b
ramalama --gpu rocm run llama3.1:8b
ramalama --gpu vulkan run llama3.1:8b
ramalama --gpu metal run llama3.1:8b   # Mac
ramalama --gpu none run llama3.1:8b    # CPU only

Container images tagged accordingly: quay.io/ramalama/cuda, quay.io/ramalama/rocm, quay.io/ramalama/vulkan, etc.


Serving with OpenAI-Compatible API {#serve}

ramalama serve --port 8080 --host 0.0.0.0 llama3.1:8b

OpenAI-compatible endpoints at http://localhost:8080/v1/.... Identical surface to Ollama, LocalAI, and vLLM.

For detached mode (long-running):

ramalama serve --detach --name llama-svc llama3.1:8b
ramalama list                     # shows running services
ramalama logs llama-svc           # logs from the container
ramalama stop llama-svc

Multimodal: Whisper, SD, Embeddings {#multimodal}

# Whisper STT
ramalama run whisper://medium audio.wav

# Stable Diffusion
ramalama serve stablediffusion-xl

# Embeddings
ramalama serve nomic-embed-text-v1.5

Each spins up a different container image with the appropriate backend (whisper.cpp, diffusers, llama.cpp embeddings).


Podman Quadlets and systemd {#quadlet}

Generate a quadlet:

ramalama generate quadlet llama3.1:8b > ~/.config/containers/systemd/llama-svc.container
systemctl --user daemon-reload
systemctl --user start llama-svc.service

The quadlet file describes the container in systemd-native syntax. systemd handles startup, restart, logging via journalctl, and resource limits.


Kubernetes Deployment {#kubernetes}

ramalama generate kube llama3.1:8b > llama-svc.yaml
kubectl apply -f llama-svc.yaml

Output is a Deployment + Service + (optionally) Ingress / Route. Edit before apply for production:

  • Resource requests / limits (nvidia.com/gpu: 1)
  • PVC for model cache
  • Readiness probes
  • HPA on localai_requests_running or custom metric

For OpenShift AI / RHEL AI:

ramalama generate kube --openshift llama3.1:8b > llama-svc.yaml

Adds OpenShift-specific routes, security context constraints, and Operator integration.


Image Signing and Supply Chain {#signing}

Sign with Sigstore cosign:

cosign sign --yes oci://quay.io/your-org/llama3.1:8b

Verify before run:

ramalama run --verify-signature oci://quay.io/your-org/llama3.1:8b

Generate SBOM:

ramalama sbom oci://quay.io/your-org/llama3.1:8b > sbom.spdx.json

For regulated environments (HIPAA, SOC2, EU AI Act), supply-chain transparency on AI artifacts is increasingly required. See HIPAA Compliant Local AI, SOC2 Self-Hosted AI, EU AI Act Local Compliance.


Air-Gapped Registries {#air-gapped}

For air-gapped deployments:

  1. On internet-connected jump host: ramalama pull oci://quay.io/ramalama/llama3.1:8b
  2. Save image: podman save quay.io/ramalama/llama3.1:8b > llama.tar
  3. Transfer to air-gapped network
  4. Load: podman load < llama.tar
  5. Push to internal registry: podman push localhost:5000/llama3.1:8b
  6. ramalama pull oci://localhost:5000/llama3.1:8b on each node

Or use skopeo copy directly between registries — no intermediate file. See Air-Gapped AI Deployment.


Ramalama vs Ollama vs LocalAI {#comparison}

PropertyOllamaLocalAIRamalama
Daemon modelLong-runningLong-runningPer-call container
Model registryCustom (Ollama)Gallery YAMLOCI registries
Container-nativeOptional (Docker)OptionalYes (default)
MultimodalLLMs + vision + embeddingsFull multimodalMultimodal via images
Image signingNoNoYes (Sigstore)
OpenShift / K8s integrationManualHelmNative quadlet/kube
GPU auto-detectYesPer-imageYes
Best fitDesktop / devDrop-in OpenAIEnterprise / regulated

For solo developer use, Ollama is simpler. For OpenAI-API parity, LocalAI. For OpenShift / Podman / regulated environments, Ramalama.


Troubleshooting {#troubleshooting}

SymptomCauseFix
GPU not detectedContainer toolkit missingInstall nvidia-container-toolkit / rocm-container-runtime
Pull failsRegistry authpodman login quay.io
OOMContainer memory limit--memory 24g or remove limit
Slow first runImage pull (~3-8 GB)Subsequent runs are fast
Ollama-format model errorsFormat conversion neededramalama convert ollama://... oci://...
WSL2: GPU not visibleNVIDIA WSL driverInstall on Windows host, not WSL
systemd service won't startquadlet syntaxsystemctl --user --failed for hints
Sigstore signature mismatchUntrusted publisherAdd public key to allowlist

FAQ {#faq}

See answers to common Ramalama questions below.


Sources: Ramalama GitHub | Podman docs | OpenShift AI | Sigstore cosign | Internal benchmarks NVIDIA, AMD.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a Ramalama + Quadlet reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators