Ramalama Setup Guide (2026): Container-Native Local LLMs from Red Hat
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ramalama is Red Hat's answer to "what does Ollama look like through a container-engine lens?" The CLI is familiar — ramalama run llama3.1 — but every model and runtime lives inside an OCI container with proper image signing, SBOM, and Kubernetes-native deployment paths. For organizations that already standardize on Podman / OpenShift / Sigstore, Ramalama is the local-LLM tool that fits the existing operational model rather than introducing a new daemon.
This guide covers everything: installation across Linux / macOS / Windows, OCI model artifacts, GPU auto-detection, multi-modality (LLM / image / audio / embeddings), Podman quadlets and systemd integration, Kubernetes deployment, signing and supply chain, and how Ramalama compares to Ollama and LocalAI for enterprise use.
Table of Contents
- What Ramalama Is
- Why Containers for Local LLMs?
- Hardware & Software Requirements
- Installation: Linux, macOS, Windows
- Your First Model
- Model Sources: HF, Ollama, OCI
- GPU Auto-Detection
- Serving with OpenAI-Compatible API
- Multimodal: Whisper, SD, Embeddings
- Podman Quadlets and systemd
- Kubernetes Deployment
- Image Signing and Supply Chain
- Air-Gapped Registries
- Ramalama vs Ollama vs LocalAI
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Ramalama Is {#what-it-is}
Ramalama is a Python CLI + container orchestration layer that:
- Runs LLMs inside OCI containers via Podman (default) or Docker
- Auto-detects GPU runtimes and pulls matching base images (CUDA, ROCm, Vulkan, Metal)
- Pulls models from Hugging Face, Ollama registry, or OCI registries (Quay, GHCR, Docker Hub)
- Generates Podman quadlets / Kubernetes YAML for deployment
- Signs and verifies images via Sigstore
Project: github.com/containers/ramalama. Maintained by the containers project (Podman / Buildah / Skopeo team) at Red Hat.
Why Containers for Local LLMs? {#why-containers}
Several reasons that matter at scale:
- Reproducibility — same image, same hash, same behavior across dev / staging / prod.
- Sandboxing — model + runtime confined; no Python venvs polluting the host.
- Supply chain — signed images with provenance, SBOMs, and CVE scanning.
- Kubernetes-native — same artifacts deploy locally and in production.
- Multi-runtime — different models can use different inference engines without conflicts.
- GPU isolation — NVIDIA / AMD / Intel runtimes coexist on the same host.
- Air-gapped friendly — pull once into a private registry, deploy anywhere.
Tradeoff: container cold start adds 1-3 seconds vs Ollama. For desktop chat use, that is unnoticeable; for hot-path serverless inference, it is a downside.
Hardware & Software Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux, macOS, Windows (WSL2) | RHEL / Fedora / Ubuntu |
| Container engine | Podman 4.x or Docker 24+ | Podman 5.x |
| RAM | 8 GB | 32 GB+ |
| GPU | None (CPU works) | 12 GB+ VRAM |
| Disk | 20 GB | NVMe |
| GPU runtime | nvidia-container-toolkit / rocm container runtime | Latest |
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: Linux, macOS, Windows {#installation}
Linux (Fedora / RHEL / CentOS Stream)
sudo dnf install ramalama
Linux (Debian / Ubuntu)
pip install ramalama
macOS
brew install ramalama
Windows (via WSL2)
# Inside WSL2 Ubuntu
pip install ramalama
Verify
ramalama --version
ramalama info # shows detected GPU runtime
Your First Model {#first-model}
# Pulls and runs Llama 3.1 8B
ramalama run llama3.1:8b
# Behind the scenes:
# 1. Pulls quay.io/ramalama/cuda:latest (or rocm / vulkan / cpu)
# 2. Pulls model from Ollama registry
# 3. Starts container with model mounted
# 4. Drops you into chat REPL
For non-interactive single-shot:
ramalama run llama3.1:8b "What is local AI?"
Model Sources: HF, Ollama, OCI {#model-sources}
Hugging Face
ramalama run huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# or for llamafile
ramalama run hf://Mozilla/Llama-3.1-8B-Instruct-llamafile
Ollama registry
ramalama run ollama://llama3.1:8b
OCI registry (Ramalama-native artifacts)
ramalama run oci://quay.io/ramalama/llama-3.1-8b-instruct:latest
Push your own model as OCI
ramalama convert huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
oci://quay.io/your-org/llama3.1:8b-q4-k-m
ramalama push oci://quay.io/your-org/llama3.1:8b-q4-k-m
This is the killer feature for enterprise: your fine-tuned model becomes a signed, scannable, versioned artifact in your existing registry alongside your other container images.
GPU Auto-Detection {#gpu}
ramalama info
# {"GPU": "nvidia", "Driver": "555.42", "VRAM": "24576"}
Force a runtime:
ramalama --gpu nvidia run llama3.1:8b
ramalama --gpu rocm run llama3.1:8b
ramalama --gpu vulkan run llama3.1:8b
ramalama --gpu metal run llama3.1:8b # Mac
ramalama --gpu none run llama3.1:8b # CPU only
Container images tagged accordingly: quay.io/ramalama/cuda, quay.io/ramalama/rocm, quay.io/ramalama/vulkan, etc.
Serving with OpenAI-Compatible API {#serve}
ramalama serve --port 8080 --host 0.0.0.0 llama3.1:8b
OpenAI-compatible endpoints at http://localhost:8080/v1/.... Identical surface to Ollama, LocalAI, and vLLM.
For detached mode (long-running):
ramalama serve --detach --name llama-svc llama3.1:8b
ramalama list # shows running services
ramalama logs llama-svc # logs from the container
ramalama stop llama-svc
Multimodal: Whisper, SD, Embeddings {#multimodal}
# Whisper STT
ramalama run whisper://medium audio.wav
# Stable Diffusion
ramalama serve stablediffusion-xl
# Embeddings
ramalama serve nomic-embed-text-v1.5
Each spins up a different container image with the appropriate backend (whisper.cpp, diffusers, llama.cpp embeddings).
Podman Quadlets and systemd {#quadlet}
Generate a quadlet:
ramalama generate quadlet llama3.1:8b > ~/.config/containers/systemd/llama-svc.container
systemctl --user daemon-reload
systemctl --user start llama-svc.service
The quadlet file describes the container in systemd-native syntax. systemd handles startup, restart, logging via journalctl, and resource limits.
Kubernetes Deployment {#kubernetes}
ramalama generate kube llama3.1:8b > llama-svc.yaml
kubectl apply -f llama-svc.yaml
Output is a Deployment + Service + (optionally) Ingress / Route. Edit before apply for production:
- Resource requests / limits (
nvidia.com/gpu: 1) - PVC for model cache
- Readiness probes
- HPA on
localai_requests_runningor custom metric
For OpenShift AI / RHEL AI:
ramalama generate kube --openshift llama3.1:8b > llama-svc.yaml
Adds OpenShift-specific routes, security context constraints, and Operator integration.
Image Signing and Supply Chain {#signing}
Sign with Sigstore cosign:
cosign sign --yes oci://quay.io/your-org/llama3.1:8b
Verify before run:
ramalama run --verify-signature oci://quay.io/your-org/llama3.1:8b
Generate SBOM:
ramalama sbom oci://quay.io/your-org/llama3.1:8b > sbom.spdx.json
For regulated environments (HIPAA, SOC2, EU AI Act), supply-chain transparency on AI artifacts is increasingly required. See HIPAA Compliant Local AI, SOC2 Self-Hosted AI, EU AI Act Local Compliance.
Air-Gapped Registries {#air-gapped}
For air-gapped deployments:
- On internet-connected jump host:
ramalama pull oci://quay.io/ramalama/llama3.1:8b - Save image:
podman save quay.io/ramalama/llama3.1:8b > llama.tar - Transfer to air-gapped network
- Load:
podman load < llama.tar - Push to internal registry:
podman push localhost:5000/llama3.1:8b ramalama pull oci://localhost:5000/llama3.1:8bon each node
Or use skopeo copy directly between registries — no intermediate file. See Air-Gapped AI Deployment.
Ramalama vs Ollama vs LocalAI {#comparison}
| Property | Ollama | LocalAI | Ramalama |
|---|---|---|---|
| Daemon model | Long-running | Long-running | Per-call container |
| Model registry | Custom (Ollama) | Gallery YAML | OCI registries |
| Container-native | Optional (Docker) | Optional | Yes (default) |
| Multimodal | LLMs + vision + embeddings | Full multimodal | Multimodal via images |
| Image signing | No | No | Yes (Sigstore) |
| OpenShift / K8s integration | Manual | Helm | Native quadlet/kube |
| GPU auto-detect | Yes | Per-image | Yes |
| Best fit | Desktop / dev | Drop-in OpenAI | Enterprise / regulated |
For solo developer use, Ollama is simpler. For OpenAI-API parity, LocalAI. For OpenShift / Podman / regulated environments, Ramalama.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| GPU not detected | Container toolkit missing | Install nvidia-container-toolkit / rocm-container-runtime |
| Pull fails | Registry auth | podman login quay.io |
| OOM | Container memory limit | --memory 24g or remove limit |
| Slow first run | Image pull (~3-8 GB) | Subsequent runs are fast |
| Ollama-format model errors | Format conversion needed | ramalama convert ollama://... oci://... |
| WSL2: GPU not visible | NVIDIA WSL driver | Install on Windows host, not WSL |
| systemd service won't start | quadlet syntax | systemctl --user --failed for hints |
| Sigstore signature mismatch | Untrusted publisher | Add public key to allowlist |
FAQ {#faq}
See answers to common Ramalama questions below.
Sources: Ramalama GitHub | Podman docs | OpenShift AI | Sigstore cosign | Internal benchmarks NVIDIA, AMD.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!