Llamafile Setup Guide (2026): Run LLMs from a Single Cross-Platform Executable
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Llamafile is the simplest LLM distribution ever made. One file. No install. No dependencies. No daemon. Runs natively on Linux, macOS, Windows, and BSD from the same binary thanks to Justine Tunney's Cosmopolitan Libc. Mozilla maintains it as an experiment in delivering AI as a single self-contained artifact.
This guide covers everything: how to download and run pre-built Llamafiles, GPU acceleration on each platform, building your own .llamafile from a GGUF model, the OpenAI-compatible API, sampling, web UI, and the niche where Llamafile beats every other option.
Table of Contents
- What Llamafile Is
- How Cosmopolitan Libc Works
- Hardware Requirements
- Running a Pre-Built Llamafile
- GPU Acceleration: NVIDIA, AMD, Apple
- Web UI
- OpenAI-Compatible API
- The Windows 4 GB Executable Limit
- Building a Custom Llamafile
- Sampling and Configuration Flags
- Embedding Llamafile in a Distribution
- Performance vs llama.cpp / Ollama
- Use Cases Where Llamafile Wins
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Llamafile Is {#what-it-is}
A Llamafile is a single executable that bundles:
- A llama.cpp binary built with Cosmopolitan Libc
- Optional embedded model weights (GGUF concatenated)
- A built-in web UI
- An OpenAI-compatible REST server
Run it. Browse to localhost. Have a local LLM. That is the entire experience.
Project: github.com/Mozilla-Ocho/llamafile. Maintained by Mozilla and Justine Tunney.
How Cosmopolitan Libc Works {#cosmopolitan}
The Cosmopolitan Libc compiler (cosmocc) produces "Actually Portable Executable" (APE) files. Each APE is simultaneously:
- A valid Windows PE+ executable
- A valid Linux ELF executable
- A valid macOS Mach-O executable (via shim)
- A valid FreeBSD / NetBSD / OpenBSD executable
- A valid POSIX shell script
When you double-click on Windows it runs as PE. On Linux, the kernel reads ELF headers. On Mac, a small bootstrap converts it to Mach-O at first run. The same machine code services all of them through portable syscall wrappers.
This is the foundation that makes Llamafile a single file across OSes.
Hardware Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux, macOS 10.14+, Windows 10+, FreeBSD, OpenBSD, NetBSD | Same |
| CPU | x86_64 with AVX2, ARM64 | Modern 8-core+ |
| RAM | 8 GB | 16-32 GB |
| GPU | None (CPU works) | 8 GB+ VRAM for medium models |
| Disk | 1-50 GB depending on model | NVMe |
Llamafile gracefully falls back to CPU when GPU is unavailable. It runs on hardware most other LLM tools won't touch.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Running a Pre-Built Llamafile {#running}
Linux / macOS
# Download (e.g., Llama 3.2 3B)
wget https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/resolve/main/Llama-3.2-3B-Instruct.Q6_K.llamafile
chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile
# Run with web UI + OpenAI API on port 8080
./Llama-3.2-3B-Instruct.Q6_K.llamafile -ngl 999
Windows
Rename to .exe (Windows requires the extension):
ren Llama-3.2-3B-Instruct.Q6_K.llamafile Llama-3.2-3B-Instruct.Q6_K.exe
.\Llama-3.2-3B-Instruct.Q6_K.exe -ngl 999
Or run via WSL2 / Git Bash without the rename.
Mozilla's published Llamafiles (2026)
| Model | Size | Use |
|---|---|---|
| TinyLlama-1.1B | 0.7 GB | Embedded, edge |
| Llama-3.2-1B-Instruct | 0.8 GB | Mobile-class |
| Llama-3.2-3B-Instruct | 2.2 GB | Sweet-spot small |
| Qwen-2.5-3B-Instruct | 2.0 GB | Strong tiny model |
| Phi-3.5-Mini-Instruct | 2.3 GB | Microsoft's small reasoning model |
| Gemma-2-2B-Instruct | 1.6 GB | Google's small chat model |
| Mistral-7B-Instruct-v0.3 | 4.4 GB | Classic 7B |
| Llama-3.1-8B-Instruct | 4.9 GB | Standard 8B |
| LLaVA-1.5-7B (multimodal) | 4.0 GB | Vision + text |
Bigger models (70B class) ship as ZIP archives separating binary from weights to dodge the Windows 4 GB limit.
GPU Acceleration: NVIDIA, AMD, Apple {#gpu}
NVIDIA CUDA
./model.llamafile -ngl 999
Llamafile auto-detects CUDA. Requires CUDA toolkit installed (will fall back to tinyBLAS / CPU if not). Set LLAMAFILE_GPU=NVIDIA to force.
Apple Metal
./model.llamafile -ngl 999
Metal is auto-enabled on Mac. No additional setup.
AMD ROCm
./model.llamafile -ngl 999
# If issues:
HSA_OVERRIDE_GFX_VERSION=11.0.0 ./model.llamafile -ngl 999
See AMD ROCm Setup for AMD specifics.
CPU-only
Drop -ngl or set to 0. Llamafile uses optimized CPU kernels (tinyBLAS) — performance on a modern Ryzen / Apple Silicon CPU is surprisingly competitive for small models.
Web UI {#ui}
Browse to http://localhost:8080 after launching. Features:
- Chat / completion playground
- Sampler controls (temperature, top-p, top-k, mirostat, penalties)
- Prompt template editor
- System prompt
- Image upload (for vision Llamafiles)
- Save / load conversations
The UI is the standard llama.cpp web UI plus minor branding.
OpenAI-Compatible API {#api}
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b",
"messages": [{"role":"user","content":"Hello"}]
}'
Endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings (with embedding models), and /health. Streaming via stream: true works.
Drop your existing OpenAI client onto http://localhost:8080/v1 and it works unchanged.
The Windows 4 GB Executable Limit {#windows-limit}
Windows PE32+ has a 4 GB image-size limit. Workaround for larger models:
# Run separated: small launcher + external GGUF
./llamafile -m big-model.Q5_K_M.gguf -ngl 999 -c 8192
Mozilla provides minimal llamafile and llamafile-server launchers (~25 MB) that work this way. Useful when:
- You need 7B+ models on Windows
- You want to share one binary across many models
- The model is updated frequently
Building a Custom Llamafile {#building}
Quick concatenation
# Download the launcher binary
wget https://github.com/Mozilla-Ocho/llamafile/releases/latest/download/llamafile-server-0.9.x
# Concatenate with your GGUF model
cat llamafile-server-0.9.x my-model-Q5_K_M.gguf > my-model.llamafile
chmod +x my-model.llamafile
With proper alignment (recommended)
# Use zipalign for memory-mapped loading
./zipalign -j0 my-model.llamafile llamafile-server-0.9.x my-model-Q5_K_M.gguf
The zipalign step is important — without it, the model is loaded into RAM rather than mmap'd, doubling memory use.
From source
git clone https://github.com/Mozilla-Ocho/llamafile
cd llamafile
make -j8
Produces o//llamafile (CLI) and o//llamafile-server (REST server). Build requires cosmocc toolchain — the build system handles fetching it.
Sampling and Configuration Flags {#config}
./model.llamafile \
-m my-model.gguf \
-ngl 999 \
-c 16384 \
-fa \
--temp 0.7 \
--top-p 0.9 \
--min-p 0.05 \
--repeat-penalty 1.05 \
--port 8080 \
--host 0.0.0.0 \
--api-key sk-yourkey
Standard llama.cpp flags. See LLM Sampling Parameters for what each does.
Embedding Llamafile in a Distribution {#distribution}
Llamafile is ideal for shipping LLMs as part of a larger distribution:
- Software products: ship an offline AI feature as one bundled .llamafile.
- Conference USB sticks: a 5 GB stick with a Llamafile + a folder of demos.
- Air-gapped sites: download once on internet-connected machine, copy to air-gapped LAN.
- CI / build servers: drop a Llamafile into the workflow for code reviews / documentation generation.
- Embedded devices: Llamafile runs on Raspberry Pi, NVIDIA Jetson, and even some routers.
Performance vs llama.cpp / Ollama {#performance}
RTX 4090 + Llama 3.1 8B Q5_K_M:
| Tool | tok/s |
|---|---|
| Plain llama.cpp (latest) | 132 |
| Ollama | 130 |
| Llamafile | 131 |
| KoboldCpp | 130 |
CPU-only on Ryzen 7 7700X with same model:
| Tool | tok/s |
|---|---|
| llama.cpp | 8.4 |
| Llamafile | 8.5 |
| Ollama | 8.3 |
Within noise across all platforms — they share the same kernel implementations now. Choose based on operational fit, not speed.
Use Cases Where Llamafile Wins {#use-cases}
- Distributing AI features as software products — one file, no install.
- Air-gapped / offline / classified environments — download once, copy in.
- USB stick / portable demos — runs on any laptop you plug it into.
- CI/CD AI tasks — single binary in the pipeline, no Docker required.
- Edge devices / kiosks — runs on Raspberry Pi 5, Jetson Orin, mini-PCs.
- First-launch onboarding — "click this file" is the simplest possible UX.
- Workshops and education — students download one file and run.
For multi-user servers, model registry management, or advanced features (PagedAttention, FP8, multi-GPU TP), other tools are better — see vLLM or TensorRT-LLM.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| "Permission denied" | Not executable | chmod +x model.llamafile |
| Windows: "executable too large" | 4 GB PE limit | Use separated launcher + .gguf |
| GPU not detected | CUDA not installed | Install CUDA toolkit; or fall back to CPU |
| AMD: hipErrorNoBinaryForGpu | Old gfx | Set HSA_OVERRIDE_GFX_VERSION |
| macOS: "cannot be opened" | Gatekeeper | xattr -d com.apple.quarantine model.llamafile |
| Web UI 404 | Running CLI variant | Use llamafile-server not llamafile |
| Slow CPU performance | No AVX2 | Older CPU; expected |
| Out of memory at load | Model too big | Smaller quant or model |
FAQ {#faq}
See answers to common Llamafile questions below.
Sources: Llamafile GitHub | Cosmopolitan Libc | Mozilla AI | Justine Tunney's blog posts on Llamafile performance.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!