★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Tools

Llamafile Setup Guide (2026): Run LLMs from a Single Cross-Platform Executable

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Llamafile is the simplest LLM distribution ever made. One file. No install. No dependencies. No daemon. Runs natively on Linux, macOS, Windows, and BSD from the same binary thanks to Justine Tunney's Cosmopolitan Libc. Mozilla maintains it as an experiment in delivering AI as a single self-contained artifact.

This guide covers everything: how to download and run pre-built Llamafiles, GPU acceleration on each platform, building your own .llamafile from a GGUF model, the OpenAI-compatible API, sampling, web UI, and the niche where Llamafile beats every other option.

Table of Contents

  1. What Llamafile Is
  2. How Cosmopolitan Libc Works
  3. Hardware Requirements
  4. Running a Pre-Built Llamafile
  5. GPU Acceleration: NVIDIA, AMD, Apple
  6. Web UI
  7. OpenAI-Compatible API
  8. The Windows 4 GB Executable Limit
  9. Building a Custom Llamafile
  10. Sampling and Configuration Flags
  11. Embedding Llamafile in a Distribution
  12. Performance vs llama.cpp / Ollama
  13. Use Cases Where Llamafile Wins
  14. Troubleshooting
  15. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Llamafile Is {#what-it-is}

A Llamafile is a single executable that bundles:

  • A llama.cpp binary built with Cosmopolitan Libc
  • Optional embedded model weights (GGUF concatenated)
  • A built-in web UI
  • An OpenAI-compatible REST server

Run it. Browse to localhost. Have a local LLM. That is the entire experience.

Project: github.com/Mozilla-Ocho/llamafile. Maintained by Mozilla and Justine Tunney.


How Cosmopolitan Libc Works {#cosmopolitan}

The Cosmopolitan Libc compiler (cosmocc) produces "Actually Portable Executable" (APE) files. Each APE is simultaneously:

  • A valid Windows PE+ executable
  • A valid Linux ELF executable
  • A valid macOS Mach-O executable (via shim)
  • A valid FreeBSD / NetBSD / OpenBSD executable
  • A valid POSIX shell script

When you double-click on Windows it runs as PE. On Linux, the kernel reads ELF headers. On Mac, a small bootstrap converts it to Mach-O at first run. The same machine code services all of them through portable syscall wrappers.

This is the foundation that makes Llamafile a single file across OSes.


Hardware Requirements {#requirements}

ComponentMinimumRecommended
OSLinux, macOS 10.14+, Windows 10+, FreeBSD, OpenBSD, NetBSDSame
CPUx86_64 with AVX2, ARM64Modern 8-core+
RAM8 GB16-32 GB
GPUNone (CPU works)8 GB+ VRAM for medium models
Disk1-50 GB depending on modelNVMe

Llamafile gracefully falls back to CPU when GPU is unavailable. It runs on hardware most other LLM tools won't touch.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Running a Pre-Built Llamafile {#running}

Linux / macOS

# Download (e.g., Llama 3.2 3B)
wget https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile/resolve/main/Llama-3.2-3B-Instruct.Q6_K.llamafile

chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile

# Run with web UI + OpenAI API on port 8080
./Llama-3.2-3B-Instruct.Q6_K.llamafile -ngl 999

Windows

Rename to .exe (Windows requires the extension):

ren Llama-3.2-3B-Instruct.Q6_K.llamafile Llama-3.2-3B-Instruct.Q6_K.exe
.\Llama-3.2-3B-Instruct.Q6_K.exe -ngl 999

Or run via WSL2 / Git Bash without the rename.

Mozilla's published Llamafiles (2026)

ModelSizeUse
TinyLlama-1.1B0.7 GBEmbedded, edge
Llama-3.2-1B-Instruct0.8 GBMobile-class
Llama-3.2-3B-Instruct2.2 GBSweet-spot small
Qwen-2.5-3B-Instruct2.0 GBStrong tiny model
Phi-3.5-Mini-Instruct2.3 GBMicrosoft's small reasoning model
Gemma-2-2B-Instruct1.6 GBGoogle's small chat model
Mistral-7B-Instruct-v0.34.4 GBClassic 7B
Llama-3.1-8B-Instruct4.9 GBStandard 8B
LLaVA-1.5-7B (multimodal)4.0 GBVision + text

Bigger models (70B class) ship as ZIP archives separating binary from weights to dodge the Windows 4 GB limit.


GPU Acceleration: NVIDIA, AMD, Apple {#gpu}

NVIDIA CUDA

./model.llamafile -ngl 999

Llamafile auto-detects CUDA. Requires CUDA toolkit installed (will fall back to tinyBLAS / CPU if not). Set LLAMAFILE_GPU=NVIDIA to force.

Apple Metal

./model.llamafile -ngl 999

Metal is auto-enabled on Mac. No additional setup.

AMD ROCm

./model.llamafile -ngl 999
# If issues:
HSA_OVERRIDE_GFX_VERSION=11.0.0 ./model.llamafile -ngl 999

See AMD ROCm Setup for AMD specifics.

CPU-only

Drop -ngl or set to 0. Llamafile uses optimized CPU kernels (tinyBLAS) — performance on a modern Ryzen / Apple Silicon CPU is surprisingly competitive for small models.


Web UI {#ui}

Browse to http://localhost:8080 after launching. Features:

  • Chat / completion playground
  • Sampler controls (temperature, top-p, top-k, mirostat, penalties)
  • Prompt template editor
  • System prompt
  • Image upload (for vision Llamafiles)
  • Save / load conversations

The UI is the standard llama.cpp web UI plus minor branding.


OpenAI-Compatible API {#api}

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.2-3b",
        "messages": [{"role":"user","content":"Hello"}]
    }'

Endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings (with embedding models), and /health. Streaming via stream: true works.

Drop your existing OpenAI client onto http://localhost:8080/v1 and it works unchanged.


The Windows 4 GB Executable Limit {#windows-limit}

Windows PE32+ has a 4 GB image-size limit. Workaround for larger models:

# Run separated: small launcher + external GGUF
./llamafile -m big-model.Q5_K_M.gguf -ngl 999 -c 8192

Mozilla provides minimal llamafile and llamafile-server launchers (~25 MB) that work this way. Useful when:

  • You need 7B+ models on Windows
  • You want to share one binary across many models
  • The model is updated frequently

Building a Custom Llamafile {#building}

Quick concatenation

# Download the launcher binary
wget https://github.com/Mozilla-Ocho/llamafile/releases/latest/download/llamafile-server-0.9.x

# Concatenate with your GGUF model
cat llamafile-server-0.9.x my-model-Q5_K_M.gguf > my-model.llamafile
chmod +x my-model.llamafile
# Use zipalign for memory-mapped loading
./zipalign -j0 my-model.llamafile llamafile-server-0.9.x my-model-Q5_K_M.gguf

The zipalign step is important — without it, the model is loaded into RAM rather than mmap'd, doubling memory use.

From source

git clone https://github.com/Mozilla-Ocho/llamafile
cd llamafile
make -j8

Produces o//llamafile (CLI) and o//llamafile-server (REST server). Build requires cosmocc toolchain — the build system handles fetching it.


Sampling and Configuration Flags {#config}

./model.llamafile \
    -m my-model.gguf \
    -ngl 999 \
    -c 16384 \
    -fa \
    --temp 0.7 \
    --top-p 0.9 \
    --min-p 0.05 \
    --repeat-penalty 1.05 \
    --port 8080 \
    --host 0.0.0.0 \
    --api-key sk-yourkey

Standard llama.cpp flags. See LLM Sampling Parameters for what each does.


Embedding Llamafile in a Distribution {#distribution}

Llamafile is ideal for shipping LLMs as part of a larger distribution:

  • Software products: ship an offline AI feature as one bundled .llamafile.
  • Conference USB sticks: a 5 GB stick with a Llamafile + a folder of demos.
  • Air-gapped sites: download once on internet-connected machine, copy to air-gapped LAN.
  • CI / build servers: drop a Llamafile into the workflow for code reviews / documentation generation.
  • Embedded devices: Llamafile runs on Raspberry Pi, NVIDIA Jetson, and even some routers.

Performance vs llama.cpp / Ollama {#performance}

RTX 4090 + Llama 3.1 8B Q5_K_M:

Tooltok/s
Plain llama.cpp (latest)132
Ollama130
Llamafile131
KoboldCpp130

CPU-only on Ryzen 7 7700X with same model:

Tooltok/s
llama.cpp8.4
Llamafile8.5
Ollama8.3

Within noise across all platforms — they share the same kernel implementations now. Choose based on operational fit, not speed.


Use Cases Where Llamafile Wins {#use-cases}

  1. Distributing AI features as software products — one file, no install.
  2. Air-gapped / offline / classified environments — download once, copy in.
  3. USB stick / portable demos — runs on any laptop you plug it into.
  4. CI/CD AI tasks — single binary in the pipeline, no Docker required.
  5. Edge devices / kiosks — runs on Raspberry Pi 5, Jetson Orin, mini-PCs.
  6. First-launch onboarding — "click this file" is the simplest possible UX.
  7. Workshops and education — students download one file and run.

For multi-user servers, model registry management, or advanced features (PagedAttention, FP8, multi-GPU TP), other tools are better — see vLLM or TensorRT-LLM.


Troubleshooting {#troubleshooting}

SymptomCauseFix
"Permission denied"Not executablechmod +x model.llamafile
Windows: "executable too large"4 GB PE limitUse separated launcher + .gguf
GPU not detectedCUDA not installedInstall CUDA toolkit; or fall back to CPU
AMD: hipErrorNoBinaryForGpuOld gfxSet HSA_OVERRIDE_GFX_VERSION
macOS: "cannot be opened"Gatekeeperxattr -d com.apple.quarantine model.llamafile
Web UI 404Running CLI variantUse llamafile-server not llamafile
Slow CPU performanceNo AVX2Older CPU; expected
Out of memory at loadModel too bigSmaller quant or model

FAQ {#faq}

See answers to common Llamafile questions below.


Sources: Llamafile GitHub | Cosmopolitan Libc | Mozilla AI | Justine Tunney's blog posts on Llamafile performance.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a Llamafile-based offline kiosk template. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators