LocalAI Setup Guide (2026): The OpenAI-Compatible Drop-In Replacement
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
LocalAI is the closest thing to a one-stop self-hosted replacement for the OpenAI API. One binary speaks every major OpenAI endpoint — chat, completions, embeddings, image generation, transcription, text-to-speech, vision — and orchestrates the right backend (llama.cpp, vLLM, Stable Diffusion, Whisper, Bark, Coqui) for each. Drop your existing OpenAI client onto a LocalAI base URL and 95% of the surface just works.
This guide covers everything: installation across Docker / Kubernetes / bare metal, the model gallery, configuring backends, running multimodal endpoints (text + image + audio + vision) from one server, function calling, JSON mode, P2P federated mode, and tuning for production.
Table of Contents
- What LocalAI Is
- Capabilities and Endpoints
- Hardware Requirements
- Installation: Docker, Bare Metal, Kubernetes
- The Model Gallery
- Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc.
- Your First Chat Completion
- Embeddings
- Image Generation (Stable Diffusion, Flux)
- Audio: Whisper, TTS, Bark, MusicGen
- Vision: LLaVA and Llama 3.2 Vision
- Reranker
- Function Calling and JSON Mode
- Custom Models and YAML Configs
- P2P Federated Mode
- Authentication, Rate Limiting, Multi-Tenancy
- Observability
- Tuning Recipes
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What LocalAI Is {#what-it-is}
LocalAI (mudler/LocalAI on GitHub) is a Go-based REST server that:
- Implements the OpenAI API surface (chat, embeddings, images, audio, vision, etc.)
- Loads pluggable backends (llama.cpp, vLLM, transformers, exllama, diffusers, whisper, piper, bark, coqui, etc.) per model
- Ships a curated model gallery for one-click model installation
- Provides a built-in WebUI for chat, image gen, transcription, voice cloning
- Supports CPU, NVIDIA CUDA, AMD ROCm, Intel oneAPI/SYCL, Apple Metal
- Open-source MIT-licensed, ~25K GitHub stars
Project: github.com/mudler/LocalAI. Maintainer: Ettore Di Giacinto (mudler).
Capabilities and Endpoints {#endpoints}
| OpenAI Endpoint | LocalAI Backend |
|---|---|
POST /v1/chat/completions | llama.cpp / vLLM / transformers |
POST /v1/completions | llama.cpp / vLLM / transformers |
POST /v1/embeddings | llama.cpp / sentence-transformers / bert |
POST /v1/images/generations | diffusers (SD 1.5 / SDXL / SD 3.5 / Flux) |
POST /v1/audio/transcriptions | whisper.cpp / faster-whisper |
POST /v1/audio/translations | whisper |
POST /v1/audio/speech | piper / bark / coqui-xtts / outetts |
POST /v1/rerank | bge-reranker / jina-reranker |
| Vision (multimodal in chat) | LLaVA / Llama 3.2 Vision / Qwen 2-VL |
POST /v1/fine_tuning/jobs | LoRA training (limited) |
POST /v1/files, /assistants | Limited compatibility |
The full OpenAI feature parity is documented in LocalAI's compatibility matrix.
Hardware Requirements {#requirements}
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | x86_64 / ARM64 | 8-core+ |
| RAM | 8 GB (CPU-only small models) | 32 GB+ |
| GPU | None (CPU works) | 12 GB+ VRAM |
| Disk | 20 GB | NVMe; models cluster fast |
| OS | Linux, macOS, Windows (WSL2) | Ubuntu 22.04 LTS |
GPU support: NVIDIA (CUDA 11.8 / 12.1 / 12.4 / 12.6), AMD (ROCm 6.2 / Vulkan), Intel (oneAPI / SYCL), Apple (Metal). The "AIO" (all-in-one) Docker images bundle the right backends per accelerator.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: Docker, Bare Metal, Kubernetes {#installation}
Docker (recommended)
# CPU-only
docker run -p 8080:8080 --name localai \
-v ./models:/build/models \
localai/localai:latest-aio-cpu
# NVIDIA CUDA 12
docker run -p 8080:8080 --gpus all --name localai \
-v ./models:/build/models \
localai/localai:latest-aio-gpu-nvidia-cuda-12
# AMD ROCm
docker run -p 8080:8080 --device=/dev/kfd --device=/dev/dri \
--group-add video --group-add render \
-v ./models:/build/models \
localai/localai:latest-aio-gpu-amd-rocm
# Intel oneAPI / SYCL
docker run -p 8080:8080 --device=/dev/dri \
-v ./models:/build/models \
localai/localai:latest-aio-gpu-intel-f16
The aio (all-in-one) tags bundle preset model configurations for the chat / embeddings / image / audio default lineup so the server is ready to use without any setup beyond pulling.
docker-compose.yml
services:
localai:
image: localai/localai:latest-aio-gpu-nvidia-cuda-12
ports: ["8080:8080"]
runtime: nvidia
environment:
- DEBUG=true
- MODELS_PATH=/build/models
- THREADS=8
volumes:
- ./models:/build/models
- ./config:/build/config
deploy:
resources:
reservations:
devices:
- { driver: nvidia, count: all, capabilities: [gpu] }
Bare metal (Linux)
curl https://localai.io/install.sh | sh
sudo systemctl enable --now local-ai
The install script picks the right binary for your platform and installs to /usr/bin/local-ai plus a systemd unit.
Kubernetes (Helm)
helm repo add localai https://go-skynet.github.io/helm-charts/
helm install localai localai/local-ai \
--set deployment.image.tag=latest-aio-gpu-nvidia-cuda-12 \
--set resources.limits."nvidia\.com/gpu"=1 \
--set persistence.models.size=200Gi
For declarative model loading via the LocalAI Operator: install the operator, then kubectl apply -f model-llama-3.1-8b.yaml with a LocalAIModel CRD.
The Model Gallery {#gallery}
LocalAI's gallery YAML lists pre-configured models. Browse via the WebUI (http://localhost:8080 → "Models" tab) or CLI:
# List available
curl http://localhost:8080/models/available | jq .
# Install a model from the gallery
curl http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "llama-3.1-8b-instruct"}'
# Status
curl http://localhost:8080/models/jobs/<job-id>
Gallery YAML reference (excerpt):
- name: llama-3.1-8b-instruct
url: github:mudler/LocalAI/gallery/llama3.1.yaml@master
files:
- filename: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
sha256: ...
uri: huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
You can host your own gallery by serving a YAML file at any HTTP URL and pointing LocalAI at it.
Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc. {#backends}
| Backend | Purpose | Format |
|---|---|---|
| llama.cpp | Default LLM (GGUF) | GGUF |
| vLLM | High-throughput LLM | HF / AWQ / FP8 |
| transformers | HF Transformers | HF FP16/BF16 |
| exllama / exllamav2 | Fast INT4 NVIDIA | EXL2 / GPTQ |
| diffusers | Image generation | HF diffusers |
| whisper | Speech-to-text | GGUF |
| piper | Lightweight TTS | ONNX |
| coqui | Voice cloning TTS | XTTS v2 |
| bark | Text-to-audio | PT |
| outetts | Long-form TTS | GGUF |
| sentencetransformers / bert | Embeddings | HF / GGUF |
| rerankers | Reranker | HF / GGUF |
You set the backend per model in the YAML config:
name: llama-3.1-70b-vllm
backend: vllm
parameters:
model: meta-llama/Llama-3.1-70B-Instruct-AWQ-INT4
quantization: awq
gpu_memory_utilization: 0.92
This is a major LocalAI strength — different models can use different backends on the same server, all behind one OpenAI-compatible endpoint.
Your First Chat Completion {#first-chat}
# Install Llama 3.1 8B from the gallery
curl http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "llama-3.1-8b-instruct"}'
# Wait for download to finish (poll /models/jobs/<id>)
# Use as OpenAI
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-instruct",
"messages": [{"role":"user","content":"Hello"}],
"temperature": 0.7
}'
Or via Python OpenAI SDK (no code change needed beyond base_url):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")
Embeddings {#embeddings}
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text-v1.5",
"input": "The quick brown fox"
}'
LocalAI ships gallery entries for Nomic Embed, BGE-M3, BGE-Small, GTE, E5, and Sentence-Transformers MiniLM. Compare with Local AI Embeddings Guide and Local vs OpenAI Embeddings.
Image Generation (Stable Diffusion, Flux) {#image-gen}
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "sdxl",
"prompt": "a cinematic photo of a cyberpunk samurai",
"size": "1024x1024",
"n": 1,
"response_format": "b64_json"
}'
Backends: diffusers (default for SD 1.5 / SDXL / SD 3.5 / Flux), stablediffusion (legacy llama.cpp). For maximum flexibility (ControlNet, IPAdapter, LoRA stacks), use ComfyUI directly; LocalAI is the right answer for "drop-in OpenAI image API."
Audio: Whisper, TTS, Bark, MusicGen {#audio}
Speech-to-text
curl http://localhost:8080/v1/audio/transcriptions \
-F file="@audio.wav" \
-F model="whisper-1"
Text-to-speech
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"voice": "alloy",
"input": "Hello world"
}' --output speech.wav
For voice cloning (Coqui XTTS v2):
# config/xtts.yaml
name: my-voice
backend: coqui
parameters:
model: tts_models/multilingual/multi-dataset/xtts_v2
voice_wav: /build/voices/reference.wav
curl http://localhost:8080/v1/audio/speech \
-d '{"model": "my-voice", "input": "Cloned voice output"}' \
--output cloned.wav
Vision: LLaVA and Llama 3.2 Vision {#vision}
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-1.6-mistral",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
]
}]
}'
Compatible: LLaVA-1.5/1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, MiniCPM-V, Pixtral.
Reranker {#reranker}
curl http://localhost:8080/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3",
"query": "What is local AI?",
"documents": ["Doc 1...", "Doc 2..."]
}'
Useful in RAG pipelines for re-ranking after vector search. See Vector Databases Comparison.
Function Calling and JSON Mode {#function-calling}
{
"model": "llama-3.1-8b-instruct",
"messages": [...],
"tools": [{"type": "function", "function": {...}}],
"tool_choice": "auto"
}
response_format: {type: "json_object"} for JSON mode; {type: "json_schema", json_schema: {...}} for schema-validated. Backend-dependent.
Custom Models and YAML Configs {#custom-models}
Drop a YAML into models/:
name: my-llama
backend: llama-cpp
parameters:
model: my-llama-Q5_K_M.gguf
threads: 8
context_size: 16384
gpu_layers: 999
f16: true
template:
chat: |
{{- range .Messages }}
{{ if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ .Content }}<|im_end|>
{{ end }}
{{ end }}<|im_start|>assistant
Restart and the model is available at /v1/chat/completions with "model": "my-llama".
P2P Federated Mode {#p2p}
LocalAI 2.0+ supports peer-to-peer federation: multiple LocalAI nodes form a cluster and route requests to the node with the requested model loaded.
# Node 1 — token-broadcasting bootstrap
local-ai run --p2p --p2p-network-id mynet
# Node 2 — joins the network
local-ai run --p2p --p2p-token <token-from-node-1>
Useful for distributed home-lab setups where you want to spread models across multiple machines without setting up an explicit reverse proxy.
Authentication, Rate Limiting, Multi-Tenancy {#auth}
LocalAI has basic API-key auth (API_KEY=sk-...) but for multi-tenant production put it behind LiteLLM or another gateway. See AI Gateway with LiteLLM.
Observability {#observability}
LocalAI exposes /metrics (Prometheus). Key metrics:
| Metric | Meaning |
|---|---|
localai_requests_total | Total requests by endpoint |
localai_request_duration_seconds | Latency histogram |
localai_tokens_total | Tokens generated |
localai_models_loaded | Models currently in memory |
Pair with Grafana for dashboards.
Tuning Recipes {#tuning}
Single-GPU NVIDIA workstation
# models/llama-3.1-8b.yaml
name: llama-3.1-8b
backend: llama-cpp
parameters:
model: llama-3.1-8b-instruct-Q5_K_M.gguf
threads: 8
context_size: 16384
gpu_layers: 999
f16: true
flash_attention: true
High-throughput server (vLLM backend)
name: llama-3.1-8b-vllm
backend: vllm
parameters:
model: meta-llama/Llama-3.1-8B-Instruct
dtype: bfloat16
gpu_memory_utilization: 0.92
max_model_len: 16384
enable_prefix_caching: true
enable_chunked_prefill: true
Multimodal AIO server
Pull localai/localai:latest-aio-gpu-nvidia-cuda-12 and the gallery auto-populates LLM + embedding + image + audio + vision presets out of the box. Restart and /v1/chat/completions, /v1/embeddings, /v1/images/generations, /v1/audio/speech all work immediately.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Model not found | Not in gallery / typo | List with /models/available |
| OOM at load | gpu_layers too high | Lower or use smaller quant |
| Slow on first request | Cold start | Send warmup request after install |
| Backend not found | Wrong AIO image | Pull image matching your accelerator |
| AMD: hipErrorNoBinaryForGpu | Old gfx | Set HSA_OVERRIDE_GFX_VERSION; see AMD ROCm Guide |
| TTS sounds robotic | Using piper not coqui | Configure coqui-xtts model |
| Image gen blank | Wrong backend | Set backend: diffusers for SDXL/SD3.5/Flux |
| /v1/embeddings returns wrong dims | Wrong embedding backend | Use sentencetransformers for HF models |
FAQ {#faq}
See answers to common LocalAI questions below.
Sources: LocalAI GitHub | LocalAI docs | Mudler's blog | Internal benchmarks NVIDIA, AMD, Apple.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!