★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

LocalAI Setup Guide (2026): The OpenAI-Compatible Drop-In Replacement

May 1, 2026
26 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

LocalAI is the closest thing to a one-stop self-hosted replacement for the OpenAI API. One binary speaks every major OpenAI endpoint — chat, completions, embeddings, image generation, transcription, text-to-speech, vision — and orchestrates the right backend (llama.cpp, vLLM, Stable Diffusion, Whisper, Bark, Coqui) for each. Drop your existing OpenAI client onto a LocalAI base URL and 95% of the surface just works.

This guide covers everything: installation across Docker / Kubernetes / bare metal, the model gallery, configuring backends, running multimodal endpoints (text + image + audio + vision) from one server, function calling, JSON mode, P2P federated mode, and tuning for production.

Table of Contents

  1. What LocalAI Is
  2. Capabilities and Endpoints
  3. Hardware Requirements
  4. Installation: Docker, Bare Metal, Kubernetes
  5. The Model Gallery
  6. Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc.
  7. Your First Chat Completion
  8. Embeddings
  9. Image Generation (Stable Diffusion, Flux)
  10. Audio: Whisper, TTS, Bark, MusicGen
  11. Vision: LLaVA and Llama 3.2 Vision
  12. Reranker
  13. Function Calling and JSON Mode
  14. Custom Models and YAML Configs
  15. P2P Federated Mode
  16. Authentication, Rate Limiting, Multi-Tenancy
  17. Observability
  18. Tuning Recipes
  19. Troubleshooting
  20. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What LocalAI Is {#what-it-is}

LocalAI (mudler/LocalAI on GitHub) is a Go-based REST server that:

  • Implements the OpenAI API surface (chat, embeddings, images, audio, vision, etc.)
  • Loads pluggable backends (llama.cpp, vLLM, transformers, exllama, diffusers, whisper, piper, bark, coqui, etc.) per model
  • Ships a curated model gallery for one-click model installation
  • Provides a built-in WebUI for chat, image gen, transcription, voice cloning
  • Supports CPU, NVIDIA CUDA, AMD ROCm, Intel oneAPI/SYCL, Apple Metal
  • Open-source MIT-licensed, ~25K GitHub stars

Project: github.com/mudler/LocalAI. Maintainer: Ettore Di Giacinto (mudler).


Capabilities and Endpoints {#endpoints}

OpenAI EndpointLocalAI Backend
POST /v1/chat/completionsllama.cpp / vLLM / transformers
POST /v1/completionsllama.cpp / vLLM / transformers
POST /v1/embeddingsllama.cpp / sentence-transformers / bert
POST /v1/images/generationsdiffusers (SD 1.5 / SDXL / SD 3.5 / Flux)
POST /v1/audio/transcriptionswhisper.cpp / faster-whisper
POST /v1/audio/translationswhisper
POST /v1/audio/speechpiper / bark / coqui-xtts / outetts
POST /v1/rerankbge-reranker / jina-reranker
Vision (multimodal in chat)LLaVA / Llama 3.2 Vision / Qwen 2-VL
POST /v1/fine_tuning/jobsLoRA training (limited)
POST /v1/files, /assistantsLimited compatibility

The full OpenAI feature parity is documented in LocalAI's compatibility matrix.


Hardware Requirements {#requirements}

ResourceMinimumRecommended
CPUx86_64 / ARM648-core+
RAM8 GB (CPU-only small models)32 GB+
GPUNone (CPU works)12 GB+ VRAM
Disk20 GBNVMe; models cluster fast
OSLinux, macOS, Windows (WSL2)Ubuntu 22.04 LTS

GPU support: NVIDIA (CUDA 11.8 / 12.1 / 12.4 / 12.6), AMD (ROCm 6.2 / Vulkan), Intel (oneAPI / SYCL), Apple (Metal). The "AIO" (all-in-one) Docker images bundle the right backends per accelerator.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: Docker, Bare Metal, Kubernetes {#installation}

# CPU-only
docker run -p 8080:8080 --name localai \
    -v ./models:/build/models \
    localai/localai:latest-aio-cpu

# NVIDIA CUDA 12
docker run -p 8080:8080 --gpus all --name localai \
    -v ./models:/build/models \
    localai/localai:latest-aio-gpu-nvidia-cuda-12

# AMD ROCm
docker run -p 8080:8080 --device=/dev/kfd --device=/dev/dri \
    --group-add video --group-add render \
    -v ./models:/build/models \
    localai/localai:latest-aio-gpu-amd-rocm

# Intel oneAPI / SYCL
docker run -p 8080:8080 --device=/dev/dri \
    -v ./models:/build/models \
    localai/localai:latest-aio-gpu-intel-f16

The aio (all-in-one) tags bundle preset model configurations for the chat / embeddings / image / audio default lineup so the server is ready to use without any setup beyond pulling.

docker-compose.yml

services:
  localai:
    image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    ports: ["8080:8080"]
    runtime: nvidia
    environment:
      - DEBUG=true
      - MODELS_PATH=/build/models
      - THREADS=8
    volumes:
      - ./models:/build/models
      - ./config:/build/config
    deploy:
      resources:
        reservations:
          devices:
            - { driver: nvidia, count: all, capabilities: [gpu] }

Bare metal (Linux)

curl https://localai.io/install.sh | sh
sudo systemctl enable --now local-ai

The install script picks the right binary for your platform and installs to /usr/bin/local-ai plus a systemd unit.

Kubernetes (Helm)

helm repo add localai https://go-skynet.github.io/helm-charts/
helm install localai localai/local-ai \
    --set deployment.image.tag=latest-aio-gpu-nvidia-cuda-12 \
    --set resources.limits."nvidia\.com/gpu"=1 \
    --set persistence.models.size=200Gi

For declarative model loading via the LocalAI Operator: install the operator, then kubectl apply -f model-llama-3.1-8b.yaml with a LocalAIModel CRD.


LocalAI's gallery YAML lists pre-configured models. Browse via the WebUI (http://localhost:8080 → "Models" tab) or CLI:

# List available
curl http://localhost:8080/models/available | jq .

# Install a model from the gallery
curl http://localhost:8080/models/apply \
    -H "Content-Type: application/json" \
    -d '{"id": "llama-3.1-8b-instruct"}'

# Status
curl http://localhost:8080/models/jobs/<job-id>

Gallery YAML reference (excerpt):

- name: llama-3.1-8b-instruct
  url: github:mudler/LocalAI/gallery/llama3.1.yaml@master
  files:
    - filename: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
      sha256: ...
      uri: huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

You can host your own gallery by serving a YAML file at any HTTP URL and pointing LocalAI at it.


Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc. {#backends}

BackendPurposeFormat
llama.cppDefault LLM (GGUF)GGUF
vLLMHigh-throughput LLMHF / AWQ / FP8
transformersHF TransformersHF FP16/BF16
exllama / exllamav2Fast INT4 NVIDIAEXL2 / GPTQ
diffusersImage generationHF diffusers
whisperSpeech-to-textGGUF
piperLightweight TTSONNX
coquiVoice cloning TTSXTTS v2
barkText-to-audioPT
outettsLong-form TTSGGUF
sentencetransformers / bertEmbeddingsHF / GGUF
rerankersRerankerHF / GGUF

You set the backend per model in the YAML config:

name: llama-3.1-70b-vllm
backend: vllm
parameters:
  model: meta-llama/Llama-3.1-70B-Instruct-AWQ-INT4
  quantization: awq
  gpu_memory_utilization: 0.92

This is a major LocalAI strength — different models can use different backends on the same server, all behind one OpenAI-compatible endpoint.


Your First Chat Completion {#first-chat}

# Install Llama 3.1 8B from the gallery
curl http://localhost:8080/models/apply \
    -H "Content-Type: application/json" \
    -d '{"id": "llama-3.1-8b-instruct"}'

# Wait for download to finish (poll /models/jobs/<id>)

# Use as OpenAI
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b-instruct",
        "messages": [{"role":"user","content":"Hello"}],
        "temperature": 0.7
    }'

Or via Python OpenAI SDK (no code change needed beyond base_url):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")

Embeddings {#embeddings}

curl http://localhost:8080/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "model": "nomic-embed-text-v1.5",
        "input": "The quick brown fox"
    }'

LocalAI ships gallery entries for Nomic Embed, BGE-M3, BGE-Small, GTE, E5, and Sentence-Transformers MiniLM. Compare with Local AI Embeddings Guide and Local vs OpenAI Embeddings.


Image Generation (Stable Diffusion, Flux) {#image-gen}

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sdxl",
        "prompt": "a cinematic photo of a cyberpunk samurai",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
    }'

Backends: diffusers (default for SD 1.5 / SDXL / SD 3.5 / Flux), stablediffusion (legacy llama.cpp). For maximum flexibility (ControlNet, IPAdapter, LoRA stacks), use ComfyUI directly; LocalAI is the right answer for "drop-in OpenAI image API."


Audio: Whisper, TTS, Bark, MusicGen {#audio}

Speech-to-text

curl http://localhost:8080/v1/audio/transcriptions \
    -F file="@audio.wav" \
    -F model="whisper-1"

Text-to-speech

curl http://localhost:8080/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tts-1",
        "voice": "alloy",
        "input": "Hello world"
    }' --output speech.wav

For voice cloning (Coqui XTTS v2):

# config/xtts.yaml
name: my-voice
backend: coqui
parameters:
  model: tts_models/multilingual/multi-dataset/xtts_v2
  voice_wav: /build/voices/reference.wav
curl http://localhost:8080/v1/audio/speech \
    -d '{"model": "my-voice", "input": "Cloned voice output"}' \
    --output cloned.wav

Vision: LLaVA and Llama 3.2 Vision {#vision}

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llava-1.6-mistral",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }]
    }'

Compatible: LLaVA-1.5/1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, MiniCPM-V, Pixtral.


Reranker {#reranker}

curl http://localhost:8080/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
        "model": "bge-reranker-v2-m3",
        "query": "What is local AI?",
        "documents": ["Doc 1...", "Doc 2..."]
    }'

Useful in RAG pipelines for re-ranking after vector search. See Vector Databases Comparison.


Function Calling and JSON Mode {#function-calling}

{
  "model": "llama-3.1-8b-instruct",
  "messages": [...],
  "tools": [{"type": "function", "function": {...}}],
  "tool_choice": "auto"
}

response_format: {type: "json_object"} for JSON mode; {type: "json_schema", json_schema: {...}} for schema-validated. Backend-dependent.


Custom Models and YAML Configs {#custom-models}

Drop a YAML into models/:

name: my-llama
backend: llama-cpp
parameters:
  model: my-llama-Q5_K_M.gguf
  threads: 8
context_size: 16384
gpu_layers: 999
f16: true

template:
  chat: |
    {{- range .Messages }}
    {{ if eq .Role "user" }}<|im_start|>user
    {{ .Content }}<|im_end|>
    {{ else if eq .Role "assistant" }}<|im_start|>assistant
    {{ .Content }}<|im_end|>
    {{ end }}
    {{ end }}<|im_start|>assistant

Restart and the model is available at /v1/chat/completions with "model": "my-llama".


P2P Federated Mode {#p2p}

LocalAI 2.0+ supports peer-to-peer federation: multiple LocalAI nodes form a cluster and route requests to the node with the requested model loaded.

# Node 1 — token-broadcasting bootstrap
local-ai run --p2p --p2p-network-id mynet

# Node 2 — joins the network
local-ai run --p2p --p2p-token <token-from-node-1>

Useful for distributed home-lab setups where you want to spread models across multiple machines without setting up an explicit reverse proxy.


Authentication, Rate Limiting, Multi-Tenancy {#auth}

LocalAI has basic API-key auth (API_KEY=sk-...) but for multi-tenant production put it behind LiteLLM or another gateway. See AI Gateway with LiteLLM.


Observability {#observability}

LocalAI exposes /metrics (Prometheus). Key metrics:

MetricMeaning
localai_requests_totalTotal requests by endpoint
localai_request_duration_secondsLatency histogram
localai_tokens_totalTokens generated
localai_models_loadedModels currently in memory

Pair with Grafana for dashboards.


Tuning Recipes {#tuning}

Single-GPU NVIDIA workstation

# models/llama-3.1-8b.yaml
name: llama-3.1-8b
backend: llama-cpp
parameters:
  model: llama-3.1-8b-instruct-Q5_K_M.gguf
  threads: 8
context_size: 16384
gpu_layers: 999
f16: true
flash_attention: true

High-throughput server (vLLM backend)

name: llama-3.1-8b-vllm
backend: vllm
parameters:
  model: meta-llama/Llama-3.1-8B-Instruct
  dtype: bfloat16
  gpu_memory_utilization: 0.92
  max_model_len: 16384
  enable_prefix_caching: true
  enable_chunked_prefill: true

Multimodal AIO server

Pull localai/localai:latest-aio-gpu-nvidia-cuda-12 and the gallery auto-populates LLM + embedding + image + audio + vision presets out of the box. Restart and /v1/chat/completions, /v1/embeddings, /v1/images/generations, /v1/audio/speech all work immediately.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Model not foundNot in gallery / typoList with /models/available
OOM at loadgpu_layers too highLower or use smaller quant
Slow on first requestCold startSend warmup request after install
Backend not foundWrong AIO imagePull image matching your accelerator
AMD: hipErrorNoBinaryForGpuOld gfxSet HSA_OVERRIDE_GFX_VERSION; see AMD ROCm Guide
TTS sounds roboticUsing piper not coquiConfigure coqui-xtts model
Image gen blankWrong backendSet backend: diffusers for SDXL/SD3.5/Flux
/v1/embeddings returns wrong dimsWrong embedding backendUse sentencetransformers for HF models

FAQ {#faq}

See answers to common LocalAI questions below.


Sources: LocalAI GitHub | LocalAI docs | Mudler's blog | Internal benchmarks NVIDIA, AMD, Apple.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a LocalAI all-in-one reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators