What is LocalAI and how is it different from Ollama?

LocalAI is a self-hosted, OpenAI-compatible REST API server that runs LLMs, embeddings, image generation (Stable Diffusion, Flux), audio (Whisper, Bark, Coqui TTS, MusicGen), vision (LLaVA), and rerankers — all from a single binary or container. It speaks the entire OpenAI API surface (`/v1/chat/completions`, `/v1/embeddings`, `/v1/images/generations`, `/v1/audio/transcriptions`, `/v1/audio/speech`, etc.), so any OpenAI client works unchanged. Ollama focuses tightly on LLMs (and now embeddings + vision). LocalAI is broader — closer to "self-host every OpenAI endpoint" — at the cost of more configuration.

What hardware does LocalAI need?

CPU-only works for small models. Recommended: 16 GB RAM and a GPU with 6 GB+ VRAM. NVIDIA CUDA, AMD ROCm, Intel oneAPI/SYCL, and Apple Metal are all supported via different LocalAI builds. Each backend (llama.cpp, vLLM, transformers, Stable Diffusion, Whisper, etc.) has its own requirements; LocalAI orchestrates them. Pick the GPU-specific image (`localai/localai:latest-aio-gpu-nvidia-cuda-12`, `localai/localai:latest-aio-gpu-amd-rocm`, etc.) to get the right backends pre-bundled.

How does the LocalAI model gallery work?

LocalAI ships a curated gallery of pre-configured models. From the WebUI or CLI you browse models by capability (text, embeddings, image, audio) and click install — LocalAI downloads the weights, configures the right backend, and exposes it via the OpenAI API. The gallery includes Llama 3.1, Qwen 2.5, Mistral, Phi, Gemma, Stable Diffusion 1.5/SDXL/SD3.5/Flux, Whisper, Bark, Coqui XTTS, Nomic Embed, BGE, and many more. The gallery file is YAML and you can add your own entries or point to a custom gallery URL.

Can I use LocalAI as a drop-in replacement for the OpenAI API in my existing app?

Yes — that is its primary use case. Set the OpenAI client `base_url` to your LocalAI endpoint (e.g., `http://localhost:8080/v1`) and use any string as the API key. Compatible endpoints: chat completions, completions, embeddings, image generation (DALL-E style), Whisper-style transcription, TTS, vision, fine-tuning (limited), files, assistants (limited). Some advanced OpenAI-only features (stable image edits, code interpreter, retrieval) are not 1:1 supported. For 95% of typical OpenAI client code, swapping the base URL is all you need to do.

How does LocalAI compare to vLLM and Ollama for raw LLM throughput?

LocalAI uses pluggable backends — llama.cpp, vLLM, transformers, exllama, and others. With the vLLM backend, throughput matches vLLM directly (5-20x higher than Ollama under load). With the llama.cpp backend (the default), throughput matches Ollama / llama.cpp. The advantage of LocalAI is not raw speed but **breadth**: one server speaks every OpenAI endpoint with multimodal coverage. For pure LLM throughput in production, run vLLM directly; for "self-host the entire OpenAI API surface," LocalAI is the right tool.

How do I deploy LocalAI in Kubernetes?

Use the official Helm chart (`helm install localai oci://ghcr.io/mudler/charts/localai`) or the Operator (`localai-operator`). The Helm chart handles GPU resource requests, model PVC, OpenAPI-style ingress, and HPA on `localai_requests_running`. The Operator adds CRDs for declarative model loading: `kind: LocalAIModel` with model name, backend, and resource hints. Both support multi-replica with shared model PVC. For autoscaling, KEDA on Prometheus metrics is the recommended pattern.

Can I do voice cloning and TTS with LocalAI?

Yes. The Coqui XTTS v2 backend supports voice cloning from a 6-10 second reference clip; the Bark backend supports text-to-audio with voice prompts; Piper supports fast on-device English TTS; and OuteTTS handles longer-form English speech. Endpoint: `POST /v1/audio/speech` with `model: tts-1` (or your custom name) and a `voice` parameter. For more advanced voice cloning workflows see [Local AI Voice Clone](/blog/local-ai-voice-clone).

Does LocalAI support function calling and JSON mode?

Yes — function/tool calling works for any chat model with tool-template support (Llama 3.1+, Qwen 2.5+, Mistral, etc.) via the OpenAI-style `tools` and `tool_choice` parameters. JSON mode and structured output via `response_format: {type: "json_object"}` and `response_format: {type: "json_schema"}` work via grammar-based constrained decoding. Backend-dependent: llama.cpp uses GBNF; vLLM uses xgrammar/outlines.

LocalAI Setup Guide (2026): The OpenAI-Compatible Drop-In Replacement

LocalAI is the closest thing to a one-stop self-hosted replacement for the OpenAI API. One binary speaks every major OpenAI endpoint — chat, completions, embeddings, image generation, transcription, text-to-speech, vision — and orchestrates the right backend (llama.cpp, vLLM, Stable Diffusion, Whisper, Bark, Coqui) for each. Drop your existing OpenAI client onto a LocalAI base URL and 95% of the surface just works.

This guide covers everything: installation across Docker / Kubernetes / bare metal, the model gallery, configuring backends, running multimodal endpoints (text + image + audio + vision) from one server, function calling, JSON mode, P2P federated mode, and tuning for production.

What LocalAI Is
Capabilities and Endpoints
Hardware Requirements
Installation: Docker, Bare Metal, Kubernetes
The Model Gallery
Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc.
Your First Chat Completion
Embeddings
Image Generation (Stable Diffusion, Flux)
Audio: Whisper, TTS, Bark, MusicGen
Vision: LLaVA and Llama 3.2 Vision
Reranker
Function Calling and JSON Mode
Custom Models and YAML Configs
P2P Federated Mode
Authentication, Rate Limiting, Multi-Tenancy
Observability
Tuning Recipes
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What LocalAI Is {#what-it-is}

LocalAI (mudler/LocalAI on GitHub) is a Go-based REST server that:

Implements the OpenAI API surface (chat, embeddings, images, audio, vision, etc.)
Loads pluggable backends (llama.cpp, vLLM, transformers, exllama, diffusers, whisper, piper, bark, coqui, etc.) per model
Ships a curated model gallery for one-click model installation
Provides a built-in WebUI for chat, image gen, transcription, voice cloning
Supports CPU, NVIDIA CUDA, AMD ROCm, Intel oneAPI/SYCL, Apple Metal
Open-source MIT-licensed, ~25K GitHub stars

Project: github.com/mudler/LocalAI. Maintainer: Ettore Di Giacinto (mudler).

Capabilities and Endpoints {#endpoints}

OpenAI Endpoint	LocalAI Backend
`POST /v1/chat/completions`	llama.cpp / vLLM / transformers
`POST /v1/completions`	llama.cpp / vLLM / transformers
`POST /v1/embeddings`	llama.cpp / sentence-transformers / bert
`POST /v1/images/generations`	diffusers (SD 1.5 / SDXL / SD 3.5 / Flux)
`POST /v1/audio/transcriptions`	whisper.cpp / faster-whisper
`POST /v1/audio/translations`	whisper
`POST /v1/audio/speech`	piper / bark / coqui-xtts / outetts
`POST /v1/rerank`	bge-reranker / jina-reranker
Vision (multimodal in chat)	LLaVA / Llama 3.2 Vision / Qwen 2-VL
`POST /v1/fine_tuning/jobs`	LoRA training (limited)
`POST /v1/files`, `/assistants`	Limited compatibility

The full OpenAI feature parity is documented in LocalAI's compatibility matrix.

Hardware Requirements {#requirements}

Resource	Minimum	Recommended
CPU	x86_64 / ARM64	8-core+
RAM	8 GB (CPU-only small models)	32 GB+
GPU	None (CPU works)	12 GB+ VRAM
Disk	20 GB	NVMe; models cluster fast
OS	Linux, macOS, Windows (WSL2)	Ubuntu 22.04 LTS

GPU support: NVIDIA (CUDA 11.8 / 12.1 / 12.4 / 12.6), AMD (ROCm 6.2 / Vulkan), Intel (oneAPI / SYCL), Apple (Metal). The "AIO" (all-in-one) Docker images bundle the right backends per accelerator.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Installation: Docker, Bare Metal, Kubernetes {#installation}

Docker (recommended)

# CPU-only
docker run -p 8080:8080 --name localai \
    -v ./models:/build/models \
    localai/localai:latest-aio-cpu

# NVIDIA CUDA 12
docker run -p 8080:8080 --gpus all --name localai \
    -v ./models:/build/models \
    localai/localai:latest-aio-gpu-nvidia-cuda-12

# AMD ROCm
docker run -p 8080:8080 --device=/dev/kfd --device=/dev/dri \
    --group-add video --group-add render \
    -v ./models:/build/models \
    localai/localai:latest-aio-gpu-amd-rocm

# Intel oneAPI / SYCL
docker run -p 8080:8080 --device=/dev/dri \
    -v ./models:/build/models \
    localai/localai:latest-aio-gpu-intel-f16

The aio (all-in-one) tags bundle preset model configurations for the chat / embeddings / image / audio default lineup so the server is ready to use without any setup beyond pulling.

docker-compose.yml

services:
  localai:
    image: localai/localai:latest-aio-gpu-nvidia-cuda-12
    ports: ["8080:8080"]
    runtime: nvidia
    environment:
      - DEBUG=true
      - MODELS_PATH=/build/models
      - THREADS=8
    volumes:
      - ./models:/build/models
      - ./config:/build/config
    deploy:
      resources:
        reservations:
          devices:
            - { driver: nvidia, count: all, capabilities: [gpu] }

Bare metal (Linux)

curl https://localai.io/install.sh | sh
sudo systemctl enable --now local-ai

The install script picks the right binary for your platform and installs to /usr/bin/local-ai plus a systemd unit.

Kubernetes (Helm)

helm repo add localai https://go-skynet.github.io/helm-charts/
helm install localai localai/local-ai \
    --set deployment.image.tag=latest-aio-gpu-nvidia-cuda-12 \
    --set resources.limits."nvidia\.com/gpu"=1 \
    --set persistence.models.size=200Gi

For declarative model loading via the LocalAI Operator: install the operator, then kubectl apply -f model-llama-3.1-8b.yaml with a LocalAIModel CRD.

The Model Gallery {#gallery}

LocalAI's gallery YAML lists pre-configured models. Browse via the WebUI (http://localhost:8080 → "Models" tab) or CLI:

# List available
curl http://localhost:8080/models/available | jq .

# Install a model from the gallery
curl http://localhost:8080/models/apply \
    -H "Content-Type: application/json" \
    -d '{"id": "llama-3.1-8b-instruct"}'

# Status
curl http://localhost:8080/models/jobs/<job-id>

Gallery YAML reference (excerpt):

- name: llama-3.1-8b-instruct
  url: github:mudler/LocalAI/gallery/llama3.1.yaml@master
  files:
    - filename: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
      sha256: ...
      uri: huggingface://bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

You can host your own gallery by serving a YAML file at any HTTP URL and pointing LocalAI at it.

Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc. {#backends}

Backend	Purpose	Format
llama.cpp	Default LLM (GGUF)	GGUF
vLLM	High-throughput LLM	HF / AWQ / FP8
transformers	HF Transformers	HF FP16/BF16
exllama / exllamav2	Fast INT4 NVIDIA	EXL2 / GPTQ
diffusers	Image generation	HF diffusers
whisper	Speech-to-text	GGUF
piper	Lightweight TTS	ONNX
coqui	Voice cloning TTS	XTTS v2
bark	Text-to-audio	PT
outetts	Long-form TTS	GGUF
sentencetransformers / bert	Embeddings	HF / GGUF
rerankers	Reranker	HF / GGUF

You set the backend per model in the YAML config:

name: llama-3.1-70b-vllm
backend: vllm
parameters:
  model: meta-llama/Llama-3.1-70B-Instruct-AWQ-INT4
  quantization: awq
  gpu_memory_utilization: 0.92

This is a major LocalAI strength — different models can use different backends on the same server, all behind one OpenAI-compatible endpoint.

Your First Chat Completion {#first-chat}

# Install Llama 3.1 8B from the gallery
curl http://localhost:8080/models/apply \
    -H "Content-Type: application/json" \
    -d '{"id": "llama-3.1-8b-instruct"}'

# Wait for download to finish (poll /models/jobs/<id>)

# Use as OpenAI
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b-instruct",
        "messages": [{"role":"user","content":"Hello"}],
        "temperature": 0.7
    }'

Or via Python OpenAI SDK (no code change needed beyond base_url):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="any")

Embeddings {#embeddings}

curl http://localhost:8080/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "model": "nomic-embed-text-v1.5",
        "input": "The quick brown fox"
    }'

LocalAI ships gallery entries for Nomic Embed, BGE-M3, BGE-Small, GTE, E5, and Sentence-Transformers MiniLM. Compare with Local AI Embeddings Guide and Local vs OpenAI Embeddings.

Image Generation (Stable Diffusion, Flux) {#image-gen}

curl http://localhost:8080/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sdxl",
        "prompt": "a cinematic photo of a cyberpunk samurai",
        "size": "1024x1024",
        "n": 1,
        "response_format": "b64_json"
    }'

Backends: diffusers (default for SD 1.5 / SDXL / SD 3.5 / Flux), stablediffusion (legacy llama.cpp). For maximum flexibility (ControlNet, IPAdapter, LoRA stacks), use ComfyUI directly; LocalAI is the right answer for "drop-in OpenAI image API."

Audio: Whisper, TTS, Bark, MusicGen {#audio}

Speech-to-text

curl http://localhost:8080/v1/audio/transcriptions \
    -F file="@audio.wav" \
    -F model="whisper-1"

Text-to-speech

curl http://localhost:8080/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tts-1",
        "voice": "alloy",
        "input": "Hello world"
    }' --output speech.wav

For voice cloning (Coqui XTTS v2):

# config/xtts.yaml
name: my-voice
backend: coqui
parameters:
  model: tts_models/multilingual/multi-dataset/xtts_v2
  voice_wav: /build/voices/reference.wav

curl http://localhost:8080/v1/audio/speech \
    -d '{"model": "my-voice", "input": "Cloned voice output"}' \
    --output cloned.wav

Vision: LLaVA and Llama 3.2 Vision {#vision}

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llava-1.6-mistral",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the image"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }]
    }'

Compatible: LLaVA-1.5/1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, MiniCPM-V, Pixtral.

Reranker {#reranker}

curl http://localhost:8080/v1/rerank \
    -H "Content-Type: application/json" \
    -d '{
        "model": "bge-reranker-v2-m3",
        "query": "What is local AI?",
        "documents": ["Doc 1...", "Doc 2..."]
    }'

Useful in RAG pipelines for re-ranking after vector search. See Vector Databases Comparison.

Function Calling and JSON Mode {#function-calling}

{
  "model": "llama-3.1-8b-instruct",
  "messages": [...],
  "tools": [{"type": "function", "function": {...}}],
  "tool_choice": "auto"
}

response_format: {type: "json_object"} for JSON mode; {type: "json_schema", json_schema: {...}} for schema-validated. Backend-dependent.

Custom Models and YAML Configs {#custom-models}

Drop a YAML into models/:

name: my-llama
backend: llama-cpp
parameters:
  model: my-llama-Q5_K_M.gguf
  threads: 8
context_size: 16384
gpu_layers: 999
f16: true

template:
  chat: |
    {{- range .Messages }}
    {{ if eq .Role "user" }}<|im_start|>user
    {{ .Content }}<|im_end|>
    {{ else if eq .Role "assistant" }}<|im_start|>assistant
    {{ .Content }}<|im_end|>
    {{ end }}
    {{ end }}<|im_start|>assistant

Restart and the model is available at /v1/chat/completions with "model": "my-llama".

P2P Federated Mode {#p2p}

LocalAI 2.0+ supports peer-to-peer federation: multiple LocalAI nodes form a cluster and route requests to the node with the requested model loaded.

# Node 1 — token-broadcasting bootstrap
local-ai run --p2p --p2p-network-id mynet

# Node 2 — joins the network
local-ai run --p2p --p2p-token <token-from-node-1>

Useful for distributed home-lab setups where you want to spread models across multiple machines without setting up an explicit reverse proxy.

Authentication, Rate Limiting, Multi-Tenancy {#auth}

LocalAI has basic API-key auth (API_KEY=sk-...) but for multi-tenant production put it behind LiteLLM or another gateway. See AI Gateway with LiteLLM.

Observability {#observability}

LocalAI exposes /metrics (Prometheus). Key metrics:

Metric	Meaning
`localai_requests_total`	Total requests by endpoint
`localai_request_duration_seconds`	Latency histogram
`localai_tokens_total`	Tokens generated
`localai_models_loaded`	Models currently in memory

Pair with Grafana for dashboards.

Tuning Recipes {#tuning}

Single-GPU NVIDIA workstation

# models/llama-3.1-8b.yaml
name: llama-3.1-8b
backend: llama-cpp
parameters:
  model: llama-3.1-8b-instruct-Q5_K_M.gguf
  threads: 8
context_size: 16384
gpu_layers: 999
f16: true
flash_attention: true

High-throughput server (vLLM backend)

name: llama-3.1-8b-vllm
backend: vllm
parameters:
  model: meta-llama/Llama-3.1-8B-Instruct
  dtype: bfloat16
  gpu_memory_utilization: 0.92
  max_model_len: 16384
  enable_prefix_caching: true
  enable_chunked_prefill: true

Multimodal AIO server

Pull localai/localai:latest-aio-gpu-nvidia-cuda-12 and the gallery auto-populates LLM + embedding + image + audio + vision presets out of the box. Restart and /v1/chat/completions, /v1/embeddings, /v1/images/generations, /v1/audio/speech all work immediately.

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
Model not found	Not in gallery / typo	List with `/models/available`
OOM at load	gpu_layers too high	Lower or use smaller quant
Slow on first request	Cold start	Send warmup request after install
Backend not found	Wrong AIO image	Pull image matching your accelerator
AMD: hipErrorNoBinaryForGpu	Old gfx	Set `HSA_OVERRIDE_GFX_VERSION`; see AMD ROCm Guide
TTS sounds robotic	Using piper not coqui	Configure coqui-xtts model
Image gen blank	Wrong backend	Set `backend: diffusers` for SDXL/SD3.5/Flux
/v1/embeddings returns wrong dims	Wrong embedding backend	Use `sentencetransformers` for HF models

FAQ {#faq}

See answers to common LocalAI questions below.

Sources: LocalAI GitHub | LocalAI docs | Mudler's blog | Internal benchmarks NVIDIA, AMD, Apple.

Related guides:

LocalAI Setup Guide (2026): The OpenAI-Compatible Drop-In Replacement

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What LocalAI Is {#what-it-is}

Capabilities and Endpoints {#endpoints}

Hardware Requirements {#requirements}

Reading articles is good. Building is better.

Installation: Docker, Bare Metal, Kubernetes {#installation}

Docker (recommended)

docker-compose.yml

Bare metal (Linux)

Kubernetes (Helm)

The Model Gallery {#gallery}

Backends: llama.cpp, vLLM, transformers, exllama, diffusers, etc. {#backends}

Your First Chat Completion {#first-chat}

Embeddings {#embeddings}

Image Generation (Stable Diffusion, Flux) {#image-gen}

Audio: Whisper, TTS, Bark, MusicGen {#audio}

Speech-to-text

Text-to-speech

Vision: LLaVA and Llama 3.2 Vision {#vision}

Reranker {#reranker}

Function Calling and JSON Mode {#function-calling}

Custom Models and YAML Configs {#custom-models}

P2P Federated Mode {#p2p}

Authentication, Rate Limiting, Multi-Tenancy {#auth}

Observability {#observability}

Tuning Recipes {#tuning}

Single-GPU NVIDIA workstation

High-throughput server (vLLM backend)

Multimodal AIO server

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Complete Ollama Guide

vLLM Complete Setup Guide

Private OpenAI-Compatible API

AI Gateway with LiteLLM

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI