MLC-LLM Setup Guide (2026): Cross-Platform LLM Inference on Any Device
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
MLC-LLM is the universal compiler for LLM inference. Built on Apache TVM, it produces device-specific binary libraries that run on every major backend — CUDA, ROCm, Metal, Vulkan, OpenCL, WebGPU, and CPU. It is the only mainstream path to run LLMs natively in the browser, on Android, on iOS, and on Intel Arc, all from one source.
This guide covers everything: how the compilation pipeline works, installing MLCEngine, compiling models for each backend, the WebLLM browser runtime, the iOS and Android apps, MLCEngine OpenAI-compatible serving, quantization choices, and benchmarks across NVIDIA / AMD / Apple / mobile.
Table of Contents
- What MLC-LLM Is and Why TVM Matters
- Hardware Coverage
- Installation: pip, Docker, Source
- The Pipeline: Convert → Compile → Run
- Your First Model: Llama 3.2 3B on CUDA
- Quantization Formats (q4f16_1, q4f32_1, q3f16_1)
- MLCEngine OpenAI-Compatible Server
- Apple Silicon (Metal)
- AMD GPUs (ROCm + Vulkan)
- Intel Arc (Vulkan)
- Android Deployment
- iOS Deployment
- WebLLM: Running in the Browser
- Long Context, FlashAttention, KV Cache
- Tool Calling and Structured Output
- Benchmarks Across Platforms
- Tuning Recipes
- Common Errors
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What MLC-LLM Is and Why TVM Matters {#what-it-is}
MLC-LLM stands for "Machine Learning Compilation for Large Language Models." It is a project from CMU and the OctoML / Apache TVM community.
The compiler approach:
Hugging Face model
│
▼
[mlc_llm convert_weight] # weight conversion
│
▼
MLC checkpoint format
│
▼
[mlc_llm compile] # TVM compiles to device-specific code
│
▼
.so / .dylib / .dll / .wasm / .so (Android) / .a (iOS)
│
▼
[mlc_llm serve / chat / library APIs]
TVM autotunes kernels for the target device. The output is a binary that performs comparably to hand-written CUDA on NVIDIA, hand-written Metal on Apple, hand-written Vulkan on Intel — without you needing to write any of those.
The trade-off, similar to TensorRT-LLM: compilation takes time (10-30 min per target). The win: one Python source produces optimized code for every backend.
Hardware Coverage {#hardware-coverage}
| Backend | Hardware | Status |
|---|---|---|
| CUDA | NVIDIA RTX 20-series and newer | Production |
| ROCm | AMD Radeon RX 7000+, MI series | Production |
| Vulkan | Any Vulkan 1.2 GPU (Intel Arc, AMD, NVIDIA, Adreno) | Production |
| Metal | Apple M1/M2/M3/M4 | Production |
| OpenCL | Older GPUs, mobile SoCs | Working |
| WebGPU | Modern browsers (Chrome 113+, Safari 18+, Edge) | WebLLM |
| Android | Adreno (Qualcomm), Mali (Arm), Apple iOS | Production |
| iOS | A14+ (iPhone 12+, iPad Pro M-series) | Production |
| CPU | x86_64 + AVX2, ARM64 | Working |
The diversity is unique. No other major LLM inference engine covers WebGPU + iOS + Android + Intel Arc + AMD ROCm + NVIDIA from one codebase.
Installation: pip, Docker, Source {#installation}
pip (recommended)
python3.11 -m venv ~/venvs/mlc
source ~/venvs/mlc/bin/activate
# CUDA 12.4
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-cu124 mlc-ai-cu124
# Metal (Mac)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm mlc-ai
# ROCm 6.2
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-rocm62 mlc-ai-rocm62
# Vulkan (Intel Arc, multi-vendor)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan
# Verify
python -c "import mlc_llm; print(mlc_llm.__version__)"
Docker
docker run --gpus all -it --rm \
-p 8000:8000 \
-v $(pwd)/models:/models \
mlcaidev/mlc-llm:latest \
mlc_llm serve /models/Llama-3.2-3B-q4f16_1
From source
git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
mkdir build && cd build
python ../cmake/gen_cmake_config.py # answer y/n for each backend
cmake .. && cmake --build . --parallel
Source builds take 30-60 minutes; only needed for custom kernels or unsupported targets.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
The Pipeline: Convert → Compile → Run {#pipeline}
# 1. Convert HF weights to MLC format
mlc_llm convert_weight ./Llama-3.2-3B-Instruct \
--quantization q4f16_1 \
--output ./mlc/Llama-3.2-3B-Instruct-q4f16_1
# 2. Generate model config
mlc_llm gen_config ./Llama-3.2-3B-Instruct \
--quantization q4f16_1 \
--conv-template llama-3 \
--output ./mlc/Llama-3.2-3B-Instruct-q4f16_1
# 3. Compile to target device
mlc_llm compile ./mlc/Llama-3.2-3B-Instruct-q4f16_1/mlc-chat-config.json \
--device cuda \
--output ./mlc/Llama-3.2-3B-Instruct-q4f16_1-cuda.so
# 4. Run via Python or serve
mlc_llm chat ./mlc/Llama-3.2-3B-Instruct-q4f16_1 \
--model-lib ./mlc/Llama-3.2-3B-Instruct-q4f16_1-cuda.so
The mlc-chat-config.json lives in the model directory and tells the runtime which library to load.
Your First Model: Llama 3.2 3B on CUDA {#first-model}
# Download from MLC Hugging Face hub (pre-quantized + pre-compiled for cuda)
git lfs install
git clone https://huggingface.co/mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC
# Run
mlc_llm serve HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC \
--host 0.0.0.0 --port 8000
# Test (OpenAI compatible)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b",
"messages": [{"role":"user","content":"Hello"}]
}'
The HF:// prefix tells MLCEngine to download from Hugging Face directly. For local models pass a filesystem path.
Quantization Formats {#quantization}
| Quantization | Bits | Group Size | Quality | Recommended Use |
|---|---|---|---|---|
| q4f16_1 | 4 | 32 | Best 4-bit default | General use |
| q4f16_0 | 4 | 32 | Slightly worse than q4f16_1 | Legacy |
| q4f32_1 | 4 | 32 | Highest 4-bit quality (FP32 scales) | Quality-critical |
| q3f16_1 | 3 | 32 | Smaller, lower quality | Very tight VRAM |
| q0f16 | 16 | n/a | FP16, no quantization | Plenty of VRAM |
| q0f32 | 32 | n/a | FP32 | Reference |
In practice: q4f16_1 for 95% of users; q4f32_1 if you have VRAM headroom and quality matters; q3f16_1 only when you have to.
MLCEngine OpenAI-Compatible Server {#engine}
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC \
--host 0.0.0.0 --port 8000 \
--device cuda \
--mode local \
--max-batch-size 8
--mode options:
local— single user, lowest latencyinteractive— small batch, balancedserver— high throughput, multi-user (similar to vLLM continuous batching)
Endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /health.
Tool calling and JSON mode work via the OpenAI tools and response_format fields. xgrammar is the constrained-generation backend.
Apple Silicon (Metal) {#metal}
Mac-native build is the simplest:
pip install --pre -U -f https://mlc.ai/wheels mlc-llm mlc-ai
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device metal
On M4 Max with Llama 3.1 8B q4f16_1:
| Engine | tok/s |
|---|---|
| llama.cpp Metal | ~58 |
| Ollama (= llama.cpp) | ~58 |
| MLC-LLM Metal | ~62 |
| MLX (separate) | ~55 |
MLC-LLM is competitive with llama.cpp on Metal and slightly faster on some models. For deep coverage of Mac local AI, see MLX vs CUDA and Apple M4 for AI Guide.
AMD GPUs (ROCm + Vulkan) {#amd}
ROCm (RX 7000+, MI series)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-rocm62 mlc-ai-rocm62
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device rocm
Vulkan (RX 6000-series, RX 5000-series, integrated)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device vulkan
Vulkan is 60-80% the speed of native ROCm. For RX 6000-series and older where ROCm is unofficial, Vulkan is often the better choice. See AMD ROCm Setup for the broader AMD picture.
Intel Arc (Vulkan) {#intel}
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan
# Verify Vulkan sees the Arc
vulkaninfo --summary | grep "Intel"
mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device vulkan
On Arc A770 16GB with Llama 3.1 8B q4f16_1: ~38 tok/s. See Intel Arc A770 Local AI for the dedicated Arc guide.
Android Deployment {#android}
The MLC-LLM Android app source is in android/MLCChat in the repo.
cd mlc-llm/android/MLCChat
# Build with Android Studio (Arctic Fox or newer)
Bundled models in the demo app: Llama 3.2 1B / 3B q4f16_1, Qwen 2.5 1.5B q4f16_1, Phi-3.5 Mini q4f16_1, Gemma 2 2B q4f16_1.
For your own app, use mlc4j (Java/Kotlin bindings) and ship a .so library + tokenizer + config. Per-platform compile:
mlc_llm compile model-config.json \
--device android \
--opt 'cublas_workspace_size=16' \
--output libmlc-llm-llama-3.2-3b-android.so
Performance on Snapdragon 8 Gen 3 (Galaxy S24): Llama 3.2 3B q4f16_1 at 18-25 tok/s.
iOS Deployment {#ios}
cd mlc-llm/ios/MLCChat
# Build with Xcode 15+
iOS uses Metal as the backend (same as Mac). Bundled models: Llama 3.2 1B / 3B q4f16_1, Qwen 2.5 1.5B / 3B q4f16_1, Phi-3.5 Mini.
For App Store distribution, the limit is your bundle size (typically <500 MB) — ship a 1B q4f16_1 model (~600 MB) via on-demand download instead.
Performance on iPhone 15 Pro: Llama 3.2 3B q4f16_1 at 22-30 tok/s, sustained until thermal throttling around 5-10 minutes of continuous load.
WebLLM: Running in the Browser {#webllm}
WebLLM is the JavaScript / TypeScript wrapper around MLC-LLM compiled to WebGPU + WebAssembly.
npm install @mlc-ai/web-llm
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine(
"Llama-3.1-8B-Instruct-q4f16_1-MLC",
{ initProgressCallback: p => console.log(p) }
);
const reply = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello" }],
stream: true,
});
for await (const chunk of reply) {
process.stdout.write(chunk.choices[0].delta.content || "");
}
The model downloads on first run (1-5 GB depending on size) and caches in IndexedDB. Subsequent loads are instant.
Browser support:
- Chrome 113+ ✅
- Safari 18+ ✅ (macOS 15, iOS 18)
- Edge 113+ ✅
- Firefox: experimental (about:config:
dom.webgpu.enabled=true)
Performance (Llama 3.1 8B q4f16_1):
| Hardware | tok/s |
|---|---|
| MacBook M3 Pro (Safari 18) | 38 |
| RTX 4090 (Chrome) | 72 |
| Intel UHD 630 (Chrome) | 6 |
| iPhone 15 Pro (Safari 18) | 18 |
For deeper coverage see WebLLM Browser AI Guide.
Long Context, FlashAttention, KV Cache {#long-context}
MLC-LLM supports up to the model's native context length plus RoPE scaling for extension. FlashAttention is implemented via TVM (not the official Tri Dao kernel) and enabled automatically.
KV cache quantization is supported via --quantization q4f16_kv etc. (limited to specific backends as of mid-2026).
For 32K-context Llama 3.1 8B on a 12 GB RTX 3060:
mlc_llm compile model-config.json \
--device cuda \
--max-seq-len 32768 \
--output llama-3.1-8b-32k-cuda.so
max-seq-len controls the engine's reserved KV-cache memory. Lower = more concurrent requests fit; higher = longer single requests fit.
Tool Calling and Structured Output {#tool-calling}
MLCEngine supports OpenAI-compatible tool calling for chat-tuned models:
{
"model": "llama-3.1-8b",
"messages": [...],
"tools": [{"type": "function", "function": {"name": "get_weather", "parameters": {...}}}],
"tool_choice": "auto"
}
For JSON-schema constrained generation, pass response_format:
{
"response_format": {
"type": "json_schema",
"json_schema": {"name": "person", "schema": {...}}
}
}
xgrammar is the backend; throughput overhead is 5-15%, similar to vLLM.
Benchmarks Across Platforms {#benchmarks}
Llama 3.1 8B q4f16_1, batch size 1, 4K context, 128 output tokens.
| Hardware / Backend | tok/s |
|---|---|
| RTX 4090 (CUDA, MLC) | 138 |
| RTX 4090 (CUDA, vLLM AWQ) | 155 |
| RTX 4090 (CUDA, ExLlamaV2) | 165 |
| MacBook M4 Max (Metal, MLC) | 62 |
| MacBook M4 Max (Metal, llama.cpp) | 58 |
| RX 7900 XTX (ROCm, MLC) | 75 |
| Intel Arc A770 (Vulkan, MLC) | 38 |
| RTX 4090 (WebGPU, browser, WebLLM) | 72 |
| Snapdragon 8 Gen 3 (Adreno, MLC Android) | 11 (3B model) |
| iPhone 15 Pro (Metal, MLC iOS) | 12 (3B model) |
MLC trails specialized engines on individual platforms (vLLM on CUDA, llama.cpp on Metal) but is competitive everywhere and the only option for some.
Tuning Recipes {#tuning}
NVIDIA RTX 4090
mlc_llm compile model-config.json \
--device cuda --opt 'cublas_workspace_size=16' \
--max-seq-len 16384 --max-batch-size 8 \
--output llama-3.1-8b-cuda.so
Apple M4 Max
mlc_llm compile model-config.json \
--device metal --max-seq-len 16384 \
--output llama-3.1-8b-metal.dylib
Android
mlc_llm compile model-config.json \
--device android \
--max-seq-len 4096 \
--quantization q4f16_1 \
--output libmlc-llm-llama-3.2-3b-android.so
WebGPU (browser)
mlc_llm compile model-config.json \
--device webgpu \
--max-seq-len 4096 \
--output llama-3.1-8b-webgpu.wasm
Common Errors {#troubleshooting}
| Error | Cause | Fix |
|---|---|---|
Cannot find a suitable device | Wrong wheel | Install CUDA / ROCm / Vulkan / Metal-specific wheel |
| Compile takes hours | First-time TVM autotune | Subsequent compiles use cache; OK to wait once |
| WebGPU "out of memory" in browser | Browser tab limit | Use smaller model or smaller max-seq-len |
| Model loads but errors per request | Wrong conv-template | Pass correct --conv-template (llama-3, mistral, etc.) |
| Vulkan: validation layer warnings | Driver version | Update GPU drivers |
| Android app crashes on launch | wrong .so for ABI | Build for arm64-v8a explicitly |
| iOS build fails on Metal shaders | Xcode version | Xcode 15+ required |
FAQ {#faq}
See answers to common MLC-LLM questions below.
Sources: MLC-LLM GitHub | MLC-LLM docs | WebLLM | Apache TVM | Internal benchmarks across NVIDIA, AMD, Apple, Android, iOS, browser.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!