★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Tools

MLC-LLM Setup Guide (2026): Cross-Platform LLM Inference on Any Device

May 1, 2026
26 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

MLC-LLM is the universal compiler for LLM inference. Built on Apache TVM, it produces device-specific binary libraries that run on every major backend — CUDA, ROCm, Metal, Vulkan, OpenCL, WebGPU, and CPU. It is the only mainstream path to run LLMs natively in the browser, on Android, on iOS, and on Intel Arc, all from one source.

This guide covers everything: how the compilation pipeline works, installing MLCEngine, compiling models for each backend, the WebLLM browser runtime, the iOS and Android apps, MLCEngine OpenAI-compatible serving, quantization choices, and benchmarks across NVIDIA / AMD / Apple / mobile.

Table of Contents

  1. What MLC-LLM Is and Why TVM Matters
  2. Hardware Coverage
  3. Installation: pip, Docker, Source
  4. The Pipeline: Convert → Compile → Run
  5. Your First Model: Llama 3.2 3B on CUDA
  6. Quantization Formats (q4f16_1, q4f32_1, q3f16_1)
  7. MLCEngine OpenAI-Compatible Server
  8. Apple Silicon (Metal)
  9. AMD GPUs (ROCm + Vulkan)
  10. Intel Arc (Vulkan)
  11. Android Deployment
  12. iOS Deployment
  13. WebLLM: Running in the Browser
  14. Long Context, FlashAttention, KV Cache
  15. Tool Calling and Structured Output
  16. Benchmarks Across Platforms
  17. Tuning Recipes
  18. Common Errors
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What MLC-LLM Is and Why TVM Matters {#what-it-is}

MLC-LLM stands for "Machine Learning Compilation for Large Language Models." It is a project from CMU and the OctoML / Apache TVM community.

The compiler approach:

Hugging Face model
       │
       ▼
[mlc_llm convert_weight]   # weight conversion
       │
       ▼
MLC checkpoint format
       │
       ▼
[mlc_llm compile]          # TVM compiles to device-specific code
       │
       ▼
.so / .dylib / .dll / .wasm / .so (Android) / .a (iOS)
       │
       ▼
[mlc_llm serve / chat / library APIs]

TVM autotunes kernels for the target device. The output is a binary that performs comparably to hand-written CUDA on NVIDIA, hand-written Metal on Apple, hand-written Vulkan on Intel — without you needing to write any of those.

The trade-off, similar to TensorRT-LLM: compilation takes time (10-30 min per target). The win: one Python source produces optimized code for every backend.


Hardware Coverage {#hardware-coverage}

BackendHardwareStatus
CUDANVIDIA RTX 20-series and newerProduction
ROCmAMD Radeon RX 7000+, MI seriesProduction
VulkanAny Vulkan 1.2 GPU (Intel Arc, AMD, NVIDIA, Adreno)Production
MetalApple M1/M2/M3/M4Production
OpenCLOlder GPUs, mobile SoCsWorking
WebGPUModern browsers (Chrome 113+, Safari 18+, Edge)WebLLM
AndroidAdreno (Qualcomm), Mali (Arm), Apple iOSProduction
iOSA14+ (iPhone 12+, iPad Pro M-series)Production
CPUx86_64 + AVX2, ARM64Working

The diversity is unique. No other major LLM inference engine covers WebGPU + iOS + Android + Intel Arc + AMD ROCm + NVIDIA from one codebase.


Installation: pip, Docker, Source {#installation}

python3.11 -m venv ~/venvs/mlc
source ~/venvs/mlc/bin/activate

# CUDA 12.4
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-cu124 mlc-ai-cu124

# Metal (Mac)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm mlc-ai

# ROCm 6.2
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-rocm62 mlc-ai-rocm62

# Vulkan (Intel Arc, multi-vendor)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan

# Verify
python -c "import mlc_llm; print(mlc_llm.__version__)"

Docker

docker run --gpus all -it --rm \
    -p 8000:8000 \
    -v $(pwd)/models:/models \
    mlcaidev/mlc-llm:latest \
    mlc_llm serve /models/Llama-3.2-3B-q4f16_1

From source

git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
mkdir build && cd build
python ../cmake/gen_cmake_config.py    # answer y/n for each backend
cmake .. && cmake --build . --parallel

Source builds take 30-60 minutes; only needed for custom kernels or unsupported targets.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

The Pipeline: Convert → Compile → Run {#pipeline}

# 1. Convert HF weights to MLC format
mlc_llm convert_weight ./Llama-3.2-3B-Instruct \
    --quantization q4f16_1 \
    --output ./mlc/Llama-3.2-3B-Instruct-q4f16_1

# 2. Generate model config
mlc_llm gen_config ./Llama-3.2-3B-Instruct \
    --quantization q4f16_1 \
    --conv-template llama-3 \
    --output ./mlc/Llama-3.2-3B-Instruct-q4f16_1

# 3. Compile to target device
mlc_llm compile ./mlc/Llama-3.2-3B-Instruct-q4f16_1/mlc-chat-config.json \
    --device cuda \
    --output ./mlc/Llama-3.2-3B-Instruct-q4f16_1-cuda.so

# 4. Run via Python or serve
mlc_llm chat ./mlc/Llama-3.2-3B-Instruct-q4f16_1 \
    --model-lib ./mlc/Llama-3.2-3B-Instruct-q4f16_1-cuda.so

The mlc-chat-config.json lives in the model directory and tells the runtime which library to load.


Your First Model: Llama 3.2 3B on CUDA {#first-model}

# Download from MLC Hugging Face hub (pre-quantized + pre-compiled for cuda)
git lfs install
git clone https://huggingface.co/mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC

# Run
mlc_llm serve HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC \
    --host 0.0.0.0 --port 8000

# Test (OpenAI compatible)
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.2-3b",
        "messages": [{"role":"user","content":"Hello"}]
    }'

The HF:// prefix tells MLCEngine to download from Hugging Face directly. For local models pass a filesystem path.


Quantization Formats {#quantization}

QuantizationBitsGroup SizeQualityRecommended Use
q4f16_1432Best 4-bit defaultGeneral use
q4f16_0432Slightly worse than q4f16_1Legacy
q4f32_1432Highest 4-bit quality (FP32 scales)Quality-critical
q3f16_1332Smaller, lower qualityVery tight VRAM
q0f1616n/aFP16, no quantizationPlenty of VRAM
q0f3232n/aFP32Reference

In practice: q4f16_1 for 95% of users; q4f32_1 if you have VRAM headroom and quality matters; q3f16_1 only when you have to.


MLCEngine OpenAI-Compatible Server {#engine}

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC \
    --host 0.0.0.0 --port 8000 \
    --device cuda \
    --mode local \
    --max-batch-size 8

--mode options:

  • local — single user, lowest latency
  • interactive — small batch, balanced
  • server — high throughput, multi-user (similar to vLLM continuous batching)

Endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /health.

Tool calling and JSON mode work via the OpenAI tools and response_format fields. xgrammar is the constrained-generation backend.


Apple Silicon (Metal) {#metal}

Mac-native build is the simplest:

pip install --pre -U -f https://mlc.ai/wheels mlc-llm mlc-ai

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device metal

On M4 Max with Llama 3.1 8B q4f16_1:

Enginetok/s
llama.cpp Metal~58
Ollama (= llama.cpp)~58
MLC-LLM Metal~62
MLX (separate)~55

MLC-LLM is competitive with llama.cpp on Metal and slightly faster on some models. For deep coverage of Mac local AI, see MLX vs CUDA and Apple M4 for AI Guide.


AMD GPUs (ROCm + Vulkan) {#amd}

ROCm (RX 7000+, MI series)

pip install --pre -U -f https://mlc.ai/wheels mlc-llm-rocm62 mlc-ai-rocm62

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device rocm

Vulkan (RX 6000-series, RX 5000-series, integrated)

pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device vulkan

Vulkan is 60-80% the speed of native ROCm. For RX 6000-series and older where ROCm is unofficial, Vulkan is often the better choice. See AMD ROCm Setup for the broader AMD picture.


Intel Arc (Vulkan) {#intel}

pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan

# Verify Vulkan sees the Arc
vulkaninfo --summary | grep "Intel"

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device vulkan

On Arc A770 16GB with Llama 3.1 8B q4f16_1: ~38 tok/s. See Intel Arc A770 Local AI for the dedicated Arc guide.


Android Deployment {#android}

The MLC-LLM Android app source is in android/MLCChat in the repo.

cd mlc-llm/android/MLCChat
# Build with Android Studio (Arctic Fox or newer)

Bundled models in the demo app: Llama 3.2 1B / 3B q4f16_1, Qwen 2.5 1.5B q4f16_1, Phi-3.5 Mini q4f16_1, Gemma 2 2B q4f16_1.

For your own app, use mlc4j (Java/Kotlin bindings) and ship a .so library + tokenizer + config. Per-platform compile:

mlc_llm compile model-config.json \
    --device android \
    --opt 'cublas_workspace_size=16' \
    --output libmlc-llm-llama-3.2-3b-android.so

Performance on Snapdragon 8 Gen 3 (Galaxy S24): Llama 3.2 3B q4f16_1 at 18-25 tok/s.


iOS Deployment {#ios}

cd mlc-llm/ios/MLCChat
# Build with Xcode 15+

iOS uses Metal as the backend (same as Mac). Bundled models: Llama 3.2 1B / 3B q4f16_1, Qwen 2.5 1.5B / 3B q4f16_1, Phi-3.5 Mini.

For App Store distribution, the limit is your bundle size (typically <500 MB) — ship a 1B q4f16_1 model (~600 MB) via on-demand download instead.

Performance on iPhone 15 Pro: Llama 3.2 3B q4f16_1 at 22-30 tok/s, sustained until thermal throttling around 5-10 minutes of continuous load.


WebLLM: Running in the Browser {#webllm}

WebLLM is the JavaScript / TypeScript wrapper around MLC-LLM compiled to WebGPU + WebAssembly.

npm install @mlc-ai/web-llm
import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine(
  "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  { initProgressCallback: p => console.log(p) }
);

const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello" }],
  stream: true,
});

for await (const chunk of reply) {
  process.stdout.write(chunk.choices[0].delta.content || "");
}

The model downloads on first run (1-5 GB depending on size) and caches in IndexedDB. Subsequent loads are instant.

Browser support:

  • Chrome 113+ ✅
  • Safari 18+ ✅ (macOS 15, iOS 18)
  • Edge 113+ ✅
  • Firefox: experimental (about:config: dom.webgpu.enabled=true)

Performance (Llama 3.1 8B q4f16_1):

Hardwaretok/s
MacBook M3 Pro (Safari 18)38
RTX 4090 (Chrome)72
Intel UHD 630 (Chrome)6
iPhone 15 Pro (Safari 18)18

For deeper coverage see WebLLM Browser AI Guide.


Long Context, FlashAttention, KV Cache {#long-context}

MLC-LLM supports up to the model's native context length plus RoPE scaling for extension. FlashAttention is implemented via TVM (not the official Tri Dao kernel) and enabled automatically.

KV cache quantization is supported via --quantization q4f16_kv etc. (limited to specific backends as of mid-2026).

For 32K-context Llama 3.1 8B on a 12 GB RTX 3060:

mlc_llm compile model-config.json \
    --device cuda \
    --max-seq-len 32768 \
    --output llama-3.1-8b-32k-cuda.so

max-seq-len controls the engine's reserved KV-cache memory. Lower = more concurrent requests fit; higher = longer single requests fit.


Tool Calling and Structured Output {#tool-calling}

MLCEngine supports OpenAI-compatible tool calling for chat-tuned models:

{
  "model": "llama-3.1-8b",
  "messages": [...],
  "tools": [{"type": "function", "function": {"name": "get_weather", "parameters": {...}}}],
  "tool_choice": "auto"
}

For JSON-schema constrained generation, pass response_format:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {"name": "person", "schema": {...}}
  }
}

xgrammar is the backend; throughput overhead is 5-15%, similar to vLLM.


Benchmarks Across Platforms {#benchmarks}

Llama 3.1 8B q4f16_1, batch size 1, 4K context, 128 output tokens.

Hardware / Backendtok/s
RTX 4090 (CUDA, MLC)138
RTX 4090 (CUDA, vLLM AWQ)155
RTX 4090 (CUDA, ExLlamaV2)165
MacBook M4 Max (Metal, MLC)62
MacBook M4 Max (Metal, llama.cpp)58
RX 7900 XTX (ROCm, MLC)75
Intel Arc A770 (Vulkan, MLC)38
RTX 4090 (WebGPU, browser, WebLLM)72
Snapdragon 8 Gen 3 (Adreno, MLC Android)11 (3B model)
iPhone 15 Pro (Metal, MLC iOS)12 (3B model)

MLC trails specialized engines on individual platforms (vLLM on CUDA, llama.cpp on Metal) but is competitive everywhere and the only option for some.


Tuning Recipes {#tuning}

NVIDIA RTX 4090

mlc_llm compile model-config.json \
    --device cuda --opt 'cublas_workspace_size=16' \
    --max-seq-len 16384 --max-batch-size 8 \
    --output llama-3.1-8b-cuda.so

Apple M4 Max

mlc_llm compile model-config.json \
    --device metal --max-seq-len 16384 \
    --output llama-3.1-8b-metal.dylib

Android

mlc_llm compile model-config.json \
    --device android \
    --max-seq-len 4096 \
    --quantization q4f16_1 \
    --output libmlc-llm-llama-3.2-3b-android.so

WebGPU (browser)

mlc_llm compile model-config.json \
    --device webgpu \
    --max-seq-len 4096 \
    --output llama-3.1-8b-webgpu.wasm

Common Errors {#troubleshooting}

ErrorCauseFix
Cannot find a suitable deviceWrong wheelInstall CUDA / ROCm / Vulkan / Metal-specific wheel
Compile takes hoursFirst-time TVM autotuneSubsequent compiles use cache; OK to wait once
WebGPU "out of memory" in browserBrowser tab limitUse smaller model or smaller max-seq-len
Model loads but errors per requestWrong conv-templatePass correct --conv-template (llama-3, mistral, etc.)
Vulkan: validation layer warningsDriver versionUpdate GPU drivers
Android app crashes on launchwrong .so for ABIBuild for arm64-v8a explicitly
iOS build fails on Metal shadersXcode versionXcode 15+ required

FAQ {#faq}

See answers to common MLC-LLM questions below.


Sources: MLC-LLM GitHub | MLC-LLM docs | WebLLM | Apache TVM | Internal benchmarks across NVIDIA, AMD, Apple, Android, iOS, browser.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Now includes an MLC-LLM cross-platform reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators