What is MLC-LLM and how is it different from Ollama or vLLM?

MLC-LLM (Machine Learning Compilation for LLMs) is a TVM-based compiler stack that targets every major hardware backend: CUDA (NVIDIA), ROCm/HIP (AMD), Metal (Apple), Vulkan (Intel/NVIDIA/AMD), OpenCL, WebGPU (browser), and CPU. Unlike Ollama (CUDA + Metal + Vulkan via llama.cpp) and vLLM (CUDA only), MLC-LLM compiles a model to a binary library tuned for your specific hardware. The result is competitive performance on each backend, plus the only practical path to run LLMs natively in the browser, on Android, and on iOS. It is the universal-compatibility option.

Should I use MLC-LLM instead of llama.cpp / Ollama?

For NVIDIA Linux desktops: stay on Ollama / llama.cpp / vLLM — they are simpler and competitive. For Apple Silicon: MLC and llama.cpp are close; MLC is faster on some models, llama.cpp has broader model coverage. For AMD Vulkan / Intel Arc / WebGPU / Android / iOS: MLC-LLM is one of the few options that works at all. Choose MLC when (a) your platform is not well-served by other engines, or (b) you need the same code path across many devices (e.g., shipping a desktop + mobile app from one codebase).

How does WebLLM (browser inference) work and is it usable?

WebLLM is MLC-LLM compiled to WebGPU + WebAssembly, running entirely in the browser tab — no server. The model downloads once (cached in IndexedDB) and inference happens client-side. On a laptop with integrated graphics, expect 5-15 tok/s on 1B-3B models. On a discrete GPU via WebGPU, 30-80 tok/s on 7B-8B models. Latency is good, privacy is excellent (no data leaves the device), but the first-load model download (1-5 GB) is a major UX cost. Best for: privacy-sensitive demos, offline web apps, embedded chat in static sites.

How big a model can I run on Android or iOS with MLC-LLM?

Modern flagship phones (iPhone 15 Pro, Galaxy S24, Pixel 9) handle 1B-3B parameter models comfortably at 4-bit quantization with 10-25 tok/s. 7B-8B models run on phones with 12+ GB RAM but with thermal throttling on long generations. The MLC-LLM Android and iOS demo apps ship pre-built configs for Llama 3.2 1B / 3B, Qwen 2.5 1.5B / 3B, Phi-3.5 Mini, and Gemma 2 2B. For production apps, expect to ship a 1B-3B model in your app bundle or download on first launch.

What quantization formats does MLC-LLM use?

MLC-LLM uses its own quantization scheme called "q4f16_1" (4-bit weights, FP16 scales) which is similar in quality to GGUF Q4_K_M. Other formats: q4f16_0 (4-bit, group size 32), q4f32_1 (4-bit weights, FP32 scales — better quality), q3f16_1 (3-bit, smaller), and the auto-tuned "any-precision" mode. For most use cases, q4f16_1 is the right default. Quality is comparable to AWQ-INT4 and slightly better than GPTQ-128g.

How do I compile a model that is not in the MLC model registry?

Use `mlc_llm convert_weight` (Hugging Face → MLC checkpoint) followed by `mlc_llm compile` (checkpoint → device-specific .so/.dll/.dylib library). The compile step takes 10-30 minutes per (model, target, dtype) combination — similar to TensorRT-LLM's engine build. Many community-published configs cover popular models on the major backends; for new architectures you may need to submit a model definition (Python) to the MLC repo or write your own.

Does MLC-LLM support tool calling, JSON mode, and streaming?

Yes via MLCEngine. The `mlc_llm serve` command exposes an OpenAI-compatible HTTP API with streaming, tool calling (for chat-tuned models that support it: Llama 3.1+, Qwen 2.5+, Mistral), and JSON-mode constrained generation via xgrammar. Pass tools and tool_choice as in the OpenAI spec. Performance is competitive with vLLM at single-stream and worse at high concurrency (no PagedAttention).

How does MLC-LLM compare to ONNX Runtime for cross-platform LLM inference?

ONNX Runtime is a more mature general-purpose runtime but its LLM-specific optimizations lag. MLC-LLM is purpose-built for LLMs with TVM's ahead-of-time compilation and aggressive autotuning, so on most LLM benchmarks MLC outperforms ONNX Runtime by 20-50%. ONNX Runtime wins when you have a mixed workload (vision + LLM + speech) and want one runtime, or when you need DirectML on Windows (MLC uses Vulkan instead). For pure LLM cross-platform, MLC is the better choice in 2026.

MLC-LLM Setup Guide (2026): Cross-Platform LLM Inference on Any Device

MLC-LLM is the universal compiler for LLM inference. Built on Apache TVM, it produces device-specific binary libraries that run on every major backend — CUDA, ROCm, Metal, Vulkan, OpenCL, WebGPU, and CPU. It is the only mainstream path to run LLMs natively in the browser, on Android, on iOS, and on Intel Arc, all from one source.

This guide covers everything: how the compilation pipeline works, installing MLCEngine, compiling models for each backend, the WebLLM browser runtime, the iOS and Android apps, MLCEngine OpenAI-compatible serving, quantization choices, and benchmarks across NVIDIA / AMD / Apple / mobile.

What MLC-LLM Is and Why TVM Matters
Hardware Coverage
Installation: pip, Docker, Source
The Pipeline: Convert → Compile → Run
Your First Model: Llama 3.2 3B on CUDA
Quantization Formats (q4f16_1, q4f32_1, q3f16_1)
MLCEngine OpenAI-Compatible Server
Apple Silicon (Metal)
AMD GPUs (ROCm + Vulkan)
Intel Arc (Vulkan)
Android Deployment
iOS Deployment
WebLLM: Running in the Browser
Long Context, FlashAttention, KV Cache
Tool Calling and Structured Output
Benchmarks Across Platforms
Tuning Recipes
Common Errors
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What MLC-LLM Is and Why TVM Matters {#what-it-is}

MLC-LLM stands for "Machine Learning Compilation for Large Language Models." It is a project from CMU and the OctoML / Apache TVM community.

The compiler approach:

Hugging Face model
       │
       ▼
[mlc_llm convert_weight]   # weight conversion
       │
       ▼
MLC checkpoint format
       │
       ▼
[mlc_llm compile]          # TVM compiles to device-specific code
       │
       ▼
.so / .dylib / .dll / .wasm / .so (Android) / .a (iOS)
       │
       ▼
[mlc_llm serve / chat / library APIs]

TVM autotunes kernels for the target device. The output is a binary that performs comparably to hand-written CUDA on NVIDIA, hand-written Metal on Apple, hand-written Vulkan on Intel — without you needing to write any of those.

The trade-off, similar to TensorRT-LLM: compilation takes time (10-30 min per target). The win: one Python source produces optimized code for every backend.

Hardware Coverage {#hardware-coverage}

Backend	Hardware	Status
CUDA	NVIDIA RTX 20-series and newer	Production
ROCm	AMD Radeon RX 7000+, MI series	Production
Vulkan	Any Vulkan 1.2 GPU (Intel Arc, AMD, NVIDIA, Adreno)	Production
Metal	Apple M1/M2/M3/M4	Production
OpenCL	Older GPUs, mobile SoCs	Working
WebGPU	Modern browsers (Chrome 113+, Safari 18+, Edge)	WebLLM
Android	Adreno (Qualcomm), Mali (Arm), Apple iOS	Production
iOS	A14+ (iPhone 12+, iPad Pro M-series)	Production
CPU	x86_64 + AVX2, ARM64	Working

The diversity is unique. No other major LLM inference engine covers WebGPU + iOS + Android + Intel Arc + AMD ROCm + NVIDIA from one codebase.

Installation: pip, Docker, Source {#installation}

pip (recommended)

python3.11 -m venv ~/venvs/mlc
source ~/venvs/mlc/bin/activate

# CUDA 12.4
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-cu124 mlc-ai-cu124

# Metal (Mac)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm mlc-ai

# ROCm 6.2
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-rocm62 mlc-ai-rocm62

# Vulkan (Intel Arc, multi-vendor)
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan

# Verify
python -c "import mlc_llm; print(mlc_llm.__version__)"

Docker

docker run --gpus all -it --rm \
    -p 8000:8000 \
    -v $(pwd)/models:/models \
    mlcaidev/mlc-llm:latest \
    mlc_llm serve /models/Llama-3.2-3B-q4f16_1

From source

git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
mkdir build && cd build
python ../cmake/gen_cmake_config.py    # answer y/n for each backend
cmake .. && cmake --build . --parallel

Source builds take 30-60 minutes; only needed for custom kernels or unsupported targets.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The Pipeline: Convert → Compile → Run {#pipeline}

# 1. Convert HF weights to MLC format
mlc_llm convert_weight ./Llama-3.2-3B-Instruct \
    --quantization q4f16_1 \
    --output ./mlc/Llama-3.2-3B-Instruct-q4f16_1

# 2. Generate model config
mlc_llm gen_config ./Llama-3.2-3B-Instruct \
    --quantization q4f16_1 \
    --conv-template llama-3 \
    --output ./mlc/Llama-3.2-3B-Instruct-q4f16_1

# 3. Compile to target device
mlc_llm compile ./mlc/Llama-3.2-3B-Instruct-q4f16_1/mlc-chat-config.json \
    --device cuda \
    --output ./mlc/Llama-3.2-3B-Instruct-q4f16_1-cuda.so

# 4. Run via Python or serve
mlc_llm chat ./mlc/Llama-3.2-3B-Instruct-q4f16_1 \
    --model-lib ./mlc/Llama-3.2-3B-Instruct-q4f16_1-cuda.so

The mlc-chat-config.json lives in the model directory and tells the runtime which library to load.

Your First Model: Llama 3.2 3B on CUDA {#first-model}

# Download from MLC Hugging Face hub (pre-quantized + pre-compiled for cuda)
git lfs install
git clone https://huggingface.co/mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC

# Run
mlc_llm serve HF://mlc-ai/Llama-3.2-3B-Instruct-q4f16_1-MLC \
    --host 0.0.0.0 --port 8000

# Test (OpenAI compatible)
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.2-3b",
        "messages": [{"role":"user","content":"Hello"}]
    }'

The HF:// prefix tells MLCEngine to download from Hugging Face directly. For local models pass a filesystem path.

Quantization Formats {#quantization}

Quantization	Bits	Group Size	Quality	Recommended Use
q4f16_1	4	32	Best 4-bit default	General use
q4f16_0	4	32	Slightly worse than q4f16_1	Legacy
q4f32_1	4	32	Highest 4-bit quality (FP32 scales)	Quality-critical
q3f16_1	3	32	Smaller, lower quality	Very tight VRAM
q0f16	16	n/a	FP16, no quantization	Plenty of VRAM
q0f32	32	n/a	FP32	Reference

In practice: q4f16_1 for 95% of users; q4f32_1 if you have VRAM headroom and quality matters; q3f16_1 only when you have to.

MLCEngine OpenAI-Compatible Server {#engine}

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC \
    --host 0.0.0.0 --port 8000 \
    --device cuda \
    --mode local \
    --max-batch-size 8

--mode options:

local — single user, lowest latency
interactive — small batch, balanced
server — high throughput, multi-user (similar to vLLM continuous batching)

Endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /health.

Tool calling and JSON mode work via the OpenAI tools and response_format fields. xgrammar is the constrained-generation backend.

Apple Silicon (Metal) {#metal}

Mac-native build is the simplest:

pip install --pre -U -f https://mlc.ai/wheels mlc-llm mlc-ai

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device metal

On M4 Max with Llama 3.1 8B q4f16_1:

Engine	tok/s
llama.cpp Metal	~58
Ollama (= llama.cpp)	~58
MLC-LLM Metal	~62
MLX (separate)	~55

MLC-LLM is competitive with llama.cpp on Metal and slightly faster on some models. For deep coverage of Mac local AI, see MLX vs CUDA and Apple M4 for AI Guide.

AMD GPUs (ROCm + Vulkan) {#amd}

ROCm (RX 7000+, MI series)

pip install --pre -U -f https://mlc.ai/wheels mlc-llm-rocm62 mlc-ai-rocm62

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device rocm

Vulkan (RX 6000-series, RX 5000-series, integrated)

pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device vulkan

Vulkan is 60-80% the speed of native ROCm. For RX 6000-series and older where ROCm is unofficial, Vulkan is often the better choice. See AMD ROCm Setup for the broader AMD picture.

Intel Arc (Vulkan) {#intel}

pip install --pre -U -f https://mlc.ai/wheels mlc-llm-vulkan mlc-ai-vulkan

# Verify Vulkan sees the Arc
vulkaninfo --summary | grep "Intel"

mlc_llm serve HF://mlc-ai/Llama-3.1-8B-Instruct-q4f16_1-MLC --device vulkan

On Arc A770 16GB with Llama 3.1 8B q4f16_1: ~38 tok/s. See Intel Arc A770 Local AI for the dedicated Arc guide.

Android Deployment {#android}

The MLC-LLM Android app source is in android/MLCChat in the repo.

cd mlc-llm/android/MLCChat
# Build with Android Studio (Arctic Fox or newer)

Bundled models in the demo app: Llama 3.2 1B / 3B q4f16_1, Qwen 2.5 1.5B q4f16_1, Phi-3.5 Mini q4f16_1, Gemma 2 2B q4f16_1.

For your own app, use mlc4j (Java/Kotlin bindings) and ship a .so library + tokenizer + config. Per-platform compile:

mlc_llm compile model-config.json \
    --device android \
    --opt 'cublas_workspace_size=16' \
    --output libmlc-llm-llama-3.2-3b-android.so

Performance on Snapdragon 8 Gen 3 (Galaxy S24): Llama 3.2 3B q4f16_1 at 18-25 tok/s.

iOS Deployment {#ios}

cd mlc-llm/ios/MLCChat
# Build with Xcode 15+

iOS uses Metal as the backend (same as Mac). Bundled models: Llama 3.2 1B / 3B q4f16_1, Qwen 2.5 1.5B / 3B q4f16_1, Phi-3.5 Mini.

For App Store distribution, the limit is your bundle size (typically <500 MB) — ship a 1B q4f16_1 model (~600 MB) via on-demand download instead.

Performance on iPhone 15 Pro: Llama 3.2 3B q4f16_1 at 22-30 tok/s, sustained until thermal throttling around 5-10 minutes of continuous load.

WebLLM: Running in the Browser {#webllm}

WebLLM is the JavaScript / TypeScript wrapper around MLC-LLM compiled to WebGPU + WebAssembly.

npm install @mlc-ai/web-llm

import { CreateMLCEngine } from "@mlc-ai/web-llm";

const engine = await CreateMLCEngine(
  "Llama-3.1-8B-Instruct-q4f16_1-MLC",
  { initProgressCallback: p => console.log(p) }
);

const reply = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello" }],
  stream: true,
});

for await (const chunk of reply) {
  process.stdout.write(chunk.choices[0].delta.content || "");
}

The model downloads on first run (1-5 GB depending on size) and caches in IndexedDB. Subsequent loads are instant.

Browser support:

Chrome 113+ ✅
Safari 18+ ✅ (macOS 15, iOS 18)
Edge 113+ ✅
Firefox: experimental (about:config: dom.webgpu.enabled=true)

Performance (Llama 3.1 8B q4f16_1):

Hardware	tok/s
MacBook M3 Pro (Safari 18)	38
RTX 4090 (Chrome)	72
Intel UHD 630 (Chrome)	6
iPhone 15 Pro (Safari 18)	18

For deeper coverage see WebLLM Browser AI Guide.

Long Context, FlashAttention, KV Cache {#long-context}

MLC-LLM supports up to the model's native context length plus RoPE scaling for extension. FlashAttention is implemented via TVM (not the official Tri Dao kernel) and enabled automatically.

KV cache quantization is supported via --quantization q4f16_kv etc. (limited to specific backends as of mid-2026).

For 32K-context Llama 3.1 8B on a 12 GB RTX 3060:

mlc_llm compile model-config.json \
    --device cuda \
    --max-seq-len 32768 \
    --output llama-3.1-8b-32k-cuda.so

max-seq-len controls the engine's reserved KV-cache memory. Lower = more concurrent requests fit; higher = longer single requests fit.

Tool Calling and Structured Output {#tool-calling}

MLCEngine supports OpenAI-compatible tool calling for chat-tuned models:

{
  "model": "llama-3.1-8b",
  "messages": [...],
  "tools": [{"type": "function", "function": {"name": "get_weather", "parameters": {...}}}],
  "tool_choice": "auto"
}

For JSON-schema constrained generation, pass response_format:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {"name": "person", "schema": {...}}
  }
}

xgrammar is the backend; throughput overhead is 5-15%, similar to vLLM.

Benchmarks Across Platforms {#benchmarks}

Llama 3.1 8B q4f16_1, batch size 1, 4K context, 128 output tokens.

Hardware / Backend	tok/s
RTX 4090 (CUDA, MLC)	138
RTX 4090 (CUDA, vLLM AWQ)	155
RTX 4090 (CUDA, ExLlamaV2)	165
MacBook M4 Max (Metal, MLC)	62
MacBook M4 Max (Metal, llama.cpp)	58
RX 7900 XTX (ROCm, MLC)	75
Intel Arc A770 (Vulkan, MLC)	38
RTX 4090 (WebGPU, browser, WebLLM)	72
Snapdragon 8 Gen 3 (Adreno, MLC Android)	11 (3B model)
iPhone 15 Pro (Metal, MLC iOS)	12 (3B model)

MLC trails specialized engines on individual platforms (vLLM on CUDA, llama.cpp on Metal) but is competitive everywhere and the only option for some.

Tuning Recipes {#tuning}

NVIDIA RTX 4090

mlc_llm compile model-config.json \
    --device cuda --opt 'cublas_workspace_size=16' \
    --max-seq-len 16384 --max-batch-size 8 \
    --output llama-3.1-8b-cuda.so

Apple M4 Max

mlc_llm compile model-config.json \
    --device metal --max-seq-len 16384 \
    --output llama-3.1-8b-metal.dylib

Android

mlc_llm compile model-config.json \
    --device android \
    --max-seq-len 4096 \
    --quantization q4f16_1 \
    --output libmlc-llm-llama-3.2-3b-android.so

WebGPU (browser)

mlc_llm compile model-config.json \
    --device webgpu \
    --max-seq-len 4096 \
    --output llama-3.1-8b-webgpu.wasm

Common Errors {#troubleshooting}

Error	Cause	Fix
`Cannot find a suitable device`	Wrong wheel	Install CUDA / ROCm / Vulkan / Metal-specific wheel
Compile takes hours	First-time TVM autotune	Subsequent compiles use cache; OK to wait once
WebGPU "out of memory" in browser	Browser tab limit	Use smaller model or smaller max-seq-len
Model loads but errors per request	Wrong conv-template	Pass correct --conv-template (llama-3, mistral, etc.)
Vulkan: validation layer warnings	Driver version	Update GPU drivers
Android app crashes on launch	wrong .so for ABI	Build for arm64-v8a explicitly
iOS build fails on Metal shaders	Xcode version	Xcode 15+ required

FAQ {#faq}

See answers to common MLC-LLM questions below.

Sources: MLC-LLM GitHub | MLC-LLM docs | WebLLM | Apache TVM | Internal benchmarks across NVIDIA, AMD, Apple, Android, iOS, browser.

Related guides:

MLC-LLM Setup Guide (2026): Cross-Platform LLM Inference on Any Device

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What MLC-LLM Is and Why TVM Matters {#what-it-is}

Hardware Coverage {#hardware-coverage}

Installation: pip, Docker, Source {#installation}

pip (recommended)

Docker

From source

Reading articles is good. Building is better.

The Pipeline: Convert → Compile → Run {#pipeline}

Your First Model: Llama 3.2 3B on CUDA {#first-model}

Quantization Formats {#quantization}

MLCEngine OpenAI-Compatible Server {#engine}

Apple Silicon (Metal) {#metal}

AMD GPUs (ROCm + Vulkan) {#amd}

ROCm (RX 7000+, MI series)

Vulkan (RX 6000-series, RX 5000-series, integrated)

Intel Arc (Vulkan) {#intel}

Android Deployment {#android}

iOS Deployment {#ios}

WebLLM: Running in the Browser {#webllm}

Long Context, FlashAttention, KV Cache {#long-context}

Tool Calling and Structured Output {#tool-calling}

Benchmarks Across Platforms {#benchmarks}

Tuning Recipes {#tuning}

NVIDIA RTX 4090

Apple M4 Max

Android

WebGPU (browser)

Common Errors {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

WebLLM Browser AI Guide

AMD ROCm Setup for Local LLMs

MLX vs CUDA for Local AI

Gemini Nano Android Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI