WebLLM Guide: Run AI Models in Your Browser (2026)
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
WebLLM Quick Start
// Run AI in 3 lines of JavaScript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({ messages: [...] });
What is WebLLM?
WebLLM is an open-source library that runs Large Language Models directly in web browsers. Created by MLC AI (Machine Learning Compilation), it eliminates server infrastructure entirely.
Key Statistics
| Metric | Value |
|---|---|
| GitHub Stars | 17,200+ |
| License | Apache 2.0 |
| Browser Support | Chrome, Edge, Firefox (Windows) |
| Performance | 80% of native speed |
How It Works
WebLLM combines three browser technologies:
- WebGPU: GPU acceleration across vendors (NVIDIA, AMD, Apple Metal)
- WebAssembly: CPU workloads via Emscripten compilation
- Web Workers: Background processing without UI blocking
User Request → Web Worker → WebGPU Kernels → GPU Inference → Response
↓
IndexedDB Cache (offline support)
Why Use WebLLM?
| Traditional AI | WebLLM |
|---|---|
| Server infrastructure | Zero servers |
| Per-token API costs | Free after build |
| Network latency | Instant inference |
| Data sent to cloud | 100% private |
| Requires internet | Works offline |
Supported Models
Model Catalog
| Model | Size | VRAM | Best For |
|---|---|---|---|
| SmolLM2-360M | 360M | 130MB | Tiny, fast |
| Llama-3.2-1B | 1B | 900MB | Mobile/low-end |
| Gemma-2-2B | 2B | 2GB | Balanced |
| Llama-3.2-3B | 3B | 2.2GB | Quality/speed |
| Phi-3.5-mini | 3.8B | 3.7GB | Reasoning |
| Mistral-7B | 7B | 5GB | Best quality |
| Llama-3.1-8B | 8B | 5GB | Maximum |
Quantization Options
| Format | Description | Use Case |
|---|---|---|
| q4f16_1 | 4-bit weights, 16-bit activations | Best balance |
| q4f32_1 | 4-bit weights, 32-bit activations | Higher precision |
| q3f16_1 | 3-bit weights | Smallest size |
Recommendation: Use q4f16_1 for most applications.
Installation and Setup
NPM Installation
npm install @mlc-ai/web-llm
CDN Usage
<script type="module">
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
</script>
Basic Usage
import { CreateMLCEngine } from "@mlc-ai/web-llm";
// Initialize with progress tracking
const engine = await CreateMLCEngine(
"Llama-3.2-1B-Instruct-q4f16_1-MLC",
{
initProgressCallback: (progress) => {
console.log(`Loading: ${(progress.progress * 100).toFixed(1)}%`);
}
}
);
// Chat completion (OpenAI-compatible)
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing briefly." }
],
temperature: 0.7,
max_tokens: 200
});
console.log(response.choices[0].message.content);
Core Features
Streaming Responses
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Write a haiku about coding." }],
stream: true,
stream_options: { include_usage: true }
});
let output = "";
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
output += content;
// Update UI in real-time
document.getElementById("response").textContent = output;
}
JSON Mode
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "Respond in JSON format." },
{ role: "user", content: "List 3 colors with hex codes." }
],
response_format: { type: "json_object" }
});
const data = JSON.parse(response.choices[0].message.content);
// { colors: [{ name: "Red", hex: "#FF0000" }, ...] }
Function Calling
const tools = [{
type: "function",
function: {
name: "get_weather",
description: "Get weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string" }
},
required: ["location"]
}
}
}];
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Weather in Tokyo?" }],
tools,
tool_choice: "auto"
});
// Handle tool calls
if (response.choices[0].message.tool_calls) {
// Execute function and continue conversation
}
Web Worker (Non-Blocking)
main.js:
import { WebWorkerMLCEngine } from "@mlc-ai/web-llm";
const engine = new WebWorkerMLCEngine(
new Worker(new URL("./worker.js", import.meta.url), { type: "module" })
);
await engine.reload("Llama-3.2-1B-Instruct-q4f16_1-MLC");
worker.js:
import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg) => handler.onmessage(msg);
Service Worker (Persistent)
// service-worker.js
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
let handler;
self.addEventListener("activate", (event) => {
handler = new ServiceWorkerMLCEngineHandler();
});
self.addEventListener("message", (event) => handler.onmessage(event));
// main.js
await navigator.serviceWorker.register("/service-worker.js");
const engine = new ServiceWorkerMLCEngine();
// Model persists across page refreshes!
React Integration
import { useState, useEffect, useCallback } from 'react';
import { CreateMLCEngine } from '@mlc-ai/web-llm';
function useWebLLM(modelId) {
const [engine, setEngine] = useState(null);
const [loading, setLoading] = useState(true);
const [progress, setProgress] = useState(0);
useEffect(() => {
let mounted = true;
CreateMLCEngine(modelId, {
initProgressCallback: (p) => {
if (mounted) setProgress(p.progress * 100);
}
}).then((eng) => {
if (mounted) {
setEngine(eng);
setLoading(false);
}
});
return () => { mounted = false; };
}, [modelId]);
const chat = useCallback(async (messages) => {
if (!engine) return null;
const response = await engine.chat.completions.create({
messages,
stream: true
});
let result = "";
for await (const chunk of response) {
result += chunk.choices[0]?.delta?.content || "";
}
return result;
}, [engine]);
return { engine, loading, progress, chat };
}
// Usage
function ChatApp() {
const { chat, loading, progress } = useWebLLM(
"Llama-3.2-1B-Instruct-q4f16_1-MLC"
);
const [messages, setMessages] = useState([]);
const [input, setInput] = useState("");
if (loading) {
return <div>Loading model... {progress.toFixed(1)}%</div>;
}
const handleSubmit = async () => {
const newMessages = [...messages, { role: "user", content: input }];
setMessages(newMessages);
setInput("");
const reply = await chat(newMessages);
setMessages([...newMessages, { role: "assistant", content: reply }]);
};
return (
<div>
{messages.map((m, i) => (
<div key={i} className={m.role}>{m.content}</div>
))}
<input value={input} onChange={(e) => setInput(e.target.value)} />
<button onClick={handleSubmit}>Send</button>
</div>
);
}
Performance
Benchmarks
| Model | WebLLM | Native | Retained |
|---|---|---|---|
| Phi-3.5-mini (3.8B) | 71 tok/s | 89 tok/s | 80% |
| Llama-3.1-8B | 41 tok/s | 58 tok/s | 71% |
| Llama-3.2-1B | ~10 tok/s | - | - |
Browser Support
| Browser | WebGPU | Status |
|---|---|---|
| Chrome 113+ | Full | ✅ Stable |
| Edge 113+ | Full | ✅ Stable |
| Chrome Android 121+ | Full | ✅ Stable |
| Firefox (Windows) | Partial | ✅ Shipping |
| Safari 26+ | Planned | ⏳ Coming |
Current coverage: ~65% of users
Cold Start Times
| Model Size | First Load | Cached Load |
|---|---|---|
| 360M | 5-15s | 1-3s |
| 1-3B | 15-45s | 3-10s |
| 7-8B | 1-3 min | 10-30s |
Use Cases
Privacy-First Apps
// Medical symptom checker - HIPAA-friendly
const engine = await CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "Medical information assistant." },
{ role: "user", content: userSymptoms }
]
});
// All processing happens locally
Offline-Capable Apps
// Check cache status
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
initProgressCallback: (p) => {
if (p.progress === 1) console.log("Model cached - works offline!");
}
});
if (!navigator.onLine) {
console.log("Offline mode - AI still functional!");
}
Static Site Deployment
<!DOCTYPE html>
<html>
<head><title>AI Chat - No Server</title></head>
<body>
<div id="chat"></div>
<script type="module">
import { CreateMLCEngine } from "https://esm.run/@mlc-ai/web-llm";
const engine = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");
// Deploy to GitHub Pages, Netlify, etc.
</script>
</body>
</html>
Chrome Extensions
// background.js
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
const handler = new ServiceWorkerMLCEngineHandler();
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.type === "chat") {
handler.engine.chat.completions.create({
messages: request.messages
}).then(sendResponse);
return true;
}
});
Error Handling
Check WebGPU Support
async function checkWebGPU() {
if (!navigator.gpu) {
return { supported: false, reason: "WebGPU not available" };
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
return { supported: false, reason: "No GPU adapter" };
}
return {
supported: true,
limits: adapter.limits,
features: [...adapter.features]
};
}
// Usage
const gpu = await checkWebGPU();
if (!gpu.supported) {
showFallbackUI("Please use Chrome 113+ or Edge 113+");
}
Handle Loading Errors
try {
const engine = await CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");
} catch (error) {
if (error.message.includes("out of memory")) {
// Fall back to smaller model
const engine = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");
} else {
console.error("Failed to load model:", error);
}
}
Limitations
| Limitation | Details | Mitigation |
|---|---|---|
| Browser support | ~65% of users | Show upgrade message |
| Model size | 8B max | Use smaller models |
| Cold start | 5s - 3min | Show progress bar |
| No cross-origin cache | Each site downloads | Host own models |
| Service worker lifecycle | Browser can kill | Retry logic |
| GPU memory | 4-6GB practical | Quantize aggressively |
Key Takeaways
- Zero server infrastructure - Deploy on static hosts
- 100% private - Data never leaves the browser
- Works offline - After first model download
- 80% native performance - WebGPU acceleration
- OpenAI-compatible API - Easy migration
- Streaming support - Real-time responses
- ~65% browser coverage - Chrome/Edge mainly
Next Steps
- Compare local AI tools for desktop
- Explore small models optimized for browsers
- Learn about quantization for model optimization
- Set up Continue.dev for coding assistance
- Check VRAM requirements for model selection
WebLLM transforms any website into an AI-powered application without servers, API costs, or privacy concerns. Whether you're building a privacy-first medical app, an offline-capable PWA, or a Chrome extension, WebLLM provides the infrastructure-free foundation for the next generation of web AI.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!