WebLLM Guide: Run AI Models in Your Browser (2026)
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
WebLLM Quick Start
// Run AI in 3 lines of JavaScript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({ messages: [...] });
What is WebLLM?
WebLLM is an open-source library that runs Large Language Models directly in web browsers. Created by MLC AI (Machine Learning Compilation), it eliminates server infrastructure entirely.
Key Statistics
| Metric | Value |
|---|---|
| GitHub Stars | 17,200+ |
| License | Apache 2.0 |
| Browser Support | Chrome, Edge, Firefox (Windows) |
| Performance | 80% of native speed |
How It Works
WebLLM combines three browser technologies:
- WebGPU: GPU acceleration across vendors (NVIDIA, AMD, Apple Metal)
- WebAssembly: CPU workloads via Emscripten compilation
- Web Workers: Background processing without UI blocking
User Request โ Web Worker โ WebGPU Kernels โ GPU Inference โ Response
โ
IndexedDB Cache (offline support)
Why Use WebLLM?
| Traditional AI | WebLLM |
|---|---|
| Server infrastructure | Zero servers |
| Per-token API costs | Free after build |
| Network latency | Instant inference |
| Data sent to cloud | 100% private |
| Requires internet | Works offline |
Supported Models
Model Catalog
| Model | Size | VRAM | Best For |
|---|---|---|---|
| SmolLM2-360M | 360M | 130MB | Tiny, fast |
| Llama-3.2-1B | 1B | 900MB | Mobile/low-end |
| Gemma-2-2B | 2B | 2GB | Balanced |
| Llama-3.2-3B | 3B | 2.2GB | Quality/speed |
| Phi-3.5-mini | 3.8B | 3.7GB | Reasoning |
| Mistral-7B | 7B | 5GB | Best quality |
| Llama-3.1-8B | 8B | 5GB | Maximum |
Quantization Options
| Format | Description | Use Case |
|---|---|---|
| q4f16_1 | 4-bit weights, 16-bit activations | Best balance |
| q4f32_1 | 4-bit weights, 32-bit activations | Higher precision |
| q3f16_1 | 3-bit weights | Smallest size |
Recommendation: Use q4f16_1 for most applications.
Installation and Setup
NPM Installation
npm install @mlc-ai/web-llm
CDN Usage
<script type="module">
import * as webllm from "https://esm.run/@mlc-ai/web-llm";
</script>
Basic Usage
import { CreateMLCEngine } from "@mlc-ai/web-llm";
// Initialize with progress tracking
const engine = await CreateMLCEngine(
"Llama-3.2-1B-Instruct-q4f16_1-MLC",
{
initProgressCallback: (progress) => {
console.log(`Loading: ${(progress.progress * 100).toFixed(1)}%`);
}
}
);
// Chat completion (OpenAI-compatible)
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing briefly." }
],
temperature: 0.7,
max_tokens: 200
});
console.log(response.choices[0].message.content);
Core Features
Streaming Responses
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Write a haiku about coding." }],
stream: true,
stream_options: { include_usage: true }
});
let output = "";
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
output += content;
// Update UI in real-time
document.getElementById("response").textContent = output;
}
JSON Mode
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "Respond in JSON format." },
{ role: "user", content: "List 3 colors with hex codes." }
],
response_format: { type: "json_object" }
});
const data = JSON.parse(response.choices[0].message.content);
// { colors: [{ name: "Red", hex: "#FF0000" }, ...] }
Function Calling
const tools = [{
type: "function",
function: {
name: "get_weather",
description: "Get weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string" }
},
required: ["location"]
}
}
}];
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: "Weather in Tokyo?" }],
tools,
tool_choice: "auto"
});
// Handle tool calls
if (response.choices[0].message.tool_calls) {
// Execute function and continue conversation
}
Web Worker (Non-Blocking)
main.js:
import { WebWorkerMLCEngine } from "@mlc-ai/web-llm";
const engine = new WebWorkerMLCEngine(
new Worker(new URL("./worker.js", import.meta.url), { type: "module" })
);
await engine.reload("Llama-3.2-1B-Instruct-q4f16_1-MLC");
worker.js:
import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg) => handler.onmessage(msg);
Service Worker (Persistent)
// service-worker.js
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
let handler;
self.addEventListener("activate", (event) => {
handler = new ServiceWorkerMLCEngineHandler();
});
self.addEventListener("message", (event) => handler.onmessage(event));
// main.js
await navigator.serviceWorker.register("/service-worker.js");
const engine = new ServiceWorkerMLCEngine();
// Model persists across page refreshes!
React Integration
import { useState, useEffect, useCallback } from 'react';
import { CreateMLCEngine } from '@mlc-ai/web-llm';
function useWebLLM(modelId) {
const [engine, setEngine] = useState(null);
const [loading, setLoading] = useState(true);
const [progress, setProgress] = useState(0);
useEffect(() => {
let mounted = true;
CreateMLCEngine(modelId, {
initProgressCallback: (p) => {
if (mounted) setProgress(p.progress * 100);
}
}).then((eng) => {
if (mounted) {
setEngine(eng);
setLoading(false);
}
});
return () => { mounted = false; };
}, [modelId]);
const chat = useCallback(async (messages) => {
if (!engine) return null;
const response = await engine.chat.completions.create({
messages,
stream: true
});
let result = "";
for await (const chunk of response) {
result += chunk.choices[0]?.delta?.content || "";
}
return result;
}, [engine]);
return { engine, loading, progress, chat };
}
// Usage
function ChatApp() {
const { chat, loading, progress } = useWebLLM(
"Llama-3.2-1B-Instruct-q4f16_1-MLC"
);
const [messages, setMessages] = useState([]);
const [input, setInput] = useState("");
if (loading) {
return <div>Loading model... {progress.toFixed(1)}%</div>;
}
const handleSubmit = async () => {
const newMessages = [...messages, { role: "user", content: input }];
setMessages(newMessages);
setInput("");
const reply = await chat(newMessages);
setMessages([...newMessages, { role: "assistant", content: reply }]);
};
return (
<div>
{messages.map((m, i) => (
<div key={i} className={m.role}>{m.content}</div>
))}
<input value={input} onChange={(e) => setInput(e.target.value)} />
<button onClick={handleSubmit}>Send</button>
</div>
);
}
Performance
Benchmarks
| Model | WebLLM | Native | Retained |
|---|---|---|---|
| Phi-3.5-mini (3.8B) | 71 tok/s | 89 tok/s | 80% |
| Llama-3.1-8B | 41 tok/s | 58 tok/s | 71% |
| Llama-3.2-1B | ~10 tok/s | - | - |
Browser Support
| Browser | WebGPU | Status |
|---|---|---|
| Chrome 113+ | Full | โ Stable |
| Edge 113+ | Full | โ Stable |
| Chrome Android 121+ | Full | โ Stable |
| Firefox (Windows) | Partial | โ Shipping |
| Safari 26+ | Planned | โณ Coming |
Current coverage: ~65% of users
Cold Start Times
| Model Size | First Load | Cached Load |
|---|---|---|
| 360M | 5-15s | 1-3s |
| 1-3B | 15-45s | 3-10s |
| 7-8B | 1-3 min | 10-30s |
Use Cases
Privacy-First Apps
// Medical symptom checker - HIPAA-friendly
const engine = await CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");
const response = await engine.chat.completions.create({
messages: [
{ role: "system", content: "Medical information assistant." },
{ role: "user", content: userSymptoms }
]
});
// All processing happens locally
Offline-Capable Apps
// Check cache status
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
initProgressCallback: (p) => {
if (p.progress === 1) console.log("Model cached - works offline!");
}
});
if (!navigator.onLine) {
console.log("Offline mode - AI still functional!");
}
Static Site Deployment
<!DOCTYPE html>
<html>
<head><title>AI Chat - No Server</title></head>
<body>
<div id="chat"></div>
<script type="module">
import { CreateMLCEngine } from "https://esm.run/@mlc-ai/web-llm";
const engine = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");
// Deploy to GitHub Pages, Netlify, etc.
</script>
</body>
</html>
Chrome Extensions
// background.js
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";
const handler = new ServiceWorkerMLCEngineHandler();
chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
if (request.type === "chat") {
handler.engine.chat.completions.create({
messages: request.messages
}).then(sendResponse);
return true;
}
});
Error Handling
Check WebGPU Support
async function checkWebGPU() {
if (!navigator.gpu) {
return { supported: false, reason: "WebGPU not available" };
}
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) {
return { supported: false, reason: "No GPU adapter" };
}
return {
supported: true,
limits: adapter.limits,
features: [...adapter.features]
};
}
// Usage
const gpu = await checkWebGPU();
if (!gpu.supported) {
showFallbackUI("Please use Chrome 113+ or Edge 113+");
}
Handle Loading Errors
try {
const engine = await CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");
} catch (error) {
if (error.message.includes("out of memory")) {
// Fall back to smaller model
const engine = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");
} else {
console.error("Failed to load model:", error);
}
}
Limitations
| Limitation | Details | Mitigation |
|---|---|---|
| Browser support | ~65% of users | Show upgrade message |
| Model size | 8B max | Use smaller models |
| Cold start | 5s - 3min | Show progress bar |
| No cross-origin cache | Each site downloads | Host own models |
| Service worker lifecycle | Browser can kill | Retry logic |
| GPU memory | 4-6GB practical | Quantize aggressively |
Key Takeaways
- Zero server infrastructure - Deploy on static hosts
- 100% private - Data never leaves the browser
- Works offline - After first model download
- 80% native performance - WebGPU acceleration
- OpenAI-compatible API - Easy migration
- Streaming support - Real-time responses
- ~65% browser coverage - Chrome/Edge mainly
Next Steps
- Compare local AI tools for desktop
- Explore small models optimized for browsers
- Learn about quantization for model optimization
- Set up Continue.dev for coding assistance
- Check VRAM requirements for model selection
WebLLM transforms any website into an AI-powered application without servers, API costs, or privacy concerns. Whether you're building a privacy-first medical app, an offline-capable PWA, or a Chrome extension, WebLLM provides the infrastructure-free foundation for the next generation of web AI.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!