What is WebLLM and how does it work?

WebLLM is an open-source library (17,200+ GitHub stars) from MLC AI that runs LLMs directly in web browsers using WebGPU acceleration. It compiles models to WebGPU kernels via Apache TVM, achieving up to 80% of native performance. WebLLM uses WebAssembly for CPU tasks and Web Workers to prevent UI blocking. Models are cached in IndexedDB for offline use. No server required—all inference happens client-side with full privacy.

Which browsers support WebLLM?

WebLLM requires WebGPU support: Chrome 113+ (stable), Edge 113+ (stable), Chrome Android 121+ (stable), Firefox on Windows (shipping), and Safari 26+ (expected). Currently ~65% of users have WebGPU-capable browsers. To check support: navigator.gpu must exist and requestAdapter() must return a valid adapter. For unsupported browsers, show a fallback message directing users to update Chrome or Edge.

What models can I run with WebLLM?

WebLLM supports Llama 3.2 (1B, 3B), Llama 3.1 8B, Phi-3.5 (mini, vision), Mistral 7B, Gemma 2B, Qwen (0.5B-7B), SmolLM2 (360M, 1.7B), and TinyLlama 1.1B. Models are quantized to q4f16_1 format for browser deployment. VRAM requirements: 360M uses ~130MB, 1B uses ~900MB, 3B uses ~2.2GB, 7-8B uses ~5GB. Practical limit is 8B parameters quantized due to browser memory constraints.

Does WebLLM work offline?

Yes, after initial model download. Models are cached in IndexedDB and the browser's Cache API. Once cached, WebLLM works completely offline with no internet required. Use Service Workers for persistent caching across page refreshes. Check navigator.onLine to detect offline mode. Cache sizes: SmolLM2-360M is 130MB, Llama-3.2-1B is ~900MB, Llama-3.1-8B is ~5GB. Clear browser storage to remove cached models.

What is the performance of WebLLM compared to native?

WebLLM achieves 71-80% of native MLC-LLM performance on the same hardware. Benchmarks: Phi-3.5-mini runs at ~71 tok/s (vs 89 native), Llama-3.1-8B at ~41 tok/s (vs 58 native). Performance depends on GPU—modern discrete GPUs (RTX, Apple M1+) run significantly faster than integrated graphics. Cold start takes 5-180 seconds depending on model size; subsequent loads are 3-30 seconds from cache.

Can I use WebLLM with React or Vue?

Yes. For React, create a custom hook: useEffect to initialize CreateMLCEngine with initProgressCallback, store engine in state, return { engine, loading, progress, chat } function. For Vue, use onMounted() with reactive refs. Key pattern: initialize engine once on mount, show loading progress, expose chat method for component use. Use Web Workers (WebWorkerMLCEngine) to prevent UI freezing during inference.

What are the limitations of WebLLM?

Key limitations: 1) WebGPU only available in ~65% of browsers (Chrome/Edge mainly). 2) Model size capped at ~8B parameters due to browser memory limits. 3) Cold start takes 5 seconds to 3 minutes. 4) No cross-origin caching—each website downloads models separately. 5) Service workers can be killed by browser. 6) Some GPU drivers have WebGPU bugs. 7) No multi-GPU support. 8) GPU process can crash with out-of-memory errors on large models.

How do I handle WebLLM errors and GPU crashes?

Wrap CreateMLCEngine in try-catch. Check navigator.gpu before loading. Handle "GPU process died" errors by suggesting browser restart or smaller model. Monitor memory with performance.memory API. Implement retry logic with exponential backoff. For OOM errors, fall back to smaller models (SmolLM2-360M instead of Llama-3.2-3B). Log adapter.limits to understand GPU constraints. Test on target hardware during development.

Can WebLLM do function calling and tool use?

Yes, WebLLM supports function calling with OpenAI-compatible API. Define tools array with function schemas, pass to chat.completions.create() with tool_choice: "auto". The model returns tool_calls in the response. You must execute the function and send results back. Works best with models trained for function calling like Hermes-2-Pro-Mistral-7B. Note: function calling support is still maturing; test thoroughly for your use case.

How do I deploy a WebLLM application?

WebLLM works on any static host: GitHub Pages, Netlify, Vercel, or S3. No server required. Deployment steps: 1) npm run build your app. 2) Deploy static files. 3) Users download models on first visit. For Chrome Extensions, use ServiceWorkerMLCEngineHandler in background script. For PWAs, register service worker for offline caching. Consider CDN for model files to improve download speeds globally.

WebLLM Guide: Run AI Models in Your Browser (2026)

Q: How do I install and set up WebLLM?

Install via npm: "npm install @mlc-ai/web-llm". Basic usage: import { CreateMLCEngine } from "@mlc-ai/web-llm"; const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC"); const response = await engine.chat.completions.create({ messages: [...] }). For CDN: import from "https://esm.run/@mlc-ai/web-llm". First load downloads model (130MB-5GB); subsequent loads use cached version.

Q: How do I use streaming responses with WebLLM?

Add stream: true to chat.completions.create(). Use for await...of to iterate chunks: const stream = await engine.chat.completions.create({ messages, stream: true }); for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content || ""; updateUI(content); }. Add stream_options: { include_usage: true } to get token counts in the final chunk. Streaming provides real-time UI updates as the model generates text.

WebLLM Quick Start

// Run AI in 3 lines of JavaScript
import { CreateMLCEngine } from "@mlc-ai/web-llm";
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({ messages: [...] });

Zero Server

Static hosting only

100% Private

Client-side only

Works Offline

After first load

What is WebLLM?

WebLLM is an open-source library that runs Large Language Models directly in web browsers. Created by MLC AI (Machine Learning Compilation), it eliminates server infrastructure entirely.

Key Statistics

Metric	Value
GitHub Stars	17,200+
License	Apache 2.0
Browser Support	Chrome, Edge, Firefox (Windows)
Performance	80% of native speed

How It Works

WebLLM combines three browser technologies:

WebGPU: GPU acceleration across vendors (NVIDIA, AMD, Apple Metal)
WebAssembly: CPU workloads via Emscripten compilation
Web Workers: Background processing without UI blocking

User Request → Web Worker → WebGPU Kernels → GPU Inference → Response
                  ↓
            IndexedDB Cache (offline support)

Why Use WebLLM?

Traditional AI	WebLLM
Server infrastructure	Zero servers
Per-token API costs	Free after build
Network latency	Instant inference
Data sent to cloud	100% private
Requires internet	Works offline

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 19 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Supported Models

Model Catalog

Model	Size	VRAM	Best For
SmolLM2-360M	360M	130MB	Tiny, fast
Llama-3.2-1B	1B	900MB	Mobile/low-end
Gemma-2-2B	2B	2GB	Balanced
Llama-3.2-3B	3B	2.2GB	Quality/speed
Phi-3.5-mini	3.8B	3.7GB	Reasoning
Mistral-7B	7B	5GB	Best quality
Llama-3.1-8B	8B	5GB	Maximum

Quantization Options

Format	Description	Use Case
q4f16_1	4-bit weights, 16-bit activations	Best balance
q4f32_1	4-bit weights, 32-bit activations	Higher precision
q3f16_1	3-bit weights	Smallest size

Recommendation: Use q4f16_1 for most applications.

Installation and Setup

NPM Installation

npm install @mlc-ai/web-llm

CDN Usage

<script type="module">
  import * as webllm from "https://esm.run/@mlc-ai/web-llm";
</script>

Basic Usage

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Initialize with progress tracking
const engine = await CreateMLCEngine(
  "Llama-3.2-1B-Instruct-q4f16_1-MLC",
  {
    initProgressCallback: (progress) => {
      console.log(`Loading: ${(progress.progress * 100).toFixed(1)}%`);
    }
  }
);

// Chat completion (OpenAI-compatible)
const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain quantum computing briefly." }
  ],
  temperature: 0.7,
  max_tokens: 200
});

console.log(response.choices[0].message.content);

Core Features

Streaming Responses

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Write a haiku about coding." }],
  stream: true,
  stream_options: { include_usage: true }
});

let output = "";
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || "";
  output += content;

  // Update UI in real-time
  document.getElementById("response").textContent = output;
}

JSON Mode

const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "Respond in JSON format." },
    { role: "user", content: "List 3 colors with hex codes." }
  ],
  response_format: { type: "json_object" }
});

const data = JSON.parse(response.choices[0].message.content);
// { colors: [{ name: "Red", hex: "#FF0000" }, ...] }

Function Calling

const tools = [{
  type: "function",
  function: {
    name: "get_weather",
    description: "Get weather for a location",
    parameters: {
      type: "object",
      properties: {
        location: { type: "string" }
      },
      required: ["location"]
    }
  }
}];

const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Weather in Tokyo?" }],
  tools,
  tool_choice: "auto"
});

// Handle tool calls
if (response.choices[0].message.tool_calls) {
  // Execute function and continue conversation
}

Web Worker (Non-Blocking)

main.js:

import { WebWorkerMLCEngine } from "@mlc-ai/web-llm";

const engine = new WebWorkerMLCEngine(
  new Worker(new URL("./worker.js", import.meta.url), { type: "module" })
);

await engine.reload("Llama-3.2-1B-Instruct-q4f16_1-MLC");

worker.js:

import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg) => handler.onmessage(msg);

Service Worker (Persistent)

// service-worker.js
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

let handler;
self.addEventListener("activate", (event) => {
  handler = new ServiceWorkerMLCEngineHandler();
});
self.addEventListener("message", (event) => handler.onmessage(event));

// main.js
await navigator.serviceWorker.register("/service-worker.js");
const engine = new ServiceWorkerMLCEngine();
// Model persists across page refreshes!

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 19 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

React Integration

import { useState, useEffect, useCallback } from 'react';
import { CreateMLCEngine } from '@mlc-ai/web-llm';

function useWebLLM(modelId) {
  const [engine, setEngine] = useState(null);
  const [loading, setLoading] = useState(true);
  const [progress, setProgress] = useState(0);

  useEffect(() => {
    let mounted = true;

    CreateMLCEngine(modelId, {
      initProgressCallback: (p) => {
        if (mounted) setProgress(p.progress * 100);
      }
    }).then((eng) => {
      if (mounted) {
        setEngine(eng);
        setLoading(false);
      }
    });

    return () => { mounted = false; };
  }, [modelId]);

  const chat = useCallback(async (messages) => {
    if (!engine) return null;
    const response = await engine.chat.completions.create({
      messages,
      stream: true
    });

    let result = "";
    for await (const chunk of response) {
      result += chunk.choices[0]?.delta?.content || "";
    }
    return result;
  }, [engine]);

  return { engine, loading, progress, chat };
}

// Usage
function ChatApp() {
  const { chat, loading, progress } = useWebLLM(
    "Llama-3.2-1B-Instruct-q4f16_1-MLC"
  );
  const [messages, setMessages] = useState([]);
  const [input, setInput] = useState("");

  if (loading) {
    return <div>Loading model... {progress.toFixed(1)}%</div>;
  }

  const handleSubmit = async () => {
    const newMessages = [...messages, { role: "user", content: input }];
    setMessages(newMessages);
    setInput("");

    const reply = await chat(newMessages);
    setMessages([...newMessages, { role: "assistant", content: reply }]);
  };

  return (
    <div>
      {messages.map((m, i) => (
        <div key={i} className={m.role}>{m.content}</div>
      ))}
      <input value={input} onChange={(e) => setInput(e.target.value)} />
      <button onClick={handleSubmit}>Send</button>
    </div>
  );
}

Performance

Benchmarks

Model	WebLLM	Native	Retained
Phi-3.5-mini (3.8B)	71 tok/s	89 tok/s	80%
Llama-3.1-8B	41 tok/s	58 tok/s	71%
Llama-3.2-1B	~10 tok/s	-	-

Browser Support

Browser	WebGPU	Status
Chrome 113+	Full	✅ Stable
Edge 113+	Full	✅ Stable
Chrome Android 121+	Full	✅ Stable
Firefox (Windows)	Partial	✅ Shipping
Safari 26+	Planned	⏳ Coming

Current coverage: ~65% of users

Cold Start Times

Model Size	First Load	Cached Load
360M	5-15s	1-3s
1-3B	15-45s	3-10s
7-8B	1-3 min	10-30s

Use Cases

Privacy-First Apps

// Medical symptom checker - HIPAA-friendly
const engine = await CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");

const response = await engine.chat.completions.create({
  messages: [
    { role: "system", content: "Medical information assistant." },
    { role: "user", content: userSymptoms }
  ]
});
// All processing happens locally

Offline-Capable Apps

// Check cache status
const engine = await CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC", {
  initProgressCallback: (p) => {
    if (p.progress === 1) console.log("Model cached - works offline!");
  }
});

if (!navigator.onLine) {
  console.log("Offline mode - AI still functional!");
}

Static Site Deployment

<!DOCTYPE html>
<html>
<head><title>AI Chat - No Server</title></head>
<body>
  <div id="chat"></div>
  <script type="module">
    import { CreateMLCEngine } from "https://esm.run/@mlc-ai/web-llm";
    const engine = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");
    // Deploy to GitHub Pages, Netlify, etc.
  </script>
</body>
</html>

Chrome Extensions

// background.js
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

const handler = new ServiceWorkerMLCEngineHandler();

chrome.runtime.onMessage.addListener((request, sender, sendResponse) => {
  if (request.type === "chat") {
    handler.engine.chat.completions.create({
      messages: request.messages
    }).then(sendResponse);
    return true;
  }
});

Error Handling

Check WebGPU Support

async function checkWebGPU() {
  if (!navigator.gpu) {
    return { supported: false, reason: "WebGPU not available" };
  }

  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) {
    return { supported: false, reason: "No GPU adapter" };
  }

  return {
    supported: true,
    limits: adapter.limits,
    features: [...adapter.features]
  };
}

// Usage
const gpu = await checkWebGPU();
if (!gpu.supported) {
  showFallbackUI("Please use Chrome 113+ or Edge 113+");
}

Handle Loading Errors

try {
  const engine = await CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");
} catch (error) {
  if (error.message.includes("out of memory")) {
    // Fall back to smaller model
    const engine = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");
  } else {
    console.error("Failed to load model:", error);
  }
}

Limitations

Limitation	Details	Mitigation
Browser support	~65% of users	Show upgrade message
Model size	8B max	Use smaller models
Cold start	5s - 3min	Show progress bar
No cross-origin cache	Each site downloads	Host own models
Service worker lifecycle	Browser can kill	Retry logic
GPU memory	4-6GB practical	Quantize aggressively

Key Takeaways

Zero server infrastructure - Deploy on static hosts
100% private - Data never leaves the browser
Works offline - After first model download
80% native performance - WebGPU acceleration
OpenAI-compatible API - Easy migration
Streaming support - Real-time responses
~65% browser coverage - Chrome/Edge mainly

Next Steps

Compare local AI tools for desktop
Explore small models optimized for browsers
Learn about quantization for model optimization
Set up Continue.dev for coding assistance
Check VRAM requirements for model selection

WebLLM transforms any website into an AI-powered application without servers, API costs, or privacy concerns. Whether you're building a privacy-first medical app, an offline-capable PWA, or a Chrome extension, WebLLM provides the infrastructure-free foundation for the next generation of web AI.

WebLLM Guide: Run AI Models in Your Browser (2026)

Want to go deeper than this article?

WebLLM Quick Start

What is WebLLM?

Key Statistics

How It Works

Why Use WebLLM?

Reading articles is good. Building is better.

Supported Models

Model Catalog

Quantization Options

Installation and Setup

NPM Installation

CDN Usage

Basic Usage

Core Features

Streaming Responses

JSON Mode

Function Calling

Web Worker (Non-Blocking)

Service Worker (Persistent)

Reading articles is good. Building is better.

React Integration

Performance

Benchmarks

Browser Support

Cold Start Times

Use Cases

Privacy-First Apps

Offline-Capable Apps

Static Site Deployment

Chrome Extensions

Error Handling

Check WebGPU Support

Handle Loading Errors

Limitations

Key Takeaways

Next Steps

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Build Real AI on Your Machine

Related Guides

Small Language Models Guide

Quantization Explained

Jan vs LM Studio vs Ollama

VRAM Requirements 2026

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI