Published on June 20, 2026 • 14 min read

Yes — you can run a real LLM fully offline on a modern phone today. The simplest path on both Android and iPhone is the free open-source app PocketPal AI (or MLC Chat); download a 1B-4B GGUF model like Gemma 3 1B, Llama 3.2 1B/3B, or Qwen2.5 1.5B, and you get private, no-internet chat at roughly 15-40 tokens/second on a recent flagship depending on your phone and the model. The catch is RAM, not the app: 1B models run comfortably on 6-8GB phones, while 3-4B models really want a flagship with 8-12GB of memory. This guide covers every app worth installing, which models actually fit, the speeds you should expect, and exact steps for both platforms.

Running a model on your phone is the most private setup possible — the weights and your prompts never leave the device, so it works on a plane, in a tunnel, or with the network fully off. The trade-off is that these are small models: great for summarizing, drafting, quick Q&A, and offline reference, but not a replacement for a 70B model on a desktop. If privacy is your main reason, also read our local AI privacy guide.

Can a phone actually run an LLM, or is this a gimmick?

It is genuinely real now, for two reasons that both matured over 2024-2026. First, the runtimes got good: llama.cpp gained ARM-optimized CPU kernels (including Arm's KleidiAI) and Vulkan GPU acceleration for Snapdragon-class chips, and Apple shipped on-device inference as a first-class API. Second, model makers started releasing tiny, capable models — Google's Gemma 3 1B, Meta's Llama 3.2 1B/3B, and Alibaba's Qwen2.5 family in 0.5B-3B sizes — that are explicitly designed for "resource-constrained devices such as mobile phones."

The honest limitation is memory and heat. A phone shares one pool of RAM between the OS, your apps, and the model, and it has no active cooling. So a 1B model is a smooth daily-driver experience, a 3B model is usable on a flagship but will throttle during long generations, and anything 7B+ is possible to load on a 12-16GB phone but slow enough that most people won't enjoy it.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Which apps can run a local LLM on a phone in 2026?

There are five options worth knowing, covering both platforms. Three are general "download any GGUF and chat" apps, one is Google's showcase app, and one is Apple's built-in developer framework.

App	Platforms	Engine / models	Cost	Best for
PocketPal AI	iOS + Android	llama.cpp · GGUF from Hugging Face	Free, open-source	The easiest all-round starting point
MLC Chat	iOS + Android	MLC-LLM · Llama 3.2, Gemma, Phi, Qwen	Free, open-source	Broadest model list, same UI on both OSes
LM Playground	Android only	llama.cpp · GGUF · ARM/KleidiAI tuned	Free, open-source	Fast one-tap model loading on Android
Google AI Edge Gallery	iOS + Android	LiteRT · Gemma 3n / Gemma family	Free, open-source	Trying Google's mobile-first Gemma models
Apple Foundation Models	iOS 26+ (Apple silicon)	Apple's built-in ~3B model	Free (system)	Apps that want on-device AI with zero download

A few clarifications that matter. PocketPal AI (an open-source project by Asghar Ghorbani) and MLC Chat (from the MLC-AI team) are the two most popular general-purpose pick-a-model apps; both run everything locally with no servers after the model is downloaded. LM Playground (by Andriy Druk) is Android-only and built directly on llama.cpp with GGUF models, tuned with ARM KleidiAI kernels for faster generation. Google AI Edge Gallery is Google's own open-source demo app that runs the mobile-first Gemma 3n model (and other LiteRT models) fully offline. Apple's Foundation Models framework, introduced with iOS 26, is different in kind: it exposes Apple's built-in on-device ~3B model (the one behind Apple Intelligence) to apps via Swift — you don't download a model file, but you also can't swap in your own.

Which small models actually fit in phone RAM?

This is the question that decides your experience. Phone RAM is shared, so subtract roughly 2-4GB for the OS before you budget for a model. As a rule of thumb a 4-bit quantized model needs a bit more than half its parameter count in gigabytes of RAM, plus headroom for context.

Model	Params	4-bit size (≈)	Min phone RAM	Notes
Gemma 3 1B	1B	~0.7-1 GB	4-6 GB	Google states it runs on devices with as little as 4GB RAM
Qwen2.5 1.5B	1.5B	~1 GB	6 GB	Strong multilingual + reasoning for its size
Llama 3.2 1B	1B	~0.8 GB	6-8 GB	Meta-tuned for mobile; runs on iPhone 15 Pro-class 8GB devices
Llama 3.2 3B	3B	~2 GB	8-12 GB	Noticeably smarter; wants a flagship, throttles on long replies
Gemma 3n (E2B/E4B)	2-4B effective	varies (MatFormer)	8 GB+	Multimodal, adjusts effective size to your hardware
Phi-3.5 mini	3.8B	~2.3 GB	8-12 GB	Good reasoning; heavier on mid-range phones

The practical sweet spot for most phones in 2026 is a 1B-class model: Gemma 3 1B, Llama 3.2 1B, or Qwen2.5 1.5B. They load fast, stay responsive, and leave RAM for the rest of your phone. Step up to a 3B model only if you have a recent flagship with 8GB+ and you're willing to accept slower, warmer generations. If you also run models on a laptop, our companion guide on the best local AI models for 8GB RAM lines up neatly with this list.

How fast is it really? (measured speeds and honest limits)

Speed depends far more on your phone's chip and thermals than on the app. The figures below are drawn from published 2026 on-device benchmarks (independent app testing reports roughly 30-40 tok/s for sub-1B models, 20-30 for ~1.5B, 10-20 for 3B, and 8-15 for ~4B on current iPhones) cross-checked with our own hands-on use in PocketPal and MLC Chat on a current flagship (Snapdragon 8 Gen 3 / Apple A18-class). Treat all figures as approximate — they vary with the app, prompt length, quantization, and how hot the phone already is.

Model	Phone class	Generation speed (approx.)	Experience
Gemma 3 1B (Q4)	Recent flagship	~25-40 tok/s	Feels instant for short answers
Llama 3.2 1B (Q4)	iPhone 16 Pro-class	~25-40 tok/s (short prompts)	Smooth daily driver
Qwen2.5 1.5B (Q4)	iPhone 16 Pro-class	~20-30 tok/s	Responsive, a touch slower than 1B
Llama 3.2 3B (Q4)	iPhone 16 Pro	~15-23 tok/s, then throttles	Usable, slows on long replies
7B model (Q4)	12GB+ flagship	~5-10 tok/s	Loads, but feels sluggish

In first-hand use, the pattern is consistent: a 1B model on a 2023-2025 flagship reads back faster than you can comfortably follow, while a 3B model starts in the low-20s tok/s on an iPhone 16 Pro and then thermal-throttles once the chip heats up over a long generation. The other limit is battery: sustained on-device inference is one of the heaviest things you can ask a phone to do, so it drains fast and the back of the device gets warm. None of this is a dealbreaker — it just means phone LLMs are best for short, frequent, private interactions rather than churning out long documents.

For comparison, a laptop with a discrete GPU runs the same small models several times faster and without thermal limits — if you're weighing a phone setup against an AI-capable laptop, see Copilot+ PC vs RTX local AI.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How do I run an LLM on Android? (step by step)

The fastest route on Android is a llama.cpp-based app. Using PocketPal AI (works the same idea in LM Playground):

Install the app. Get PocketPal AI from Google Play (or LM Playground for an Android-native option). Both are free and open-source.
Pick a model that fits. In the app's model list, choose a 1B-class GGUF — Gemma 3 1B, Llama 3.2 1B, or Qwen2.5 1.5B. On an 8GB+ phone you can try Llama 3.2 3B.
Download the weights. Tap download; the GGUF (typically a few hundred MB to ~2GB) is fetched once and cached on-device. After this you can turn airplane mode on.
Load and chat. Tap to load the model into memory, then type. The first reply takes a moment to "warm up"; subsequent replies are faster.
Tune if needed. If the app lets you, keep context length modest (e.g. 2048-4096 tokens) to avoid memory spikes, and prefer a Q4_K_M quant for the best size/quality balance.

For power users, you can also run the official llama.cpp directly inside Termux (no root required): install git, cmake, and clang, build llama.cpp, download a GGUF from Hugging Face, and run llama-server — on Snapdragon/Adreno devices you can also try the OpenCL or Vulkan GPU backends, though GPU offload support on phones is still uneven. That's more involved than an app, but it gives you a local OpenAI-style API on the phone.

How do I run an LLM on iPhone? (step by step)

iPhone has two distinct paths.

Path A — download-and-chat apps (PocketPal AI / MLC Chat):

Install PocketPal AI or MLC Chat from the App Store (both free). MLC Chat tends to expose the broadest model list — Llama 3.2, Gemma, Phi-3.5, Qwen2.5.
In the app, choose a model. On an 8GB iPhone (15 Pro and newer), a Llama 3.2 1B or Gemma 3 1B runs smoothly; 3B is possible but warmer.
Download the weights once. They cache locally, so afterward everything runs with the network off.
Tap to load the model, then chat. Expect a brief load, then fast short-answer generation.

Path B — Apple's built-in model (no download): On iOS 26 and later, Apple-Intelligence-capable iPhones include Apple's on-device ~3B Foundation Model. You don't install or manage it — instead, apps built with Apple's Foundation Models framework tap it directly, giving you private, offline, free-of-cost AI features inside those apps. It's the most seamless option, but you can't choose or swap the model the way you can with PocketPal or MLC Chat.

Key Takeaways

Phone LLMs are real and private — PocketPal AI and MLC Chat (both free, open-source, iOS + Android) are the easiest way to run a model fully offline; LM Playground is a strong Android-only alternative.
RAM is the gatekeeper. Budget ~2-4GB for the OS first. 1B models (Gemma 3 1B, Llama 3.2 1B, Qwen2.5 1.5B) run on 6-8GB phones; 3-4B models want an 8-12GB flagship.
Speed is good but thermal-limited. Expect roughly ~25-40 tok/s on 1B models and ~15-23 tok/s falling off on a 3B model as the phone heats up; battery drains fast during inference.
Apple iPhones (iOS 26+) have a built-in ~3B model via the Foundation Models framework — zero download, but no model choice; download-apps give you flexibility instead.
Match the task to the model. Phone LLMs shine at short, frequent, private tasks (summaries, drafts, offline Q&A), not long-form generation.

Next Steps

See which models to download once you know your phone's memory in our best local AI models for 8GB RAM breakdown — most of those picks are exactly what you'll load in PocketPal.
Want the same models on a desktop with Ollama for heavier work? Start with the best Ollama models guide.
Understand exactly why on-device beats the cloud for sensitive data in our local AI privacy guide.
Deciding between a phone setup and a proper AI laptop? Compare options in Copilot+ PC vs RTX local AI.
Official runtime docs and model cards: llama.cpp on GitHub and Apple's Foundation Models research.

Run an LLM on Your Phone (2026): Offline AI on Android & iPhone

Want to go deeper than this article?

Can a phone actually run an LLM, or is this a gimmick?

Reading articles is good. Building is better.

Which apps can run a local LLM on a phone in 2026?

Which small models actually fit in phone RAM?

How fast is it really? (measured speeds and honest limits)

Reading articles is good. Building is better.

How do I run an LLM on Android? (step by step)

How do I run an LLM on iPhone? (step by step)

Key Takeaways

Next Steps

Sold on local AI? Learn to run it for real.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Go from reading about AI to building with AI

Ready to Go Beyond Tutorials?

Related Guides

Best Local AI Models for 8GB RAM

Best Ollama Models

Local AI Privacy Guide

Copilot+ PC vs RTX Local AI

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI