Run an LLM on Your Phone (2026): Offline AI on Android & iPhone
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Sold on local AI? Learn to run it for real. Private, offline AI from fundamentals to production — your data never leaves your machine. First chapter free.
Published on June 20, 2026 • 14 min read
Yes — you can run a real LLM fully offline on a modern phone today. The simplest path on both Android and iPhone is the free open-source app PocketPal AI (or MLC Chat); download a 1B-4B GGUF model like Gemma 3 1B, Llama 3.2 1B/3B, or Qwen2.5 1.5B, and you get private, no-internet chat at roughly 15-40 tokens/second on a recent flagship depending on your phone and the model. The catch is RAM, not the app: 1B models run comfortably on 6-8GB phones, while 3-4B models really want a flagship with 8-12GB of memory. This guide covers every app worth installing, which models actually fit, the speeds you should expect, and exact steps for both platforms.
Running a model on your phone is the most private setup possible — the weights and your prompts never leave the device, so it works on a plane, in a tunnel, or with the network fully off. The trade-off is that these are small models: great for summarizing, drafting, quick Q&A, and offline reference, but not a replacement for a 70B model on a desktop. If privacy is your main reason, also read our local AI privacy guide.
Can a phone actually run an LLM, or is this a gimmick?
It is genuinely real now, for two reasons that both matured over 2024-2026. First, the runtimes got good: llama.cpp gained ARM-optimized CPU kernels (including Arm's KleidiAI) and Vulkan GPU acceleration for Snapdragon-class chips, and Apple shipped on-device inference as a first-class API. Second, model makers started releasing tiny, capable models — Google's Gemma 3 1B, Meta's Llama 3.2 1B/3B, and Alibaba's Qwen2.5 family in 0.5B-3B sizes — that are explicitly designed for "resource-constrained devices such as mobile phones."
The honest limitation is memory and heat. A phone shares one pool of RAM between the OS, your apps, and the model, and it has no active cooling. So a 1B model is a smooth daily-driver experience, a 3B model is usable on a flagship but will throttle during long generations, and anything 7B+ is possible to load on a 12-16GB phone but slow enough that most people won't enjoy it.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which apps can run a local LLM on a phone in 2026?
There are five options worth knowing, covering both platforms. Three are general "download any GGUF and chat" apps, one is Google's showcase app, and one is Apple's built-in developer framework.
| App | Platforms | Engine / models | Cost | Best for |
|---|---|---|---|---|
| PocketPal AI | iOS + Android | llama.cpp · GGUF from Hugging Face | Free, open-source | The easiest all-round starting point |
| MLC Chat | iOS + Android | MLC-LLM · Llama 3.2, Gemma, Phi, Qwen | Free, open-source | Broadest model list, same UI on both OSes |
| LM Playground | Android only | llama.cpp · GGUF · ARM/KleidiAI tuned | Free, open-source | Fast one-tap model loading on Android |
| Google AI Edge Gallery | iOS + Android | LiteRT · Gemma 3n / Gemma family | Free, open-source | Trying Google's mobile-first Gemma models |
| Apple Foundation Models | iOS 26+ (Apple silicon) | Apple's built-in ~3B model | Free (system) | Apps that want on-device AI with zero download |
A few clarifications that matter. PocketPal AI (an open-source project by Asghar Ghorbani) and MLC Chat (from the MLC-AI team) are the two most popular general-purpose pick-a-model apps; both run everything locally with no servers after the model is downloaded. LM Playground (by Andriy Druk) is Android-only and built directly on llama.cpp with GGUF models, tuned with ARM KleidiAI kernels for faster generation. Google AI Edge Gallery is Google's own open-source demo app that runs the mobile-first Gemma 3n model (and other LiteRT models) fully offline. Apple's Foundation Models framework, introduced with iOS 26, is different in kind: it exposes Apple's built-in on-device ~3B model (the one behind Apple Intelligence) to apps via Swift — you don't download a model file, but you also can't swap in your own.
Which small models actually fit in phone RAM?
This is the question that decides your experience. Phone RAM is shared, so subtract roughly 2-4GB for the OS before you budget for a model. As a rule of thumb a 4-bit quantized model needs a bit more than half its parameter count in gigabytes of RAM, plus headroom for context.
| Model | Params | 4-bit size (≈) | Min phone RAM | Notes |
|---|---|---|---|---|
| Gemma 3 1B | 1B | ~0.7-1 GB | 4-6 GB | Google states it runs on devices with as little as 4GB RAM |
| Qwen2.5 1.5B | 1.5B | ~1 GB | 6 GB | Strong multilingual + reasoning for its size |
| Llama 3.2 1B | 1B | ~0.8 GB | 6-8 GB | Meta-tuned for mobile; runs on iPhone 15 Pro-class 8GB devices |
| Llama 3.2 3B | 3B | ~2 GB | 8-12 GB | Noticeably smarter; wants a flagship, throttles on long replies |
| Gemma 3n (E2B/E4B) | 2-4B effective | varies (MatFormer) | 8 GB+ | Multimodal, adjusts effective size to your hardware |
| Phi-3.5 mini | 3.8B | ~2.3 GB | 8-12 GB | Good reasoning; heavier on mid-range phones |
The practical sweet spot for most phones in 2026 is a 1B-class model: Gemma 3 1B, Llama 3.2 1B, or Qwen2.5 1.5B. They load fast, stay responsive, and leave RAM for the rest of your phone. Step up to a 3B model only if you have a recent flagship with 8GB+ and you're willing to accept slower, warmer generations. If you also run models on a laptop, our companion guide on the best local AI models for 8GB RAM lines up neatly with this list.
How fast is it really? (measured speeds and honest limits)
Speed depends far more on your phone's chip and thermals than on the app. The figures below are drawn from published 2026 on-device benchmarks (independent app testing reports roughly 30-40 tok/s for sub-1B models, 20-30 for ~1.5B, 10-20 for 3B, and 8-15 for ~4B on current iPhones) cross-checked with our own hands-on use in PocketPal and MLC Chat on a current flagship (Snapdragon 8 Gen 3 / Apple A18-class). Treat all figures as approximate — they vary with the app, prompt length, quantization, and how hot the phone already is.
| Model | Phone class | Generation speed (approx.) | Experience |
|---|---|---|---|
| Gemma 3 1B (Q4) | Recent flagship | ~25-40 tok/s | Feels instant for short answers |
| Llama 3.2 1B (Q4) | iPhone 16 Pro-class | ~25-40 tok/s (short prompts) | Smooth daily driver |
| Qwen2.5 1.5B (Q4) | iPhone 16 Pro-class | ~20-30 tok/s | Responsive, a touch slower than 1B |
| Llama 3.2 3B (Q4) | iPhone 16 Pro | ~15-23 tok/s, then throttles | Usable, slows on long replies |
| 7B model (Q4) | 12GB+ flagship | ~5-10 tok/s | Loads, but feels sluggish |
In first-hand use, the pattern is consistent: a 1B model on a 2023-2025 flagship reads back faster than you can comfortably follow, while a 3B model starts in the low-20s tok/s on an iPhone 16 Pro and then thermal-throttles once the chip heats up over a long generation. The other limit is battery: sustained on-device inference is one of the heaviest things you can ask a phone to do, so it drains fast and the back of the device gets warm. None of this is a dealbreaker — it just means phone LLMs are best for short, frequent, private interactions rather than churning out long documents.
For comparison, a laptop with a discrete GPU runs the same small models several times faster and without thermal limits — if you're weighing a phone setup against an AI-capable laptop, see Copilot+ PC vs RTX local AI.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I run an LLM on Android? (step by step)
The fastest route on Android is a llama.cpp-based app. Using PocketPal AI (works the same idea in LM Playground):
- Install the app. Get PocketPal AI from Google Play (or LM Playground for an Android-native option). Both are free and open-source.
- Pick a model that fits. In the app's model list, choose a 1B-class GGUF — Gemma 3 1B, Llama 3.2 1B, or Qwen2.5 1.5B. On an 8GB+ phone you can try Llama 3.2 3B.
- Download the weights. Tap download; the GGUF (typically a few hundred MB to ~2GB) is fetched once and cached on-device. After this you can turn airplane mode on.
- Load and chat. Tap to load the model into memory, then type. The first reply takes a moment to "warm up"; subsequent replies are faster.
- Tune if needed. If the app lets you, keep context length modest (e.g. 2048-4096 tokens) to avoid memory spikes, and prefer a Q4_K_M quant for the best size/quality balance.
For power users, you can also run the official llama.cpp directly inside Termux (no root required): install git, cmake, and clang, build llama.cpp, download a GGUF from Hugging Face, and run llama-server — on Snapdragon/Adreno devices you can also try the OpenCL or Vulkan GPU backends, though GPU offload support on phones is still uneven. That's more involved than an app, but it gives you a local OpenAI-style API on the phone.
How do I run an LLM on iPhone? (step by step)
iPhone has two distinct paths.
Path A — download-and-chat apps (PocketPal AI / MLC Chat):
- Install PocketPal AI or MLC Chat from the App Store (both free). MLC Chat tends to expose the broadest model list — Llama 3.2, Gemma, Phi-3.5, Qwen2.5.
- In the app, choose a model. On an 8GB iPhone (15 Pro and newer), a Llama 3.2 1B or Gemma 3 1B runs smoothly; 3B is possible but warmer.
- Download the weights once. They cache locally, so afterward everything runs with the network off.
- Tap to load the model, then chat. Expect a brief load, then fast short-answer generation.
Path B — Apple's built-in model (no download): On iOS 26 and later, Apple-Intelligence-capable iPhones include Apple's on-device ~3B Foundation Model. You don't install or manage it — instead, apps built with Apple's Foundation Models framework tap it directly, giving you private, offline, free-of-cost AI features inside those apps. It's the most seamless option, but you can't choose or swap the model the way you can with PocketPal or MLC Chat.
Key Takeaways
- Phone LLMs are real and private — PocketPal AI and MLC Chat (both free, open-source, iOS + Android) are the easiest way to run a model fully offline; LM Playground is a strong Android-only alternative.
- RAM is the gatekeeper. Budget ~2-4GB for the OS first. 1B models (Gemma 3 1B, Llama 3.2 1B, Qwen2.5 1.5B) run on 6-8GB phones; 3-4B models want an 8-12GB flagship.
- Speed is good but thermal-limited. Expect roughly ~25-40 tok/s on 1B models and ~15-23 tok/s falling off on a 3B model as the phone heats up; battery drains fast during inference.
- Apple iPhones (iOS 26+) have a built-in ~3B model via the Foundation Models framework — zero download, but no model choice; download-apps give you flexibility instead.
- Match the task to the model. Phone LLMs shine at short, frequent, private tasks (summaries, drafts, offline Q&A), not long-form generation.
Next Steps
- See which models to download once you know your phone's memory in our best local AI models for 8GB RAM breakdown — most of those picks are exactly what you'll load in PocketPal.
- Want the same models on a desktop with Ollama for heavier work? Start with the best Ollama models guide.
- Understand exactly why on-device beats the cloud for sensitive data in our local AI privacy guide.
- Deciding between a phone setup and a proper AI laptop? Compare options in Copilot+ PC vs RTX local AI.
- Official runtime docs and model cards: llama.cpp on GitHub and Apple's Foundation Models research.
Sold on local AI? Learn to run it for real.
Private, offline AI from fundamentals to production — your data never leaves your machine. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARLocal AI vs ChatGPT 2026: Save $240/yr (Tested)
- AI on Synology NAS: Docker + Ollama Self-Hosted Setup (2026)
- Air-Gapped AI Deployment: Complete Offline Setup Guide (2026)
- blog/gpt-4o-vs-claude-35-sonnet-2025-comparison
- blog/local-vs-cloud-llm-deployment-strategies
- blog/mistral-large-vs-claude-35-sonnet-2025
- Build an Offline AI Survival Kit: No Internet Required
- Build Local AI Chatbot: Run ChatGPT FREE & Offline 2026
- Dify Self-Hosted: Deploy Your Own AI Platform
- GDPR-Compliant Local AI: Why Self-Hosted Beats Cloud (2026)
Comments (0)
No comments yet. Be the first to share your thoughts!