Run Claude Code Offline with Ollama (2026): Local Model, No Cloud Bill
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Yes — you can run Claude Code fully offline against a local model. As of Ollama v0.14.0 (Anthropic Messages API compatibility shipped January 16, 2026), Ollama exposes a native Anthropic-compatible endpoint at /v1/messages, so Claude Code talks to it directly — no LiteLLM proxy, no API translation layer. You set two environment variables (ANTHROPIC_BASE_URL=http://localhost:11434 and ANTHROPIC_AUTH_TOKEN=ollama), pull a coding model like qwen3-coder, and run claude. Your proprietary code never leaves the machine, and there's no per-token Anthropic bill — the only cost is the GPU you already own. The honest catch: a local 30B-class model is not Sonnet- or Opus-class on the hardest, longest agentic tasks, and Claude Code wants a large context window (Ollama's docs recommend at least 64K), which costs VRAM.
This guide covers the native endpoint (why no proxy is needed in 2026), the exact env vars, the context-window fix that makes or breaks it, which local model to pick for your hardware, when the older LiteLLM-proxy route still matters, and a straight comparison against cloud Claude.
Can Claude Code run offline on a local model?
Claude Code is Anthropic's terminal coding agent. By default it calls Anthropic's hosted API, which means a network connection and a metered bill. But Claude Code reads two environment variables — ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN — that let you redirect every request to any server that speaks the Anthropic Messages API format. Once Ollama is that server, Claude Code runs entirely on your machine.
That buys you three things that matter for real work:
- $0/mo — no Anthropic token billing. The model runs on hardware you already paid for.
- 100% private — your proprietary code, secrets, and repo structure stay on-device. Nothing is uploaded to a third party. This is the reason regulated and IP-sensitive teams do this at all.
- Offline & unlimited — no rate limits, no usage caps, works on a plane.
The trade-off is real and we cover it in the limits section: the local model doing the thinking is a 14–30B open-weight model, not Anthropic's frontier Claude. For scoped edits, refactors, test generation, and boilerplate on a private codebase, a well-configured local setup is a genuine daily driver. For the gnarliest multi-file, long-horizon tasks, cloud Claude still leads.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How does Claude Code talk to Ollama? (native, no proxy)
This is the part that changed in 2026 and the reason most older tutorials are out of date.
Ollama v0.14.0 (compatibility announced January 16, 2026) added a native Anthropic-compatible endpoint at /v1/messages. Ollama listens on its usual port (11434), accepts requests in Anthropic's exact Messages format, translates internally to whatever the underlying model expects, runs inference, and returns a response in the same Anthropic format. Streaming and tool (function) calling are both supported — which is what makes an agent like Claude Code actually work, not just chat.
The practical consequence: you no longer need a proxy. Before this, the working route was to run LiteLLM as an Anthropic-compatible shim that translated Claude Code's calls into Ollama's OpenAI-style /api/chat endpoint. That still works and is still useful in a few cases, but for a single local machine the native endpoint is simpler and faster — there's one fewer moving part.
You can sanity-check the endpoint with a raw request before involving Claude Code at all:
curl -X POST http://localhost:11434/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: ollama" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "qwen3-coder",
"max_tokens": 1024,
"messages": [{ "role": "user", "content": "Say hello in one line." }]
}'
If that returns a JSON response in Anthropic's shape, Claude Code will work against the same endpoint.
Step-by-step: point Claude Code at Ollama
You need three pieces: Ollama (v0.14.0 or later — that's what ships the native endpoint), a coding model, and the Claude Code CLI.
1. Install or update Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com
# Confirm you're on 0.14.0+ — the native Anthropic endpoint depends on it:
ollama --version
2. Pull a coding model (pick based on your VRAM — see the model section):
# Strong local agentic pick (~24GB VRAM):
ollama pull qwen3-coder:30b
# Lighter fallback for smaller cards:
ollama pull qwen2.5-coder:14b
3. Point Claude Code at Ollama with two env vars
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama # required but ignored by Ollama
export ANTHROPIC_API_KEY="" # clear any real key so it doesn't override
Note the base URL is the bare host — http://localhost:11434 with no /v1 appended. Claude Code adds the /v1/messages path itself. Put these exports in your ~/.zshrc / ~/.bashrc (or a project .envrc) so a new shell picks them up automatically.
4. Run Claude Code against the local model
claude --model qwen3-coder:30b
That's the whole setup. Newer Ollama builds also ship a shortcut, ollama launch claude, that wires the environment for you — handy, but knowing the two variables above is what lets you debug it when something's off.
Set a big enough context window
This is the single most common reason a local Claude Code setup feels "broken." Ollama defaults a model's context window (num_ctx) to a small value regardless of what the model itself supports, and an agent like Claude Code — system prompt + file contents + tool-call history — blows past that within a few turns, after which it truncates, loops, or "forgets" the task.
Ollama's own Claude Code guidance is explicit: Claude Code needs a large context window — at least 64K tokens (the broader coding recommendation floor is 32K, but agentic Claude Code wants more headroom). The most durable fix is to bake the context size into a custom Modelfile, which takes precedence over the baked-in default:
# Save as Modelfile (no extension)
FROM qwen3-coder:30b
PARAMETER num_ctx 65536
# Build a tag you can pass to claude --model
ollama create qwen3-coder-cc-64k -f ./Modelfile
claude --model qwen3-coder-cc-64k
Why you can't just max it out: the KV cache grows with context length, so roughly doubling num_ctx roughly doubles the context VRAM on top of the model weights. On a 24GB card, 64K is comfortable for a 30B MoE at Q4; pushing toward the model's full native context (Qwen3-Coder is 256K, extendable to ~1M with YaRN) needs offloading or a bigger card. Size the context to your task and your VRAM, not to the model's advertised maximum. For a quick experiment instead of a Modelfile, you can launch the server with OLLAMA_CONTEXT_LENGTH=65536 ollama serve.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which local model should you use?
Agentic coding is harder than autocomplete: the model has to follow tool-calling instructions reliably across many turns. These are the open-weight models worth running with offline Claude Code in 2026.
| Model | Params (active) | Q4 download | Native context | Notes | Best for |
|---|---|---|---|---|---|
| Qwen3-Coder 30B A3B | 30.5B (3.3B active, MoE) | ~19 GB | 256K (→~1M w/ YaRN) | ~51.6% SWE-bench Verified (OpenHands harness); fast for its size; Ollama's headline local coding pick | 24GB cards, big-repo context |
| gpt-oss 20B | 20B | ~13 GB | 128K | OpenAI's open-weight model; solid general + tool use; Ollama-recommended local option | 16–24GB cards |
| Qwen2.5-Coder 14B | 14B (dense) | ~9 GB | 128K | Older but reliable for tool-calling at this size | 12–16GB cards |
| Qwen2.5-Coder 7B | 7B (dense) | ~4.7 GB | 128K | Lightweight fallback; scoped edits only | 8GB cards |
Download sizes are the Q4-class figures Ollama lists for each tag; actual VRAM at load is higher once the KV cache and runtime overhead are added, and grows with the context window you set. SWE-bench figures depend heavily on the agent harness and turn budget — treat them as directional, not absolute.
Qwen3-Coder 30B A3B is the one to reach for first on a 24GB card: it's a Mixture-of-Experts model (30.5B total, only ~3.3B active per token) so it's noticeably faster than a dense 30B, it has a giant native context, and Ollama specifically calls it out as a recommended local coding model. gpt-oss 20B is a strong second if you want a slightly lighter footprint. Below that, Qwen2.5-Coder 14B / 7B keep things working on smaller GPUs at the cost of agentic reliability — they'll handle scoped edits but stumble on long, multi-file tasks.
If you want the broader field, see our tested best local AI models for programming ranking, the focused best 14B coding models breakdown for mid-tier hardware, and the dedicated best local AI coding models page.
What hardware do you need?
The model weights are only half the VRAM story — context (KV cache) is the other half, and Claude Code pushes context hard. Rough guidance:
| GPU / Unified RAM | Realistic model | Context you can run |
|---|---|---|
| 8 GB | Qwen2.5-Coder 7B (Q4) | ~16–32K |
| 12–16 GB | Qwen2.5-Coder 14B (Q4) | ~32K |
| 24 GB (RTX 3090/4090) | Qwen3-Coder 30B / gpt-oss 20B (Q4) | ~64K comfortably |
| 32 GB+ unified (Apple Silicon) | Qwen3-Coder 30B | 64–128K |
CPU-only inference works but is slow enough that agent loops become tedious. An Apple Silicon Mac with 32GB+ unified memory or a 24GB NVIDIA card is the practical sweet spot. For a full memory map of every model and quant, see our Ollama RAM/VRAM table.
Is the LiteLLM proxy still needed?
Before Ollama's native endpoint shipped, the standard way to run Claude Code on a local model was a LiteLLM proxy: LiteLLM presented an Anthropic-compatible /v1/messages endpoint, and translated each call into Ollama's OpenAI-style /api/chat. You'd point ANTHROPIC_BASE_URL at LiteLLM (commonly http://localhost:4000) instead of at Ollama directly.
With Ollama v0.14.0+, you don't need LiteLLM for a basic local setup — the native endpoint does the translation in-process. But the proxy route is still worth knowing for a few situations:
- Older Ollama you can't upgrade — LiteLLM bridges the gap.
- Non-Ollama backends — LiteLLM fronts vLLM, llama.cpp, LM Studio, or a mix behind one Anthropic-compatible URL.
- Routing, fallbacks, logging, and rate-limit policy across several models or machines — LiteLLM is built for that; Ollama's endpoint is deliberately minimal.
For one machine running Ollama, prefer the native endpoint. Reach for LiteLLM when you outgrow a single backend or need its routing features.
How does offline Claude Code compare to cloud Claude?
Being honest here matters more than cheerleading. The tool (Claude Code) is the same either way; what changes is the model behind it.
Where offline Claude Code wins
- Cost: $0/mo vs. Anthropic token billing that adds up fast on big agentic sessions.
- Privacy: proprietary code never leaves the machine — the whole reason to do this.
- No limits / offline: unlimited runs, no usage caps, works without a connection.
Where cloud Claude still wins
- Raw capability: Anthropic's frontier Claude (Sonnet/Opus-class) leads on the hardest multi-file, long-horizon agent tasks. A local 14–30B model is strong but not in that weight class on the toughest problems.
- Context with no VRAM math: cloud hands you 200K+ context for free; locally, every extra token of context costs you GPU memory.
- Zero setup: no Ollama version checks, no Modelfiles, no
num_ctxtuning, no quant tradeoffs.
The pragmatic pattern most developers land on: local Claude Code for the bulk of day-to-day, private, scoped work; cloud Claude for the occasional hard task where the extra capability is worth paying for. If you'd rather run an agent inside VS Code than in the terminal, Cline + Ollama is the closest equivalent. To give any local agent real tools — files, web, databases — wire in Ollama MCP integration, and if you're new to that protocol start with MCP servers explained. For everything Ollama itself can do, the complete Ollama guide is the hub.
Key Takeaways
- Claude Code runs fully offline on Ollama — free, private, no rate limits — using Ollama v0.14.0+'s native Anthropic Messages endpoint at
/v1/messages(compatibility shipped January 16, 2026). No proxy required. - Two env vars do it:
ANTHROPIC_BASE_URL=http://localhost:11434(no/v1) andANTHROPIC_AUTH_TOKEN=ollama, thenclaude --model qwen3-coder:30b. Clear any realANTHROPIC_API_KEYso it doesn't override. - Context window is the trap. Claude Code wants ≥64K tokens; Ollama defaults low. Bake
num_ctxinto a custom Modelfile — the single highest-impact fix. - Model pick: Qwen3-Coder 30B A3B (MoE, fast, 256K context) is the best 24GB choice; gpt-oss 20B and Qwen2.5-Coder 14B/7B step down for smaller cards. Context costs VRAM, so size it to the task.
- LiteLLM is no longer required for a single Ollama machine, but it's still the right tool for non-Ollama backends, multi-model routing, and fallbacks.
- Local for private, scoped, unlimited work; cloud Claude still leads on the hardest tasks. Use both deliberately.
Next Steps
- Prefer an agent inside the editor over the terminal? Set up Cline + Ollama — same local models, in VS Code.
- Choosing a model? Read our tested best local AI models for programming ranking and the focused best 14B coding models guide.
- Want your agent to touch files, browsers, and databases safely? Add Ollama MCP integration (and read MCP servers explained first if it's new to you).
- New to Ollama itself? The complete Ollama guide covers install, models, and tuning end to end.
External references: Ollama Anthropic API compatibility docs · Ollama + Claude Code blog post.
Ollama’s running. Here’s what to build with it.
Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARLocal AI vs ChatGPT 2026: Save $240/yr (Tested)
- AI on Synology NAS: Docker + Ollama Self-Hosted Setup (2026)
- Air-Gapped AI Deployment: Complete Offline Setup Guide (2026)
- blog/gpt-4o-vs-claude-35-sonnet-2025-comparison
- blog/local-vs-cloud-llm-deployment-strategies
- blog/mistral-large-vs-claude-35-sonnet-2025
- Build an Offline AI Survival Kit: No Internet Required
- Build Local AI Chatbot: Run ChatGPT FREE & Offline 2026
- Dify Self-Hosted: Deploy Your Own AI Platform
- GDPR-Compliant Local AI: Why Self-Hosted Beats Cloud (2026)
Comments (0)
No comments yet. Be the first to share your thoughts!