★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Coding Tools

Run Claude Code Offline with Ollama (2026): Local Model, No Cloud Bill

June 21, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Start free
Or own it for life — Lifetime $149, pay once

Yes — you can run Claude Code fully offline against a local model. As of Ollama v0.14.0 (Anthropic Messages API compatibility shipped January 16, 2026), Ollama exposes a native Anthropic-compatible endpoint at /v1/messages, so Claude Code talks to it directly — no LiteLLM proxy, no API translation layer. You set two environment variables (ANTHROPIC_BASE_URL=http://localhost:11434 and ANTHROPIC_AUTH_TOKEN=ollama), pull a coding model like qwen3-coder, and run claude. Your proprietary code never leaves the machine, and there's no per-token Anthropic bill — the only cost is the GPU you already own. The honest catch: a local 30B-class model is not Sonnet- or Opus-class on the hardest, longest agentic tasks, and Claude Code wants a large context window (Ollama's docs recommend at least 64K), which costs VRAM.

This guide covers the native endpoint (why no proxy is needed in 2026), the exact env vars, the context-window fix that makes or breaks it, which local model to pick for your hardware, when the older LiteLLM-proxy route still matters, and a straight comparison against cloud Claude.

Can Claude Code run offline on a local model?

Claude Code is Anthropic's terminal coding agent. By default it calls Anthropic's hosted API, which means a network connection and a metered bill. But Claude Code reads two environment variables — ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN — that let you redirect every request to any server that speaks the Anthropic Messages API format. Once Ollama is that server, Claude Code runs entirely on your machine.

That buys you three things that matter for real work:

  • $0/mo — no Anthropic token billing. The model runs on hardware you already paid for.
  • 100% private — your proprietary code, secrets, and repo structure stay on-device. Nothing is uploaded to a third party. This is the reason regulated and IP-sensitive teams do this at all.
  • Offline & unlimited — no rate limits, no usage caps, works on a plane.

The trade-off is real and we cover it in the limits section: the local model doing the thinking is a 14–30B open-weight model, not Anthropic's frontier Claude. For scoped edits, refactors, test generation, and boilerplate on a private codebase, a well-configured local setup is a genuine daily driver. For the gnarliest multi-file, long-horizon tasks, cloud Claude still leads.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How does Claude Code talk to Ollama? (native, no proxy)

This is the part that changed in 2026 and the reason most older tutorials are out of date.

Ollama v0.14.0 (compatibility announced January 16, 2026) added a native Anthropic-compatible endpoint at /v1/messages. Ollama listens on its usual port (11434), accepts requests in Anthropic's exact Messages format, translates internally to whatever the underlying model expects, runs inference, and returns a response in the same Anthropic format. Streaming and tool (function) calling are both supported — which is what makes an agent like Claude Code actually work, not just chat.

The practical consequence: you no longer need a proxy. Before this, the working route was to run LiteLLM as an Anthropic-compatible shim that translated Claude Code's calls into Ollama's OpenAI-style /api/chat endpoint. That still works and is still useful in a few cases, but for a single local machine the native endpoint is simpler and faster — there's one fewer moving part.

You can sanity-check the endpoint with a raw request before involving Claude Code at all:

curl -X POST http://localhost:11434/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: ollama" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen3-coder",
    "max_tokens": 1024,
    "messages": [{ "role": "user", "content": "Say hello in one line." }]
  }'

If that returns a JSON response in Anthropic's shape, Claude Code will work against the same endpoint.

Step-by-step: point Claude Code at Ollama

You need three pieces: Ollama (v0.14.0 or later — that's what ships the native endpoint), a coding model, and the Claude Code CLI.

1. Install or update Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com
# Confirm you're on 0.14.0+ — the native Anthropic endpoint depends on it:
ollama --version

2. Pull a coding model (pick based on your VRAM — see the model section):

# Strong local agentic pick (~24GB VRAM):
ollama pull qwen3-coder:30b

# Lighter fallback for smaller cards:
ollama pull qwen2.5-coder:14b

3. Point Claude Code at Ollama with two env vars

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama   # required but ignored by Ollama
export ANTHROPIC_API_KEY=""          # clear any real key so it doesn't override

Note the base URL is the bare host — http://localhost:11434 with no /v1 appended. Claude Code adds the /v1/messages path itself. Put these exports in your ~/.zshrc / ~/.bashrc (or a project .envrc) so a new shell picks them up automatically.

4. Run Claude Code against the local model

claude --model qwen3-coder:30b

That's the whole setup. Newer Ollama builds also ship a shortcut, ollama launch claude, that wires the environment for you — handy, but knowing the two variables above is what lets you debug it when something's off.

Set a big enough context window

This is the single most common reason a local Claude Code setup feels "broken." Ollama defaults a model's context window (num_ctx) to a small value regardless of what the model itself supports, and an agent like Claude Code — system prompt + file contents + tool-call history — blows past that within a few turns, after which it truncates, loops, or "forgets" the task.

Ollama's own Claude Code guidance is explicit: Claude Code needs a large context window — at least 64K tokens (the broader coding recommendation floor is 32K, but agentic Claude Code wants more headroom). The most durable fix is to bake the context size into a custom Modelfile, which takes precedence over the baked-in default:

# Save as Modelfile (no extension)
FROM qwen3-coder:30b
PARAMETER num_ctx 65536
# Build a tag you can pass to claude --model
ollama create qwen3-coder-cc-64k -f ./Modelfile
claude --model qwen3-coder-cc-64k

Why you can't just max it out: the KV cache grows with context length, so roughly doubling num_ctx roughly doubles the context VRAM on top of the model weights. On a 24GB card, 64K is comfortable for a 30B MoE at Q4; pushing toward the model's full native context (Qwen3-Coder is 256K, extendable to ~1M with YaRN) needs offloading or a bigger card. Size the context to your task and your VRAM, not to the model's advertised maximum. For a quick experiment instead of a Modelfile, you can launch the server with OLLAMA_CONTEXT_LENGTH=65536 ollama serve.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Which local model should you use?

Agentic coding is harder than autocomplete: the model has to follow tool-calling instructions reliably across many turns. These are the open-weight models worth running with offline Claude Code in 2026.

ModelParams (active)Q4 downloadNative contextNotesBest for
Qwen3-Coder 30B A3B30.5B (3.3B active, MoE)~19 GB256K (→~1M w/ YaRN)~51.6% SWE-bench Verified (OpenHands harness); fast for its size; Ollama's headline local coding pick24GB cards, big-repo context
gpt-oss 20B20B~13 GB128KOpenAI's open-weight model; solid general + tool use; Ollama-recommended local option16–24GB cards
Qwen2.5-Coder 14B14B (dense)~9 GB128KOlder but reliable for tool-calling at this size12–16GB cards
Qwen2.5-Coder 7B7B (dense)~4.7 GB128KLightweight fallback; scoped edits only8GB cards

Download sizes are the Q4-class figures Ollama lists for each tag; actual VRAM at load is higher once the KV cache and runtime overhead are added, and grows with the context window you set. SWE-bench figures depend heavily on the agent harness and turn budget — treat them as directional, not absolute.

Qwen3-Coder 30B A3B is the one to reach for first on a 24GB card: it's a Mixture-of-Experts model (30.5B total, only ~3.3B active per token) so it's noticeably faster than a dense 30B, it has a giant native context, and Ollama specifically calls it out as a recommended local coding model. gpt-oss 20B is a strong second if you want a slightly lighter footprint. Below that, Qwen2.5-Coder 14B / 7B keep things working on smaller GPUs at the cost of agentic reliability — they'll handle scoped edits but stumble on long, multi-file tasks.

If you want the broader field, see our tested best local AI models for programming ranking, the focused best 14B coding models breakdown for mid-tier hardware, and the dedicated best local AI coding models page.

What hardware do you need?

The model weights are only half the VRAM story — context (KV cache) is the other half, and Claude Code pushes context hard. Rough guidance:

GPU / Unified RAMRealistic modelContext you can run
8 GBQwen2.5-Coder 7B (Q4)~16–32K
12–16 GBQwen2.5-Coder 14B (Q4)~32K
24 GB (RTX 3090/4090)Qwen3-Coder 30B / gpt-oss 20B (Q4)~64K comfortably
32 GB+ unified (Apple Silicon)Qwen3-Coder 30B64–128K

CPU-only inference works but is slow enough that agent loops become tedious. An Apple Silicon Mac with 32GB+ unified memory or a 24GB NVIDIA card is the practical sweet spot. For a full memory map of every model and quant, see our Ollama RAM/VRAM table.

Is the LiteLLM proxy still needed?

Before Ollama's native endpoint shipped, the standard way to run Claude Code on a local model was a LiteLLM proxy: LiteLLM presented an Anthropic-compatible /v1/messages endpoint, and translated each call into Ollama's OpenAI-style /api/chat. You'd point ANTHROPIC_BASE_URL at LiteLLM (commonly http://localhost:4000) instead of at Ollama directly.

With Ollama v0.14.0+, you don't need LiteLLM for a basic local setup — the native endpoint does the translation in-process. But the proxy route is still worth knowing for a few situations:

  • Older Ollama you can't upgrade — LiteLLM bridges the gap.
  • Non-Ollama backends — LiteLLM fronts vLLM, llama.cpp, LM Studio, or a mix behind one Anthropic-compatible URL.
  • Routing, fallbacks, logging, and rate-limit policy across several models or machines — LiteLLM is built for that; Ollama's endpoint is deliberately minimal.

For one machine running Ollama, prefer the native endpoint. Reach for LiteLLM when you outgrow a single backend or need its routing features.

How does offline Claude Code compare to cloud Claude?

Being honest here matters more than cheerleading. The tool (Claude Code) is the same either way; what changes is the model behind it.

Where offline Claude Code wins

  • Cost: $0/mo vs. Anthropic token billing that adds up fast on big agentic sessions.
  • Privacy: proprietary code never leaves the machine — the whole reason to do this.
  • No limits / offline: unlimited runs, no usage caps, works without a connection.

Where cloud Claude still wins

  • Raw capability: Anthropic's frontier Claude (Sonnet/Opus-class) leads on the hardest multi-file, long-horizon agent tasks. A local 14–30B model is strong but not in that weight class on the toughest problems.
  • Context with no VRAM math: cloud hands you 200K+ context for free; locally, every extra token of context costs you GPU memory.
  • Zero setup: no Ollama version checks, no Modelfiles, no num_ctx tuning, no quant tradeoffs.

The pragmatic pattern most developers land on: local Claude Code for the bulk of day-to-day, private, scoped work; cloud Claude for the occasional hard task where the extra capability is worth paying for. If you'd rather run an agent inside VS Code than in the terminal, Cline + Ollama is the closest equivalent. To give any local agent real tools — files, web, databases — wire in Ollama MCP integration, and if you're new to that protocol start with MCP servers explained. For everything Ollama itself can do, the complete Ollama guide is the hub.

Key Takeaways

  1. Claude Code runs fully offline on Ollama — free, private, no rate limits — using Ollama v0.14.0+'s native Anthropic Messages endpoint at /v1/messages (compatibility shipped January 16, 2026). No proxy required.
  2. Two env vars do it: ANTHROPIC_BASE_URL=http://localhost:11434 (no /v1) and ANTHROPIC_AUTH_TOKEN=ollama, then claude --model qwen3-coder:30b. Clear any real ANTHROPIC_API_KEY so it doesn't override.
  3. Context window is the trap. Claude Code wants ≥64K tokens; Ollama defaults low. Bake num_ctx into a custom Modelfile — the single highest-impact fix.
  4. Model pick: Qwen3-Coder 30B A3B (MoE, fast, 256K context) is the best 24GB choice; gpt-oss 20B and Qwen2.5-Coder 14B/7B step down for smaller cards. Context costs VRAM, so size it to the task.
  5. LiteLLM is no longer required for a single Ollama machine, but it's still the right tool for non-Ollama backends, multi-model routing, and fallbacks.
  6. Local for private, scoped, unlimited work; cloud Claude still leads on the hardest tasks. Use both deliberately.

Next Steps

External references: Ollama Anthropic API compatibility docs · Ollama + Claude Code blog post.

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local AI vs Cloud
See the full Local AI vs Cloud AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 21, 2026🔄 Last Updated: June 21, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators