Cline + Ollama Setup (2026): Free Local AI Coding Agent in VS Code
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Yes — Cline, the open-source autonomous coding agent for VS Code (63k+ GitHub stars as of mid-2026), runs fully offline on local Ollama models. Two strong local picks in 2026 are Qwen3-Coder 30B A3B (released July 31, 2025; ~19GB download at Q4_K_M on Ollama, 256K native context) and Devstral Small 2 24B (released Dec 9, 2025; ~15GB download at Q4_K_M, which Mistral reports at 68.0% on SWE-bench Verified). The single most important step almost everyone misses: Ollama defaults a model's context window (num_ctx) to roughly 2K–4K tokens, and an autonomous agent like Cline blows past that within a few tool calls — after which it silently loops or fails. Set the context to at least 32K (ideally 64K) and Cline goes from "broken" to genuinely useful.
This guide walks through installing Cline, pointing it at your local Ollama server, fixing the context trap with a custom Modelfile, choosing a model that fits your VRAM, and an honest look at where local agents still lose to cloud frontier models.
What is Cline and does it work with Ollama?
Cline is a free, open-source VS Code extension that turns the editor into an agentic coding assistant: it reads your files, plans multi-step changes, runs terminal commands, and edits code across your repo with your approval on each step. It is one of the most-starred coding agents on GitHub (63k+ stars as of mid-2026) and, unlike many agents, it is genuinely provider-agnostic — Anthropic, OpenAI, OpenRouter, and local models via Ollama or LM Studio.
Running it on Ollama means three things that matter:
- $0 in subscriptions — no per-request token billing, no monthly seat.
- 100% private — your proprietary code never leaves the machine.
- No rate limits — hammer it during a refactor session; the only ceiling is your GPU.
The trade-off is real and we cover it in the limits section: a 24–30B local model is not Claude or GPT-class on the hardest agentic tasks. But for scoped edits, boilerplate, test generation, and refactors on a private codebase, a well-configured local Cline is a legitimate daily driver.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I install Cline in VS Code?
You need two pieces: Ollama (the local model server) and the Cline extension.
1. Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com
2. Pull a coding model (pick based on your VRAM — see the model section):
# Best agentic pick if you have ~24GB VRAM:
ollama pull devstral-small-2:24b
# Or the MoE option (fast, big context):
ollama pull qwen3-coder:30b
3. Install the Cline extension
Open VS Code → Extensions (⇧⌘X / Ctrl+Shift+X) → search "Cline" → Install. The Cline icon appears in the Activity Bar on the left.
That's the whole install. The part that determines whether it works well is the configuration below.
How do I configure the Ollama provider in Cline?
- Click the Cline icon in the Activity Bar to open the panel.
- Click the settings gear (top-right of the Cline panel).
- Set API Provider to Ollama.
- Set Base URL to
http://localhost:11434(Cline usually detects a running Ollama automatically). - Select your model from the dropdown (e.g.
devstral-small-2:24b). If it doesn't appear, confirm the model is pulled withollama list.
That connects Cline to your local server. Now test it: open a project, type a small task like "add a docstring to the top function in this file" in the Cline chat, and approve the steps. If it stalls after one or two actions, you've hit the context trap below — that is the #1 reason "Cline + Ollama doesn't work" reports happen.
Why does Cline keep failing? (the num_ctx trap)
This is the section that fixes most broken local-Cline setups. Ollama ships models with a small default context window — historically 2,048 tokens, and 4,096 on more recent builds — regardless of what the model itself supports. Cline's system prompt, file contents, and tool-call history fill that window almost immediately, after which the agent silently truncates, loops, or "forgets" what it was doing. The official Ollama + Cline integration docs recommend at least 32K tokens for coding work; in practice many users running heavier agentic sessions push that to 64K for more reliable tool-calling.
The most reliable fix is to bake the context size into a custom Modelfile — that value takes precedence over environment variables and the model's baked-in default:
# Save as Modelfile (no extension)
FROM devstral-small-2:24b
PARAMETER num_ctx 65536
# Build a new tag Cline can select
ollama create devstral-cline-64k -f ./Modelfile
Then pick devstral-cline-64k in the Cline model dropdown. (There's even a community tag built exactly for this, sammcj/devstral-small-24b-2505-ud:cline-128k-q6_k_xl, which ships with a 128K context preset.)
The catch — and why you can't just crank it to 256K: the KV cache grows with context length, so roughly doubling num_ctx roughly doubles the KV-cache VRAM on top of the model weights. On a 24GB card, 64K context is comfortable for a 24B Q4 model; 128K starts to bite; full 256K needs offloading or a bigger card. Set the context as large as your task needs and your VRAM allows — not the maximum the model advertises.
Alternatively, for a quick test you can launch the server with OLLAMA_CONTEXT_LENGTH=65536 ollama serve, but the Modelfile approach is more durable.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which local model should I run with Cline?
Agentic coding is harder than autocomplete — the model has to follow tool-calling instructions reliably across many turns. In 2026, two open-weight models stand out for local Cline, both Apache 2.0 licensed:
| Model | Params (active) | Q4_K_M download | Native context | Notable | Best for |
|---|---|---|---|---|---|
| Devstral Small 2 24B | 24B (dense) | ~15 GB | 256K | Mistral reports 68.0% SWE-bench Verified; purpose-built for code agents | Best agentic reliability on ~24GB |
| Qwen3-Coder 30B A3B | 30.5B (3.3B active, MoE) | ~19 GB | 256K (→1M w/ YaRN) | Fast for its size (MoE); huge context | Big-repo context, faster tokens/s |
| Qwen2.5-Coder 14B | 14B (dense) | ~9 GB | 32K | Older but solid | 12–16GB cards |
| Qwen2.5-Coder 7B | 7B (dense) | ~4.7 GB | 32K | Lightweight fallback | 8GB cards, scoped edits |
Download sizes are the Q4_K_M figures Ollama lists for each tag; actual VRAM use at load is higher once the KV cache and runtime overhead are added, and grows with the context window you set.
Devstral Small 2 24B is the one I reach for first: Mistral and All Hands AI built the Devstral line specifically for agentic software engineering. Mistral reports the 24B Small 2 at 68.0% on SWE-bench Verified (the larger 123B Devstral 2 hits 72.2%). The first-generation Devstral Small (released May 2025) scored 46.8% on the same benchmark, so the year-over-year jump in the small model is large.
Qwen3-Coder 30B A3B is a Mixture-of-Experts model — 30.5B total but only ~3.3B parameters active per token — which makes it noticeably faster than a dense 30B and gives it a giant native context (256K). Pick it when you're feeding large multi-file context to the agent.
First-hand note: On an RTX 3090 (24GB), I measured roughly 18–22 tokens/sec running
devstral-small-2:24bat Q4_K_M withnum_ctxset to 64K — usable for interactive agent loops, if not instant. The Qwen3-Coder 30B MoE felt snappier in short bursts (the active-param count helps), but its KV cache at large context filled the card faster. Treat these as approximate, single-machine figures — your CPU, RAM speed, and quant level will shift them.
If you want the broader field, see our best local AI models for programming ranking and the dedicated best 14B coding models breakdown for mid-tier hardware.
What hardware do I actually need?
The model weights are only half the VRAM story — context (KV cache) is the other half, and Cline pushes context hard. Rough guidance:
| GPU / Unified RAM | Realistic Cline model | Context you can run |
|---|---|---|
| 8 GB | Qwen2.5-Coder 7B (Q4) | ~16–32K |
| 12–16 GB | Qwen2.5-Coder 14B (Q4) | ~32K |
| 24 GB (RTX 3090/4090) | Devstral Small 2 24B / Qwen3-Coder 30B (Q4) | ~64K comfortably |
| 32 GB+ unified (Apple Silicon) | Either 24–30B model | 64–128K |
CPU-only inference works but is slow enough that agent loops become tedious; an Apple Silicon Mac with 32GB+ unified memory or a 24GB NVIDIA card is the practical sweet spot. For a full memory map of every model and quant, see our Ollama RAM/VRAM table.
How does local Cline compare to cloud?
Being honest here matters more than cheerleading:
Where local Cline wins
- Cost: $0 ongoing vs. cloud agent token bills that can run dollars per task on a big refactor.
- Privacy: code never leaves your machine — the reason regulated and proprietary teams use it at all.
- No limits / offline: unlimited runs, works on a plane.
Where cloud still wins
- Raw capability: frontier cloud models lead on the hardest multi-file, long-horizon agent tasks. A 24B local model is strong but not Claude/GPT-class on the toughest SWE-bench problems.
- Context ceiling: cloud models hand you 200K+ context with no VRAM math; locally, every extra token of context costs you GPU memory.
- Zero setup: no Modelfiles, no
num_ctxtuning, no quant tradeoffs.
The pragmatic pattern many developers land on: local Cline for the bulk of day-to-day, private, scoped work; cloud for the occasional gnarly task where you'll pay for the extra capability. If you primarily want inline autocomplete rather than a full agent, Continue.dev + Ollama is the lighter-weight companion. And to give any local agent superpowers — file system, web, database tools — wire in Ollama MCP integration.
Key Takeaways
- Cline runs fully local on Ollama — free, private, no rate limits — and it's one of the most popular VS Code coding agents (63k+ stars, mid-2026).
- The
num_ctxdefault is the trap. Ollama defaults to a small context (2K on older builds, 4K on newer ones); agents need 32K+ (ideally 64K). Bake it into a custom Modelfile — that's the single highest-impact fix. - Devstral Small 2 24B (Mistral-reported 68.0% SWE-bench Verified, ~15GB Q4 download) is the best agentic reliability pick on a 24GB card; Qwen3-Coder 30B A3B (MoE, 256K context, ~19GB Q4 download) is faster and better for big-context work.
- Context costs VRAM. Larger
num_ctxroughly scales KV-cache memory linearly — size it to the task, not the model's max. - Local is for private, scoped, unlimited work; cloud still leads on the hardest tasks. Use both deliberately.
Next Steps
- Prefer lightweight inline completion over a full agent? Set up Continue.dev with Ollama — it pairs well with the same local models.
- Choosing a model? Read our tested best local AI models for programming ranking, and the focused best 14B coding models guide for 12–16GB GPUs.
- Want your agent to touch files, browsers, and databases safely? Add Ollama MCP integration to extend Cline's tool reach.
External references: Cline on GitHub · Qwen3-Coder model card.
Ollama’s running. Here’s what to build with it.
Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBest Local AI for Coding 2026: 10 Models Tested & Ranked
- 7B vs 14B vs 32B vs 70B for Coding (2026): What Size?
- AI Context Windows: 4K vs 128K vs 1M vs 10M Tokens (2026)
- AI vs Coding for Kids: Which Should Children Learn First?
- Best 14B Coding Models (2026): Ranked by HumanEval + VRAM
- Best AI Coding Models 2026: Top 12 Ranked on SWE-Bench
- Best AI for JavaScript & TypeScript 2026: 10 Models Ranked
- Best AI Models for Python Development 2026: Top 10 Ranked
- Best Claude Model for Coding (2026): Opus 4.8 vs Sonnet 4.6 vs Haiku
- Best Local AI Coding Models 2026: Qwen3-Coder, DeepSeek & Llama, Ranked
Comments (0)
No comments yet. Be the first to share your thoughts!