Hardware for Local AI Agents (2026): RAM, GPU & VRAM
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Building agents? Do it the structured way. AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.
The best hardware for running local AI agents in 2026 is a 16GB GPU (an RTX 5060 Ti 16GB at about $429, or a used RTX 3060 12GB on a budget) paired with 32GB of system RAM — that comfortably runs an 8B-to-14B agent model with room for a long tool-call chain and a local vector database. Step up to a 24GB card (the RTX 3090, ~936 GB/s of bandwidth) when you want a 27-32B model handling RAG, and to 64GB of system RAM once you run multi-agent "crews" with several models or processes in parallel. On Mac, unified memory changes the math entirely: an Apple Silicon machine with 32-64GB of RAM can hold agent models that would never fit a consumer NVIDIA card.
Agents are harder on hardware than a single chat turn, and the reason is structural: an agent does not answer once and stop. It plans, calls a tool, reads the result, plans again, and loops — often a dozen times to finish one task. Every loop re-feeds the growing transcript (the system prompt, the tool definitions, every prior step) back through the model, so your context keeps growing and your KV cache keeps eating VRAM. Size for the loop, not the first reply.
Why agents need more hardware than chat
A normal chat exchange is one prompt in, one answer out. An agentic workload is a loop: the model reads its instructions and tool list, decides on an action, the runtime executes a tool (search, code run, API call), the result is appended to the conversation, and the model is invoked again on the now-longer transcript. Frameworks like LangChain, CrewAI and the Qwen team's own Qwen-Agent all work this way (the Qwen3 GitHub repo documents the model family and its agentic tool-calling design).
That has three hardware consequences:
- Context grows every step. Tool definitions alone can be 1-3K tokens; each tool result adds more. A 10-step task can balloon a 2K starting prompt to 20K+ tokens of live context — and context lives in the KV cache, which is pure VRAM.
- RAG adds a second memory load. Most useful agents retrieve from documents. That means a vector database (Chroma, Qdrant, FAISS) and an embedding model running alongside your main model, plus chunks of retrieved text injected into every prompt.
- Multi-agent means concurrency. A "crew" of a researcher, a writer and a reviewer may run several model calls in parallel, or load more than one model. That multiplies both VRAM (for weights and caches) and system RAM (for the orchestration, the vector store and the document cache).
So the right question is not "what model fits?" but "what model, plus its full context window, plus retrieval, plus however many of these run at once, fits?"
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How much VRAM and RAM per tier?
Here is the practical sizing table. VRAM figures assume the Q4_K_M GGUF quant most people run, with enough KV cache headroom for an agent's growing context (not just a one-line prompt). System RAM is what you want on the motherboard for the OS, the orchestration framework, a vector DB and document caching — separate from GPU VRAM.
| Tier | GPU / VRAM | System RAM | Agent workload it handles | Example model |
|---|---|---|---|---|
| Entry | 8GB (RTX 3060 Ti / 4060) | 16GB | Single agent, short tool chains, light RAG | Qwen3 8B (Q4, ~5 GB) |
| Serious | 16GB (RTX 5060 Ti 16GB / 4060 Ti 16GB) | 32GB | Single agent, long chains, real RAG, 32K+ context | Qwen3 14B (Q4) |
| Heavy / RAG | 24GB (RTX 3090 / 4090) | 32-64GB | 27-32B model + RAG, big context, light concurrency | Qwen3 32B (Q4, ~20 GB) |
| Multi-agent crew | 24GB+ (or 2× GPU) | 64GB+ | Parallel agents, multiple models, vector DB in RAM | Qwen3 30B-A3B (MoE) |
| Apple Silicon | Unified (use RAM column) | 32-128GB unified | Same as VRAM tier above; bandwidth-limited | Qwen3 14B-32B |
A few notes that the round numbers hide:
- 8GB is the floor, not a comfort zone. Qwen3 8B fits in ~5 GB at Q4_K_M, but once an agent's context climbs past ~8-16K tokens the KV cache eats the rest of an 8GB card. It works for short, simple agents — keep RAG light.
- 16GB is the real sweet spot for serious local agents. It holds a 14B model and leaves several GB for a long, growing context plus an embedding model. The RTX 5060 Ti 16GB launched April 16, 2025 at $429 with GDDR7 and 448 GB/s of bandwidth; a used RTX 3060 12GB is the budget alternative at ~360 GB/s.
- 24GB unlocks 32B + RAG together. A 32B model at Q4 is roughly 20 GB, leaving only a little headroom — fine for one agent with RAG, tight for concurrency. The RTX 3090 (24GB GDDR6X, ~936 GB/s) remains the value king here; see our deep dive on the RTX 3090 for local AI.
- System RAM is the multi-agent lever. Your vector database, the orchestration framework, and cached documents all live in system RAM, not VRAM. Multi-agent crews are where 32GB starts to feel tight and 64GB pays off — our full RAM requirements for local AI guide breaks down why.
For the underlying GPU-by-GPU memory math behind every figure above, see our VRAM requirements guide, and for the broader build picture our complete AI hardware requirements guide.
The context-length VRAM math agents actually hit
This is the part most agent guides skip, and it is the one that bites. Your model weights are fixed, but the KV cache grows with context — and agents generate a lot of context. The rough rule of thumb:
VRAM ≈ (model weights at your quant) + (KV cache for your live context).
The KV cache scales with the number of tokens currently in the window. As a practical guide, budget roughly 1-2 GB of extra VRAM for every ~16-32K tokens of context on a 7-14B model (it varies with the model's attention config and the cache quant). That is why an agent that idles at 6 GB can spike past 10 GB on step 12 of a long task — the weights never moved, but the transcript tripled.
| Live context | Approx KV-cache add (8-14B, Q4 model) | What it means for an agent |
|---|---|---|
| 4K tokens | ~0.3-0.6 GB | A single short tool call |
| 16K tokens | ~1-1.5 GB | A few RAG chunks + a handful of steps |
| 32K tokens | ~2-3 GB | A long multi-step task with retrieval |
| 64K+ tokens | ~4 GB+ | Deep agent loops, big document context |
Two ways to keep this under control: cap your context window to what the task actually needs (you rarely need the model's full 128K just because it supports it), and use a KV-cache quantization (e.g. q8 or q4 cache in llama.cpp / Ollama) to roughly halve the cache footprint. Plug your exact model, quant and context length into our VRAM calculator before you buy a card — the back-of-envelope numbers above gloss over per-model differences.
Concurrency math for multi-agent crews
Multi-agent frameworks (CrewAI, AutoGen-style setups, LangGraph) run several agents that may execute at once. The naive assumption is that three agents need three times the hardware. The reality depends on whether they share a model:
- Shared model, sequential turns: the cheapest case. One model loaded once; agents take turns. VRAM cost is one model + the largest single context. This runs on a 16GB card for 8-14B models.
- Shared model, parallel requests: the model serves multiple in-flight requests (via an inference server like Ollama or vLLM). Weights load once, but each concurrent request needs its own KV cache. Three parallel 32K contexts can add 6-9 GB of cache on top of the weights — this is where 24GB starts to matter.
- Different models per agent: the most expensive case. A 14B planner + a 8B tool-runner + an embedding model means loading all of them. That can exceed a single 24GB card and pushes you toward a second GPU or an Apple Silicon machine with large unified memory.
Mixture-of-Experts is the concurrency cheat code. A model like Qwen3-30B-A3B has 30.5B total parameters but activates only ~3.3B per token (128 experts, 8 active). You pay the full ~18-20 GB of VRAM to hold all the experts, but the per-token compute is that of a ~3B model — so it feels fast even under several concurrent agent calls. For a crew on a single 24GB card, an MoE model is often the best balance of capability and throughput. For which specific models hold up under tool-calling pressure, see our guide to the best Ollama models for agents.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Apple Silicon: unified memory rewrites the rules
On a Mac, there is no separate VRAM — CPU and GPU share one pool of "unified memory." That removes the hard 24GB consumer-NVIDIA ceiling: a Mac configured with 64GB or 128GB of unified memory can hold agent models, a vector DB and multiple processes that simply will not fit a single consumer GPU.
The trade-off is bandwidth. Apple's M4 Max tops out at 546 GB/s of memory bandwidth (and up to 128GB of unified memory), while the M4 Pro is about 273 GB/s, per Apple's M4 Pro and M4 Max announcement. That is well below a dedicated GPU like the RTX 3090's ~936 GB/s, so token generation is slower for the same model — but for an agent that spends most of its wall-clock time waiting on tool calls (web requests, code execution, file I/O) rather than raw generation, the bandwidth penalty often matters less than the sheer capacity to keep everything resident. For multi-agent crews where capacity is the bottleneck, a high-memory Mac is a genuinely strong, quiet, low-power option.
My own rough numbers running agents locally
For calibration, not as a controlled benchmark: on my RTX 3090 (24GB), an 8B agent model at Q4 idles around 5-6 GB and I have watched it climb past 9-10 GB on a long, RAG-heavy task as the KV cache filled — the weights never moved, the context did. Generation sat in the rough range of 40+ tokens/sec for the 8B at Q4, dropping as context grew. A 32B model at Q4 left so little headroom that running it alongside a vector DB and an embedding model felt right at the edge of 24GB — workable for one agent, not for parallel ones. Treat these as approximate, single-machine figures; your model, quant, cache settings and framework will shift them. The pattern is what matters: plan for the peak context, not the idle footprint.
Key Takeaways
- Size for the loop, not the first reply. Agents re-feed a growing transcript every step, so context — and the KV cache — keeps climbing. Plan VRAM for the peak, not the idle.
- 16GB VRAM + 32GB RAM is the serious-agent sweet spot. An RTX 5060 Ti 16GB ($429, GDDR7, 448 GB/s) or a budget RTX 3060 12GB runs an 8-14B agent with room for long chains and real RAG.
- 24GB unlocks 27-32B + RAG together. A 32B Q4 model is ~20 GB; the RTX 3090 (24GB, ~936 GB/s) is still the value pick. Concurrency is tight at this tier.
- System RAM (32-64GB) is the multi-agent lever. Your vector DB, orchestration framework and document cache live in system RAM — crews are where 64GB pays off.
- MoE and Apple Silicon are the scaling shortcuts. Qwen3-30B-A3B (3.3B active of 30.5B) gives big-model quality at small-model speed; a 64-128GB Mac trades bandwidth for capacity that no single consumer GPU can match.
Next Steps
- Pick the model first: our roundup of the best Ollama models for agents ranks the tool-calling models worth loading.
- Get the memory math right: read the VRAM requirements guide and the RAM requirements for local AI before you spend.
- Building the whole rig? The complete AI hardware requirements guide covers CPU, PSU and cooling too.
- Considering the value GPU? Our RTX 3090 for local AI breakdown explains why 24GB at ~936 GB/s is still the price-to-capability king.
- Not sure a model fits your card? Run your exact model, quant and context length through the VRAM calculator first.
Building agents? Do it the structured way.
AutoGen, CrewAI, tool-use, planning — hands-on and running on your own hardware. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARBuild AI Agents Locally with Ollama: No API Costs (2026 Guide)
- 8 Essential Steps: Optimize Sites for AI Agents 2025
- Build a Local RAG Agent with Ollama (2026): Agentic RAG
- CrewAI Local Setup Guide: Build Multi-Agent Systems 2026
- CrewAI vs LangGraph vs AutoGen: Tested in 2026
- Give Your Local AI Agent Memory with Mem0 (2026)
- How to Build a Local AI Agent (2026): Ollama + Tools
- LangGraph + Ollama: Build Local AI Agents (2026 Guide)
Comments (0)
No comments yet. Be the first to share your thoughts!