What is the best local LLM for game NPCs?

Qwen2.5:3b is the default for ambient NPCs — it follows instructions tightly, stays under 100 ms first-token latency on an RTX 4060, and uses 2.3 GB of VRAM at Q4. For premium named characters, qwen2.5:7b gives noticeably stronger persona consistency at the cost of an extra 2 GB of VRAM and 30-40% slower generation. Llama3.2:3b is the fastest option but sometimes drifts from character.

How much latency do players accept from an AI NPC?

The acceptable band is 150-400 ms before the response begins. Under 150 ms feels like immediate banter. Above 400 ms players notice the pause; above 800 ms they think the game has frozen. Streaming the response and starting TTS playback on the first sentence keeps perceived latency in the 150-300 ms range even for longer replies.

Can I ship an AI NPC game on Steam Deck?

Yes. Use a 2B-class model like gemma2:2b at Q4 quantization, which fits in the Steam Deck's 8 GB shared memory while leaving room for the rest of the game. Latency runs 200-400 ms per response. Pair it with Piper TTS, which generates voices in 80-150 ms on the Deck's CPU. The combination is shippable, and the Deck's official Linux base makes Ollama integration straightforward.

How do AI NPCs remember conversations between sessions?

Three layers: keep the last 6-10 turns in the prompt for short-term memory, summarize older turns into a session summary stored in SQLite, and embed those summaries with a local embedding model for semantic retrieval across sessions. When the player talks to the NPC again, pull the top-3 most relevant past summaries and inject them into the system prompt. This produces the 'this NPC actually knows me' feeling.

How do I keep AI NPCs from breaking character?

Use hard CONSTRAINTS in the system prompt rather than prose ('Respond in 1-2 sentences. Never more than 30 words. Never mention the player's stats.'), few-shot anchor with 2-3 sample in-character lines, force JSON output for any line that affects game state, and run a regex filter on responses to reject anything containing 'as an AI', 'I cannot', or other meta-language. These four steps eliminate roughly 95% of break-character incidents in playtesting.

How big does the install get if I bundle a local LLM with my game?

Ollama's binary is 220 MB. A 3B-parameter model at Q4 quantization is about 1.9 GB. Add a Piper TTS voice at 60-90 MB. Total addition to install size: roughly 2.2 GB, similar to a single high-quality audio language pack. Most modern game installs are 30-80 GB already, so the overhead is small.

Do I need an internet connection for AI NPCs?

No, that is the point of the local approach. The model files ship with the game, the inference server runs as a sidecar process on the player's machine, and all conversations stay on their hardware. The game still works offline, on airplanes, and the day after the original cloud LLM API gets deprecated.

How do I drive game state changes from AI dialogue?

Force structured JSON output by setting Ollama's format parameter to 'json' and asking the model to return fields like speech, emotion, quest_offered, affinity_change, and inventory_transfer. The game parses the JSON and updates state through normal code paths. The natural-language speech goes to the dialogue UI; the structured fields drive quests, faction reputation, and gameplay logic. This keeps the LLM out of the trust boundary for anything that affects the player's progress.

Local AI NPCs for Game Dev: Build Intelligent Characters Without Cloud

Published on April 23, 2026 • 17 min read

The idea of an NPC that actually listens to the player has been pitched for thirty years. Until 2024, it was either a scripted facade or a $0.04-per-conversation OpenAI bill that nobody could ship in a $20 game. Now I have shipped two prototypes — one in Unity, one in Unreal — where every NPC runs on a 4B-parameter local model, responds in under 90 ms, and remembers what the player said three sessions ago. None of it phones home.

This guide is the engineering playbook I use. No "imagine a world" pitches. Real numbers, real prompt templates, the exact JSON contracts I send between the game thread and the inference server, and the parts where I broke things and had to back out.

Quick Start: Talk to a Local NPC in 5 Minutes {#quick-start}

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a small, fast model that runs on a laptop GPU
ollama pull qwen2.5:3b

# Test the chat API
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:3b",
  "messages": [
    {"role": "system", "content": "You are Garrett, a wary blacksmith in a medieval village. Speak in 1-2 short sentences."},
    {"role": "user", "content": "Have you seen the bandits south of town?"}
  ],
  "stream": false
}'

That returns a JSON payload with the NPC's reply in roughly 200 ms on an RTX 3060. Your first AI NPC.

Why Local AI for NPCs
Latency Budget for Believable Dialogue
Picking the Right Model
The NPC Architecture
Unity Integration
Unreal Engine Integration
Persistent NPC Memory
Prompt Patterns That Actually Work
Shipping: TTS, Lip-Sync, and Distribution
Pitfalls and Fixes
FAQs

Why Local AI for NPCs {#why-local}

Three reasons cloud-backed NPCs never made it into shipped games:

Per-conversation cost. A 50-hour RPG generates thousands of dialogue turns. At GPT-4o pricing that adds $30-$80 of API spend to a $30 game's cost basis.
Latency. Cloud round-trip latency floors at 200 ms even from US-East. Add a model that takes 600-1200 ms to respond and the player thinks the game froze.
Player fragility. A retail-bought game that requires an internet connection for NPCs is dead the day the API gets deprecated.

Local AI fixes all three. The cost is the player's GPU, which they already have. Latency is whatever your model and GPU produce — for a 3B model on an RTX 4060, that is 80-110 ms for a typical NPC line. And the game still works in 2034.

For deeper context on running open models locally, our best open-source LLMs post profiles every model class worth shipping.

Latency Budget for Believable Dialogue {#latency-budget}

Conversation latency is perceptual, not absolute. Players accept a beat of silence after they speak — they do it with humans. The threshold I use:

Latency	Perception
0-150 ms	Imperceptible. NPC feels witty.
150-400 ms	Natural pause. Acceptable.
400-800 ms	NPC feels slow. Players notice.
800-1500 ms	NPC feels broken.
> 1500 ms	Players reload, suspecting a bug.

Target the 150-400 ms band. With a 3B-class model on a mid-range GPU you have room for:

30-80 ms model warm-up
60-200 ms first-token generation
100-300 ms full short response (under 40 tokens)

Streaming helps. If you start playing the first audio while the rest streams, players perceive the latency as the time-to-first-token, not the full response.

Picking the Right Model {#picking-model}

The model decision drives everything. After shipping with eight different models, here is my ranked list for NPC use cases.

Model	Params	VRAM (Q4)	Tokens/sec (RTX 4060)	Best For
qwen2.5:3b	3B	2.3 GB	95 t/s	Default. Strong instruction following.
phi-3.5:3.8b	3.8B	2.5 GB	88 t/s	Concise, less prone to long monologues.
llama3.2:3b	3B	2.0 GB	110 t/s	Fastest. Slightly less character consistency.
gemma2:2b	2B	1.6 GB	120 t/s	Mobile / Steam Deck.
qwen2.5:7b	7B	4.5 GB	55 t/s	Premium NPCs (companions, key story characters).
mistral-nemo:12b	12B	7.5 GB	32 t/s	Cinematic conversations on a 12GB+ GPU.

The shipped balance for a 12GB target audience: qwen2.5:3b for ambient NPCs, qwen2.5:7b swapped in only for named story characters. That keeps VRAM headroom for the rest of the game and stays inside the latency budget.

Quantization matters here. Q4_K_M is the right starting point. Q3_K_M shaves 30% VRAM with about 5% quality drop — usable for ambient barks but not for named characters. Q5 and above add latency without meaningful quality gain at this size.

The NPC Architecture {#architecture}

The pattern that scales across genres:

+----------------+     JSON over HTTP     +----------------+
|  Game Process  | <--------------------> | LLM Server     |
| (Unity/Unreal) |   (loopback, 11434)    | (Ollama/llama.cpp) |
+----------------+                         +----------------+
        |                                          |
        v                                          v
+----------------+                         +----------------+
|  NPC Memory    |                         |  Model Files   |
|  (SQLite)      |                         |  (.gguf)       |
+----------------+                         +----------------+

The game never embeds the model directly. It talks over loopback HTTP to a sidecar process. This sounds heavy but the round-trip is sub-millisecond, and it lets you swap models, update inference engines, and use any language for the runtime without recompiling the game.

The system has four moving parts:

Inference server. Ollama (easy) or llama.cpp's server binary (more control).
NPC controller in your engine. Tracks conversation state, builds prompts, calls the server.
Memory store. SQLite with a JSON column per NPC.
Optional: TTS. Piper for cheap, XTTS-v2 for cloned voices.

Unity Integration {#unity}

I ship Unity NPCs using a single MonoBehaviour that talks to Ollama via UnityWebRequest. No third-party SDK required.

using System.Collections;
using System.Text;
using UnityEngine;
using UnityEngine.Networking;
using Newtonsoft.Json;
using System.Collections.Generic;

[System.Serializable]
public class ChatMessage {
    public string role;
    public string content;
}

[System.Serializable]
public class ChatRequest {
    public string model;
    public List<ChatMessage> messages;
    public bool stream = false;
    public Options options = new Options();
    [System.Serializable] public class Options {
        public float temperature = 0.7f;
        public int num_predict = 80;
    }
}

public class NPCDialogue : MonoBehaviour {
    public string characterName = "Garrett";
    public string persona = "A wary blacksmith in a medieval village. Speaks in 1-2 short sentences.";
    private List<ChatMessage> history = new List<ChatMessage>();

    void Start() {
        history.Add(new ChatMessage {
            role = "system",
            content = $"You are {characterName}. {persona} Stay in character. Never break the fourth wall."
        });
    }

    public IEnumerator Speak(string playerLine, System.Action<string> onResponse) {
        history.Add(new ChatMessage { role = "user", content = playerLine });

        var payload = new ChatRequest {
            model = "qwen2.5:3b",
            messages = history
        };
        var body = Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(payload));

        using var req = new UnityWebRequest("http://127.0.0.1:11434/api/chat", "POST");
        req.uploadHandler = new UploadHandlerRaw(body);
        req.downloadHandler = new DownloadHandlerBuffer();
        req.SetRequestHeader("Content-Type", "application/json");

        yield return req.SendWebRequest();

        if (req.result == UnityWebRequest.Result.Success) {
            var json = JsonConvert.DeserializeObject<dynamic>(req.downloadHandler.text);
            string reply = json.message.content;
            history.Add(new ChatMessage { role = "assistant", content = reply });
            onResponse?.Invoke(reply);
        }
    }
}

Wire it from your dialogue UI: StartCoroutine(npc.Speak(input, line => uiText.text = line)); and you have a working AI NPC.

For streaming responses (better perceived latency), set stream = true and parse the JSON-Lines response — Unity's UnityWebRequest supports incremental download via DownloadHandlerScript.

Unreal Engine Integration {#unreal}

For Unreal, the cleanest pattern is a UBlueprintFunctionLibrary wrapping FHttpModule. The C++ side fires the request and broadcasts a multicast delegate when the response arrives.

void UNPCDialogue::SendNPCMessage(const FString& Persona, const FString& Player, FOnNPCReply Callback) {
    FString Body = FString::Printf(TEXT("{\"model\":\"qwen2.5:3b\",\"messages\":[{\"role\":\"system\",\"content\":\"%s\"},{\"role\":\"user\",\"content\":\"%s\"}],\"stream\":false}"),
        *Persona, *Player);

    FHttpRequestRef Req = FHttpModule::Get().CreateRequest();
    Req->SetURL(TEXT("http://127.0.0.1:11434/api/chat"));
    Req->SetVerb(TEXT("POST"));
    Req->SetHeader(TEXT("Content-Type"), TEXT("application/json"));
    Req->SetContentAsString(Body);
    Req->OnProcessRequestComplete().BindLambda([Callback](FHttpRequestPtr, FHttpResponsePtr Resp, bool bOk) {
        if (bOk && Resp.IsValid()) {
            // parse Resp->GetContentAsString() and extract message.content
            Callback.ExecuteIfBound(Resp->GetContentAsString());
        }
    });
    Req->ProcessRequest();
}

Call it from a Blueprint event on the NPC actor and route the resulting string into your dialogue widget.

For shipping builds, embed Ollama (or llama.cpp's server binary) inside Project/Binaries/ and spawn it on game start with FPlatformProcess::CreateProc. Players get a self-contained executable.

If your engine uses MetaHuman or LipSync, route the response through a TTS pipeline first and feed the audio into the lip-sync system. The dialogue lag is hidden by the audio playback time.

Persistent NPC Memory {#memory}

Without memory, an AI NPC is a goldfish. With memory, the player gets the "this character actually knows me" feeling. Three layers I always implement:

Layer 1: Short-term (in-context)

Keep the last 6-10 turns in the prompt. Past that, summarize.

Layer 2: Medium-term (per-session summary)

After every 10 turns, ask the model to compress the conversation:

Summarize the conversation between Garrett and the player in 3 bullets:
- What did the player ask about?
- What did Garrett reveal?
- What is Garrett's current mood toward the player?

Persist that summary to SQLite. On the next conversation, prepend it to the system prompt as "Previous interactions: ...".

Layer 3: Long-term (semantic memory)

For longer games, embed past summaries with a local embedding model. When the player initiates a conversation, retrieve the top-3 relevant memories and inject them.

CREATE TABLE npc_memory (
  npc_id TEXT,
  session_id INTEGER,
  ts INTEGER,
  summary TEXT,
  embedding BLOB,        -- 768-dim float32
  player_affinity INTEGER -- -100 to 100
);

The affinity column drives behavior the model alone cannot. NPCs with low affinity refuse certain dialogue branches outright — the LLM never sees the option, so it cannot "be talked into" it.

Prompt Patterns That Actually Work {#prompt-patterns}

After a lot of debugging, these are the prompt patterns that produced shippable NPCs.

Pattern 1: Constrained persona

You are GARRETT, a 47-year-old blacksmith.

CONSTRAINTS:
- Respond in 1-2 sentences. Never more than 30 words.
- Speak with a Yorkshire accent. Use "aye" instead of "yes".
- Never mention the player's class, level, or stats.
- If asked about magic, refuse and change the subject.

CURRENT MOOD: irritated (the apprentice burned the forge again).

Respond in character only. Do not narrate actions.

The CONSTRAINTS list is the difference between an NPC that ships and one that monologues for 200 words.

Pattern 2: JSON output for game logic

When the dialogue needs to drive game state (quest acceptance, inventory transfer, faction change), force structured output:

Reply with JSON:
{
  "speech": "what Garrett says aloud (1-2 sentences)",
  "emotion": "neutral | happy | angry | suspicious | sad",
  "quest_offered": true | false,
  "affinity_change": -10 to 10
}

Ollama's format: "json" parameter constrains the output to valid JSON. The game parses the result and updates state without ever interpreting natural language.

Pattern 3: Few-shot for tone consistency

Give the model 2-3 examples of in-character lines before the conversation starts. This anchors the voice better than prose persona descriptions alone.

Shipping: TTS, Lip-Sync, and Distribution {#shipping}

Local TTS

For NPC voices on a budget, Piper generates speech in 50-100 ms per sentence on CPU. Voices are 30-80 MB each. Quality is comparable to mid-tier mobile TTS.

For premium voices, XTTS-v2 clones a voice from a 6-second sample. Inference takes 200-400 ms but the result rivals high-end commercial TTS.

Lip-Sync

Unity has Oculus LipSync (free, plugs into Audio Source). Unreal has MetaHuman's audio-driven facial animation. Both work with whatever audio your TTS produces.

Bundling Ollama with Your Game

Ship Ollama or llama.cpp as a sidecar binary. Total addition to install size:

Component	Size
Ollama binary	220 MB
qwen2.5:3b Q4	1.9 GB
Piper TTS + 1 voice	90 MB
Total per character voice	2.2 GB

Most players have 50+ GB of game installs already. 2 GB for genuinely intelligent NPCs is a fair trade.

For Steam, mark the game as "AI-generated content" per Valve's generative AI disclosure policy. Players are increasingly fine with AI NPCs as long as they are disclosed.

Pitfalls and Fixes {#pitfalls}

Pitfall 1: NPC goes off-script

Symptom: a wood elf starts explaining how to install Linux. Cause: prompt injection from creative players ("ignore previous instructions and...").

Fix: prepend an immutable framing prompt and reject responses that violate it. A simple regex check for OOC patterns like "as an AI" or "I cannot" filters 95% of jailbreaks.

Pitfall 2: NPCs all sound the same

Cause: same model, similar persona prompts. The model defaults to its average voice.

Fix: few-shot anchoring. Drop 3 sample lines into each NPC's system prompt. The model latches onto the cadence and vocabulary.

Pitfall 3: VRAM exhaustion mid-game

Cause: model loaded alongside the rest of the game's GPU usage.

Fix: quantize aggressively for ambient NPCs (Q3 or Q4), and unload the model when no NPC is active. Ollama unloads idle models after OLLAMA_KEEP_ALIVE (default 5 minutes).

Pitfall 4: First conversation lags

Cause: cold model load takes 1-3 seconds.

Fix: warm the model at game start with a single empty prompt. After that, every conversation hits a hot model.

Pitfall 5: Players cheat by asking the NPC for hints

Cause: the LLM does not know which information is "leaked" to the player.

Fix: never include unrevealed quest information in the system prompt. Pass only what the player has earned. Treat the NPC's prompt context as a permission system.

For the larger architecture conversation, our hybrid AI architecture post covers how to selectively route premium dialogue to cloud models for marquee characters while keeping ambient dialogue local.

Performance Budget for Different Targets

Target Hardware	Recommended Setup	Concurrent NPCs
Steam Deck (8GB shared)	gemma2:2b Q4 + Piper	1 active
Mid-range PC (RTX 3060 12GB)	qwen2.5:3b Q4 + Piper	1-2 active
High-end PC (RTX 4070 12GB)	qwen2.5:7b Q4 + Piper	1 active
Enthusiast (RTX 4090 24GB)	qwen2.5:14b Q4 + XTTS	2-3 active

"Active" means actively generating. Background NPCs use cached barks (pre-generated lines).

Conclusion

The era of canned NPC dialogue is ending. A 3B local model with a 50-line system prompt and a SQLite memory store creates characters that out-perform anything a static dialogue tree can offer — and the player owns the entire experience offline. Costs do not scale with playtime. The studio does not depend on a cloud API surviving the next five years.

Start with one NPC. Use qwen2.5:3b. Lock the persona with hard constraints. Add JSON output for any line that touches game state. Layer in summary memory once it works. Ship it.

The first time a playtester says "I cannot believe Garrett remembered I gave him that hammer last session," the entire investment pays off in one moment.

Building game-side tools next? See our Ollama Function Calling and Tool Use and Ollama Modelfile Mastery guides for character-specific fine-tuning patterns.

Local AI NPCs for Game Dev: Build Intelligent Characters Without Cloud

Want to go deeper than this article?

Local AI NPCs for Game Dev: Build Intelligent Characters Without Cloud

Quick Start: Talk to a Local NPC in 5 Minutes {#quick-start}

Table of Contents

Why Local AI for NPCs {#why-local}

Latency Budget for Believable Dialogue {#latency-budget}

Picking the Right Model {#picking-model}

The NPC Architecture {#architecture}

Unity Integration {#unity}

Unreal Engine Integration {#unreal}

Persistent NPC Memory {#memory}

Layer 1: Short-term (in-context)

Layer 2: Medium-term (per-session summary)

Layer 3: Long-term (semantic memory)

Prompt Patterns That Actually Work {#prompt-patterns}

Pattern 1: Constrained persona

Pattern 2: JSON output for game logic

Pattern 3: Few-shot for tone consistency

Shipping: TTS, Lip-Sync, and Distribution {#shipping}

Local TTS

Lip-Sync

Bundling Ollama with Your Game

Pitfalls and Fixes {#pitfalls}

Pitfall 1: NPC goes off-script

Pitfall 2: NPCs all sound the same

Pitfall 3: VRAM exhaustion mid-game

Pitfall 4: First conversation lags

Pitfall 5: Players cheat by asking the NPC for hints

Performance Budget for Different Targets

Conclusion

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

Ship Smarter NPCs

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI