Build a Local Voice Assistant: Whisper + Ollama + Piper
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
To build a fully local voice assistant, you chain three open-source pieces: faster-whisper turns your speech into text, a small Ollama model (Llama 3.1 8B or Qwen3 4B) generates the reply, and Piper speaks it back — all offline, no cloud. On a single RTX 3060 12GB the whole loop, from end of speech to first spoken word, runs about 1-2 seconds; on a Raspberry Pi 5 it is more like 5-8 seconds. If you want something that feels like talking to a person with near-instant interruptible speech, Kyutai's Moshi (~200ms) is the specialist tool — but the Whisper to Ollama to Piper stack is far more flexible because you can swap the brain for any model you already run.
This guide covers the full pipeline: how the three stages connect, how to stream so the user is not staring at silence, realistic latency on real hardware, and where this approach beats Moshi and Home Assistant's Assist — and where it does not.
What is the Whisper to Ollama to Piper voice assistant pipeline?
It is a cascaded (pipelined) voice assistant: three independent models run in sequence, each doing one job well. Cascaded means each stage is swappable, which is the whole point — you are not locked into one vendor's monolithic model.
| Stage | Job | Tool | What runs |
|---|---|---|---|
| 1. Speech-to-text (STT) | Turn microphone audio into text | faster-whisper | Whisper small/medium via CTranslate2 |
| 2. Reasoning (LLM) | Generate the spoken reply | Ollama | Llama 3.1 8B or Qwen3 4B |
| 3. Text-to-speech (TTS) | Turn the reply into audio | Piper | VITS/ONNX voice model |
faster-whisper is a reimplementation of OpenAI's Whisper using CTranslate2; it is roughly 4x faster on GPU and 2x faster on CPU than the original PyTorch code, at the same accuracy, with INT8/FP16 quantization. Ollama is the local model runner most people already use. Piper is a fast, local neural TTS from the Rhasspy team (active development has moved to the Open Home Foundation's OHF-Voice/piper1-gpl fork) built on VITS and exported to ONNX, light enough to run in real time even on a Raspberry Pi.
The reason this stack matters: every stage is fully offline. No audio leaves your machine, there is no per-request API cost, and you can run it on a network with no internet at all. The trade-off versus a single speech-to-speech model is latency, which we get to below.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which Ollama model should be the brain?
For a voice assistant the brain should be small and fast, because in a cascade the LLM is usually the slowest stage and you are reading its output aloud token by token. Two solid, verified picks:
| Model | Params | Released | Ollama Q4_K_M size | Context | License | Best for |
|---|---|---|---|---|---|---|
| Llama 3.1 8B Instruct | 8B dense | Jul 23, 2024 | ~4.9 GB | 128K | Llama 3.1 Community | Best general quality that still fits 8-12 GB VRAM |
| Qwen3 4B | 4.02B dense | Apr 29, 2025 | ~2.5 GB | 32K (128K via YaRN) | Apache 2.0 | Fastest first token; runs on a Pi 5 or 8 GB GPU |
Both are genuinely small. Llama 3.1 8B is the quality pick if you have a 12 GB GPU to spare; Qwen3 4B is the latency pick — at ~2.5 GB it leaves room for Whisper and Piper to share the same card, and its smaller body returns the first token sooner, which is what the user actually hears as "responsiveness."
Pull either through Ollama:
# Quality pick (needs ~8-12 GB free for model + context)
ollama pull llama3.1:8b
# Latency / low-VRAM pick
ollama pull qwen3:4b
Keep the system prompt short and instruct the model to answer in one or two sentences. A voice assistant that monologues for 200 tokens feels broken — long replies are the single biggest cause of "this thing is slow," not the model speed.
How do you make it stream so it does not feel slow?
The naive version waits for each stage to fully finish before starting the next: record all the audio, transcribe all of it, generate the entire LLM reply, then synthesize the whole thing, then play it. That stacks every stage's latency end to end and feels sluggish.
The fix is to overlap the stages:
- Stream STT. Run faster-whisper on rolling chunks with a voice-activity-detection (VAD) endpoint so transcription is essentially done the instant the user stops talking — not started then.
- Stream the LLM. Use Ollama's streaming response (
stream: true) so you get tokens as they are generated instead of waiting for the full reply. - Stream TTS at sentence boundaries. This is the key trick: as the LLM emits text, buffer until you hit a sentence boundary (a period, question mark, or newline), send that sentence to Piper, and start playing audio while the LLM is still writing the next sentence. Piper synthesizes a sentence in a fraction of real time, so the user hears the first sentence within a second of the model starting to think.
With sentence-level TTS streaming, the latency the user perceives is "time until the first sentence is spoken," not "time until the whole answer is ready." That single change is the difference between a 1-2 second assistant and a 6-8 second one on the same hardware.
A minimal streaming loop looks like this:
import ollama, re
buffer = ""
for chunk in ollama.chat(
model="qwen3:4b",
messages=[{"role": "user", "content": transcript}],
stream=True,
):
buffer += chunk["message"]["content"]
# flush each complete sentence straight to Piper as it arrives
while (m := re.search(r"[.!?]\s", buffer)):
sentence, buffer = buffer[: m.end()], buffer[m.end():]
speak_with_piper(sentence) # synthesize + play, non-blocking
if buffer.strip():
speak_with_piper(buffer)
What latency should you actually expect?
Latency is dominated by hardware and by whether you stream. Below are realistic, approximate figures for the perceived loop — from the moment you stop speaking to the first spoken word back — assuming the sentence-streaming approach above. Treat these as ballpark, not lab numbers; exact values move with model size, quant, and prompt length.
| Hardware | STT model | LLM | Approx. perceived latency | Notes |
|---|---|---|---|---|
| RTX 3060 12GB | Whisper small (faster-whisper) | Qwen3 4B | ~1-1.5 s | All three on the GPU, comfortable |
| RTX 3060 12GB | Whisper medium | Llama 3.1 8B | ~1.5-2 s | Higher STT accuracy, slightly slower |
| RTX 3090/4090 24GB | Whisper medium | Llama 3.1 8B | ~0.7-1.2 s | Headroom for bigger models |
| Raspberry Pi 5 (8GB, CPU) | Whisper base/small | Qwen3 4B | ~5-8 s | CPU-only; usable but not snappy |
| Mac (Apple Silicon, Metal) | Whisper small | Qwen3 4B | ~1-2 s | Unified memory helps a lot |
The single biggest lever after streaming is the Whisper model size. faster-whisper on an RTX 3060 runs Whisper large-v3 well above real time (roughly 3-4x), but for an assistant you rarely need large — Whisper small (244M params) is fast and accurate enough for command-style speech, and dropping from large to small can shave most of a second off the STT stage. The Whisper family sizes are tiny (39M), base (74M), small (244M), medium (769M) and large-v3 (1.55B); for a responsive assistant, small or medium is the sweet spot. For tuning the STT stage specifically — model choice, VAD, and quantization — see our faster-whisper setup guide.
A note from our own testing
In my own rough testing on an RTX 3060 12GB, with faster-whisper running Whisper small (INT8) plus Qwen3 4B at Q4_K_M and Piper synthesizing per sentence, the perceived "stop talking to first spoken word" loop landed around 1-1.5 seconds for short questions — fast enough to feel conversational. Pushing the LLM up to Llama 3.1 8B added a few hundred milliseconds. These are single-machine, eyeballed figures rather than a controlled benchmark, but the takeaway held: the LLM and the TTS-flush strategy, not Whisper, decided how "instant" it felt.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How does this compare to Moshi?
Moshi, from Kyutai, is a different animal: a single speech-to-speech foundation model (a 7B "Temporal Transformer" paired with the Mimi neural audio codec) that listens and speaks in full duplex — it can talk and listen at the same time, and you can interrupt it mid-sentence. Its theoretical latency is about 160ms (200ms in practice on an L4 GPU), an order of magnitude lower than any cascade.
| Whisper + Ollama + Piper | Moshi (Kyutai) | |
|---|---|---|
| Architecture | Cascaded (3 swappable models) | Single end-to-end speech-to-speech |
| Latency | ~1-2 s (RTX 3060) | ~200 ms in practice |
| Full-duplex / interruptible | No (turn-based) | Yes |
| Swap the "brain" LLM | Yes — any Ollama model | No — fixed model |
| Tool use / function calling | Easy (it is just an LLM) | Limited |
| VRAM footprint | Flexible (4 GB and up) | ~7B model, heavier |
| Maturity for assistants | Production-ready, very common | Research-forward, conversational |
Pick Moshi when natural, interruptible, low-latency conversation is the product. Pick the Whisper + Ollama + Piper cascade when you want a controllable assistant — pick your own LLM, add retrieval or tool calls, choose any Piper voice, run on a Pi — and a one-to-two-second turn-based response is acceptable. For most "answer my question / control my house" assistants, the cascade wins on flexibility; for a companion you actually converse with, Moshi wins on feel.
How does this compare to Home Assistant Assist?
If your goal is a smart-home voice assistant, you probably should not build the pipeline from scratch — Home Assistant already ships this exact cascade. Home Assistant's Assist uses the Wyoming protocol (socket interfaces that let services plug together) with faster-whisper and Piper exposed as Wyoming services, and Home Assistant speaking Wyoming natively. As of 2026 this runs locally with sub-second responses on modest hardware and can route the "brain" to a local Ollama model for open-ended questions.
So the two are not really competitors:
- Build the raw Whisper + Ollama + Piper pipeline yourself when you want a standalone assistant app, a kiosk, a custom product, or you need control over every stage and prompt.
- Use Home Assistant Assist when the assistant's main job is controlling your home — it gives you wake words, satellite hardware, exposed-entity grounding, and the Whisper/Piper plumbing already wired through Wyoming, so you write almost no code.
For the smart-home route end to end — installing the Wyoming Whisper and Piper add-ons and pointing Assist at Ollama — see our guide to a local AI + Home Assistant setup.
Key Takeaways
- A local voice assistant is three swappable models: faster-whisper (STT) to Ollama (LLM) to Piper (TTS), and it runs fully offline with no cloud or per-request cost.
- Stream at sentence boundaries. Flush each finished sentence from the LLM straight to Piper and start playing it — this is what turns a 6-8 second assistant into a 1-2 second one on the same hardware.
- Pick the brain for speed: Qwen3 4B (~2.5 GB, Apr 2025) for lowest latency and low VRAM; Llama 3.1 8B (~4.9 GB Q4, Jul 2024) for best quality on a 12 GB GPU.
- Expect ~1-2 s on an RTX 3060 and ~5-8 s on a Raspberry Pi 5. Whisper small (244M) is the sweet spot — you rarely need large-v3 for an assistant.
- Moshi (~200 ms, full-duplex) beats the cascade on conversational feel; the cascade beats Moshi on flexibility (any LLM, tool use, any voice, runs on a Pi). For smart homes, Home Assistant Assist already ships this stack via Wyoming.
Next Steps
- Dial in the speech-to-text stage — model size, VAD, and quantization — with our faster-whisper setup guide.
- Want the low-latency, interruptible alternative? Read the Moshi real-time speech-to-speech guide.
- Building this for the home instead of from scratch? Follow the local AI + Home Assistant guide to wire Whisper and Piper through Wyoming.
- Check the official repos for the latest models and voices: faster-whisper and Piper (piper1-gpl).
Ollama’s running. Here’s what to build with it.
Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui TTS (XTTS-v2): Local Voice Cloning Setup Guide
- Best Local TTS Models 2026: 8 Open-Source Voices Tested
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)
- Coqui TTS Python Guide: pip install + XTTS API Examples
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Generate Subtitles Locally with Whisper (2026): Free & Private
- Is XTTS v2 / Coqui TTS Free for Commercial Use? (2026)
- Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model
Comments (0)
No comments yet. Be the first to share your thoughts!