To build a fully local voice assistant, you chain three open-source pieces: faster-whisper turns your speech into text, a small Ollama model (Llama 3.1 8B or Qwen3 4B) generates the reply, and Piper speaks it back — all offline, no cloud. On a single RTX 3060 12GB the whole loop, from end of speech to first spoken word, runs about 1-2 seconds; on a Raspberry Pi 5 it is more like 5-8 seconds. If you want something that feels like talking to a person with near-instant interruptible speech, Kyutai's Moshi (~200ms) is the specialist tool — but the Whisper to Ollama to Piper stack is far more flexible because you can swap the brain for any model you already run.

This guide covers the full pipeline: how the three stages connect, how to stream so the user is not staring at silence, realistic latency on real hardware, and where this approach beats Moshi and Home Assistant's Assist — and where it does not.

What is the Whisper to Ollama to Piper voice assistant pipeline?

It is a cascaded (pipelined) voice assistant: three independent models run in sequence, each doing one job well. Cascaded means each stage is swappable, which is the whole point — you are not locked into one vendor's monolithic model.

Stage	Job	Tool	What runs
1. Speech-to-text (STT)	Turn microphone audio into text	faster-whisper	Whisper small/medium via CTranslate2
2. Reasoning (LLM)	Generate the spoken reply	Ollama	Llama 3.1 8B or Qwen3 4B
3. Text-to-speech (TTS)	Turn the reply into audio	Piper	VITS/ONNX voice model

faster-whisper is a reimplementation of OpenAI's Whisper using CTranslate2; it is roughly 4x faster on GPU and 2x faster on CPU than the original PyTorch code, at the same accuracy, with INT8/FP16 quantization. Ollama is the local model runner most people already use. Piper is a fast, local neural TTS from the Rhasspy team (active development has moved to the Open Home Foundation's OHF-Voice/piper1-gpl fork) built on VITS and exported to ONNX, light enough to run in real time even on a Raspberry Pi.

The reason this stack matters: every stage is fully offline. No audio leaves your machine, there is no per-request API cost, and you can run it on a network with no internet at all. The trade-off versus a single speech-to-speech model is latency, which we get to below.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Which Ollama model should be the brain?

For a voice assistant the brain should be small and fast, because in a cascade the LLM is usually the slowest stage and you are reading its output aloud token by token. Two solid, verified picks:

Model	Params	Released	Ollama Q4_K_M size	Context	License	Best for
Llama 3.1 8B Instruct	8B dense	Jul 23, 2024	~4.9 GB	128K	Llama 3.1 Community	Best general quality that still fits 8-12 GB VRAM
Qwen3 4B	4.02B dense	Apr 29, 2025	~2.5 GB	32K (128K via YaRN)	Apache 2.0	Fastest first token; runs on a Pi 5 or 8 GB GPU

Both are genuinely small. Llama 3.1 8B is the quality pick if you have a 12 GB GPU to spare; Qwen3 4B is the latency pick — at ~2.5 GB it leaves room for Whisper and Piper to share the same card, and its smaller body returns the first token sooner, which is what the user actually hears as "responsiveness."

Pull either through Ollama:

# Quality pick (needs ~8-12 GB free for model + context)
ollama pull llama3.1:8b

# Latency / low-VRAM pick
ollama pull qwen3:4b

Keep the system prompt short and instruct the model to answer in one or two sentences. A voice assistant that monologues for 200 tokens feels broken — long replies are the single biggest cause of "this thing is slow," not the model speed.

How do you make it stream so it does not feel slow?

The naive version waits for each stage to fully finish before starting the next: record all the audio, transcribe all of it, generate the entire LLM reply, then synthesize the whole thing, then play it. That stacks every stage's latency end to end and feels sluggish.

The fix is to overlap the stages:

Stream STT. Run faster-whisper on rolling chunks with a voice-activity-detection (VAD) endpoint so transcription is essentially done the instant the user stops talking — not started then.
Stream the LLM. Use Ollama's streaming response (stream: true) so you get tokens as they are generated instead of waiting for the full reply.
Stream TTS at sentence boundaries. This is the key trick: as the LLM emits text, buffer until you hit a sentence boundary (a period, question mark, or newline), send that sentence to Piper, and start playing audio while the LLM is still writing the next sentence. Piper synthesizes a sentence in a fraction of real time, so the user hears the first sentence within a second of the model starting to think.

With sentence-level TTS streaming, the latency the user perceives is "time until the first sentence is spoken," not "time until the whole answer is ready." That single change is the difference between a 1-2 second assistant and a 6-8 second one on the same hardware.

A minimal streaming loop looks like this:

import ollama, re

buffer = ""
for chunk in ollama.chat(
    model="qwen3:4b",
    messages=[{"role": "user", "content": transcript}],
    stream=True,
):
    buffer += chunk["message"]["content"]
    # flush each complete sentence straight to Piper as it arrives
    while (m := re.search(r"[.!?]\s", buffer)):
        sentence, buffer = buffer[: m.end()], buffer[m.end():]
        speak_with_piper(sentence)   # synthesize + play, non-blocking
if buffer.strip():
    speak_with_piper(buffer)

What latency should you actually expect?

Latency is dominated by hardware and by whether you stream. Below are realistic, approximate figures for the perceived loop — from the moment you stop speaking to the first spoken word back — assuming the sentence-streaming approach above. Treat these as ballpark, not lab numbers; exact values move with model size, quant, and prompt length.

Hardware	STT model	LLM	Approx. perceived latency	Notes
RTX 3060 12GB	Whisper small (faster-whisper)	Qwen3 4B	~1-1.5 s	All three on the GPU, comfortable
RTX 3060 12GB	Whisper medium	Llama 3.1 8B	~1.5-2 s	Higher STT accuracy, slightly slower
RTX 3090/4090 24GB	Whisper medium	Llama 3.1 8B	~0.7-1.2 s	Headroom for bigger models
Raspberry Pi 5 (8GB, CPU)	Whisper base/small	Qwen3 4B	~5-8 s	CPU-only; usable but not snappy
Mac (Apple Silicon, Metal)	Whisper small	Qwen3 4B	~1-2 s	Unified memory helps a lot

The single biggest lever after streaming is the Whisper model size. faster-whisper on an RTX 3060 runs Whisper large-v3 well above real time (roughly 3-4x), but for an assistant you rarely need large — Whisper small (244M params) is fast and accurate enough for command-style speech, and dropping from large to small can shave most of a second off the STT stage. The Whisper family sizes are tiny (39M), base (74M), small (244M), medium (769M) and large-v3 (1.55B); for a responsive assistant, small or medium is the sweet spot. For tuning the STT stage specifically — model choice, VAD, and quantization — see our faster-whisper setup guide.

A note from our own testing

In my own rough testing on an RTX 3060 12GB, with faster-whisper running Whisper small (INT8) plus Qwen3 4B at Q4_K_M and Piper synthesizing per sentence, the perceived "stop talking to first spoken word" loop landed around 1-1.5 seconds for short questions — fast enough to feel conversational. Pushing the LLM up to Llama 3.1 8B added a few hundred milliseconds. These are single-machine, eyeballed figures rather than a controlled benchmark, but the takeaway held: the LLM and the TTS-flush strategy, not Whisper, decided how "instant" it felt.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How does this compare to Moshi?

Moshi, from Kyutai, is a different animal: a single speech-to-speech foundation model (a 7B "Temporal Transformer" paired with the Mimi neural audio codec) that listens and speaks in full duplex — it can talk and listen at the same time, and you can interrupt it mid-sentence. Its theoretical latency is about 160ms (200ms in practice on an L4 GPU), an order of magnitude lower than any cascade.

	Whisper + Ollama + Piper	Moshi (Kyutai)
Architecture	Cascaded (3 swappable models)	Single end-to-end speech-to-speech
Latency	~1-2 s (RTX 3060)	~200 ms in practice
Full-duplex / interruptible	No (turn-based)	Yes
Swap the "brain" LLM	Yes — any Ollama model	No — fixed model
Tool use / function calling	Easy (it is just an LLM)	Limited
VRAM footprint	Flexible (4 GB and up)	~7B model, heavier
Maturity for assistants	Production-ready, very common	Research-forward, conversational

Pick Moshi when natural, interruptible, low-latency conversation is the product. Pick the Whisper + Ollama + Piper cascade when you want a controllable assistant — pick your own LLM, add retrieval or tool calls, choose any Piper voice, run on a Pi — and a one-to-two-second turn-based response is acceptable. For most "answer my question / control my house" assistants, the cascade wins on flexibility; for a companion you actually converse with, Moshi wins on feel.

How does this compare to Home Assistant Assist?

If your goal is a smart-home voice assistant, you probably should not build the pipeline from scratch — Home Assistant already ships this exact cascade. Home Assistant's Assist uses the Wyoming protocol (socket interfaces that let services plug together) with faster-whisper and Piper exposed as Wyoming services, and Home Assistant speaking Wyoming natively. As of 2026 this runs locally with sub-second responses on modest hardware and can route the "brain" to a local Ollama model for open-ended questions.

So the two are not really competitors:

Build the raw Whisper + Ollama + Piper pipeline yourself when you want a standalone assistant app, a kiosk, a custom product, or you need control over every stage and prompt.
Use Home Assistant Assist when the assistant's main job is controlling your home — it gives you wake words, satellite hardware, exposed-entity grounding, and the Whisper/Piper plumbing already wired through Wyoming, so you write almost no code.

For the smart-home route end to end — installing the Wyoming Whisper and Piper add-ons and pointing Assist at Ollama — see our guide to a local AI + Home Assistant setup.

Key Takeaways

A local voice assistant is three swappable models: faster-whisper (STT) to Ollama (LLM) to Piper (TTS), and it runs fully offline with no cloud or per-request cost.
Stream at sentence boundaries. Flush each finished sentence from the LLM straight to Piper and start playing it — this is what turns a 6-8 second assistant into a 1-2 second one on the same hardware.
Pick the brain for speed: Qwen3 4B (~2.5 GB, Apr 2025) for lowest latency and low VRAM; Llama 3.1 8B (~4.9 GB Q4, Jul 2024) for best quality on a 12 GB GPU.
Expect ~1-2 s on an RTX 3060 and ~5-8 s on a Raspberry Pi 5. Whisper small (244M) is the sweet spot — you rarely need large-v3 for an assistant.
Moshi (~200 ms, full-duplex) beats the cascade on conversational feel; the cascade beats Moshi on flexibility (any LLM, tool use, any voice, runs on a Pi). For smart homes, Home Assistant Assist already ships this stack via Wyoming.

Next Steps

Dial in the speech-to-text stage — model size, VAD, and quantization — with our faster-whisper setup guide.
Want the low-latency, interruptible alternative? Read the Moshi real-time speech-to-speech guide.
Building this for the home instead of from scratch? Follow the local AI + Home Assistant guide to wire Whisper and Piper through Wyoming.
Check the official repos for the latest models and voices: faster-whisper and Piper (piper1-gpl).

Build a Local Voice Assistant: Whisper + Ollama + Piper

Want to go deeper than this article?

What is the Whisper to Ollama to Piper voice assistant pipeline?

Reading articles is good. Building is better.

Which Ollama model should be the brain?

How do you make it stream so it does not feel slow?

What latency should you actually expect?

A note from our own testing

Reading articles is good. Building is better.

How does this compare to Moshi?

How does this compare to Home Assistant Assist?

Key Takeaways

Next Steps

Ollama’s running. Here’s what to build with it.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Faster-Whisper Setup Guide

Moshi Real-Time Speech-to-Speech Guide

Local AI + Home Assistant

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Ollama’s running. Here’s what to build with it.