★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Tools

Build a Local Voice Assistant: Whisper + Ollama + Piper

June 20, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Ollama’s running. Here’s what to build with it. Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Start free
Or own it for life — Lifetime $149, pay once

To build a fully local voice assistant, you chain three open-source pieces: faster-whisper turns your speech into text, a small Ollama model (Llama 3.1 8B or Qwen3 4B) generates the reply, and Piper speaks it back — all offline, no cloud. On a single RTX 3060 12GB the whole loop, from end of speech to first spoken word, runs about 1-2 seconds; on a Raspberry Pi 5 it is more like 5-8 seconds. If you want something that feels like talking to a person with near-instant interruptible speech, Kyutai's Moshi (~200ms) is the specialist tool — but the Whisper to Ollama to Piper stack is far more flexible because you can swap the brain for any model you already run.

This guide covers the full pipeline: how the three stages connect, how to stream so the user is not staring at silence, realistic latency on real hardware, and where this approach beats Moshi and Home Assistant's Assist — and where it does not.

What is the Whisper to Ollama to Piper voice assistant pipeline?

It is a cascaded (pipelined) voice assistant: three independent models run in sequence, each doing one job well. Cascaded means each stage is swappable, which is the whole point — you are not locked into one vendor's monolithic model.

StageJobToolWhat runs
1. Speech-to-text (STT)Turn microphone audio into textfaster-whisperWhisper small/medium via CTranslate2
2. Reasoning (LLM)Generate the spoken replyOllamaLlama 3.1 8B or Qwen3 4B
3. Text-to-speech (TTS)Turn the reply into audioPiperVITS/ONNX voice model

faster-whisper is a reimplementation of OpenAI's Whisper using CTranslate2; it is roughly 4x faster on GPU and 2x faster on CPU than the original PyTorch code, at the same accuracy, with INT8/FP16 quantization. Ollama is the local model runner most people already use. Piper is a fast, local neural TTS from the Rhasspy team (active development has moved to the Open Home Foundation's OHF-Voice/piper1-gpl fork) built on VITS and exported to ONNX, light enough to run in real time even on a Raspberry Pi.

The reason this stack matters: every stage is fully offline. No audio leaves your machine, there is no per-request API cost, and you can run it on a network with no internet at all. The trade-off versus a single speech-to-speech model is latency, which we get to below.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Which Ollama model should be the brain?

For a voice assistant the brain should be small and fast, because in a cascade the LLM is usually the slowest stage and you are reading its output aloud token by token. Two solid, verified picks:

ModelParamsReleasedOllama Q4_K_M sizeContextLicenseBest for
Llama 3.1 8B Instruct8B denseJul 23, 2024~4.9 GB128KLlama 3.1 CommunityBest general quality that still fits 8-12 GB VRAM
Qwen3 4B4.02B denseApr 29, 2025~2.5 GB32K (128K via YaRN)Apache 2.0Fastest first token; runs on a Pi 5 or 8 GB GPU

Both are genuinely small. Llama 3.1 8B is the quality pick if you have a 12 GB GPU to spare; Qwen3 4B is the latency pick — at ~2.5 GB it leaves room for Whisper and Piper to share the same card, and its smaller body returns the first token sooner, which is what the user actually hears as "responsiveness."

Pull either through Ollama:

# Quality pick (needs ~8-12 GB free for model + context)
ollama pull llama3.1:8b

# Latency / low-VRAM pick
ollama pull qwen3:4b

Keep the system prompt short and instruct the model to answer in one or two sentences. A voice assistant that monologues for 200 tokens feels broken — long replies are the single biggest cause of "this thing is slow," not the model speed.

How do you make it stream so it does not feel slow?

The naive version waits for each stage to fully finish before starting the next: record all the audio, transcribe all of it, generate the entire LLM reply, then synthesize the whole thing, then play it. That stacks every stage's latency end to end and feels sluggish.

The fix is to overlap the stages:

  1. Stream STT. Run faster-whisper on rolling chunks with a voice-activity-detection (VAD) endpoint so transcription is essentially done the instant the user stops talking — not started then.
  2. Stream the LLM. Use Ollama's streaming response (stream: true) so you get tokens as they are generated instead of waiting for the full reply.
  3. Stream TTS at sentence boundaries. This is the key trick: as the LLM emits text, buffer until you hit a sentence boundary (a period, question mark, or newline), send that sentence to Piper, and start playing audio while the LLM is still writing the next sentence. Piper synthesizes a sentence in a fraction of real time, so the user hears the first sentence within a second of the model starting to think.

With sentence-level TTS streaming, the latency the user perceives is "time until the first sentence is spoken," not "time until the whole answer is ready." That single change is the difference between a 1-2 second assistant and a 6-8 second one on the same hardware.

A minimal streaming loop looks like this:

import ollama, re

buffer = ""
for chunk in ollama.chat(
    model="qwen3:4b",
    messages=[{"role": "user", "content": transcript}],
    stream=True,
):
    buffer += chunk["message"]["content"]
    # flush each complete sentence straight to Piper as it arrives
    while (m := re.search(r"[.!?]\s", buffer)):
        sentence, buffer = buffer[: m.end()], buffer[m.end():]
        speak_with_piper(sentence)   # synthesize + play, non-blocking
if buffer.strip():
    speak_with_piper(buffer)

What latency should you actually expect?

Latency is dominated by hardware and by whether you stream. Below are realistic, approximate figures for the perceived loop — from the moment you stop speaking to the first spoken word back — assuming the sentence-streaming approach above. Treat these as ballpark, not lab numbers; exact values move with model size, quant, and prompt length.

HardwareSTT modelLLMApprox. perceived latencyNotes
RTX 3060 12GBWhisper small (faster-whisper)Qwen3 4B~1-1.5 sAll three on the GPU, comfortable
RTX 3060 12GBWhisper mediumLlama 3.1 8B~1.5-2 sHigher STT accuracy, slightly slower
RTX 3090/4090 24GBWhisper mediumLlama 3.1 8B~0.7-1.2 sHeadroom for bigger models
Raspberry Pi 5 (8GB, CPU)Whisper base/smallQwen3 4B~5-8 sCPU-only; usable but not snappy
Mac (Apple Silicon, Metal)Whisper smallQwen3 4B~1-2 sUnified memory helps a lot

The single biggest lever after streaming is the Whisper model size. faster-whisper on an RTX 3060 runs Whisper large-v3 well above real time (roughly 3-4x), but for an assistant you rarely need large — Whisper small (244M params) is fast and accurate enough for command-style speech, and dropping from large to small can shave most of a second off the STT stage. The Whisper family sizes are tiny (39M), base (74M), small (244M), medium (769M) and large-v3 (1.55B); for a responsive assistant, small or medium is the sweet spot. For tuning the STT stage specifically — model choice, VAD, and quantization — see our faster-whisper setup guide.

A note from our own testing

In my own rough testing on an RTX 3060 12GB, with faster-whisper running Whisper small (INT8) plus Qwen3 4B at Q4_K_M and Piper synthesizing per sentence, the perceived "stop talking to first spoken word" loop landed around 1-1.5 seconds for short questions — fast enough to feel conversational. Pushing the LLM up to Llama 3.1 8B added a few hundred milliseconds. These are single-machine, eyeballed figures rather than a controlled benchmark, but the takeaway held: the LLM and the TTS-flush strategy, not Whisper, decided how "instant" it felt.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How does this compare to Moshi?

Moshi, from Kyutai, is a different animal: a single speech-to-speech foundation model (a 7B "Temporal Transformer" paired with the Mimi neural audio codec) that listens and speaks in full duplex — it can talk and listen at the same time, and you can interrupt it mid-sentence. Its theoretical latency is about 160ms (200ms in practice on an L4 GPU), an order of magnitude lower than any cascade.

Whisper + Ollama + PiperMoshi (Kyutai)
ArchitectureCascaded (3 swappable models)Single end-to-end speech-to-speech
Latency~1-2 s (RTX 3060)~200 ms in practice
Full-duplex / interruptibleNo (turn-based)Yes
Swap the "brain" LLMYes — any Ollama modelNo — fixed model
Tool use / function callingEasy (it is just an LLM)Limited
VRAM footprintFlexible (4 GB and up)~7B model, heavier
Maturity for assistantsProduction-ready, very commonResearch-forward, conversational

Pick Moshi when natural, interruptible, low-latency conversation is the product. Pick the Whisper + Ollama + Piper cascade when you want a controllable assistant — pick your own LLM, add retrieval or tool calls, choose any Piper voice, run on a Pi — and a one-to-two-second turn-based response is acceptable. For most "answer my question / control my house" assistants, the cascade wins on flexibility; for a companion you actually converse with, Moshi wins on feel.

How does this compare to Home Assistant Assist?

If your goal is a smart-home voice assistant, you probably should not build the pipeline from scratch — Home Assistant already ships this exact cascade. Home Assistant's Assist uses the Wyoming protocol (socket interfaces that let services plug together) with faster-whisper and Piper exposed as Wyoming services, and Home Assistant speaking Wyoming natively. As of 2026 this runs locally with sub-second responses on modest hardware and can route the "brain" to a local Ollama model for open-ended questions.

So the two are not really competitors:

  • Build the raw Whisper + Ollama + Piper pipeline yourself when you want a standalone assistant app, a kiosk, a custom product, or you need control over every stage and prompt.
  • Use Home Assistant Assist when the assistant's main job is controlling your home — it gives you wake words, satellite hardware, exposed-entity grounding, and the Whisper/Piper plumbing already wired through Wyoming, so you write almost no code.

For the smart-home route end to end — installing the Wyoming Whisper and Piper add-ons and pointing Assist at Ollama — see our guide to a local AI + Home Assistant setup.

Key Takeaways

  1. A local voice assistant is three swappable models: faster-whisper (STT) to Ollama (LLM) to Piper (TTS), and it runs fully offline with no cloud or per-request cost.
  2. Stream at sentence boundaries. Flush each finished sentence from the LLM straight to Piper and start playing it — this is what turns a 6-8 second assistant into a 1-2 second one on the same hardware.
  3. Pick the brain for speed: Qwen3 4B (~2.5 GB, Apr 2025) for lowest latency and low VRAM; Llama 3.1 8B (~4.9 GB Q4, Jul 2024) for best quality on a 12 GB GPU.
  4. Expect ~1-2 s on an RTX 3060 and ~5-8 s on a Raspberry Pi 5. Whisper small (244M) is the sweet spot — you rarely need large-v3 for an assistant.
  5. Moshi (~200 ms, full-duplex) beats the cascade on conversational feel; the cascade beats Moshi on flexibility (any LLM, tool use, any voice, runs on a Pi). For smart homes, Home Assistant Assist already ships this stack via Wyoming.

Next Steps

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Ollama’s running. Here’s what to build with it.

Go from “ollama run” to RAG apps, agents, and fine-tuned models — structured and hands-on. First chapter free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators