← All build guides

Build a Voice AI Agent in 60 Minutes

A realtime voice agent that listens, thinks, and talks back — sub-second latency, runs entirely on your own machine.

60 minutes🛠 Python 3.10+, a microphone, and a GPU with 6GB+ VRAM (CPU also works, slower)
  1. 1

    Pick the three pieces

    A realtime voice agent is just three models stitched together: speech-to-text (Whisper or faster-whisper), an LLM (any chat model via Ollama), and text-to-speech (Piper, Coqui, or OpenAI-compatible TTS). Pick the smallest viable model for each — `faster-whisper-base`, `llama3.1:8b`, `piper:en_US-amy-medium` is a solid starting point.

  2. 2

    Capture and stream audio

    `pip install pyaudio webrtcvad`. Capture mic input at 16kHz with `pyaudio` and use `webrtcvad` to detect when the user starts and stops talking — this is voice activity detection and it's the difference between an agent that feels real and one that interrupts you mid-sentence. Buffer audio between VAD events.

  3. 3

    Transcribe with Whisper

    Pass the buffered audio to `faster_whisper.WhisperModel("base").transcribe(audio_array)`. Base is fast enough for realtime on a modest GPU. For lower latency, use `whisper.cpp` with a quantized model. The output is plain text — your input to the LLM.

  4. 4

    LLM with conversation memory

    Maintain a `messages` list with the system prompt + last N turns. Send to Ollama via `ollama.chat(model="llama3.1:8b", messages=messages, stream=True)`. Stream the tokens — don't wait for the full response — because TTS can start speaking the first sentence while the LLM is still generating the second.

  5. 5

    TTS in real time

    Pipe the streaming LLM output into Piper sentence-by-sentence. Each sentence becomes a wav blob you push to the audio output. Total perceived latency: ~700ms from user stops talking → agent starts talking. That's in the same ballpark as ChatGPT voice mode but running on your hardware.

  6. 6

    What you do not get from a 60-min build

    Production voice agents need turn-taking interrupts (the agent shutting up when the user starts speaking), barge-in detection, eval sets for transcription accuracy, telephony integration (Twilio / SIP), and observability for sub-second-budget tracking. The Voice AI course on Local AI Master walks all of that, with capstones for telephony deployment.

Want the full path?

Continue with the full Voice AI and Realtime Agents course on Local AI Master.

This page is one chapter of a structured course covering everything from foundations to production. Try Pro free for 7 days — full access to all 264 chapters across 10 courses, no charge until day 8, cancel anytime.

No charge for 7 days · $9.99/mo after · cancel anytime
Free Tools & Calculators