★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Multimodal

Moshi Real-Time Speech-to-Speech Guide (2026): Sub-200ms Local Voice AI

May 1, 2026
20 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Moshi is the first open-source speech-to-speech foundation model that delivers genuinely real-time voice AI. Sub-200 ms latency, full-duplex (listens and speaks simultaneously), processes speech tokens directly without pipelined STT/LLM/TTS — the open-source counterpart to OpenAI's GPT-4o voice mode and Realtime API. Built by Kyutai, the Paris-based open-source AI lab.

For voice agents, conversational interfaces, language practice apps, and any voice product where pipelined latency makes interactions feel "off" — Moshi is the new architectural option. This guide covers setup, the Mimi audio codec foundation, real-time WebSocket streaming, the hybrid Moshi+LLM+TTS pattern for production, latency tuning, and where Moshi belongs vs. traditional pipelines.

Table of Contents

  1. What Moshi Is
  2. Why Real-Time Speech-to-Speech Matters
  3. Architecture: Mimi Codec + Helium LLM Backbone
  4. Moshi vs OpenAI Realtime / GPT-4o Voice
  5. Hardware Requirements
  6. Installation
  7. Standalone Moshi Server
  8. WebSocket Streaming Client
  9. Hybrid Pattern: Moshi + LLM + TTS
  10. Voice Agent Frameworks (Pipecat, LiveKit)
  11. Latency Tuning
  12. The Mimi Audio Codec
  13. Multilingual Status
  14. Performance Benchmarks
  15. Use Cases Where Moshi Wins
  16. Production Considerations
  17. Troubleshooting
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Moshi Is {#what-it-is}

Moshi (kyutai-labs/moshi) is a 7B-parameter speech-to-speech foundation model released by Kyutai in late 2024. Key architectural decisions:

  • Full-duplex — listens and speaks simultaneously, with separate audio streams for input and output
  • Speech-token native — input audio → Mimi codec tokens → Helium LLM backbone → Mimi tokens → output audio
  • Real-time — sub-200 ms end-to-end latency
  • Streaming — both input and output stream continuously; no batching delays
  • Open weights — Apache 2.0 license

Project: github.com/kyutai-labs/moshi. Reference paper: "Moshi: a speech-text foundation model for real-time dialogue" (Kyutai, 2024).


Why Real-Time Speech-to-Speech Matters {#why-rt}

Traditional voice agent pipeline:

User audio → [Whisper STT, 300-500ms] → text
                                          ↓
Text response ← [LLM, 500-1500ms] ← text
                                          ↓
Output text → [TTS, 200-500ms] → audio → user

Total: 1000-2500 ms per turn. Feels laggy. Cannot interrupt. Cannot backchannel. Cannot speak while thinking.

Moshi:

User audio (continuous) ──┐
                          ▼
                  [Moshi, ~200ms latency]
                          │
                          ▼
                  Output audio (continuous, can overlap with input)

Total: 200 ms. Feels conversational. Can be interrupted. Can backchannel. Naturally turn-takes.

The latency improvement is qualitative, not just quantitative — voice interaction goes from "talking to a robot" to "talking to a person."


Architecture: Mimi Codec + Helium LLM Backbone {#architecture}

Moshi has three components:

  1. Mimi — neural audio codec, 1.1 kbps, 12.5 Hz frame rate. Encodes audio → discrete tokens; decodes tokens → audio.
  2. Helium 7B LLM — Kyutai's 7B language model trained on text + audio tokens jointly.
  3. Inner Monologue — a parallel text stream the model uses internally as scratchpad reasoning before producing audio output.

Inner Monologue is the key innovation: the model "thinks in text" while speaking, getting the reasoning quality of a text LLM into a real-time speech model.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Moshi vs OpenAI Realtime / GPT-4o Voice {#vs-openai}

PropertyMoshiOpenAI Realtime / GPT-4o Voice
Latency<200 ms300-500 ms
Voice quality (MOS)3.84.5
Reasoning quality7B-level40T+-level
LanguagesEnglish50+
PrivacyLocalCloud
CostGPU + electricity$0.06/minute
Voice varietyLimitedMany
Tool callingLimitedFull OpenAI tools
LicenseApache 2.0Closed

For privacy-sensitive / cost-sensitive applications: Moshi. For maximum voice quality + reasoning + multilingual: OpenAI. Moshi is the right open-source primitive.


Hardware Requirements {#requirements}

HardwareLatency
RTX 4090 (FP16)<200 ms
RTX 4070 (FP16)~250 ms
RTX 3090 (FP16)~250 ms
Apple M4 Max (MLX)~250 ms
RTX 3060 (Q8)~350 ms
CPU onlyNot real-time

VRAM 16 GB+ for FP16, 10 GB+ with quantization. The model is 7B parameters; this is comparable to running Llama 3.1 8B with KV cache for streaming.


Installation {#installation}

python3.11 -m venv ~/venvs/moshi
source ~/venvs/moshi/bin/activate

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install moshi

For MLX (Apple Silicon):

pip install moshi-mlx

For Rust standalone (lowest latency, no Python):

cargo install --features cuda moshi-server

Model weights download automatically on first run from Hugging Face (kmhf/moshiko-pytorch-bf16 or kmhf/moshika-pytorch-bf16 for two voices).


Standalone Moshi Server {#standalone}

# Python server
python -m moshi.server

# Rust server (lower latency)
moshi-server worker --config config.toml

Default port 8998 with WebSocket endpoint at /api/chat. Browse to the included web demo at http://localhost:8998 to test.

For Docker:

docker run --gpus all -p 8998:8998 \
    ghcr.io/kyutai-labs/moshi-server:latest

WebSocket Streaming Client {#websocket}

import asyncio, websockets, sounddevice as sd, numpy as np

async def voice_client():
    async with websockets.connect("ws://localhost:8998/api/chat") as ws:
        # Audio in: capture mic, send 24 kHz mono frames
        def mic_callback(indata, frames, time, status):
            asyncio.run_coroutine_threadsafe(
                ws.send(indata.tobytes()),
                loop
            )

        with sd.InputStream(samplerate=24000, channels=1, dtype="float32", callback=mic_callback):
            # Audio out: receive frames, play
            async for message in ws:
                audio = np.frombuffer(message, dtype="float32")
                sd.play(audio, samplerate=24000)

asyncio.run(voice_client())

Both directions stream simultaneously — Moshi listens and speaks at the same time.


Hybrid Pattern: Moshi + LLM + TTS {#hybrid}

For production voice agents that need richer reasoning + tool calling + multilingual + voice cloning, combine Moshi with separate components:

User audio
    │
    ▼
[Moshi for real-time backchannel + ASR]
    │
    ▼
Text transcript
    │
    ▼
[Llama 3.1 / Qwen 2.5 with tool calling]
    │
    ▼
Response text
    │
    ▼
[F5-TTS / OpenVoice for cloned-voice output]
    │
    ▼
User audio (delayed by ~500-800 ms vs Moshi pure)

Trade-off: lose the sub-200 ms latency, gain richer reasoning, multilingual support, and voice cloning. For most production agents in 2026, hybrid is the right pattern.


Voice Agent Frameworks (Pipecat, LiveKit) {#frameworks}

Two popular orchestration frameworks for voice agents:

  • Pipecat (daily-co/pipecat) — Python framework for building real-time voice/video agents. Supports Moshi as a service. Handles audio I/O, VAD, turn detection, tool calling.
  • LiveKit Agents — TypeScript / Python framework with WebRTC infrastructure. Used in production by many voice products.

Both let you swap STT / LLM / TTS components, including Moshi as the speech layer + a separate LLM for reasoning.


Latency Tuning {#latency}

Real-time voice latency budget targets:

  • Network round-trip: 50 ms
  • Microphone capture buffer: 20 ms
  • Model inference: 100-150 ms
  • Speaker playback buffer: 30 ms
  • Total target: <200 ms

Tuning levers:

  • Use Rust server instead of Python (~30 ms savings)
  • GPU power state lock with nvidia-smi -lgc for deterministic latency
  • Disable speaker auto-mute in OS audio settings
  • Wired headset vs Bluetooth (BT adds 100-200 ms)
  • Moshi Q8 quant for tighter VRAM and faster decode (slight quality loss)

The Mimi Audio Codec {#mimi}

Mimi is Moshi's neural audio codec. Specs:

  • 1.1 kbps bitrate
  • 12.5 Hz frame rate (80 ms per frame)
  • 24 kHz mono audio
  • VBR available

Use Mimi standalone:

from moshi.models import loaders, MimiModel

mimi = loaders.get_mimi("kyutai/mimi", device="cuda")
codes = mimi.encode(audio_tensor)            # to discrete tokens
audio = mimi.decode(codes)                    # back to audio

Useful for: audio language models, real-time codecs for VoIP, audio compression research.


Multilingual Status {#multilingual}

English-only as of mid-2026. Kyutai has announced multilingual variants are in development. For non-English real-time voice today: hybrid pipeline (Faster-Whisper streaming + LLM + TTS in target language) or commercial APIs (OpenAI Realtime supports 50+ languages).


Performance Benchmarks {#benchmarks}

End-to-end voice agent latency, RTX 4090:

StackLatencyNotes
Moshi (pure)180 msEnglish only, voice limited
Moshi + Llama hybrid600-800 msMultilingual, tool-capable
Faster-Whisper streaming + Llama + F5-TTS800-1200 msMost flexible
OpenAI Realtime API350-500 msCloud, paid
OpenAI GPT-4o voice350-500 msCloud, paid
Standard pipelined (offline Whisper + Llama + Coqui)1500-3000 msAvoid for real-time

For latency-critical applications (gaming, accessibility), Moshi pure is the only sub-300ms option that's open-source.


Use Cases Where Moshi Wins {#use-cases}

  1. Privacy-sensitive voice agents — healthcare, legal, financial advisors
  2. Language practice partners — sub-200 ms feels natural for conversation flow
  3. Voice diary / journaling — local-only conversational journaling
  4. Voice assistants for accessibility — sub-200 ms reduces cognitive load for users with disabilities
  5. In-game NPCs — runtime conversational characters
  6. Voice control for desktop apps — fast push-to-talk
  7. Telephony bots — when SIP integration latency budget is tight

For 7B reasoning depth + voice quality + multilingual: hybrid pipeline or OpenAI Realtime.


Production Considerations {#production}

For shipping Moshi in a product:

  • Graceful degradation — fall back to text if audio fails
  • VAD edge cases — handle long silences, music, multi-speaker scenarios
  • Network jitter — buffer 50-100 ms; reconnect on disconnect
  • Hot reload — keep model loaded; restart connections without unload
  • Monitoring — track p50/p95/p99 latency per session
  • Load — single H100 or RTX 4090 handles ~5-10 concurrent sessions
  • Privacy / compliance — verify state-level synthetic audio laws apply
  • Logging — log transcripts (text), not raw audio, for compliance and debugging

Troubleshooting {#troubleshooting}

SymptomCauseFix
Latency above 300 msWrong server / BluetoothUse Rust server, wired headset
Audio glitchesBuffer underrunIncrease audio buffer 20→40 ms
Model OOMFP16 too tightUse Q8 quant variant
WebSocket disconnectsNetwork jitterAdd reconnect logic with state preserve
AMD no supportROCm not yetUse CUDA / MPS / CPU; track project for ROCm
Voice variety limitedTrained voices onlyUse hybrid + voice cloning TTS for arbitrary voices

FAQ {#faq}

See answers to common Moshi questions below.


Sources: Moshi GitHub | Moshi paper | Kyutai | Pipecat | Internal benchmarks RTX 4090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a Moshi + LiveKit voice agent reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators