Moshi Real-Time Speech-to-Speech Guide (2026): Sub-200ms Local Voice AI
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Moshi is the first open-source speech-to-speech foundation model that delivers genuinely real-time voice AI. Sub-200 ms latency, full-duplex (listens and speaks simultaneously), processes speech tokens directly without pipelined STT/LLM/TTS — the open-source counterpart to OpenAI's GPT-4o voice mode and Realtime API. Built by Kyutai, the Paris-based open-source AI lab.
For voice agents, conversational interfaces, language practice apps, and any voice product where pipelined latency makes interactions feel "off" — Moshi is the new architectural option. This guide covers setup, the Mimi audio codec foundation, real-time WebSocket streaming, the hybrid Moshi+LLM+TTS pattern for production, latency tuning, and where Moshi belongs vs. traditional pipelines.
Table of Contents
- What Moshi Is
- Why Real-Time Speech-to-Speech Matters
- Architecture: Mimi Codec + Helium LLM Backbone
- Moshi vs OpenAI Realtime / GPT-4o Voice
- Hardware Requirements
- Installation
- Standalone Moshi Server
- WebSocket Streaming Client
- Hybrid Pattern: Moshi + LLM + TTS
- Voice Agent Frameworks (Pipecat, LiveKit)
- Latency Tuning
- The Mimi Audio Codec
- Multilingual Status
- Performance Benchmarks
- Use Cases Where Moshi Wins
- Production Considerations
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Moshi Is {#what-it-is}
Moshi (kyutai-labs/moshi) is a 7B-parameter speech-to-speech foundation model released by Kyutai in late 2024. Key architectural decisions:
- Full-duplex — listens and speaks simultaneously, with separate audio streams for input and output
- Speech-token native — input audio → Mimi codec tokens → Helium LLM backbone → Mimi tokens → output audio
- Real-time — sub-200 ms end-to-end latency
- Streaming — both input and output stream continuously; no batching delays
- Open weights — Apache 2.0 license
Project: github.com/kyutai-labs/moshi. Reference paper: "Moshi: a speech-text foundation model for real-time dialogue" (Kyutai, 2024).
Why Real-Time Speech-to-Speech Matters {#why-rt}
Traditional voice agent pipeline:
User audio → [Whisper STT, 300-500ms] → text
↓
Text response ← [LLM, 500-1500ms] ← text
↓
Output text → [TTS, 200-500ms] → audio → user
Total: 1000-2500 ms per turn. Feels laggy. Cannot interrupt. Cannot backchannel. Cannot speak while thinking.
Moshi:
User audio (continuous) ──┐
▼
[Moshi, ~200ms latency]
│
▼
Output audio (continuous, can overlap with input)
Total: 200 ms. Feels conversational. Can be interrupted. Can backchannel. Naturally turn-takes.
The latency improvement is qualitative, not just quantitative — voice interaction goes from "talking to a robot" to "talking to a person."
Architecture: Mimi Codec + Helium LLM Backbone {#architecture}
Moshi has three components:
- Mimi — neural audio codec, 1.1 kbps, 12.5 Hz frame rate. Encodes audio → discrete tokens; decodes tokens → audio.
- Helium 7B LLM — Kyutai's 7B language model trained on text + audio tokens jointly.
- Inner Monologue — a parallel text stream the model uses internally as scratchpad reasoning before producing audio output.
Inner Monologue is the key innovation: the model "thinks in text" while speaking, getting the reasoning quality of a text LLM into a real-time speech model.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Moshi vs OpenAI Realtime / GPT-4o Voice {#vs-openai}
| Property | Moshi | OpenAI Realtime / GPT-4o Voice |
|---|---|---|
| Latency | <200 ms | 300-500 ms |
| Voice quality (MOS) | 3.8 | 4.5 |
| Reasoning quality | 7B-level | 40T+-level |
| Languages | English | 50+ |
| Privacy | Local | Cloud |
| Cost | GPU + electricity | $0.06/minute |
| Voice variety | Limited | Many |
| Tool calling | Limited | Full OpenAI tools |
| License | Apache 2.0 | Closed |
For privacy-sensitive / cost-sensitive applications: Moshi. For maximum voice quality + reasoning + multilingual: OpenAI. Moshi is the right open-source primitive.
Hardware Requirements {#requirements}
| Hardware | Latency |
|---|---|
| RTX 4090 (FP16) | <200 ms |
| RTX 4070 (FP16) | ~250 ms |
| RTX 3090 (FP16) | ~250 ms |
| Apple M4 Max (MLX) | ~250 ms |
| RTX 3060 (Q8) | ~350 ms |
| CPU only | Not real-time |
VRAM 16 GB+ for FP16, 10 GB+ with quantization. The model is 7B parameters; this is comparable to running Llama 3.1 8B with KV cache for streaming.
Installation {#installation}
python3.11 -m venv ~/venvs/moshi
source ~/venvs/moshi/bin/activate
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install moshi
For MLX (Apple Silicon):
pip install moshi-mlx
For Rust standalone (lowest latency, no Python):
cargo install --features cuda moshi-server
Model weights download automatically on first run from Hugging Face (kmhf/moshiko-pytorch-bf16 or kmhf/moshika-pytorch-bf16 for two voices).
Standalone Moshi Server {#standalone}
# Python server
python -m moshi.server
# Rust server (lower latency)
moshi-server worker --config config.toml
Default port 8998 with WebSocket endpoint at /api/chat. Browse to the included web demo at http://localhost:8998 to test.
For Docker:
docker run --gpus all -p 8998:8998 \
ghcr.io/kyutai-labs/moshi-server:latest
WebSocket Streaming Client {#websocket}
import asyncio, websockets, sounddevice as sd, numpy as np
async def voice_client():
async with websockets.connect("ws://localhost:8998/api/chat") as ws:
# Audio in: capture mic, send 24 kHz mono frames
def mic_callback(indata, frames, time, status):
asyncio.run_coroutine_threadsafe(
ws.send(indata.tobytes()),
loop
)
with sd.InputStream(samplerate=24000, channels=1, dtype="float32", callback=mic_callback):
# Audio out: receive frames, play
async for message in ws:
audio = np.frombuffer(message, dtype="float32")
sd.play(audio, samplerate=24000)
asyncio.run(voice_client())
Both directions stream simultaneously — Moshi listens and speaks at the same time.
Hybrid Pattern: Moshi + LLM + TTS {#hybrid}
For production voice agents that need richer reasoning + tool calling + multilingual + voice cloning, combine Moshi with separate components:
User audio
│
▼
[Moshi for real-time backchannel + ASR]
│
▼
Text transcript
│
▼
[Llama 3.1 / Qwen 2.5 with tool calling]
│
▼
Response text
│
▼
[F5-TTS / OpenVoice for cloned-voice output]
│
▼
User audio (delayed by ~500-800 ms vs Moshi pure)
Trade-off: lose the sub-200 ms latency, gain richer reasoning, multilingual support, and voice cloning. For most production agents in 2026, hybrid is the right pattern.
Voice Agent Frameworks (Pipecat, LiveKit) {#frameworks}
Two popular orchestration frameworks for voice agents:
- Pipecat (daily-co/pipecat) — Python framework for building real-time voice/video agents. Supports Moshi as a service. Handles audio I/O, VAD, turn detection, tool calling.
- LiveKit Agents — TypeScript / Python framework with WebRTC infrastructure. Used in production by many voice products.
Both let you swap STT / LLM / TTS components, including Moshi as the speech layer + a separate LLM for reasoning.
Latency Tuning {#latency}
Real-time voice latency budget targets:
- Network round-trip: 50 ms
- Microphone capture buffer: 20 ms
- Model inference: 100-150 ms
- Speaker playback buffer: 30 ms
- Total target: <200 ms
Tuning levers:
- Use Rust server instead of Python (~30 ms savings)
- GPU power state lock with
nvidia-smi -lgcfor deterministic latency - Disable speaker auto-mute in OS audio settings
- Wired headset vs Bluetooth (BT adds 100-200 ms)
- Moshi Q8 quant for tighter VRAM and faster decode (slight quality loss)
The Mimi Audio Codec {#mimi}
Mimi is Moshi's neural audio codec. Specs:
- 1.1 kbps bitrate
- 12.5 Hz frame rate (80 ms per frame)
- 24 kHz mono audio
- VBR available
Use Mimi standalone:
from moshi.models import loaders, MimiModel
mimi = loaders.get_mimi("kyutai/mimi", device="cuda")
codes = mimi.encode(audio_tensor) # to discrete tokens
audio = mimi.decode(codes) # back to audio
Useful for: audio language models, real-time codecs for VoIP, audio compression research.
Multilingual Status {#multilingual}
English-only as of mid-2026. Kyutai has announced multilingual variants are in development. For non-English real-time voice today: hybrid pipeline (Faster-Whisper streaming + LLM + TTS in target language) or commercial APIs (OpenAI Realtime supports 50+ languages).
Performance Benchmarks {#benchmarks}
End-to-end voice agent latency, RTX 4090:
| Stack | Latency | Notes |
|---|---|---|
| Moshi (pure) | 180 ms | English only, voice limited |
| Moshi + Llama hybrid | 600-800 ms | Multilingual, tool-capable |
| Faster-Whisper streaming + Llama + F5-TTS | 800-1200 ms | Most flexible |
| OpenAI Realtime API | 350-500 ms | Cloud, paid |
| OpenAI GPT-4o voice | 350-500 ms | Cloud, paid |
| Standard pipelined (offline Whisper + Llama + Coqui) | 1500-3000 ms | Avoid for real-time |
For latency-critical applications (gaming, accessibility), Moshi pure is the only sub-300ms option that's open-source.
Use Cases Where Moshi Wins {#use-cases}
- Privacy-sensitive voice agents — healthcare, legal, financial advisors
- Language practice partners — sub-200 ms feels natural for conversation flow
- Voice diary / journaling — local-only conversational journaling
- Voice assistants for accessibility — sub-200 ms reduces cognitive load for users with disabilities
- In-game NPCs — runtime conversational characters
- Voice control for desktop apps — fast push-to-talk
- Telephony bots — when SIP integration latency budget is tight
For 7B reasoning depth + voice quality + multilingual: hybrid pipeline or OpenAI Realtime.
Production Considerations {#production}
For shipping Moshi in a product:
- Graceful degradation — fall back to text if audio fails
- VAD edge cases — handle long silences, music, multi-speaker scenarios
- Network jitter — buffer 50-100 ms; reconnect on disconnect
- Hot reload — keep model loaded; restart connections without unload
- Monitoring — track p50/p95/p99 latency per session
- Load — single H100 or RTX 4090 handles ~5-10 concurrent sessions
- Privacy / compliance — verify state-level synthetic audio laws apply
- Logging — log transcripts (text), not raw audio, for compliance and debugging
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Latency above 300 ms | Wrong server / Bluetooth | Use Rust server, wired headset |
| Audio glitches | Buffer underrun | Increase audio buffer 20→40 ms |
| Model OOM | FP16 too tight | Use Q8 quant variant |
| WebSocket disconnects | Network jitter | Add reconnect logic with state preserve |
| AMD no support | ROCm not yet | Use CUDA / MPS / CPU; track project for ROCm |
| Voice variety limited | Trained voices only | Use hybrid + voice cloning TTS for arbitrary voices |
FAQ {#faq}
See answers to common Moshi questions below.
Sources: Moshi GitHub | Moshi paper | Kyutai | Pipecat | Internal benchmarks RTX 4090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!