Kokoro vs XTTS vs Chatterbox: Best Local TTS in 2026?
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
For local TTS in 2026, pick by job: Kokoro-82M (Apache 2.0) is the best choice for fast, clean narration and is the only one of the three you can safely use commercially without caveats; Coqui XTTS v2 still produces the most convincing voice clone from a ~6-second sample but its CPML license is non-commercial only; and Resemble AI's Chatterbox (MIT) is the pick when you need emotional, expressive speech and a permissive license with cloning. There is no single winner — Kokoro can't clone voices at all, XTTS v2 can't be used commercially, and Chatterbox is heavier to run. The right answer depends entirely on whether you need narration, an exact voice clone, or expressive emotion, and whether the output is commercial.
If you only remember one thing: license and use case decide this, not raw audio quality. All three sound good. The difference that will actually bite you is whether you're allowed to ship the output and whether you need to clone a specific voice.
TL;DR — which local TTS should I use?
- Want clean narration / audiobook / app voice, and you might sell it? Use Kokoro-82M. Apache 2.0, tiny (82M params), runs faster than real time on a CPU, no GPU required.
- Need to clone a specific person's voice and it's a personal / non-commercial project? Use XTTS v2. Best zero-shot clone from ~6 seconds of audio, 17 languages — but the license forbids commercial use.
- Need emotional, expressive speech and a clean commercial license? Use Chatterbox (MIT). Zero-shot cloning from ~5 seconds plus a unique emotion-exaggeration dial.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What are Kokoro, XTTS v2, and Chatterbox?
These are the three most-recommended open-weight, run-it-yourself TTS models in mid-2026. They solve different problems, so comparing them on a single "best" axis is misleading. Here's the one-line identity of each:
- Kokoro-82M — an open-weight model from the developer "hexgrad," built on the StyleTTS 2 + iSTFTNet architecture. At just 82 million parameters it's tiny by TTS standards. Its v1.0 shipped on January 27, 2025 with 54 voices across 8 languages. It is trained on long-form narration and reading, has no voice-cloning ability, and is released under the permissive Apache 2.0 license.
- Coqui XTTS v2 — the multilingual voice-cloning model from Coqui. It clones a voice from roughly a 6-second reference clip across 17 languages. The catch: Coqui Inc. shut down in January 2024, and XTTS v2 ships under the Coqui Public Model License (CPML), which is non-commercial. With the company gone, there is effectively no one to sell you a commercial license.
- Chatterbox — Resemble AI's open-source TTS, released under the MIT license. It's built on a 0.5B-parameter Llama backbone, does zero-shot voice cloning from ~5 seconds of audio, and is notable for being one of the first open TTS models with an explicit emotion-exaggeration control. Resemble has reported it beating ElevenLabs in side-by-side listener preference tests.
If you want the wider field beyond these three, our roundup of the best local TTS models covers Piper, F5-TTS and others alongside these.
Kokoro vs XTTS vs Chatterbox: full comparison table
Here is the head-to-head. Figures are taken from each model's official model card, repository, or vendor documentation; parameter counts and licenses are confirmed against the official sources.
| Feature | Kokoro-82M | Coqui XTTS v2 | Chatterbox |
|---|---|---|---|
| License | Apache 2.0 (commercial OK) | CPML (non-commercial only) | MIT (commercial OK) |
| Params | 82M | ~hundreds of M (GPT-style) | 0.5B (Llama backbone) |
| Voice cloning | ❌ None | ✅ ~6-sec zero-shot | ✅ ~5-sec zero-shot |
| Built-in voices | 54 voices | Uses your reference clip | Uses your reference clip |
| Languages | 8 | 17 | English base / 23 multilingual |
| Emotion control | ❌ No | Limited | ✅ Exaggeration dial (0.0–1.0+) |
| Architecture | StyleTTS 2 + iSTFTNet | GPT-style autoregressive | Flow-matching, Llama backbone |
| Runs on CPU? | ✅ Faster than real time | ⚠️ Slow without GPU | ⚠️ GPU recommended |
| Best at | Fast clean narration | Most convincing clone | Expressive / emotional speech |
| Vendor status | Active (community) | Coqui shut down Jan 2024 | Active (Resemble AI) |
A few honest caveats on that table. XTTS v2's exact public parameter count isn't headlined the way Kokoro's "82M" or Chatterbox's "0.5B" are, so we've left it as an approximate GPT-style scale rather than invent a precise number. Chatterbox's language count depends on which release you pull: the original English-first model versus the later Chatterbox Multilingual, which covers 23 languages. And "emotion control" for XTTS means you can nudge tone through reference-clip choice, not a dedicated dial like Chatterbox's.
Which has the best license? (This is the real decision)
If your output will ever be sold, monetized, or used in a product, license is the first filter — and it eliminates one model outright.
- Kokoro — Apache 2.0. The most permissive of the three. You can use it commercially, modify it, and embed it in closed-source products without paying or asking.
- Chatterbox — MIT. Also fully commercial-friendly. MIT and Apache 2.0 are both safe for shipping products; the practical difference is negligible for most users.
- XTTS v2 — CPML, non-commercial. The Coqui Public Model License restricts the model to non-commercial use. Because Coqui Inc. closed in January 2024, there's no remaining entity to grant you a paid commercial license. Treat XTTS v2 as personal / research / non-commercial only.
This single fact reshapes the whole comparison: XTTS v2 may produce the best clone, but you can't legally ship it in a paid product. For the full breakdown of what is and isn't allowed, see our deep dive on the XTTS / Coqui commercial license.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which has the best voice cloning?
If your goal is "make it sound like this specific person," Kokoro is out — it has no cloning at all, only its 54 fixed voices. That leaves XTTS v2 vs Chatterbox.
- XTTS v2 remains, in our testing and broad community consensus, the most faithful zero-shot cloner from a short (~6-second) sample, especially for matching timbre and accent across its 17 languages. The blocker is purely legal, not technical.
- Chatterbox clones from ~5 seconds and is very close on fidelity, while adding expressiveness XTTS lacks. For any commercial cloning work, Chatterbox is the answer because XTTS can't be used commercially.
So the practical rule: non-commercial clone → XTTS v2; commercial clone → Chatterbox. Our step-by-step Chatterbox TTS setup guide walks through installing it and dialing in a cloned voice.
Which is best for emotion and expressiveness?
This is Chatterbox's standout. It exposes an exaggeration parameter that scales emotional intensity:
- 0.0 — flat, monotone
- 0.5 — natural, conversational (the default)
- 1.0+ — dramatic, theatrical
Paired with a CFG/pacing control, this lets you tune delivery without re-recording, which is genuinely useful for character voices, ads, or expressive narration. Resemble AI has reported Chatterbox being preferred over ElevenLabs in listener A/B tests — take vendor benchmarks with a grain of salt, but it signals the model is competitive at the top end.
Kokoro and XTTS, by contrast, give you the emotional tone baked into their voices or reference clips; there's no equivalent intensity dial.
How fast are they, and what hardware do you need?
Speed and footprint matter as much as quality for local use. Here's how they compare in practice, framed approximately.
| Model | Footprint | Speed | GPU needed? |
|---|---|---|---|
| Kokoro-82M | Weights under ~1 GB (FP16); ~2–3 GB GPU memory in use | Faster than real time | No — runs well on CPU |
| XTTS v2 | Larger, GPT-style | Real-time-ish on GPU; sluggish on CPU | Recommended |
| Chatterbox | ~0.5B params | Real-time-ish on a modern GPU | Recommended |
From our own informal testing, Kokoro-82M generated short narration clips comfortably faster than real time on an Apple Silicon laptop CPU with no GPU at all — roughly a few times real-time for short sentences, though exact throughput varies with text length and machine. That CPU-friendliness is Kokoro's quiet superpower: XTTS v2 and Chatterbox both really want a GPU to feel responsive, while Kokoro will happily run on a cheap mini PC or a laptop. Treat these as ballpark figures from a single machine, not a controlled benchmark.
Decision tree: pick your TTS by use case
Walk down this list and stop at the first match:
- Are you selling or monetizing the output (commercial use)?
- Yes → XTTS v2 is out (non-commercial). Continue to step 2.
- No → all three are on the table; use the cloning/narration questions below.
- Do you need to clone a specific voice?
- Yes, commercial → Chatterbox (MIT, clones from ~5s, plus emotion).
- Yes, non-commercial → XTTS v2 (best clone fidelity, 17 languages).
- No, any prebuilt voice is fine → go to step 3.
- Do you need expressive / emotional delivery?
- Yes → Chatterbox (exaggeration dial).
- No, just clean narration → Kokoro-82M (fast, light, Apache 2.0, runs on CPU).
- Tight on hardware (no GPU)?
- Pick Kokoro regardless — it's the only one that's genuinely happy on CPU.
In short: Kokoro for narration, XTTS for non-commercial clones, Chatterbox for commercial clones and emotion.
Key Takeaways
- There is no universal winner. Kokoro, XTTS v2, and Chatterbox each win a different job — narration, best clone, and emotion/commercial clone respectively.
- License is the decisive filter. Kokoro (Apache 2.0) and Chatterbox (MIT) are commercial-safe; XTTS v2 (CPML) is non-commercial only, and Coqui's shutdown means no commercial license is available.
- Kokoro can't clone voices. It ships 54 fixed voices across 8 languages and is built for narration, not impersonation — but it's tiny (82M) and runs faster than real time on CPU.
- XTTS v2 still clones best from a ~6-second sample across 17 languages, which is why it survives despite the dead company and restrictive license — for personal projects only.
- Chatterbox is the expressive, commercial-friendly clone. MIT-licensed, ~5-second cloning, and a unique emotion-exaggeration dial (0.0–1.0+).
Next Steps
- Want the full field, not just these three? Read Best Local TTS Models for Piper, F5-TTS and more, with install notes.
- Ready to set up the most flexible commercial option? Follow the Chatterbox TTS setup guide step by step.
- Need to be sure you're allowed to ship XTTS output? Read the XTTS / Coqui commercial license breakdown before you build anything on it.
- Verify the specs yourself on the official model pages: the Kokoro-82M model card and the Chatterbox GitHub repo.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui TTS (XTTS-v2): Local Voice Cloning Setup Guide
- Best Local TTS Models 2026: 8 Open-Source Voices Tested
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- Build a Local Voice Assistant: Whisper + Ollama + Piper
- Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)
- Coqui TTS Python Guide: pip install + XTTS API Examples
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Generate Subtitles Locally with Whisper (2026): Free & Private
- Is XTTS v2 / Coqui TTS Free for Commercial Use? (2026)
Comments (0)
No comments yet. Be the first to share your thoughts!