For local TTS in 2026, pick by job: Kokoro-82M (Apache 2.0) is the best choice for fast, clean narration and is the only one of the three you can safely use commercially without caveats; Coqui XTTS v2 still produces the most convincing voice clone from a ~6-second sample but its CPML license is non-commercial only; and Resemble AI's Chatterbox (MIT) is the pick when you need emotional, expressive speech and a permissive license with cloning. There is no single winner — Kokoro can't clone voices at all, XTTS v2 can't be used commercially, and Chatterbox is heavier to run. The right answer depends entirely on whether you need narration, an exact voice clone, or expressive emotion, and whether the output is commercial.

If you only remember one thing: license and use case decide this, not raw audio quality. All three sound good. The difference that will actually bite you is whether you're allowed to ship the output and whether you need to clone a specific voice.

TL;DR — which local TTS should I use?

Want clean narration / audiobook / app voice, and you might sell it? Use Kokoro-82M. Apache 2.0, tiny (82M params), runs faster than real time on a CPU, no GPU required.
Need to clone a specific person's voice and it's a personal / non-commercial project? Use XTTS v2. Best zero-shot clone from ~6 seconds of audio, 17 languages — but the license forbids commercial use.
Need emotional, expressive speech and a clean commercial license? Use Chatterbox (MIT). Zero-shot cloning from ~5 seconds plus a unique emotion-exaggeration dial.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What are Kokoro, XTTS v2, and Chatterbox?

These are the three most-recommended open-weight, run-it-yourself TTS models in mid-2026. They solve different problems, so comparing them on a single "best" axis is misleading. Here's the one-line identity of each:

Kokoro-82M — an open-weight model from the developer "hexgrad," built on the StyleTTS 2 + iSTFTNet architecture. At just 82 million parameters it's tiny by TTS standards. Its v1.0 shipped on January 27, 2025 with 54 voices across 8 languages. It is trained on long-form narration and reading, has no voice-cloning ability, and is released under the permissive Apache 2.0 license.
Coqui XTTS v2 — the multilingual voice-cloning model from Coqui. It clones a voice from roughly a 6-second reference clip across 17 languages. The catch: Coqui Inc. shut down in January 2024, and XTTS v2 ships under the Coqui Public Model License (CPML), which is non-commercial. With the company gone, there is effectively no one to sell you a commercial license.
Chatterbox — Resemble AI's open-source TTS, released under the MIT license. It's built on a 0.5B-parameter Llama backbone, does zero-shot voice cloning from ~5 seconds of audio, and is notable for being one of the first open TTS models with an explicit emotion-exaggeration control. Resemble has reported it beating ElevenLabs in side-by-side listener preference tests.

If you want the wider field beyond these three, our roundup of the best local TTS models covers Piper, F5-TTS and others alongside these.

Kokoro vs XTTS vs Chatterbox: full comparison table

Here is the head-to-head. Figures are taken from each model's official model card, repository, or vendor documentation; parameter counts and licenses are confirmed against the official sources.

Feature	Kokoro-82M	Coqui XTTS v2	Chatterbox
License	Apache 2.0 (commercial OK)	CPML (non-commercial only)	MIT (commercial OK)
Params	82M	~hundreds of M (GPT-style)	0.5B (Llama backbone)
Voice cloning	❌ None	✅ ~6-sec zero-shot	✅ ~5-sec zero-shot
Built-in voices	54 voices	Uses your reference clip	Uses your reference clip
Languages	8	17	English base / 23 multilingual
Emotion control	❌ No	Limited	✅ Exaggeration dial (0.0–1.0+)
Architecture	StyleTTS 2 + iSTFTNet	GPT-style autoregressive	Flow-matching, Llama backbone
Runs on CPU?	✅ Faster than real time	⚠️ Slow without GPU	⚠️ GPU recommended
Best at	Fast clean narration	Most convincing clone	Expressive / emotional speech
Vendor status	Active (community)	Coqui shut down Jan 2024	Active (Resemble AI)

A few honest caveats on that table. XTTS v2's exact public parameter count isn't headlined the way Kokoro's "82M" or Chatterbox's "0.5B" are, so we've left it as an approximate GPT-style scale rather than invent a precise number. Chatterbox's language count depends on which release you pull: the original English-first model versus the later Chatterbox Multilingual, which covers 23 languages. And "emotion control" for XTTS means you can nudge tone through reference-clip choice, not a dedicated dial like Chatterbox's.

Which has the best license? (This is the real decision)

If your output will ever be sold, monetized, or used in a product, license is the first filter — and it eliminates one model outright.

Kokoro — Apache 2.0. The most permissive of the three. You can use it commercially, modify it, and embed it in closed-source products without paying or asking.
Chatterbox — MIT. Also fully commercial-friendly. MIT and Apache 2.0 are both safe for shipping products; the practical difference is negligible for most users.
XTTS v2 — CPML, non-commercial. The Coqui Public Model License restricts the model to non-commercial use. Because Coqui Inc. closed in January 2024, there's no remaining entity to grant you a paid commercial license. Treat XTTS v2 as personal / research / non-commercial only.

This single fact reshapes the whole comparison: XTTS v2 may produce the best clone, but you can't legally ship it in a paid product. For the full breakdown of what is and isn't allowed, see our deep dive on the XTTS / Coqui commercial license.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Which has the best voice cloning?

If your goal is "make it sound like this specific person," Kokoro is out — it has no cloning at all, only its 54 fixed voices. That leaves XTTS v2 vs Chatterbox.

XTTS v2 remains, in our testing and broad community consensus, the most faithful zero-shot cloner from a short (~6-second) sample, especially for matching timbre and accent across its 17 languages. The blocker is purely legal, not technical.
Chatterbox clones from ~5 seconds and is very close on fidelity, while adding expressiveness XTTS lacks. For any commercial cloning work, Chatterbox is the answer because XTTS can't be used commercially.

So the practical rule: non-commercial clone → XTTS v2; commercial clone → Chatterbox. Our step-by-step Chatterbox TTS setup guide walks through installing it and dialing in a cloned voice.

Which is best for emotion and expressiveness?

This is Chatterbox's standout. It exposes an exaggeration parameter that scales emotional intensity:

0.0 — flat, monotone
0.5 — natural, conversational (the default)
1.0+ — dramatic, theatrical

Paired with a CFG/pacing control, this lets you tune delivery without re-recording, which is genuinely useful for character voices, ads, or expressive narration. Resemble AI has reported Chatterbox being preferred over ElevenLabs in listener A/B tests — take vendor benchmarks with a grain of salt, but it signals the model is competitive at the top end.

Kokoro and XTTS, by contrast, give you the emotional tone baked into their voices or reference clips; there's no equivalent intensity dial.

How fast are they, and what hardware do you need?

Speed and footprint matter as much as quality for local use. Here's how they compare in practice, framed approximately.

Model	Footprint	Speed	GPU needed?
Kokoro-82M	Weights under ~1 GB (FP16); ~2–3 GB GPU memory in use	Faster than real time	No — runs well on CPU
XTTS v2	Larger, GPT-style	Real-time-ish on GPU; sluggish on CPU	Recommended
Chatterbox	~0.5B params	Real-time-ish on a modern GPU	Recommended

From our own informal testing, Kokoro-82M generated short narration clips comfortably faster than real time on an Apple Silicon laptop CPU with no GPU at all — roughly a few times real-time for short sentences, though exact throughput varies with text length and machine. That CPU-friendliness is Kokoro's quiet superpower: XTTS v2 and Chatterbox both really want a GPU to feel responsive, while Kokoro will happily run on a cheap mini PC or a laptop. Treat these as ballpark figures from a single machine, not a controlled benchmark.

Decision tree: pick your TTS by use case

Walk down this list and stop at the first match:

Are you selling or monetizing the output (commercial use)?
- Yes → XTTS v2 is out (non-commercial). Continue to step 2.
- No → all three are on the table; use the cloning/narration questions below.
Do you need to clone a specific voice?
- Yes, commercial → Chatterbox (MIT, clones from ~5s, plus emotion).
- Yes, non-commercial → XTTS v2 (best clone fidelity, 17 languages).
- No, any prebuilt voice is fine → go to step 3.
Do you need expressive / emotional delivery?
- Yes → Chatterbox (exaggeration dial).
- No, just clean narration → Kokoro-82M (fast, light, Apache 2.0, runs on CPU).
Tight on hardware (no GPU)?
- Pick Kokoro regardless — it's the only one that's genuinely happy on CPU.

In short: Kokoro for narration, XTTS for non-commercial clones, Chatterbox for commercial clones and emotion.

Key Takeaways

There is no universal winner. Kokoro, XTTS v2, and Chatterbox each win a different job — narration, best clone, and emotion/commercial clone respectively.
License is the decisive filter. Kokoro (Apache 2.0) and Chatterbox (MIT) are commercial-safe; XTTS v2 (CPML) is non-commercial only, and Coqui's shutdown means no commercial license is available.
Kokoro can't clone voices. It ships 54 fixed voices across 8 languages and is built for narration, not impersonation — but it's tiny (82M) and runs faster than real time on CPU.
XTTS v2 still clones best from a ~6-second sample across 17 languages, which is why it survives despite the dead company and restrictive license — for personal projects only.
Chatterbox is the expressive, commercial-friendly clone. MIT-licensed, ~5-second cloning, and a unique emotion-exaggeration dial (0.0–1.0+).

Next Steps

Want the full field, not just these three? Read Best Local TTS Models for Piper, F5-TTS and more, with install notes.
Ready to set up the most flexible commercial option? Follow the Chatterbox TTS setup guide step by step.
Need to be sure you're allowed to ship XTTS output? Read the XTTS / Coqui commercial license breakdown before you build anything on it.
Verify the specs yourself on the official model pages: the Kokoro-82M model card and the Chatterbox GitHub repo.

Kokoro vs XTTS vs Chatterbox: Best Local TTS in 2026?

Want to go deeper than this article?

TL;DR — which local TTS should I use?

Reading articles is good. Building is better.

What are Kokoro, XTTS v2, and Chatterbox?

Kokoro vs XTTS vs Chatterbox: full comparison table

Which has the best license? (This is the real decision)

Reading articles is good. Building is better.

Which has the best voice cloning?

Which is best for emotion and expressiveness?

How fast are they, and what hardware do you need?

Decision tree: pick your TTS by use case

Key Takeaways

Next Steps

Voice working locally? Build the whole pipeline.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Best Local TTS Models

Chatterbox TTS Setup Guide

XTTS / Coqui Commercial License

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Voice working locally? Build the whole pipeline.