★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Tools

Kokoro vs XTTS vs Chatterbox: Best Local TTS in 2026?

June 20, 2026
10 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

For local TTS in 2026, pick by job: Kokoro-82M (Apache 2.0) is the best choice for fast, clean narration and is the only one of the three you can safely use commercially without caveats; Coqui XTTS v2 still produces the most convincing voice clone from a ~6-second sample but its CPML license is non-commercial only; and Resemble AI's Chatterbox (MIT) is the pick when you need emotional, expressive speech and a permissive license with cloning. There is no single winner — Kokoro can't clone voices at all, XTTS v2 can't be used commercially, and Chatterbox is heavier to run. The right answer depends entirely on whether you need narration, an exact voice clone, or expressive emotion, and whether the output is commercial.

If you only remember one thing: license and use case decide this, not raw audio quality. All three sound good. The difference that will actually bite you is whether you're allowed to ship the output and whether you need to clone a specific voice.

TL;DR — which local TTS should I use?

  • Want clean narration / audiobook / app voice, and you might sell it? Use Kokoro-82M. Apache 2.0, tiny (82M params), runs faster than real time on a CPU, no GPU required.
  • Need to clone a specific person's voice and it's a personal / non-commercial project? Use XTTS v2. Best zero-shot clone from ~6 seconds of audio, 17 languages — but the license forbids commercial use.
  • Need emotional, expressive speech and a clean commercial license? Use Chatterbox (MIT). Zero-shot cloning from ~5 seconds plus a unique emotion-exaggeration dial.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What are Kokoro, XTTS v2, and Chatterbox?

These are the three most-recommended open-weight, run-it-yourself TTS models in mid-2026. They solve different problems, so comparing them on a single "best" axis is misleading. Here's the one-line identity of each:

  • Kokoro-82M — an open-weight model from the developer "hexgrad," built on the StyleTTS 2 + iSTFTNet architecture. At just 82 million parameters it's tiny by TTS standards. Its v1.0 shipped on January 27, 2025 with 54 voices across 8 languages. It is trained on long-form narration and reading, has no voice-cloning ability, and is released under the permissive Apache 2.0 license.
  • Coqui XTTS v2 — the multilingual voice-cloning model from Coqui. It clones a voice from roughly a 6-second reference clip across 17 languages. The catch: Coqui Inc. shut down in January 2024, and XTTS v2 ships under the Coqui Public Model License (CPML), which is non-commercial. With the company gone, there is effectively no one to sell you a commercial license.
  • Chatterbox — Resemble AI's open-source TTS, released under the MIT license. It's built on a 0.5B-parameter Llama backbone, does zero-shot voice cloning from ~5 seconds of audio, and is notable for being one of the first open TTS models with an explicit emotion-exaggeration control. Resemble has reported it beating ElevenLabs in side-by-side listener preference tests.

If you want the wider field beyond these three, our roundup of the best local TTS models covers Piper, F5-TTS and others alongside these.

Kokoro vs XTTS vs Chatterbox: full comparison table

Here is the head-to-head. Figures are taken from each model's official model card, repository, or vendor documentation; parameter counts and licenses are confirmed against the official sources.

FeatureKokoro-82MCoqui XTTS v2Chatterbox
LicenseApache 2.0 (commercial OK)CPML (non-commercial only)MIT (commercial OK)
Params82M~hundreds of M (GPT-style)0.5B (Llama backbone)
Voice cloning❌ None✅ ~6-sec zero-shot✅ ~5-sec zero-shot
Built-in voices54 voicesUses your reference clipUses your reference clip
Languages817English base / 23 multilingual
Emotion control❌ NoLimited✅ Exaggeration dial (0.0–1.0+)
ArchitectureStyleTTS 2 + iSTFTNetGPT-style autoregressiveFlow-matching, Llama backbone
Runs on CPU?✅ Faster than real time⚠️ Slow without GPU⚠️ GPU recommended
Best atFast clean narrationMost convincing cloneExpressive / emotional speech
Vendor statusActive (community)Coqui shut down Jan 2024Active (Resemble AI)

A few honest caveats on that table. XTTS v2's exact public parameter count isn't headlined the way Kokoro's "82M" or Chatterbox's "0.5B" are, so we've left it as an approximate GPT-style scale rather than invent a precise number. Chatterbox's language count depends on which release you pull: the original English-first model versus the later Chatterbox Multilingual, which covers 23 languages. And "emotion control" for XTTS means you can nudge tone through reference-clip choice, not a dedicated dial like Chatterbox's.

Which has the best license? (This is the real decision)

If your output will ever be sold, monetized, or used in a product, license is the first filter — and it eliminates one model outright.

  • Kokoro — Apache 2.0. The most permissive of the three. You can use it commercially, modify it, and embed it in closed-source products without paying or asking.
  • Chatterbox — MIT. Also fully commercial-friendly. MIT and Apache 2.0 are both safe for shipping products; the practical difference is negligible for most users.
  • XTTS v2 — CPML, non-commercial. The Coqui Public Model License restricts the model to non-commercial use. Because Coqui Inc. closed in January 2024, there's no remaining entity to grant you a paid commercial license. Treat XTTS v2 as personal / research / non-commercial only.

This single fact reshapes the whole comparison: XTTS v2 may produce the best clone, but you can't legally ship it in a paid product. For the full breakdown of what is and isn't allowed, see our deep dive on the XTTS / Coqui commercial license.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Which has the best voice cloning?

If your goal is "make it sound like this specific person," Kokoro is out — it has no cloning at all, only its 54 fixed voices. That leaves XTTS v2 vs Chatterbox.

  • XTTS v2 remains, in our testing and broad community consensus, the most faithful zero-shot cloner from a short (~6-second) sample, especially for matching timbre and accent across its 17 languages. The blocker is purely legal, not technical.
  • Chatterbox clones from ~5 seconds and is very close on fidelity, while adding expressiveness XTTS lacks. For any commercial cloning work, Chatterbox is the answer because XTTS can't be used commercially.

So the practical rule: non-commercial clone → XTTS v2; commercial clone → Chatterbox. Our step-by-step Chatterbox TTS setup guide walks through installing it and dialing in a cloned voice.

Which is best for emotion and expressiveness?

This is Chatterbox's standout. It exposes an exaggeration parameter that scales emotional intensity:

  • 0.0 — flat, monotone
  • 0.5 — natural, conversational (the default)
  • 1.0+ — dramatic, theatrical

Paired with a CFG/pacing control, this lets you tune delivery without re-recording, which is genuinely useful for character voices, ads, or expressive narration. Resemble AI has reported Chatterbox being preferred over ElevenLabs in listener A/B tests — take vendor benchmarks with a grain of salt, but it signals the model is competitive at the top end.

Kokoro and XTTS, by contrast, give you the emotional tone baked into their voices or reference clips; there's no equivalent intensity dial.

How fast are they, and what hardware do you need?

Speed and footprint matter as much as quality for local use. Here's how they compare in practice, framed approximately.

ModelFootprintSpeedGPU needed?
Kokoro-82MWeights under ~1 GB (FP16); ~2–3 GB GPU memory in useFaster than real timeNo — runs well on CPU
XTTS v2Larger, GPT-styleReal-time-ish on GPU; sluggish on CPURecommended
Chatterbox~0.5B paramsReal-time-ish on a modern GPURecommended

From our own informal testing, Kokoro-82M generated short narration clips comfortably faster than real time on an Apple Silicon laptop CPU with no GPU at all — roughly a few times real-time for short sentences, though exact throughput varies with text length and machine. That CPU-friendliness is Kokoro's quiet superpower: XTTS v2 and Chatterbox both really want a GPU to feel responsive, while Kokoro will happily run on a cheap mini PC or a laptop. Treat these as ballpark figures from a single machine, not a controlled benchmark.

Decision tree: pick your TTS by use case

Walk down this list and stop at the first match:

  1. Are you selling or monetizing the output (commercial use)?
    • Yes → XTTS v2 is out (non-commercial). Continue to step 2.
    • No → all three are on the table; use the cloning/narration questions below.
  2. Do you need to clone a specific voice?
    • Yes, commercial → Chatterbox (MIT, clones from ~5s, plus emotion).
    • Yes, non-commercial → XTTS v2 (best clone fidelity, 17 languages).
    • No, any prebuilt voice is fine → go to step 3.
  3. Do you need expressive / emotional delivery?
    • Yes → Chatterbox (exaggeration dial).
    • No, just clean narration → Kokoro-82M (fast, light, Apache 2.0, runs on CPU).
  4. Tight on hardware (no GPU)?
    • Pick Kokoro regardless — it's the only one that's genuinely happy on CPU.

In short: Kokoro for narration, XTTS for non-commercial clones, Chatterbox for commercial clones and emotion.

Key Takeaways

  1. There is no universal winner. Kokoro, XTTS v2, and Chatterbox each win a different job — narration, best clone, and emotion/commercial clone respectively.
  2. License is the decisive filter. Kokoro (Apache 2.0) and Chatterbox (MIT) are commercial-safe; XTTS v2 (CPML) is non-commercial only, and Coqui's shutdown means no commercial license is available.
  3. Kokoro can't clone voices. It ships 54 fixed voices across 8 languages and is built for narration, not impersonation — but it's tiny (82M) and runs faster than real time on CPU.
  4. XTTS v2 still clones best from a ~6-second sample across 17 languages, which is why it survives despite the dead company and restrictive license — for personal projects only.
  5. Chatterbox is the expressive, commercial-friendly clone. MIT-licensed, ~5-second cloning, and a unique emotion-exaggeration dial (0.0–1.0+).

Next Steps

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators