★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Tools

Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)

June 20, 2026
10 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

Chatterbox TTS is Resemble AI's open-source, MIT-licensed text-to-speech model that you install with pip install chatterbox-tts, clones a voice from roughly 5 seconds of reference audio, and was preferred over ElevenLabs 63.75% of the time in blind listening tests (run on Podonos). It ships in three flavors — the original 0.5B English model, a 23-language Multilingual version, and a leaner 350M "Turbo" build — and it is the first open-source TTS with an emotion exaggeration knob you can dial from calm to dramatic. You can run it as a Python library or stand it up behind a self-hosted, OpenAI-compatible API on your own GPU.

If you have been paying ElevenLabs by the character and want a local model that sounds close (and sometimes better) for free, this is the one to try first. Below is the honest setup: what to install, how the variants differ, what the emotion control actually does, and how to self-host it as a drop-in API.

What is Chatterbox TTS?

Chatterbox is a production-grade open-source TTS model from Resemble AI. The original model is built on a 0.5B-parameter Llama backbone trained on roughly 0.5M hours of cleaned speech data, and Resemble released it under a permissive MIT license — so you can use it in commercial products, modify it, and redistribute it without paying per character.

Two things make it stand out from the older open-source crowd (Coqui XTTS, Piper, Bark):

  1. Emotion exaggeration control. Resemble bills it as the first open-source TTS to expose an explicit emotion-exaggeration parameter. You pass an exaggeration value (0.5 is neutral) to push the delivery from flat-and-clean toward expressive-and-dramatic.
  2. It actually competes with the paid leader. In blind A/B tests where listeners compared identical text and reference clips, Chatterbox was preferred over ElevenLabs 63.75% of the time. That is the headline claim, and it comes from Resemble's own evaluation suite — treat it as "very competitive," not gospel, but it matches what most reviewers report.

Every Chatterbox output also carries Resemble's PerTh (Perceptual Threshold) watermark — an inaudible neural signal baked into the audio so synthetic speech stays traceable. That is a responsible-AI feature, not a limiter; the audio quality is unaffected.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How do you install Chatterbox TTS? (the 2-minute version)

The fastest path is the pip package. You need Python 3.10+ and, ideally, an NVIDIA GPU with CUDA (it runs on CPU and Apple Silicon too, just slower).

# 1. Create a clean environment (recommended)
python -m venv chatterbox-env
source chatterbox-env/bin/activate    # Windows: chatterbox-env\Scripts\activate

# 2. Install the package (pulls in PyTorch + model loader)
pip install chatterbox-tts

Then generate speech in a few lines of Python. Weights download automatically from Hugging Face on first run:

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")   # "cpu" or "mps" also work

text = "Chatterbox runs entirely on my own machine — no API key, no per-character bill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)

That is the whole "hello world." To clone a voice, point the same call at a short reference clip (more on that next).

How does 5-second voice cloning work?

Chatterbox does zero-shot voice cloning: you give it a short sample of a target voice and it speaks new text in that voice without any fine-tuning. Resemble's guidance is that around 5 seconds of clean reference audio is enough — a clear, single-speaker clip with no music or background noise works best.

from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

wav = model.generate(
    "This sentence is read in the cloned voice.",
    audio_prompt_path="reference_voice.wav",   # ~5 seconds, clean, one speaker
)

A practical note from testing this kind of model: the quality of the clone tracks the quality of the reference far more than its length. A pristine 5-second clip beats a noisy 30-second one. If a clone sounds off, re-record the reference before you touch any parameters. (For a deeper, dedicated walkthrough of cloning workflows, see our local AI voice clone guide.)

What does the emotion (exaggeration) parameter do?

This is Chatterbox's signature feature. Two knobs shape the delivery:

  • exaggeration controls expressiveness. The neutral default is 0.5; raising it adds emphasis and emotion, lowering it flattens the read. Values around 0.7+ push toward dramatic, performance-style delivery.
  • cfg_weight controls pacing and adherence; the default is 0.5. Lowering it (toward ~0.3) tends to speed up delivery and pairs well with a higher exaggeration for emotional speech.
# Calm, steady narration
wav = model.generate(text, exaggeration=0.4, cfg_weight=0.5)

# Lively, expressive read (good for ads or characters)
wav = model.generate(text, exaggeration=0.8, cfg_weight=0.3)

In practice these two interact: very high exaggeration with a high cfg_weight can rush the cadence, so Resemble suggests dropping cfg_weight when you crank exaggeration. Start at the defaults, change one knob at a time, and you will dial in a voice quickly.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

The three Chatterbox variants compared

Chatterbox is a small family, not a single model. Pick by language need and hardware budget. All three are MIT-licensed and carry the PerTh watermark.

VariantParamsLanguagesBest forCloning
Chatterbox (English)0.5B (Llama backbone)EnglishThe default — best English quality~5s zero-shot
Chatterbox Multilingual0.5B class23 languagesNon-English / mixed-language work~5s zero-shot
Chatterbox Turbo350MEnglish (lighter build)Low-VRAM / real-time / streaming~5s zero-shot

The Multilingual model supports 23 languages out of the box: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese.

Turbo is the speed-and-efficiency pick. At 350M parameters it is meant to "run anywhere," and Resemble quotes roughly 75ms latency and about 6x-faster-than-real-time inference on a single GPU — i.e. it can generate audio well ahead of playback, which is what you want for streaming or interactive apps. The original 0.5B model is still the quality benchmark for English; Turbo trades a little fidelity for a much lighter footprint.

How fast and heavy is it really? (first-hand notes)

Numbers from Resemble and the community line up with what you would expect for sub-1B models. Treat the figures below as approximate and hardware-dependent.

VariantParamsQuoted latencyThroughputNotes
Chatterbox (English)0.5Bsub-200ms rangereal-time on a modern GPUbest English quality
Chatterbox Turbo350M~75ms~6x faster than real-time (1 GPU)streaming / low-VRAM

In my own informal test running the original 0.5B model on a single RTX 3090 (24GB), short sentences generated comfortably faster than real-time with the model fully on the GPU, and VRAM use sat well under what a 14B language model would need — a 350M-500M speech model is tiny by today's standards, so an 8-12GB card is plenty. This is a single-machine impression, not a controlled benchmark, but it matches Resemble's real-time claims: the bottleneck is almost never VRAM with Chatterbox, it is just keeping the model on the GPU rather than CPU. If you only have CPU, expect generation to be slower than real-time but still usable for batch jobs.

How do you self-host Chatterbox as an OpenAI-compatible API?

If you want to swap Chatterbox in wherever your app already calls a TTS API, the community Chatterbox-TTS-Server project wraps the model in a server with a web UI and OpenAI-compatible endpoints. It exposes /v1/audio/speech and /v1/audio/voices (drop-in for OpenAI's TTS API) plus a richer native /tts endpoint, and it can hot-swap between the Original, Multilingual (23 languages), and Turbo models.

# Clone and run the self-hosted server
git clone https://github.com/devnen/Chatterbox-TTS-Server.git
cd Chatterbox-TTS-Server
pip install -r requirements.txt
python server.py
# Web UI + API default to http://localhost:8004

It runs accelerated on NVIDIA (CUDA), AMD (ROCm), Apple Silicon (MPS), or CPU fallback, handles audiobook-scale text by splitting and concatenating chunks, and supports voice cloning from uploaded reference clips plus a folder of predefined voices. Once it is up, point any OpenAI-TTS client at http://localhost:8004/v1/audio/speech and you have replaced a paid API with a local one.

Chatterbox vs the other open-source TTS options

Chatterbox is excellent, but it is not the only good local TTS in 2026, and the right pick depends on the job:

  • Want the best English clone quality and an emotion knob? Chatterbox (original 0.5B) is the pick.
  • Need a specific non-English language? Use Chatterbox Multilingual, or compare against XTTS v2, which has long been the go-to multilingual cloner.
  • Need the lowest latency / smallest footprint? Chatterbox Turbo (350M), or a fixed-voice model like Kokoro if you do not need cloning at all.

For a side-by-side look at how Chatterbox stacks up against XTTS v2 and other cloners on a real machine, our local AI voice clone guide walks through the trade-offs with audio in mind.

Key Takeaways

  1. Chatterbox TTS is free, MIT-licensed, and competitive with ElevenLabs — preferred 63.75% of the time in blind tests, with no per-character billing.
  2. Setup is one command: pip install chatterbox-tts, then a few lines of Python. Weights download on first run.
  3. Voice cloning needs only ~5 seconds of clean, single-speaker reference audio — clip quality matters more than length.
  4. The emotion exaggeration knob is the differentiator. Start at exaggeration=0.5 / cfg_weight=0.5 and adjust one at a time.
  5. Three variants: original 0.5B (best English), Multilingual (23 languages), and Turbo (350M, ~75ms latency, ~6x real-time) for low-VRAM/streaming.
  6. You can self-host it as an OpenAI-compatible API via Chatterbox-TTS-Server (/v1/audio/speech) on NVIDIA, AMD, Apple Silicon, or CPU.

Next Steps

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators