Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Chatterbox TTS is Resemble AI's open-source, MIT-licensed text-to-speech model that you install with pip install chatterbox-tts, clones a voice from roughly 5 seconds of reference audio, and was preferred over ElevenLabs 63.75% of the time in blind listening tests (run on Podonos). It ships in three flavors — the original 0.5B English model, a 23-language Multilingual version, and a leaner 350M "Turbo" build — and it is the first open-source TTS with an emotion exaggeration knob you can dial from calm to dramatic. You can run it as a Python library or stand it up behind a self-hosted, OpenAI-compatible API on your own GPU.
If you have been paying ElevenLabs by the character and want a local model that sounds close (and sometimes better) for free, this is the one to try first. Below is the honest setup: what to install, how the variants differ, what the emotion control actually does, and how to self-host it as a drop-in API.
What is Chatterbox TTS?
Chatterbox is a production-grade open-source TTS model from Resemble AI. The original model is built on a 0.5B-parameter Llama backbone trained on roughly 0.5M hours of cleaned speech data, and Resemble released it under a permissive MIT license — so you can use it in commercial products, modify it, and redistribute it without paying per character.
Two things make it stand out from the older open-source crowd (Coqui XTTS, Piper, Bark):
- Emotion exaggeration control. Resemble bills it as the first open-source TTS to expose an explicit emotion-exaggeration parameter. You pass an
exaggerationvalue (0.5 is neutral) to push the delivery from flat-and-clean toward expressive-and-dramatic. - It actually competes with the paid leader. In blind A/B tests where listeners compared identical text and reference clips, Chatterbox was preferred over ElevenLabs 63.75% of the time. That is the headline claim, and it comes from Resemble's own evaluation suite — treat it as "very competitive," not gospel, but it matches what most reviewers report.
Every Chatterbox output also carries Resemble's PerTh (Perceptual Threshold) watermark — an inaudible neural signal baked into the audio so synthetic speech stays traceable. That is a responsible-AI feature, not a limiter; the audio quality is unaffected.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do you install Chatterbox TTS? (the 2-minute version)
The fastest path is the pip package. You need Python 3.10+ and, ideally, an NVIDIA GPU with CUDA (it runs on CPU and Apple Silicon too, just slower).
# 1. Create a clean environment (recommended)
python -m venv chatterbox-env
source chatterbox-env/bin/activate # Windows: chatterbox-env\Scripts\activate
# 2. Install the package (pulls in PyTorch + model loader)
pip install chatterbox-tts
Then generate speech in a few lines of Python. Weights download automatically from Hugging Face on first run:
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda") # "cpu" or "mps" also work
text = "Chatterbox runs entirely on my own machine — no API key, no per-character bill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)
That is the whole "hello world." To clone a voice, point the same call at a short reference clip (more on that next).
How does 5-second voice cloning work?
Chatterbox does zero-shot voice cloning: you give it a short sample of a target voice and it speaks new text in that voice without any fine-tuning. Resemble's guidance is that around 5 seconds of clean reference audio is enough — a clear, single-speaker clip with no music or background noise works best.
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate(
"This sentence is read in the cloned voice.",
audio_prompt_path="reference_voice.wav", # ~5 seconds, clean, one speaker
)
A practical note from testing this kind of model: the quality of the clone tracks the quality of the reference far more than its length. A pristine 5-second clip beats a noisy 30-second one. If a clone sounds off, re-record the reference before you touch any parameters. (For a deeper, dedicated walkthrough of cloning workflows, see our local AI voice clone guide.)
What does the emotion (exaggeration) parameter do?
This is Chatterbox's signature feature. Two knobs shape the delivery:
exaggerationcontrols expressiveness. The neutral default is 0.5; raising it adds emphasis and emotion, lowering it flattens the read. Values around 0.7+ push toward dramatic, performance-style delivery.cfg_weightcontrols pacing and adherence; the default is 0.5. Lowering it (toward ~0.3) tends to speed up delivery and pairs well with a higher exaggeration for emotional speech.
# Calm, steady narration
wav = model.generate(text, exaggeration=0.4, cfg_weight=0.5)
# Lively, expressive read (good for ads or characters)
wav = model.generate(text, exaggeration=0.8, cfg_weight=0.3)
In practice these two interact: very high exaggeration with a high cfg_weight can rush the cadence, so Resemble suggests dropping cfg_weight when you crank exaggeration. Start at the defaults, change one knob at a time, and you will dial in a voice quickly.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
The three Chatterbox variants compared
Chatterbox is a small family, not a single model. Pick by language need and hardware budget. All three are MIT-licensed and carry the PerTh watermark.
| Variant | Params | Languages | Best for | Cloning |
|---|---|---|---|---|
| Chatterbox (English) | 0.5B (Llama backbone) | English | The default — best English quality | ~5s zero-shot |
| Chatterbox Multilingual | 0.5B class | 23 languages | Non-English / mixed-language work | ~5s zero-shot |
| Chatterbox Turbo | 350M | English (lighter build) | Low-VRAM / real-time / streaming | ~5s zero-shot |
The Multilingual model supports 23 languages out of the box: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese.
Turbo is the speed-and-efficiency pick. At 350M parameters it is meant to "run anywhere," and Resemble quotes roughly 75ms latency and about 6x-faster-than-real-time inference on a single GPU — i.e. it can generate audio well ahead of playback, which is what you want for streaming or interactive apps. The original 0.5B model is still the quality benchmark for English; Turbo trades a little fidelity for a much lighter footprint.
How fast and heavy is it really? (first-hand notes)
Numbers from Resemble and the community line up with what you would expect for sub-1B models. Treat the figures below as approximate and hardware-dependent.
| Variant | Params | Quoted latency | Throughput | Notes |
|---|---|---|---|---|
| Chatterbox (English) | 0.5B | sub-200ms range | real-time on a modern GPU | best English quality |
| Chatterbox Turbo | 350M | ~75ms | ~6x faster than real-time (1 GPU) | streaming / low-VRAM |
In my own informal test running the original 0.5B model on a single RTX 3090 (24GB), short sentences generated comfortably faster than real-time with the model fully on the GPU, and VRAM use sat well under what a 14B language model would need — a 350M-500M speech model is tiny by today's standards, so an 8-12GB card is plenty. This is a single-machine impression, not a controlled benchmark, but it matches Resemble's real-time claims: the bottleneck is almost never VRAM with Chatterbox, it is just keeping the model on the GPU rather than CPU. If you only have CPU, expect generation to be slower than real-time but still usable for batch jobs.
How do you self-host Chatterbox as an OpenAI-compatible API?
If you want to swap Chatterbox in wherever your app already calls a TTS API, the community Chatterbox-TTS-Server project wraps the model in a server with a web UI and OpenAI-compatible endpoints. It exposes /v1/audio/speech and /v1/audio/voices (drop-in for OpenAI's TTS API) plus a richer native /tts endpoint, and it can hot-swap between the Original, Multilingual (23 languages), and Turbo models.
# Clone and run the self-hosted server
git clone https://github.com/devnen/Chatterbox-TTS-Server.git
cd Chatterbox-TTS-Server
pip install -r requirements.txt
python server.py
# Web UI + API default to http://localhost:8004
It runs accelerated on NVIDIA (CUDA), AMD (ROCm), Apple Silicon (MPS), or CPU fallback, handles audiobook-scale text by splitting and concatenating chunks, and supports voice cloning from uploaded reference clips plus a folder of predefined voices. Once it is up, point any OpenAI-TTS client at http://localhost:8004/v1/audio/speech and you have replaced a paid API with a local one.
Chatterbox vs the other open-source TTS options
Chatterbox is excellent, but it is not the only good local TTS in 2026, and the right pick depends on the job:
- Want the best English clone quality and an emotion knob? Chatterbox (original 0.5B) is the pick.
- Need a specific non-English language? Use Chatterbox Multilingual, or compare against XTTS v2, which has long been the go-to multilingual cloner.
- Need the lowest latency / smallest footprint? Chatterbox Turbo (350M), or a fixed-voice model like Kokoro if you do not need cloning at all.
For a side-by-side look at how Chatterbox stacks up against XTTS v2 and other cloners on a real machine, our local AI voice clone guide walks through the trade-offs with audio in mind.
Key Takeaways
- Chatterbox TTS is free, MIT-licensed, and competitive with ElevenLabs — preferred 63.75% of the time in blind tests, with no per-character billing.
- Setup is one command:
pip install chatterbox-tts, then a few lines of Python. Weights download on first run. - Voice cloning needs only ~5 seconds of clean, single-speaker reference audio — clip quality matters more than length.
- The emotion exaggeration knob is the differentiator. Start at
exaggeration=0.5/cfg_weight=0.5and adjust one at a time. - Three variants: original 0.5B (best English), Multilingual (23 languages), and Turbo (350M, ~75ms latency, ~6x real-time) for low-VRAM/streaming.
- You can self-host it as an OpenAI-compatible API via Chatterbox-TTS-Server (
/v1/audio/speech) on NVIDIA, AMD, Apple Silicon, or CPU.
Next Steps
- New to local TTS? Compare cloners head to head in our local AI voice clone guide before you settle on one.
- Need strong multilingual cloning? Read the XTTS v2 voice cloning guide to see where it still beats (and loses to) Chatterbox.
- Confirm the official details and license on the Resemble AI Chatterbox GitHub repo before deploying to production.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui TTS (XTTS-v2): Local Voice Cloning Setup Guide
- Best Local TTS Models 2026: 8 Open-Source Voices Tested
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- Build a Local Voice Assistant: Whisper + Ollama + Piper
- Coqui TTS Python Guide: pip install + XTTS API Examples
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Generate Subtitles Locally with Whisper (2026): Free & Private
- Is XTTS v2 / Coqui TTS Free for Commercial Use? (2026)
- Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model
Comments (0)
No comments yet. Be the first to share your thoughts!