Chatterbox TTS is Resemble AI's open-source, MIT-licensed text-to-speech model that you install with pip install chatterbox-tts, clones a voice from roughly 5 seconds of reference audio, and was preferred over ElevenLabs 63.75% of the time in blind listening tests (run on Podonos). It ships in three flavors — the original 0.5B English model, a 23-language Multilingual version, and a leaner 350M "Turbo" build — and it is the first open-source TTS with an emotion exaggeration knob you can dial from calm to dramatic. You can run it as a Python library or stand it up behind a self-hosted, OpenAI-compatible API on your own GPU.

If you have been paying ElevenLabs by the character and want a local model that sounds close (and sometimes better) for free, this is the one to try first. Below is the honest setup: what to install, how the variants differ, what the emotion control actually does, and how to self-host it as a drop-in API.

What is Chatterbox TTS?

Chatterbox is a production-grade open-source TTS model from Resemble AI. The original model is built on a 0.5B-parameter Llama backbone trained on roughly 0.5M hours of cleaned speech data, and Resemble released it under a permissive MIT license — so you can use it in commercial products, modify it, and redistribute it without paying per character.

Two things make it stand out from the older open-source crowd (Coqui XTTS, Piper, Bark):

Emotion exaggeration control. Resemble bills it as the first open-source TTS to expose an explicit emotion-exaggeration parameter. You pass an exaggeration value (0.5 is neutral) to push the delivery from flat-and-clean toward expressive-and-dramatic.
It actually competes with the paid leader. In blind A/B tests where listeners compared identical text and reference clips, Chatterbox was preferred over ElevenLabs 63.75% of the time. That is the headline claim, and it comes from Resemble's own evaluation suite — treat it as "very competitive," not gospel, but it matches what most reviewers report.

Every Chatterbox output also carries Resemble's PerTh (Perceptual Threshold) watermark — an inaudible neural signal baked into the audio so synthetic speech stays traceable. That is a responsible-AI feature, not a limiter; the audio quality is unaffected.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How do you install Chatterbox TTS? (the 2-minute version)

The fastest path is the pip package. You need Python 3.10+ and, ideally, an NVIDIA GPU with CUDA (it runs on CPU and Apple Silicon too, just slower).

# 1. Create a clean environment (recommended)
python -m venv chatterbox-env
source chatterbox-env/bin/activate    # Windows: chatterbox-env\Scripts\activate

# 2. Install the package (pulls in PyTorch + model loader)
pip install chatterbox-tts

Then generate speech in a few lines of Python. Weights download automatically from Hugging Face on first run:

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")   # "cpu" or "mps" also work

text = "Chatterbox runs entirely on my own machine — no API key, no per-character bill."
wav = model.generate(text)
ta.save("output.wav", wav, model.sr)

That is the whole "hello world." To clone a voice, point the same call at a short reference clip (more on that next).

How does 5-second voice cloning work?

Chatterbox does zero-shot voice cloning: you give it a short sample of a target voice and it speaks new text in that voice without any fine-tuning. Resemble's guidance is that around 5 seconds of clean reference audio is enough — a clear, single-speaker clip with no music or background noise works best.

from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

wav = model.generate(
    "This sentence is read in the cloned voice.",
    audio_prompt_path="reference_voice.wav",   # ~5 seconds, clean, one speaker
)

A practical note from testing this kind of model: the quality of the clone tracks the quality of the reference far more than its length. A pristine 5-second clip beats a noisy 30-second one. If a clone sounds off, re-record the reference before you touch any parameters. (For a deeper, dedicated walkthrough of cloning workflows, see our local AI voice clone guide.)

What does the emotion (exaggeration) parameter do?

This is Chatterbox's signature feature. Two knobs shape the delivery:

exaggeration controls expressiveness. The neutral default is 0.5; raising it adds emphasis and emotion, lowering it flattens the read. Values around 0.7+ push toward dramatic, performance-style delivery.
cfg_weight controls pacing and adherence; the default is 0.5. Lowering it (toward ~0.3) tends to speed up delivery and pairs well with a higher exaggeration for emotional speech.

# Calm, steady narration
wav = model.generate(text, exaggeration=0.4, cfg_weight=0.5)

# Lively, expressive read (good for ads or characters)
wav = model.generate(text, exaggeration=0.8, cfg_weight=0.3)

In practice these two interact: very high exaggeration with a high cfg_weight can rush the cadence, so Resemble suggests dropping cfg_weight when you crank exaggeration. Start at the defaults, change one knob at a time, and you will dial in a voice quickly.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The three Chatterbox variants compared

Chatterbox is a small family, not a single model. Pick by language need and hardware budget. All three are MIT-licensed and carry the PerTh watermark.

Variant	Params	Languages	Best for	Cloning
Chatterbox (English)	0.5B (Llama backbone)	English	The default — best English quality	~5s zero-shot
Chatterbox Multilingual	0.5B class	23 languages	Non-English / mixed-language work	~5s zero-shot
Chatterbox Turbo	350M	English (lighter build)	Low-VRAM / real-time / streaming	~5s zero-shot

The Multilingual model supports 23 languages out of the box: Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, and Chinese.

Turbo is the speed-and-efficiency pick. At 350M parameters it is meant to "run anywhere," and Resemble quotes roughly 75ms latency and about 6x-faster-than-real-time inference on a single GPU — i.e. it can generate audio well ahead of playback, which is what you want for streaming or interactive apps. The original 0.5B model is still the quality benchmark for English; Turbo trades a little fidelity for a much lighter footprint.

How fast and heavy is it really? (first-hand notes)

Numbers from Resemble and the community line up with what you would expect for sub-1B models. Treat the figures below as approximate and hardware-dependent.

Variant	Params	Quoted latency	Throughput	Notes
Chatterbox (English)	0.5B	sub-200ms range	real-time on a modern GPU	best English quality
Chatterbox Turbo	350M	~75ms	~6x faster than real-time (1 GPU)	streaming / low-VRAM

In my own informal test running the original 0.5B model on a single RTX 3090 (24GB), short sentences generated comfortably faster than real-time with the model fully on the GPU, and VRAM use sat well under what a 14B language model would need — a 350M-500M speech model is tiny by today's standards, so an 8-12GB card is plenty. This is a single-machine impression, not a controlled benchmark, but it matches Resemble's real-time claims: the bottleneck is almost never VRAM with Chatterbox, it is just keeping the model on the GPU rather than CPU. If you only have CPU, expect generation to be slower than real-time but still usable for batch jobs.

How do you self-host Chatterbox as an OpenAI-compatible API?

If you want to swap Chatterbox in wherever your app already calls a TTS API, the community Chatterbox-TTS-Server project wraps the model in a server with a web UI and OpenAI-compatible endpoints. It exposes /v1/audio/speech and /v1/audio/voices (drop-in for OpenAI's TTS API) plus a richer native /tts endpoint, and it can hot-swap between the Original, Multilingual (23 languages), and Turbo models.

# Clone and run the self-hosted server
git clone https://github.com/devnen/Chatterbox-TTS-Server.git
cd Chatterbox-TTS-Server
pip install -r requirements.txt
python server.py
# Web UI + API default to http://localhost:8004

It runs accelerated on NVIDIA (CUDA), AMD (ROCm), Apple Silicon (MPS), or CPU fallback, handles audiobook-scale text by splitting and concatenating chunks, and supports voice cloning from uploaded reference clips plus a folder of predefined voices. Once it is up, point any OpenAI-TTS client at http://localhost:8004/v1/audio/speech and you have replaced a paid API with a local one.

Chatterbox vs the other open-source TTS options

Chatterbox is excellent, but it is not the only good local TTS in 2026, and the right pick depends on the job:

Want the best English clone quality and an emotion knob? Chatterbox (original 0.5B) is the pick.
Need a specific non-English language? Use Chatterbox Multilingual, or compare against XTTS v2, which has long been the go-to multilingual cloner.
Need the lowest latency / smallest footprint? Chatterbox Turbo (350M), or a fixed-voice model like Kokoro if you do not need cloning at all.

For a side-by-side look at how Chatterbox stacks up against XTTS v2 and other cloners on a real machine, our local AI voice clone guide walks through the trade-offs with audio in mind.

Key Takeaways

Chatterbox TTS is free, MIT-licensed, and competitive with ElevenLabs — preferred 63.75% of the time in blind tests, with no per-character billing.
Setup is one command: pip install chatterbox-tts, then a few lines of Python. Weights download on first run.
Voice cloning needs only ~5 seconds of clean, single-speaker reference audio — clip quality matters more than length.
The emotion exaggeration knob is the differentiator. Start at exaggeration=0.5 / cfg_weight=0.5 and adjust one at a time.
Three variants: original 0.5B (best English), Multilingual (23 languages), and Turbo (350M, ~75ms latency, ~6x real-time) for low-VRAM/streaming.
You can self-host it as an OpenAI-compatible API via Chatterbox-TTS-Server (/v1/audio/speech) on NVIDIA, AMD, Apple Silicon, or CPU.

Next Steps

New to local TTS? Compare cloners head to head in our local AI voice clone guide before you settle on one.
Need strong multilingual cloning? Read the XTTS v2 voice cloning guide to see where it still beats (and loses to) Chatterbox.
Confirm the official details and license on the Resemble AI Chatterbox GitHub repo before deploying to production.

Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)

Want to go deeper than this article?

What is Chatterbox TTS?

Reading articles is good. Building is better.

How do you install Chatterbox TTS? (the 2-minute version)

How does 5-second voice cloning work?

What does the emotion (exaggeration) parameter do?

Reading articles is good. Building is better.

The three Chatterbox variants compared

How fast and heavy is it really? (first-hand notes)

How do you self-host Chatterbox as an OpenAI-compatible API?

Chatterbox vs the other open-source TTS options

Key Takeaways

Next Steps

Voice working locally? Build the whole pipeline.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Local AI Voice Clone Guide

XTTS v2 Voice Cloning Guide

Kokoro TTS Local Setup

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Voice working locally? Build the whole pipeline.