Coqui TTS Python Guide: pip install + XTTS API Examples
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
To use Coqui TTS in Python in 2026, install the maintained fork with pip install coqui-tts (the original pip install TTS from coqui-ai is abandoned), then run from TTS.api import TTS and call tts.tts_to_file(text=..., file_path="out.wav", speaker_wav="voice.wav", language="en") with the XTTS v2 model. The package you actually want is coqui-tts v0.27.5 (released January 26, 2026), the community-maintained idiap/coqui-ai-TTS fork that works on Python 3.10 through 3.14. XTTS v2 clones a voice from a few seconds of reference audio, speaks 17 languages, and outputs 24 kHz audio.
If you searched "coqui tts python" and landed on a wall of import errors, you are almost certainly fighting the dead original package. This guide gives you the install command that actually works in 2026, the full tts_to_file() API with every argument that matters, voice cloning, streaming, and the four errors that trip up nearly everyone.
Which Coqui TTS package do I install in 2026?
This is the single most important thing on the page, so let's settle it first. Coqui.ai, the startup behind the original library, shut down in early 2024, and the original coqui-ai/TTS repository is no longer maintained. The PyPI package literally named TTS still installs, but it pins old dependencies and breaks on modern Python and PyTorch.
The community moved to the idiap/coqui-ai-TTS fork, published on PyPI as coqui-tts. It is the same codebase and the same TTS.api import path, just patched to run on current Python and dependencies.
| Package | PyPI name | Status (mid-2026) | Python support |
|---|---|---|---|
| Coqui TTS (idiap fork) | coqui-tts | ✅ Maintained — v0.27.5, Jan 26 2026 | 3.10 – 3.14 |
| Original Coqui | TTS | ❌ Unmaintained since early 2024 | breaks on new Python |
So install the fork:
pip install coqui-tts
The import path stays from TTS.api import TTS — the fork deliberately kept it identical so old tutorials still work once you swap the install line. If you only ever copy one line from this page, copy that pip install coqui-tts.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I generate speech with tts_to_file()?
Here is the minimal "text to a WAV file" example using the multilingual XTTS v2 model and one of its built-in studio speakers (no reference clip needed):
import torch
from TTS.api import TTS
device = "cuda" if torch.cuda.is_available() else "cpu"
# First run downloads the XTTS v2 model (~1.8 GB) and prompts you to accept the license
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts.tts_to_file(
text="Hello from a fully local text to speech model.",
file_path="output.wav",
speaker="Craig Gutsy", # a built-in XTTS speaker
language="en",
)
That writes output.wav at a 24 kHz sample rate. The first call downloads the model and shows a license prompt — XTTS v2 ships under the Coqui Public Model License, which has real restrictions; if you plan to ship a product, read our breakdown of the XTTS commercial license before you build on it.
The key tts_to_file() arguments:
| Argument | Type | What it does |
|---|---|---|
text | str | The text to synthesize |
file_path | str | Output WAV path |
language | str | Language code — "en", "es", "fr", etc. (required for XTTS) |
speaker | str | Name of a built-in XTTS studio speaker |
speaker_wav | str / list | Path(s) to reference audio for voice cloning (instead of speaker) |
split_sentences | bool | Auto-split long text into sentences (default true) |
How do I clone a voice with speaker_wav?
This is what XTTS v2 is famous for: zero-shot voice cloning from a short reference clip. Swap speaker for speaker_wav and point it at 6–20 seconds of clean audio of the target voice:
tts.tts_to_file(
text="This sentence is spoken in a cloned voice.",
file_path="cloned.wav",
speaker_wav="my_voice_sample.wav", # 6-20s of clean reference audio
language="en",
)
A few hard-won notes: the reference clip should be clean (no music or background noise), mono, and ideally 16 kHz or higher. You can pass a list of WAV paths to speaker_wav to average several samples for a more stable clone. And language is mandatory for XTTS — it controls the phonemizer, not just an accent, so the wrong code produces garbled output. For a deeper walkthrough with audio tips and longer-form generation, see our full XTTS v2 voice cloning guide.
If you want the in-memory array instead of a file, use tts.tts(...) with the same arguments — it returns the waveform as a list of floats you can post-process before saving.
Which languages does XTTS v2 support?
XTTS v2 supports 17 languages out of the box. Pass the code in the language argument:
en Spanish=es French=fr German=de Italian=it
pt Polish=pl Turkish=tr Russian=ru Dutch=nl
cs Arabic=ar zh-cn Japanese=ja Hungarian=hu
ko Hindi=hi
In full: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), and Hindi (hi). The same cloned voice can speak any of them — clone once in English, then generate Spanish from the same speaker_wav by changing only the language code.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I stream XTTS audio in real time?
For interactive apps you do not want to wait for the whole clip. XTTS exposes a lower-level streaming API, inference_stream(), that yields audio chunks as they are generated — Coqui reports time-to-first-chunk under ~200 ms on a capable GPU. You drop down from the high-level TTS.api wrapper to the model classes:
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
config = XttsConfig()
config.load_json("XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="XTTS-v2/")
model.cuda()
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path=["my_voice_sample.wav"]
)
chunks = model.inference_stream(
"Streaming text to speech, one chunk at a time.",
"en",
gpt_cond_latent,
speaker_embedding,
)
for i, chunk in enumerate(chunks):
# each chunk is a torch tensor of audio you can pipe to a speaker/socket
print(f"chunk {i}: {chunk.shape}")
Streaming needs a GPU to feel real-time; on CPU it works but the sub-200 ms latency claim does not hold.
Common Coqui TTS errors and fixes
These four cover the vast majority of "coqui tts python" support threads:
| Error / symptom | Cause | Fix |
|---|---|---|
ModuleNotFoundError: No module named 'TTS' after install | Installed nothing, or the wrong wheel failed | Run pip install coqui-tts (the fork), not TTS |
| Dependency / build conflicts on Python 3.12+ | Old TTS package pins ancient deps | Uninstall TTS, install coqui-tts on Python 3.10–3.14 |
weights_only / unpickling error on load | New PyTorch defaults block the checkpoint | Use coqui-tts 0.27.x, which patches the loader for modern PyTorch |
| Garbled / wrong-accent output | Missing or wrong language code | Always pass language="en" (etc.) — it is required for XTTS |
The single most common mistake is installing the dead TTS package and then fighting dependency hell. Uninstall it, install coqui-tts, and most "it won't import" problems disappear.
First-hand performance note
On my own RTX 3090 (24 GB), XTTS v2 loaded in roughly 5–8 seconds and generated a single English sentence with a cloned speaker_wav in well under a second of wall-clock time — comfortably faster than real-time playback, so a paragraph renders in a couple of seconds. Treat these as approximate single-machine figures, not a controlled benchmark: model load time, clip length and disk speed all move the numbers. On CPU the same generation ran several times slower than real-time, which is why streaming is GPU-only in practice. VRAM use sat around 4–5 GB during inference, so the model fits comfortably on an 8 GB card.
Key Takeaways
- Install
coqui-tts, notTTS. The original Coqui package is unmaintained since early 2024; the idiap fork (PyPIcoqui-ttsv0.27.5) is the live one and keeps the sameTTS.apiimport. tts_to_file()is the one-liner you want — passtext,file_path,language, and either a built-inspeakeror aspeaker_wavreference clip.speaker_wavdoes zero-shot voice cloning from ~6–20 seconds of clean audio; pass a list of clips for a steadier clone.- XTTS v2 speaks 17 languages at 24 kHz and can stream with
inference_stream()at sub-200 ms latency on a GPU. - Always set
language— it drives the phonemizer; omitting or mismatching it is the top cause of garbled output.
Next Steps
- Want the full model overview, hardware needs and license? Read our Coqui XTTS v2 model page.
- Building a product on cloned voices? Check the XTTS commercial license rules first — they are stricter than most open models.
- Need a step-by-step cloning walkthrough with audio prep tips? See the XTTS v2 voice cloning guide.
- Confirm versions and the latest API on the idiap/coqui-ai-TTS GitHub repo and the coqui-tts PyPI page.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui TTS (XTTS-v2): Local Voice Cloning Setup Guide
- Best Local TTS Models 2026: 8 Open-Source Voices Tested
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- Build a Local Voice Assistant: Whisper + Ollama + Piper
- Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Generate Subtitles Locally with Whisper (2026): Free & Private
- Is XTTS v2 / Coqui TTS Free for Commercial Use? (2026)
- Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model
Comments (0)
No comments yet. Be the first to share your thoughts!