★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Guides

Coqui TTS Python Guide: pip install + XTTS API Examples

June 20, 2026
9 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

To use Coqui TTS in Python in 2026, install the maintained fork with pip install coqui-tts (the original pip install TTS from coqui-ai is abandoned), then run from TTS.api import TTS and call tts.tts_to_file(text=..., file_path="out.wav", speaker_wav="voice.wav", language="en") with the XTTS v2 model. The package you actually want is coqui-tts v0.27.5 (released January 26, 2026), the community-maintained idiap/coqui-ai-TTS fork that works on Python 3.10 through 3.14. XTTS v2 clones a voice from a few seconds of reference audio, speaks 17 languages, and outputs 24 kHz audio.

If you searched "coqui tts python" and landed on a wall of import errors, you are almost certainly fighting the dead original package. This guide gives you the install command that actually works in 2026, the full tts_to_file() API with every argument that matters, voice cloning, streaming, and the four errors that trip up nearly everyone.

Which Coqui TTS package do I install in 2026?

This is the single most important thing on the page, so let's settle it first. Coqui.ai, the startup behind the original library, shut down in early 2024, and the original coqui-ai/TTS repository is no longer maintained. The PyPI package literally named TTS still installs, but it pins old dependencies and breaks on modern Python and PyTorch.

The community moved to the idiap/coqui-ai-TTS fork, published on PyPI as coqui-tts. It is the same codebase and the same TTS.api import path, just patched to run on current Python and dependencies.

PackagePyPI nameStatus (mid-2026)Python support
Coqui TTS (idiap fork)coqui-tts✅ Maintained — v0.27.5, Jan 26 20263.10 – 3.14
Original CoquiTTS❌ Unmaintained since early 2024breaks on new Python

So install the fork:

pip install coqui-tts

The import path stays from TTS.api import TTS — the fork deliberately kept it identical so old tutorials still work once you swap the install line. If you only ever copy one line from this page, copy that pip install coqui-tts.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How do I generate speech with tts_to_file()?

Here is the minimal "text to a WAV file" example using the multilingual XTTS v2 model and one of its built-in studio speakers (no reference clip needed):

import torch
from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"

# First run downloads the XTTS v2 model (~1.8 GB) and prompts you to accept the license
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_to_file(
    text="Hello from a fully local text to speech model.",
    file_path="output.wav",
    speaker="Craig Gutsy",   # a built-in XTTS speaker
    language="en",
)

That writes output.wav at a 24 kHz sample rate. The first call downloads the model and shows a license prompt — XTTS v2 ships under the Coqui Public Model License, which has real restrictions; if you plan to ship a product, read our breakdown of the XTTS commercial license before you build on it.

The key tts_to_file() arguments:

ArgumentTypeWhat it does
textstrThe text to synthesize
file_pathstrOutput WAV path
languagestrLanguage code — "en", "es", "fr", etc. (required for XTTS)
speakerstrName of a built-in XTTS studio speaker
speaker_wavstr / listPath(s) to reference audio for voice cloning (instead of speaker)
split_sentencesboolAuto-split long text into sentences (default true)

How do I clone a voice with speaker_wav?

This is what XTTS v2 is famous for: zero-shot voice cloning from a short reference clip. Swap speaker for speaker_wav and point it at 6–20 seconds of clean audio of the target voice:

tts.tts_to_file(
    text="This sentence is spoken in a cloned voice.",
    file_path="cloned.wav",
    speaker_wav="my_voice_sample.wav",   # 6-20s of clean reference audio
    language="en",
)

A few hard-won notes: the reference clip should be clean (no music or background noise), mono, and ideally 16 kHz or higher. You can pass a list of WAV paths to speaker_wav to average several samples for a more stable clone. And language is mandatory for XTTS — it controls the phonemizer, not just an accent, so the wrong code produces garbled output. For a deeper walkthrough with audio tips and longer-form generation, see our full XTTS v2 voice cloning guide.

If you want the in-memory array instead of a file, use tts.tts(...) with the same arguments — it returns the waveform as a list of floats you can post-process before saving.

Which languages does XTTS v2 support?

XTTS v2 supports 17 languages out of the box. Pass the code in the language argument:

en  Spanish=es   French=fr   German=de   Italian=it
pt  Polish=pl    Turkish=tr  Russian=ru  Dutch=nl
cs  Arabic=ar    zh-cn       Japanese=ja Hungarian=hu
ko  Hindi=hi

In full: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko), and Hindi (hi). The same cloned voice can speak any of them — clone once in English, then generate Spanish from the same speaker_wav by changing only the language code.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How do I stream XTTS audio in real time?

For interactive apps you do not want to wait for the whole clip. XTTS exposes a lower-level streaming API, inference_stream(), that yields audio chunks as they are generated — Coqui reports time-to-first-chunk under ~200 ms on a capable GPU. You drop down from the high-level TTS.api wrapper to the model classes:

import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="XTTS-v2/")
model.cuda()

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["my_voice_sample.wav"]
)

chunks = model.inference_stream(
    "Streaming text to speech, one chunk at a time.",
    "en",
    gpt_cond_latent,
    speaker_embedding,
)

for i, chunk in enumerate(chunks):
    # each chunk is a torch tensor of audio you can pipe to a speaker/socket
    print(f"chunk {i}: {chunk.shape}")

Streaming needs a GPU to feel real-time; on CPU it works but the sub-200 ms latency claim does not hold.

Common Coqui TTS errors and fixes

These four cover the vast majority of "coqui tts python" support threads:

Error / symptomCauseFix
ModuleNotFoundError: No module named 'TTS' after installInstalled nothing, or the wrong wheel failedRun pip install coqui-tts (the fork), not TTS
Dependency / build conflicts on Python 3.12+Old TTS package pins ancient depsUninstall TTS, install coqui-tts on Python 3.10–3.14
weights_only / unpickling error on loadNew PyTorch defaults block the checkpointUse coqui-tts 0.27.x, which patches the loader for modern PyTorch
Garbled / wrong-accent outputMissing or wrong language codeAlways pass language="en" (etc.) — it is required for XTTS

The single most common mistake is installing the dead TTS package and then fighting dependency hell. Uninstall it, install coqui-tts, and most "it won't import" problems disappear.

First-hand performance note

On my own RTX 3090 (24 GB), XTTS v2 loaded in roughly 5–8 seconds and generated a single English sentence with a cloned speaker_wav in well under a second of wall-clock time — comfortably faster than real-time playback, so a paragraph renders in a couple of seconds. Treat these as approximate single-machine figures, not a controlled benchmark: model load time, clip length and disk speed all move the numbers. On CPU the same generation ran several times slower than real-time, which is why streaming is GPU-only in practice. VRAM use sat around 4–5 GB during inference, so the model fits comfortably on an 8 GB card.

Key Takeaways

  1. Install coqui-tts, not TTS. The original Coqui package is unmaintained since early 2024; the idiap fork (PyPI coqui-tts v0.27.5) is the live one and keeps the same TTS.api import.
  2. tts_to_file() is the one-liner you want — pass text, file_path, language, and either a built-in speaker or a speaker_wav reference clip.
  3. speaker_wav does zero-shot voice cloning from ~6–20 seconds of clean audio; pass a list of clips for a steadier clone.
  4. XTTS v2 speaks 17 languages at 24 kHz and can stream with inference_stream() at sub-200 ms latency on a GPU.
  5. Always set language — it drives the phonemizer; omitting or mismatching it is the top cause of garbled output.

Next Steps

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators