XTTS v2 Voice Cloning Guide (2026): Coqui TTS for 17 Languages
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
XTTS v2 is the open-source multilingual TTS that powered most local-AI voice clones from 2023-2025 and remains the right default for any multilingual deployment in 2026. Coqui AI built it; Coqui AI shut down; the codebase and weights remain freely available and the community keeps the project alive. 17 languages, zero-shot voice cloning from 6 seconds of reference, real-time streaming on consumer GPUs.
This guide covers everything: installation across platforms, voice cloning best practices, multilingual generation, the Coqui TTS server vs LocalAI integration, AllTalk wrapper for SillyTavern users, fine-tuning for specific speakers, performance tuning, and the licensing situation post-Coqui-shutdown.
Table of Contents
- What XTTS v2 Is
- The Coqui Shutdown and What It Means
- Hardware Requirements
- Installation: pip, Docker, AllTalk
- Your First Voice Clone
- The 17 Supported Languages
- Reference Audio Best Practices
- Streaming and Real-Time
- Coqui TTS Server (Built-in HTTP API)
- LocalAI Integration (OpenAI-Compatible)
- AllTalk: The Community Wrapper
- SillyTavern / Open WebUI / Home Assistant
- Fine-Tuning for a Specific Speaker
- Performance Benchmarks
- License Status (CPML)
- Ethical Considerations
- Tuning Recipes
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What XTTS v2 Is {#what-it-is}
XTTS v2 ("Cross-lingual Text-to-Speech") is a 750M parameter model from Coqui AI released late 2023. Architecture: GPT-2-style autoregressive transformer over discrete audio tokens, paired with an HiFi-GAN vocoder.
Capabilities:
- 17-language synthesis from a single model
- Zero-shot voice cloning from 6+ seconds of reference
- Cross-lingual cloning (English speaker → Spanish output)
- Streaming output for real-time use
- Fine-tuning for tighter speaker matching
Project: github.com/coqui-ai/TTS (archived); github.com/idiap/coqui-ai-TTS (community fork, active).
The Coqui Shutdown and What It Means {#shutdown}
Coqui AI announced wind-down in January 2024. The repository was archived. As of mid-2026:
- Code: community fork at github.com/idiap/coqui-ai-TTS keeps it current.
- Weights: XTTS v2 weights remain on Hugging Face under the Coqui Public Model License (CPML).
- Commercial license: the paid commercial license tier no longer exists. CPML allows research / personal use; commercial use was always supposed to require the now-defunct paid license.
- Practical reality: many people and small businesses use XTTS v2 commercially anyway. There is no rights holder actively enforcing CPML in 2026. For confirmed commercial deployments at scale, switch to F5-TTS commercial license, Spark-TTS (Apache 2.0), or an enterprise TTS provider.
Always consult an attorney for your specific situation. This guide does not constitute legal advice.
Hardware Requirements {#requirements}
| Hardware | RTF |
|---|---|
| RTX 4090 | ~4x real-time |
| RTX 4070 | ~3x |
| RTX 3060 | ~2x |
| M4 Max (MPS) | ~1.2x |
| RX 7900 XTX (ROCm) | ~2x |
| Ryzen 7 7700X (CPU) | ~0.4x |
VRAM 2-3 GB. CPU-only is slow but workable for batch.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: pip, Docker, AllTalk {#installation}
pip (community fork)
python3.11 -m venv ~/venvs/xtts
source ~/venvs/xtts/bin/activate
# CUDA 12.4
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
# Community fork of Coqui TTS (active maintenance)
pip install coqui-tts
# Verify
tts --help
Docker
docker run --gpus all --rm -it \
-v $(pwd):/workspace \
ghcr.io/coqui-ai/tts:latest \
bash
AllTalk (community wrapper, recommended for non-developers)
git clone https://github.com/erew123/alltalk_tts
cd alltalk_tts
./atsetup.sh # Linux/Mac
# or .\atsetup.bat on Windows
AllTalk auto-installs XTTS v2, downloads default voices, sets up Gradio UI, and provides SillyTavern integration. Easiest path for most users.
Your First Voice Clone {#first-clone}
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.tts_to_file(
text="Hello, this is a cloned voice speaking new words.",
speaker_wav="reference.wav",
language="en",
file_path="output.wav",
)
That's it. Reference WAV (6+ seconds clean speech) + new text + language code → cloned audio.
CLI equivalent:
tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
--speaker_wav reference.wav \
--language_idx en \
--text "Hello world" \
--out_path output.wav
The 17 Supported Languages {#languages}
| Code | Language |
|---|---|
| en | English |
| es | Spanish |
| fr | French |
| de | German |
| it | Italian |
| pt | Portuguese |
| pl | Polish |
| tr | Turkish |
| ru | Russian |
| nl | Dutch |
| cs | Czech |
| ar | Arabic |
| zh | Chinese |
| ja | Japanese |
| hu | Hungarian |
| ko | Korean |
| hi | Hindi |
Cross-lingual cloning works for any pair. European-language pairs are highest fidelity; cross-script (Latin reference → CJK / Arabic / Hindi) has noticeable voice drift.
Reference Audio Best Practices {#reference}
- Length: 6-15 seconds (longer past 20 sec gives diminishing returns)
- Format: mono WAV, 16-44 kHz, 16-bit
- Content: clean speech, varied intonation, no music or noise
- Speaker: single, full sentence with natural prosody
- Microphone: match the speaker's typical recording setup
For voices with strong emotion (excited, sad, etc.), include that emotion in the reference — XTTS partially carries it through.
For audiobook narration: record 1-2 minutes of varied content from the narrator, slice the cleanest 15 seconds for the reference clip.
Streaming and Real-Time {#streaming}
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="model/", eval=True)
model.cuda()
# Stream chunked output
for chunk in model.inference_stream(
text="Long text...",
language="en",
gpt_cond_latent=...,
speaker_embedding=...,
):
play(chunk) # play as each chunk arrives
Latency to first audio on RTX 4090: ~280 ms. Acceptable for interactive voice agents.
Coqui TTS Server (Built-in HTTP API) {#tts-server}
tts-server \
--model_name tts_models/multilingual/multi-dataset/xtts_v2 \
--port 5002
Endpoints:
GET /api/tts?text=...&speaker_wav=...&language=enPOST /api/ttswith JSON body
Simple but functional. For OpenAI compatibility, wrap with LocalAI.
LocalAI Integration (OpenAI-Compatible) {#localai}
# models/xtts.yaml in LocalAI
name: xtts-v2
backend: coqui
parameters:
model: tts_models/multilingual/multi-dataset/xtts_v2
voice_wav: /build/voices/default-en.wav
language: en
Then OpenAI clients work:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")
resp = client.audio.speech.create(
model="xtts-v2",
voice="default-en",
input="Hello world",
)
resp.stream_to_file("out.wav")
For multiple voices, configure multiple YAML files, one per voice, and pick via the voice parameter.
AllTalk: The Community Wrapper {#alltalk}
AllTalk TTS is the community wrapper that makes XTTS v2 production-ready for non-developers. Features:
- Gradio web UI with voice picker
- Built-in voice library (drop WAVs into
voices/) - SillyTavern integration out of the box
- Streaming output
- Narrator-style voice gallery
- Fine-tuning UI (no code required)
- DeepSpeed acceleration support
- Real-time-factor 4x+ on RTX 4090
For most users, AllTalk is the right entry point — install it once, drop your reference voices in voices/, and use the Gradio UI or hook it up to SillyTavern.
SillyTavern / Open WebUI / Home Assistant {#integrations}
SillyTavern
API → Connections → "TTS" provider → AllTalk or Coqui TTS Server → URL http://localhost:5002 (Coqui server) or http://localhost:7851 (AllTalk). Pick voice per character card. Bot replies are read aloud in the cloned voice.
Open WebUI
Settings → Audio → TTS Engine → OpenAI-compatible URL → LocalAI with XTTS configured.
Home Assistant
assist_pipeline integration with the Coqui TTS server backend. Smart-home voice responses use cloned family-member voices for a fun touch (with consent).
Fine-Tuning for a Specific Speaker {#fine-tuning}
For tighter voice match than zero-shot:
# Prepare 1+ hour of clean speech with transcripts in LJSpeech format
python TTS/bin/train_tts.py \
--config_path TTS/tts/configs/xtts_v2_train.json \
--restore_path /path/to/xtts_v2.pth \
--output_path /path/to/output \
--formatter ljspeech \
--dataset_path /path/to/your_dataset
Time: 12-24 hours on RTX 4090 for 1 hour of training data. Result: noticeably tighter voice match than zero-shot, useful for commercial-grade audiobook or branded-voice deployments.
AllTalk has a built-in fine-tuning UI that wraps this for non-developers.
Performance Benchmarks {#benchmarks}
10-second generation, RTX 4090:
| Setting | Time | RTF |
|---|---|---|
| Default | 2.5 sec | 4x |
| With DeepSpeed | 1.6 sec | 6.3x |
| Streaming first-chunk latency | 280 ms | n/a |
Vs F5-TTS at default settings: F5-TTS 5x RTF, XTTS 4x — F5 is slightly faster. Vs Bark: XTTS 4x, Bark 0.5x — XTTS is 8x faster.
License Status (CPML) {#license}
XTTS v2 weights ship under the Coqui Public Model License (CPML):
- Free for research, personal, and academic use
- Commercial use originally required Coqui paid license
- Coqui AI shut down in 2024; paid license tier no longer exists
- Codebase community-fork lives under MPL 2.0 (commercial-friendly)
- Weights remain in legal limbo for commercial use
For confirmed commercial deployments, alternatives:
- F5-TTS (CC-BY-NC-4.0) with separate commercial license
- Spark-TTS (Apache 2.0)
- OpenVoice v2 (MIT)
- Commercial APIs (ElevenLabs, OpenAI TTS, Azure)
For research / personal / small-scale internal use: XTTS v2 is fine in practice.
Ethical Considerations {#ethics}
Same considerations as F5-TTS:
- Consent before cloning
- Disclosure of synthetic audio
- Watermarking where possible
- Logging
- Refusal of public-figure / fraud cloning
- Compliance with state-level synthetic media laws
See F5-TTS ethics section for detailed discussion.
Tuning Recipes {#tuning}
RTX 4090 production server
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.synthesizer.tts_model.config.gpt_max_audio_tokens = 605 # default
Enable DeepSpeed for ~50% speedup:
tts = TTS("...").to("cuda")
tts.synthesizer.tts_model.gpt.use_deepspeed = True
Tight VRAM (4-6 GB)
tts = TTS("...").to("cuda")
# Drop sample resolution
tts.synthesizer.tts_model.config.audio.output_sample_rate = 16000
CPU-only
tts = TTS("...").to("cpu")
Slower but functional. For batch use cases (audiobook generation overnight) CPU works.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Robotic output | Reference noisy | Clean reference (Demucs / Whisper VAD) |
| Wrong language pronunciation | Wrong language code | Pass correct language parameter |
| OOM | Long input + GPU tight | Chunk text, use CPU offload |
| Slow on AMD | PyTorch CUDA build | Reinstall with rocm6.2 index |
| Mac slow | MPS not enabled | Verify torch.backends.mps.is_available() |
| AllTalk install fails | Python version | Use Python 3.10 / 3.11 only |
| Cross-lingual voice drift | Script mismatch (Latin → CJK) | Use same-script reference where possible |
FAQ {#faq}
See answers to common XTTS v2 questions below.
Sources: Coqui TTS GitHub (archived) | idiap community fork | AllTalk TTS | Internal benchmarks RTX 4090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!