Question 1

What is XTTS v2 and why is it the standard multilingual TTS in 2026?

Accepted Answer

XTTS v2 is Coqui AI's open-source multilingual text-to-speech model. It supports 17 languages (English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Hungarian, Korean, Hindi) with zero-shot voice cloning from 6 seconds of reference audio. Compared to F5-TTS (English/Chinese only at base): XTTS wins on multilingual coverage. Compared to ElevenLabs: XTTS is comparable on cloning fidelity for European languages, slightly behind on emotional expressiveness, but free and self-hosted. License is MPL 2.0 (commercial-friendly). For multilingual voiceover, audiobook narration, accessibility tools, and chatbot voices in non-English markets, XTTS v2 is the default choice.

Question 2

Coqui shut down — is XTTS still maintained?

Accepted Answer

Coqui AI (the company) shut down in early 2024, but the Coqui TTS open-source codebase lives on at github.com/coqui-ai/TTS (community-maintained) and the idiap/coqui-ai-TTS fork. XTTS v2 model weights remain freely available under the original Coqui Public Model License (CPML) — research and personal use free; commercial deployment requires a separate license that is no longer offered (de facto means use it freely but the legal status for commercial production is ambiguous). For confirmed commercial use, Spark-TTS or F5-TTS commercial licenses are cleaner paths.

Question 3

What hardware does XTTS v2 need?

Accepted Answer

XTTS v2 is a 750M parameter model. NVIDIA GPU with 4 GB+ VRAM is comfortable; the model fits in 2 GB. CPU-only inference works but is ~5-10x slower (real-time-factor 0.3-0.6 vs 4-6x on RTX 4090). Apple Silicon via MPS works, AMD ROCm works via PyTorch. Recommended: RTX 4060 or better for real-time / interactive use. For batch (audiobook) generation, anything works with patience.

Question 4

How long does the reference audio need to be?

Accepted Answer

XTTS v2 works with as little as 6 seconds of clean reference audio — half what F5-TTS needs. Quality improves with longer references (10-20 seconds), but you hit diminishing returns past 30 seconds. The reference should be clean (no music / background / echo), single speaker, ideally 16-44 kHz mono WAV. Cross-lingual cloning (English reference → Spanish output) is one of XTTS v2's strengths and works well for European language pairs; cross-script (Latin → Chinese / Japanese / Arabic) has more drift in voice fidelity.

Question 5

How fast is XTTS v2 and can it stream output?

Accepted Answer

Real-time-factor on RTX 4090: ~4x real-time (4 seconds of audio in 1 second of compute). RTX 3060: ~2x. M4 Max via MPS: ~1.2x. The model supports streaming chunked output — start playing audio while later text is still generating. Latency to first audio on RTX 4090: ~250-400 ms. Suitable for real-time chatbots, voice agents, game NPC dialogue. For sub-200ms latency, look at Moshi (lower quality but truly real-time).

Question 6

Can I integrate XTTS with my existing stack?

Accepted Answer

Yes. (1) **Direct Python API**: `from TTS.api import TTS` then `tts.tts_to_file(text, speaker_wav, language)`. (2) **Coqui-TTS server**: `tts-server --model_name tts_models/multilingual/multi-dataset/xtts_v2` exposes HTTP API on port 5002. (3) **OpenAI-compatible**: wrap with [LocalAI](/blog/localai-setup-guide) — `/v1/audio/speech` endpoint. (4) **Open WebUI / SillyTavern / Home Assistant**: all support XTTS via LocalAI or direct integration. (5) **AllTalk TTS**: a community wrapper that adds Gradio UI, voice library, and SillyTavern integration on top of XTTS.

Question 7

How do I fine-tune XTTS for a specific speaker?

Accepted Answer

XTTS already does zero-shot cloning from 6 seconds, so fine-tuning is rarely needed. For commercial-grade matching of one specific voice (audiobook narrator, brand voice, etc.), fine-tune via the Coqui TTS training pipeline: gather 1+ hour of clean speech with transcripts, run `python TTS/bin/train_tts.py --config_path config.json`, ~12-24 hours on RTX 4090. The fine-tuned model produces tighter voice match than zero-shot. For most use cases, zero-shot with a good 15-second reference is enough.

Question 8

XTTS v2 vs F5-TTS vs OpenVoice v2 — which should I pick?

Accepted Answer

XTTS v2 for the broadest multilingual coverage (17 languages), best zero-shot cross-lingual, and the most mature ecosystem (AllTalk, SillyTavern integration, fine-tuning recipes). F5-TTS for the best English/Chinese voice fidelity and most natural prosody on long-form output. OpenVoice v2 for emotion/style transfer between voices. Spark-TTS for Apache-licensed commercial deployments. For a multilingual product (chatbot in 5+ languages, international audiobook): XTTS v2. For English-only highest quality: F5-TTS. For game NPCs with emotional range: OpenVoice v2.

Hardware	RTF
RTX 4090	~4x real-time
RTX 4070	~3x
RTX 3060	~2x
M4 Max (MPS)	~1.2x
RX 7900 XTX (ROCm)	~2x
Ryzen 7 7700X (CPU)	~0.4x

Setting	Time	RTF
Default	2.5 sec	4x
With DeepSpeed	1.6 sec	6.3x
Streaming first-chunk latency	280 ms	n/a

Symptom	Cause	Fix
Robotic output	Reference noisy	Clean reference (Demucs / Whisper VAD)
Wrong language pronunciation	Wrong language code	Pass correct `language` parameter
OOM	Long input + GPU tight	Chunk text, use CPU offload
Slow on AMD	PyTorch CUDA build	Reinstall with rocm6.2 index
Mac slow	MPS not enabled	Verify torch.backends.mps.is_available()
AllTalk install fails	Python version	Use Python 3.10 / 3.11 only
Cross-lingual voice drift	Script mismatch (Latin → CJK)	Use same-script reference where possible

Code	Language
en	English
es	Spanish
fr	French
de	German
it	Italian
pt	Portuguese
pl	Polish
tr	Turkish
ru	Russian
nl	Dutch
cs	Czech
ar	Arabic
zh	Chinese
ja	Japanese
hu	Hungarian
ko	Korean
hi	Hindi

XTTS v2 Voice Cloning Guide (2026): Coqui TTS for 17 Languages

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What XTTS v2 Is {#what-it-is}

The Coqui Shutdown and What It Means {#shutdown}

Hardware Requirements {#requirements}

Reading articles is good. Building is better.

Installation: pip, Docker, AllTalk {#installation}

pip (community fork)

Docker

AllTalk (community wrapper, recommended for non-developers)

Your First Voice Clone {#first-clone}

The 17 Supported Languages {#languages}

Reference Audio Best Practices {#reference}

Streaming and Real-Time {#streaming}

Coqui TTS Server (Built-in HTTP API) {#tts-server}

LocalAI Integration (OpenAI-Compatible) {#localai}

AllTalk: The Community Wrapper {#alltalk}

SillyTavern / Open WebUI / Home Assistant {#integrations}

SillyTavern

Open WebUI

Home Assistant

Fine-Tuning for a Specific Speaker {#fine-tuning}

Performance Benchmarks {#benchmarks}

License Status (CPML) {#license}

Ethical Considerations {#ethics}

Tuning Recipes {#tuning}

RTX 4090 production server

Tight VRAM (4-6 GB)

CPU-only

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

F5-TTS Setup Guide

Local AI Voice Clone

Whisper Local Speech-to-Text

Local AI Podcast Production

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI