★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Multimodal

XTTS v2 Voice Cloning Guide (2026): Coqui TTS for 17 Languages

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

XTTS v2 is the open-source multilingual TTS that powered most local-AI voice clones from 2023-2025 and remains the right default for any multilingual deployment in 2026. Coqui AI built it; Coqui AI shut down; the codebase and weights remain freely available and the community keeps the project alive. 17 languages, zero-shot voice cloning from 6 seconds of reference, real-time streaming on consumer GPUs.

This guide covers everything: installation across platforms, voice cloning best practices, multilingual generation, the Coqui TTS server vs LocalAI integration, AllTalk wrapper for SillyTavern users, fine-tuning for specific speakers, performance tuning, and the licensing situation post-Coqui-shutdown.

Table of Contents

  1. What XTTS v2 Is
  2. The Coqui Shutdown and What It Means
  3. Hardware Requirements
  4. Installation: pip, Docker, AllTalk
  5. Your First Voice Clone
  6. The 17 Supported Languages
  7. Reference Audio Best Practices
  8. Streaming and Real-Time
  9. Coqui TTS Server (Built-in HTTP API)
  10. LocalAI Integration (OpenAI-Compatible)
  11. AllTalk: The Community Wrapper
  12. SillyTavern / Open WebUI / Home Assistant
  13. Fine-Tuning for a Specific Speaker
  14. Performance Benchmarks
  15. License Status (CPML)
  16. Ethical Considerations
  17. Tuning Recipes
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What XTTS v2 Is {#what-it-is}

XTTS v2 ("Cross-lingual Text-to-Speech") is a 750M parameter model from Coqui AI released late 2023. Architecture: GPT-2-style autoregressive transformer over discrete audio tokens, paired with an HiFi-GAN vocoder.

Capabilities:

  • 17-language synthesis from a single model
  • Zero-shot voice cloning from 6+ seconds of reference
  • Cross-lingual cloning (English speaker → Spanish output)
  • Streaming output for real-time use
  • Fine-tuning for tighter speaker matching

Project: github.com/coqui-ai/TTS (archived); github.com/idiap/coqui-ai-TTS (community fork, active).


The Coqui Shutdown and What It Means {#shutdown}

Coqui AI announced wind-down in January 2024. The repository was archived. As of mid-2026:

  • Code: community fork at github.com/idiap/coqui-ai-TTS keeps it current.
  • Weights: XTTS v2 weights remain on Hugging Face under the Coqui Public Model License (CPML).
  • Commercial license: the paid commercial license tier no longer exists. CPML allows research / personal use; commercial use was always supposed to require the now-defunct paid license.
  • Practical reality: many people and small businesses use XTTS v2 commercially anyway. There is no rights holder actively enforcing CPML in 2026. For confirmed commercial deployments at scale, switch to F5-TTS commercial license, Spark-TTS (Apache 2.0), or an enterprise TTS provider.

Always consult an attorney for your specific situation. This guide does not constitute legal advice.


Hardware Requirements {#requirements}

HardwareRTF
RTX 4090~4x real-time
RTX 4070~3x
RTX 3060~2x
M4 Max (MPS)~1.2x
RX 7900 XTX (ROCm)~2x
Ryzen 7 7700X (CPU)~0.4x

VRAM 2-3 GB. CPU-only is slow but workable for batch.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: pip, Docker, AllTalk {#installation}

pip (community fork)

python3.11 -m venv ~/venvs/xtts
source ~/venvs/xtts/bin/activate

# CUDA 12.4
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

# Community fork of Coqui TTS (active maintenance)
pip install coqui-tts

# Verify
tts --help

Docker

docker run --gpus all --rm -it \
    -v $(pwd):/workspace \
    ghcr.io/coqui-ai/tts:latest \
    bash
git clone https://github.com/erew123/alltalk_tts
cd alltalk_tts
./atsetup.sh        # Linux/Mac
# or .\atsetup.bat on Windows

AllTalk auto-installs XTTS v2, downloads default voices, sets up Gradio UI, and provides SillyTavern integration. Easiest path for most users.


Your First Voice Clone {#first-clone}

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

tts.tts_to_file(
    text="Hello, this is a cloned voice speaking new words.",
    speaker_wav="reference.wav",
    language="en",
    file_path="output.wav",
)

That's it. Reference WAV (6+ seconds clean speech) + new text + language code → cloned audio.

CLI equivalent:

tts --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --speaker_wav reference.wav \
    --language_idx en \
    --text "Hello world" \
    --out_path output.wav

The 17 Supported Languages {#languages}

CodeLanguage
enEnglish
esSpanish
frFrench
deGerman
itItalian
ptPortuguese
plPolish
trTurkish
ruRussian
nlDutch
csCzech
arArabic
zhChinese
jaJapanese
huHungarian
koKorean
hiHindi

Cross-lingual cloning works for any pair. European-language pairs are highest fidelity; cross-script (Latin reference → CJK / Arabic / Hindi) has noticeable voice drift.


Reference Audio Best Practices {#reference}

  • Length: 6-15 seconds (longer past 20 sec gives diminishing returns)
  • Format: mono WAV, 16-44 kHz, 16-bit
  • Content: clean speech, varied intonation, no music or noise
  • Speaker: single, full sentence with natural prosody
  • Microphone: match the speaker's typical recording setup

For voices with strong emotion (excited, sad, etc.), include that emotion in the reference — XTTS partially carries it through.

For audiobook narration: record 1-2 minutes of varied content from the narrator, slice the cleanest 15 seconds for the reference clip.


Streaming and Real-Time {#streaming}

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="model/", eval=True)
model.cuda()

# Stream chunked output
for chunk in model.inference_stream(
    text="Long text...",
    language="en",
    gpt_cond_latent=...,
    speaker_embedding=...,
):
    play(chunk)  # play as each chunk arrives

Latency to first audio on RTX 4090: ~280 ms. Acceptable for interactive voice agents.


Coqui TTS Server (Built-in HTTP API) {#tts-server}

tts-server \
    --model_name tts_models/multilingual/multi-dataset/xtts_v2 \
    --port 5002

Endpoints:

  • GET /api/tts?text=...&speaker_wav=...&language=en
  • POST /api/tts with JSON body

Simple but functional. For OpenAI compatibility, wrap with LocalAI.


LocalAI Integration (OpenAI-Compatible) {#localai}

# models/xtts.yaml in LocalAI
name: xtts-v2
backend: coqui
parameters:
  model: tts_models/multilingual/multi-dataset/xtts_v2
  voice_wav: /build/voices/default-en.wav
  language: en

Then OpenAI clients work:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

resp = client.audio.speech.create(
    model="xtts-v2",
    voice="default-en",
    input="Hello world",
)
resp.stream_to_file("out.wav")

For multiple voices, configure multiple YAML files, one per voice, and pick via the voice parameter.


AllTalk: The Community Wrapper {#alltalk}

AllTalk TTS is the community wrapper that makes XTTS v2 production-ready for non-developers. Features:

  • Gradio web UI with voice picker
  • Built-in voice library (drop WAVs into voices/)
  • SillyTavern integration out of the box
  • Streaming output
  • Narrator-style voice gallery
  • Fine-tuning UI (no code required)
  • DeepSpeed acceleration support
  • Real-time-factor 4x+ on RTX 4090

For most users, AllTalk is the right entry point — install it once, drop your reference voices in voices/, and use the Gradio UI or hook it up to SillyTavern.


SillyTavern / Open WebUI / Home Assistant {#integrations}

SillyTavern

API → Connections → "TTS" provider → AllTalk or Coqui TTS Server → URL http://localhost:5002 (Coqui server) or http://localhost:7851 (AllTalk). Pick voice per character card. Bot replies are read aloud in the cloned voice.

Open WebUI

Settings → Audio → TTS Engine → OpenAI-compatible URL → LocalAI with XTTS configured.

Home Assistant

assist_pipeline integration with the Coqui TTS server backend. Smart-home voice responses use cloned family-member voices for a fun touch (with consent).


Fine-Tuning for a Specific Speaker {#fine-tuning}

For tighter voice match than zero-shot:

# Prepare 1+ hour of clean speech with transcripts in LJSpeech format
python TTS/bin/train_tts.py \
    --config_path TTS/tts/configs/xtts_v2_train.json \
    --restore_path /path/to/xtts_v2.pth \
    --output_path /path/to/output \
    --formatter ljspeech \
    --dataset_path /path/to/your_dataset

Time: 12-24 hours on RTX 4090 for 1 hour of training data. Result: noticeably tighter voice match than zero-shot, useful for commercial-grade audiobook or branded-voice deployments.

AllTalk has a built-in fine-tuning UI that wraps this for non-developers.


Performance Benchmarks {#benchmarks}

10-second generation, RTX 4090:

SettingTimeRTF
Default2.5 sec4x
With DeepSpeed1.6 sec6.3x
Streaming first-chunk latency280 msn/a

Vs F5-TTS at default settings: F5-TTS 5x RTF, XTTS 4x — F5 is slightly faster. Vs Bark: XTTS 4x, Bark 0.5x — XTTS is 8x faster.


License Status (CPML) {#license}

XTTS v2 weights ship under the Coqui Public Model License (CPML):

  • Free for research, personal, and academic use
  • Commercial use originally required Coqui paid license
  • Coqui AI shut down in 2024; paid license tier no longer exists
  • Codebase community-fork lives under MPL 2.0 (commercial-friendly)
  • Weights remain in legal limbo for commercial use

For confirmed commercial deployments, alternatives:

  • F5-TTS (CC-BY-NC-4.0) with separate commercial license
  • Spark-TTS (Apache 2.0)
  • OpenVoice v2 (MIT)
  • Commercial APIs (ElevenLabs, OpenAI TTS, Azure)

For research / personal / small-scale internal use: XTTS v2 is fine in practice.


Ethical Considerations {#ethics}

Same considerations as F5-TTS:

  • Consent before cloning
  • Disclosure of synthetic audio
  • Watermarking where possible
  • Logging
  • Refusal of public-figure / fraud cloning
  • Compliance with state-level synthetic media laws

See F5-TTS ethics section for detailed discussion.


Tuning Recipes {#tuning}

RTX 4090 production server

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
tts.synthesizer.tts_model.config.gpt_max_audio_tokens = 605      # default

Enable DeepSpeed for ~50% speedup:

tts = TTS("...").to("cuda")
tts.synthesizer.tts_model.gpt.use_deepspeed = True

Tight VRAM (4-6 GB)

tts = TTS("...").to("cuda")
# Drop sample resolution
tts.synthesizer.tts_model.config.audio.output_sample_rate = 16000

CPU-only

tts = TTS("...").to("cpu")

Slower but functional. For batch use cases (audiobook generation overnight) CPU works.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Robotic outputReference noisyClean reference (Demucs / Whisper VAD)
Wrong language pronunciationWrong language codePass correct language parameter
OOMLong input + GPU tightChunk text, use CPU offload
Slow on AMDPyTorch CUDA buildReinstall with rocm6.2 index
Mac slowMPS not enabledVerify torch.backends.mps.is_available()
AllTalk install failsPython versionUse Python 3.10 / 3.11 only
Cross-lingual voice driftScript mismatch (Latin → CJK)Use same-script reference where possible

FAQ {#faq}

See answers to common XTTS v2 questions below.


Sources: Coqui TTS GitHub (archived) | idiap community fork | AllTalk TTS | Internal benchmarks RTX 4090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes XTTS + AllTalk + SillyTavern reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators