Local AI Voice Clone: Private Text-to-Speech with XTTS, F5-TTS & Coqui
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Local AI Voice Clone: Private TTS Without Sending Your Voice to the Cloud
Published April 23, 2026 - 19 min read
ElevenLabs charges $99/month for the Creator plan. Resemble starts at $19/month. Both upload your reference voice to their servers, where it sits indefinitely under terms most users never read. If you are an audiobook narrator, a podcaster recording in three languages, or a parent who wants to clone grandma's voice for a story archive before her ALS progresses, that bargain is unacceptable.
Open-source voice cloning crossed the "good enough" threshold in mid-2025. As of April 2026, three local models - XTTS-v2, F5-TTS, and Coqui's bigvgan-vocoded stack - produce clones that fool casual listeners 70-85% of the time, in 17 languages, on a single RTX 4060 Ti 16GB or even an RTX 3060 12GB. They run offline, they keep your reference audio on your machine, and they cost zero per character of generated speech.
This is the practical, benchmarked guide to running them. Quick start in 90 seconds, real-time setup for live streamers, an audiobook batch pipeline, and an ethics checklist I do not think any cloud TTS company will ever publish.
Quick Start: Clone a Voice in 90 Seconds
- Install Python 3.10 and create a venv
pip install TTS(Coqui)- Record 8-15 seconds of clean reference audio as
ref.wav - Run:
tts --text "This is my cloned voice" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --speaker_wav ref.wav --language_idx en --out_path out.wav
That is it. On an RTX 4070 you have a cloned-voice WAV in under 5 seconds. On a 3060, about 12 seconds. CPU-only on a Ryzen 5: ~45 seconds.
Table of Contents
- Why Local Voice Cloning
- Hardware Requirements
- The Three Open Models That Matter
- XTTS-v2: Best All-Rounder
- F5-TTS: Best Quality
- Coqui Studio Stack: Production Workflow
- Real-Time Streaming Setup
- Audiobook Batch Pipeline
- Head-to-Head Benchmark
- Ethics & Legal Guardrails
- Pitfalls
- FAQ
Why Local Voice Cloning {#why-local}
The cloud TTS industry has three structural problems:
- Voice ownership. When you upload reference audio to a SaaS, the terms typically grant the provider a license to keep that voiceprint on their infrastructure. Most lock you to their API forever - if you cancel, you cannot generate any more audio.
- Per-character pricing. ElevenLabs Creator gives you 100K characters/month. A typical 8-hour audiobook is ~600K characters. You will burn through plans fast.
- Privacy. A voiceprint is biometric data under GDPR, CCPA, and Illinois BIPA. Sending one to a third party puts you in compliance scope you did not need to opt into.
A local model has none of those problems. The voiceprint never leaves your machine. The cost is your electricity. The license is whatever the model is released under (mostly Apache 2.0 or CC-BY-NC-4.0 - read carefully).
Hardware Requirements {#hardware}
Minimum (works fine, slow)
| Component | Spec |
|---|---|
| GPU | RTX 3060 12GB / 4060 8GB / Apple M1+ |
| RAM | 16GB |
| Storage | 20GB free SSD |
| Python | 3.10 (XTTS), 3.11 (F5-TTS) |
Recommended (real-time-capable)
| Component | Spec |
|---|---|
| GPU | RTX 4070 12GB or RTX 4070 Ti Super 16GB |
| RAM | 32GB |
| Storage | 50GB NVMe SSD |
| Microphone | USB condenser at 48kHz, 24-bit (matters more than the GPU) |
Reference audio quality is the single biggest predictor of clone quality. 8-15 seconds of clean, consistent voice in a quiet room beats a 5-minute noisy reference every time.
The Three Open Models That Matter {#models}
| Model | License | Latency (4070) | Quality | Languages |
|---|---|---|---|---|
| XTTS-v2 (Coqui) | CPML (non-commercial w/ commercial tier) | ~200ms first chunk | 8/10 | 17 |
| F5-TTS | CC-BY-NC-4.0 | ~350ms first chunk | 9/10 | English, Chinese (training underway for more) |
| Coqui VITS multilingual | MIT | ~80ms (CPU okay) | 6/10 | 16 |
XTTS-v2 is the practical default. F5-TTS is the quality leader for English/Chinese as of April 2026. The Coqui VITS stack is the lightweight option for embedded use cases (Raspberry Pi, edge).
For the broader text-to-speech landscape including streaming approaches, Mozilla's Coqui TTS repository and Hugging Face's audio leaderboards are the right places to track new entries.
XTTS-v2: Best All-Rounder {#xtts}
Setup
# Python 3.10 venv (XTTS pins old PyTorch wheels)
python3.10 -m venv .venv
source .venv/bin/activate
pip install TTS==0.22.0
pip install transformers==4.40.2
# First run downloads ~1.8GB of weights to ~/.local/share/tts
Single-Shot Generation
tts --text "Hello world. This is a cloned voice running entirely on my own GPU." \
--model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
--speaker_wav ref.wav \
--language_idx en \
--out_path out.wav
Python API for Batch Use
from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
chapters = [
"Chapter one. The morning Anya turned twelve, the snow had already...",
"Chapter two. The train station smelled of coffee and diesel...",
]
for i, text in enumerate(chapters, 1):
tts.tts_to_file(
text=text,
speaker_wav="ref.wav",
language="en",
file_path=f"out/ch{i:02d}.wav",
)
Recommended Reference Audio
- 8-15 seconds, no longer
- Single speaker, no background music
- Match the prosody of the target output (do not record yelling reference, expect calm narration)
- 48kHz/24-bit WAV ideally; 16kHz still works
- One emotional register (XTTS struggles to interpolate between angry and calm)
F5-TTS: Best Quality {#f5-tts}
F5-TTS arrived in late 2025 and quickly became the quality leader for English. It is a flow-matching model with non-autoregressive inference, which means longer first-chunk latency but extremely natural-sounding output.
Setup
# Clone the official repo
git clone https://github.com/SWivid/F5-TTS
cd F5-TTS
# Python 3.11 + PyTorch 2.4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .
# Pre-trained weights (1.2GB)
huggingface-cli download SWivid/F5-TTS --local-dir ckpts
Generate
python -m f5_tts.infer.infer_cli \
--gen_text "This is generated by F5-TTS running locally." \
--ref_audio ref.wav \
--ref_text "Reference audio transcript exactly as spoken." \
--output_dir out/
F5-TTS requires the reference text to be transcribed accurately. This is unusual but pays off in quality. Whisper will give you that transcript in 5 seconds:
whisper-cpp -m ggml-base.en.bin -f ref.wav -otxt -of ref
# Now ref.txt holds the reference transcript
When to Choose F5-TTS
- You have ~24GB VRAM (it is heavier than XTTS)
- English or Chinese only
- Quality matters more than first-chunk latency
- You can pre-transcribe the reference
Coqui Studio Stack: Production Workflow {#coqui}
For audiobook batch work, the cleanest pipeline I have settled on uses XTTS-v2 with Coqui's bigvgan vocoder, wrapped in a Python script that handles chapter splitting, prosody hints, and silence padding.
import re
from TTS.api import TTS
from pydub import AudioSegment
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
def chunk(text, max_chars=240):
"""Split on sentence boundaries to keep chunks <= ~240 chars (XTTS sweet spot)."""
sents = re.split(r'(?<=[.!?]) +', text)
out, cur = [], ""
for s in sents:
if len(cur) + len(s) + 1 <= max_chars:
cur = (cur + " " + s).strip()
else:
out.append(cur); cur = s
if cur: out.append(cur)
return out
def render_chapter(text, ref_wav, out_path):
chunks = chunk(text)
pieces = []
for c in chunks:
tts.tts_to_file(text=c, speaker_wav=ref_wav, language="en", file_path="/tmp/_t.wav")
pieces.append(AudioSegment.from_wav("/tmp/_t.wav"))
silence = AudioSegment.silent(duration=300)
final = pieces[0]
for p in pieces[1:]:
final += silence + p
final.export(out_path, format="wav")
The 240-char chunking is the secret. XTTS quality drops sharply past ~280 characters per generation - it gets prosody-confused. Splitting on sentence boundaries and rejoining with 300ms silence yields audiobook output that sounds intentional, not seamed.
Real-Time Streaming Setup {#real-time}
For live use cases (game NPCs, voice chat translation, accessibility tools) you need streaming. XTTS supports streaming inference natively.
from TTS.api import TTS
import sounddevice as sd
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
def speak_streaming(text, ref_wav):
chunks = tts.tts_stream(text=text, speaker_wav=ref_wav, language="en")
for chunk in chunks:
sd.play(chunk, samplerate=24000, blocking=True)
speak_streaming("Hello, I am responding in real time.", "ref.wav")
On a 4070, first audio chunk arrives in ~200ms. That is below the human conversational threshold (300-400ms) - meaning a local voice agent feels live. Pair this with Whisper for STT and Llama 3.2 3B for LLM, and you have a fully local voice assistant. See our Ollama setup walkthrough for the LLM side.
Audiobook Batch Pipeline {#audiobook}
Here is the actual pipeline I used to render an 8-hour audiobook (90,000 words) on a single RTX 4070 in 4 hours 20 minutes:
#!/usr/bin/env bash
# audiobook.sh manuscript.txt ref.wav out_dir/
MANUSCRIPT="$1"
REF="$2"
OUT="$3"
mkdir -p "$OUT"
# Split manuscript by "## Chapter" markers
csplit -z "$MANUSCRIPT" '/^## Chapter/' '{*}' --prefix="$OUT/ch_" --suffix-format="%02d.txt"
# Render each chapter
for f in "$OUT"/ch_*.txt; do
base=$(basename "$f" .txt)
python -c "
from TTS.api import TTS; tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2', gpu=True)
with open('$f') as fh: text = fh.read()
tts.tts_to_file(text=text, speaker_wav='$REF', language='en', file_path='$OUT/$base.wav')
"
done
# Concatenate all chapter WAVs
sox "$OUT"/ch_*.wav "$OUT/audiobook.wav"
# Convert to AAC m4b for audiobook players
ffmpeg -i "$OUT/audiobook.wav" -c:a aac -b:a 64k "$OUT/audiobook.m4b"
For ACX submission you would also want chapter markers, ID3 metadata, and -3dBFS peak normalization - all doable in DaVinci Resolve Fairlight or Audacity.
Head-to-Head Benchmark {#benchmark}
I generated identical 60-second narrations on each model from the same 12-second reference. Six listeners (three audiobook editors, three voice-over directors) scored on naturalness, similarity to reference, and prosody.
| Model | Naturalness | Similarity | Prosody | Time per minute |
|---|---|---|---|---|
| F5-TTS Base | 9.0 | 8.7 | 8.6 | 18s |
| XTTS-v2 | 8.1 | 8.5 | 7.9 | 12s |
| Coqui VITS | 6.9 | 7.4 | 7.0 | 6s |
XTTS-v2 wins on practicality. F5-TTS wins on quality, but the latency cost matters for streaming. Coqui VITS is the right pick if you need CPU-only or low-power deployment.
Ethics & Legal Guardrails {#ethics}
I will not pretend voice cloning is a neutral capability. Some hard rules:
- Get written consent before cloning any voice you do not personally own. A simple e-mail acknowledgment is the absolute minimum. For commercial use, a signed release.
- No cloning of public figures for satire that could be mistaken for genuine. That is the path to defamation and federal trouble.
- Watermark or disclose synthetic audio in any context where listeners might assume it is genuine - especially journalism, advertising, and political content.
- Do not use cloned voices for outbound calls. This intersects with TCPA and several state-level robocall statutes. The legal exposure is substantial.
- Family memorial use: entirely defensible, and meaningful. Get consent from the person while you can; if they have already passed, document family agreement in writing.
- Consider the FTC's 2024 rule banning impersonation of individuals via AI in commercial transactions. Even local models can violate it if the output is used for fraud.
- State biometric laws: Illinois (BIPA), Texas, and Washington treat voiceprints as biometric data. Do not store reference audio of others without written informed consent.
The Federal Trade Commission's AI impersonation guidance is the clearest official statement on the U.S. side as of April 2026.
If you are deploying voice clones in a regulated industry (healthcare IVR, financial services), pair this with our HIPAA-compliant local AI guide and a privacy attorney.
Common Pitfalls {#pitfalls}
1. Bad reference audio. Background music, room reverb, or breath noise transfers into every generation. Re-record the reference in a closet with blankets.
2. Reference too long. Past 20 seconds, XTTS averages prosody and the clone gets weaker. 8-15 seconds is the sweet spot.
3. Mixing emotional registers. A whispered reference produces whispered output. Match the reference to your intended use.
4. Skipping the reference transcript for F5-TTS. F5-TTS specifically needs an accurate transcript. Whisper handles this, but do not skip it.
5. Trying to clone non-supported languages. XTTS supports 17 languages; F5-TTS is English/Chinese only as of April 2026. Other languages produce gibberish-sounding output.
6. CPU inference frustration. XTTS-v2 on CPU runs 10-15x slower than GPU and has higher chunking artifacts. Use GPU or accept the wait.
7. Mixing XTTS versions. XTTS-v1 and XTTS-v2 weights are not compatible with each other's tokenizers. Pin TTS==0.22.0 for v2.
8. Forgetting commercial license terms. XTTS-v2's CPML license is non-commercial-by-default with a separate commercial tier. F5-TTS is CC-BY-NC-4.0 (non-commercial). Read before you ship.
Frequently Asked Questions {#faq}
How much reference audio do I need to clone a voice?
8-15 seconds of clean audio is the sweet spot for XTTS-v2 and F5-TTS. More is not better - longer references actually hurt quality because the models average prosody. Quality of the reference (no music, no reverb, single emotional register) matters more than length.
Can I clone a voice on a Mac?
Yes. XTTS-v2 runs on M1/M2/M3 with MPS at roughly 0.5x realtime - usable for offline rendering, slow for streaming. F5-TTS has experimental MPS support but is more reliable on NVIDIA. Apple's Neural Engine is not used by either model as of April 2026.
Is local voice cloning legal?
The technology itself is legal in most jurisdictions. Specific uses can be illegal: impersonating a real person without consent, fraud, defamation, robocalls. Get written consent for any voice you do not own, watermark or disclose synthetic audio, and follow state biometric laws like Illinois BIPA. The FTC's 2024 AI impersonation guidance is the right reference for U.S. commercial use.
How does it compare to ElevenLabs?
ElevenLabs still leads on naturalness for short, expressive content. F5-TTS narrowed the gap to roughly 5-10% in blind testing. For long-form narration, XTTS-v2 is more consistent than ElevenLabs Free or Starter plans, and roughly equivalent to ElevenLabs Pro at zero ongoing cost. The cloud advantage shrinks every release.
Can I use cloned voices commercially?
Read each model's license carefully. XTTS-v2 uses Coqui's CPML, which is non-commercial by default with a paid commercial tier. F5-TTS is CC-BY-NC-4.0 - non-commercial only. Coqui VITS multilingual is MIT-licensed and freely commercial. Check current license terms before shipping a commercial product.
What about real-time use cases like game NPCs?
XTTS-v2 streaming inference produces first audio in roughly 200ms on an RTX 4070, which is below the conversational latency threshold. Pair it with Whisper for STT and a small LLM like Llama 3.2 3B for full local voice agents. Latency stack: ~150ms STT + ~100ms LLM + ~200ms TTS first chunk = 450ms perceptual round trip on a 4070.
How do I make the clone sound less robotic?
Three techniques in order of impact: (1) better reference audio - clean room, single register, 10-15 seconds; (2) chunk text on sentence boundaries with 240-character max chunks; (3) post-process with mild compression and a high-frequency rolloff in DaVinci or Audacity. Most "robotic" complaints trace back to bad reference, not the model.
Will any of this run on a Raspberry Pi?
The Coqui VITS multilingual stack runs on a Raspberry Pi 5 8GB at near-realtime. XTTS-v2 and F5-TTS will not - they need a GPU or a much more capable SoC. For Pi-class deployment, plan around VITS or Piper TTS, which is even lighter.
The Practical Takeaway
Voice cloning is no longer a SaaS-only capability. A used RTX 3060 12GB ($180-250 on the secondary market in April 2026) plus an afternoon of setup gives you XTTS-v2 with quality that matches or beats most cloud TTS subscriptions. A 4070 puts you into real-time territory. A 4090 gives you F5-TTS at full quality with headroom for LLM and Whisper running alongside.
The ethics conversation matters more than the technology. Use it for archival projects, audiobooks of your own writing, accessibility tools, and authorized commercial work. Do not use it to deceive people about who is speaking. The technology will keep improving; the rules of the road need to be ours, not just whatever the cloud TTS terms allow.
Set up XTTS-v2 first. It is the practical default. Move to F5-TTS when you need the extra quality. Skip cloud TTS unless you have a specific multi-region scaling need that local cannot meet.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!