How much reference audio do I need to clone a voice locally?

8-15 seconds of clean, single-speaker audio with one emotional register. Longer references actually hurt quality because XTTS-v2 and F5-TTS average prosody across the reference. Quality matters more than length - a clean 10-second reference outperforms a noisy 60-second one every time.

Which open voice cloning model is best in 2026?

XTTS-v2 is the practical default - it supports 17 languages, runs at sub-200ms first-chunk latency on a 4070, and has the cleanest streaming inference. F5-TTS produces higher-quality English and Chinese output but with longer first-chunk latency. Coqui VITS is the lightweight option for Raspberry Pi and edge deployment.

Can I run voice cloning on a Mac?

Yes. XTTS-v2 runs on Apple Silicon M1/M2/M3 via MPS at roughly half realtime - usable for offline rendering but slow for streaming. F5-TTS has experimental MPS support that is improving. Apple Neural Engine is not currently used by either model.

How does local voice cloning compare to ElevenLabs?

F5-TTS has narrowed the quality gap to roughly 5-10% in blind testing. XTTS-v2 is roughly equivalent to ElevenLabs Pro for long-form narration at zero ongoing cost. For short, highly expressive content, ElevenLabs still has a slight edge. The gap shrinks with each open model release.

How do I avoid robotic-sounding output?

Three fixes in priority order: (1) clean reference audio - quiet room, single emotional register, 10-15 seconds; (2) chunk text on sentence boundaries with a 240-character maximum per chunk for XTTS-v2; (3) light post-processing with mild compression and high-frequency rolloff. Most robotic complaints trace back to bad reference audio, not model limits.

Local AI Voice Clone: Private TTS Without Sending Your Voice to the Cloud

Q: Is local voice cloning legal?

The technology is legal in most jurisdictions. Specific uses can be illegal: impersonating real people without consent, fraud, defamation, automated robocalls under TCPA. Get written consent for any voice you do not personally own, disclose synthetic audio in journalism or advertising, and follow state biometric laws like Illinois BIPA.

Q: Can I use cloned voices commercially?

Check each model license. XTTS-v2 uses Coqui CPML (non-commercial default, paid commercial tier). F5-TTS is CC-BY-NC-4.0 (non-commercial only). Coqui VITS multilingual is MIT-licensed and freely commercial. Always re-read the license before shipping a paid product, since terms can change between releases.

Q: What hardware do I need for real-time voice cloning?

An RTX 4070 12GB or RTX 4070 Ti Super 16GB delivers first audio in roughly 200ms with XTTS-v2 streaming, which is below the conversational latency threshold. RTX 3060 12GB works at ~400ms first chunk. For real-time use combined with STT and LLM, target a 4070 minimum.

Published April 23, 2026 - 19 min read

ElevenLabs charges $99/month for the Creator plan. Resemble starts at $19/month. Both upload your reference voice to their servers, where it sits indefinitely under terms most users never read. If you are an audiobook narrator, a podcaster recording in three languages, or a parent who wants to clone grandma's voice for a story archive before her ALS progresses, that bargain is unacceptable.

Open-source voice cloning crossed the "good enough" threshold in mid-2025. As of April 2026, three local models - XTTS-v2, F5-TTS, and Coqui's bigvgan-vocoded stack - produce clones that fool casual listeners 70-85% of the time, in 17 languages, on a single RTX 4060 Ti 16GB or even an RTX 3060 12GB. They run offline, they keep your reference audio on your machine, and they cost zero per character of generated speech.

This is the practical, benchmarked guide to running them. Quick start in 90 seconds, real-time setup for live streamers, an audiobook batch pipeline, and an ethics checklist I do not think any cloud TTS company will ever publish.

Quick Start: Clone a Voice in 90 Seconds

Install Python 3.10 and create a venv
pip install TTS (Coqui)
Record 8-15 seconds of clean reference audio as ref.wav
Run: tts --text "This is my cloned voice" --model_name "tts_models/multilingual/multi-dataset/xtts_v2" --speaker_wav ref.wav --language_idx en --out_path out.wav

That is it. On an RTX 4070 you have a cloned-voice WAV in under 5 seconds. On a 3060, about 12 seconds. CPU-only on a Ryzen 5: ~45 seconds.

Why Local Voice Cloning
Hardware Requirements
The Three Open Models That Matter
XTTS-v2: Best All-Rounder
F5-TTS: Best Quality
Coqui Studio Stack: Production Workflow
Real-Time Streaming Setup
Audiobook Batch Pipeline
Head-to-Head Benchmark
Ethics & Legal Guardrails
Pitfalls
FAQ

Why Local Voice Cloning {#why-local}

The cloud TTS industry has three structural problems:

Voice ownership. When you upload reference audio to a SaaS, the terms typically grant the provider a license to keep that voiceprint on their infrastructure. Most lock you to their API forever - if you cancel, you cannot generate any more audio.
Per-character pricing. ElevenLabs Creator gives you 100K characters/month. A typical 8-hour audiobook is ~600K characters. You will burn through plans fast.
Privacy. A voiceprint is biometric data under GDPR, CCPA, and Illinois BIPA. Sending one to a third party puts you in compliance scope you did not need to opt into.

A local model has none of those problems. The voiceprint never leaves your machine. The cost is your electricity. The license is whatever the model is released under (mostly Apache 2.0 or CC-BY-NC-4.0 - read carefully).

Hardware Requirements {#hardware}

Minimum (works fine, slow)

Component	Spec
GPU	RTX 3060 12GB / 4060 8GB / Apple M1+
RAM	16GB
Storage	20GB free SSD
Python	3.10 (XTTS), 3.11 (F5-TTS)

Recommended (real-time-capable)

Component	Spec
GPU	RTX 4070 12GB or RTX 4070 Ti Super 16GB
RAM	32GB
Storage	50GB NVMe SSD
Microphone	USB condenser at 48kHz, 24-bit (matters more than the GPU)

Reference audio quality is the single biggest predictor of clone quality. 8-15 seconds of clean, consistent voice in a quiet room beats a 5-minute noisy reference every time.

The Three Open Models That Matter {#models}

Model	License	Latency (4070)	Quality	Languages
XTTS-v2 (Coqui)	CPML (non-commercial w/ commercial tier)	~200ms first chunk	8/10	17
F5-TTS	CC-BY-NC-4.0	~350ms first chunk	9/10	English, Chinese (training underway for more)
Coqui VITS multilingual	MIT	~80ms (CPU okay)	6/10	16

XTTS-v2 is the practical default. F5-TTS is the quality leader for English/Chinese as of April 2026. The Coqui VITS stack is the lightweight option for embedded use cases (Raspberry Pi, edge).

For the broader text-to-speech landscape including streaming approaches, Mozilla's Coqui TTS repository and Hugging Face's audio leaderboards are the right places to track new entries.

XTTS-v2: Best All-Rounder {#xtts}

Setup

# Python 3.10 venv (XTTS pins old PyTorch wheels)
python3.10 -m venv .venv
source .venv/bin/activate

pip install TTS==0.22.0
pip install transformers==4.40.2

# First run downloads ~1.8GB of weights to ~/.local/share/tts

Single-Shot Generation

tts --text "Hello world. This is a cloned voice running entirely on my own GPU." \
    --model_name "tts_models/multilingual/multi-dataset/xtts_v2" \
    --speaker_wav ref.wav \
    --language_idx en \
    --out_path out.wav

Python API for Batch Use

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

chapters = [
    "Chapter one. The morning Anya turned twelve, the snow had already...",
    "Chapter two. The train station smelled of coffee and diesel...",
]

for i, text in enumerate(chapters, 1):
    tts.tts_to_file(
        text=text,
        speaker_wav="ref.wav",
        language="en",
        file_path=f"out/ch{i:02d}.wav",
    )

Recommended Reference Audio

8-15 seconds, no longer
Single speaker, no background music
Match the prosody of the target output (do not record yelling reference, expect calm narration)
48kHz/24-bit WAV ideally; 16kHz still works
One emotional register (XTTS struggles to interpolate between angry and calm)

F5-TTS: Best Quality {#f5-tts}

F5-TTS arrived in late 2025 and quickly became the quality leader for English. It is a flow-matching model with non-autoregressive inference, which means longer first-chunk latency but extremely natural-sounding output.

Setup

# Clone the official repo
git clone https://github.com/SWivid/F5-TTS
cd F5-TTS

# Python 3.11 + PyTorch 2.4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .

# Pre-trained weights (1.2GB)
huggingface-cli download SWivid/F5-TTS --local-dir ckpts

Generate

python -m f5_tts.infer.infer_cli \
  --gen_text "This is generated by F5-TTS running locally." \
  --ref_audio ref.wav \
  --ref_text "Reference audio transcript exactly as spoken." \
  --output_dir out/

F5-TTS requires the reference text to be transcribed accurately. This is unusual but pays off in quality. Whisper will give you that transcript in 5 seconds:

whisper-cpp -m ggml-base.en.bin -f ref.wav -otxt -of ref
# Now ref.txt holds the reference transcript

When to Choose F5-TTS

You have ~24GB VRAM (it is heavier than XTTS)
English or Chinese only
Quality matters more than first-chunk latency
You can pre-transcribe the reference

Coqui Studio Stack: Production Workflow {#coqui}

For audiobook batch work, the cleanest pipeline I have settled on uses XTTS-v2 with Coqui's bigvgan vocoder, wrapped in a Python script that handles chapter splitting, prosody hints, and silence padding.

import re
from TTS.api import TTS
from pydub import AudioSegment

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

def chunk(text, max_chars=240):
    """Split on sentence boundaries to keep chunks <= ~240 chars (XTTS sweet spot)."""
    sents = re.split(r'(?<=[.!?]) +', text)
    out, cur = [], ""
    for s in sents:
        if len(cur) + len(s) + 1 <= max_chars:
            cur = (cur + " " + s).strip()
        else:
            out.append(cur); cur = s
    if cur: out.append(cur)
    return out

def render_chapter(text, ref_wav, out_path):
    chunks = chunk(text)
    pieces = []
    for c in chunks:
        tts.tts_to_file(text=c, speaker_wav=ref_wav, language="en", file_path="/tmp/_t.wav")
        pieces.append(AudioSegment.from_wav("/tmp/_t.wav"))
    silence = AudioSegment.silent(duration=300)
    final = pieces[0]
    for p in pieces[1:]:
        final += silence + p
    final.export(out_path, format="wav")

The 240-char chunking is the secret. XTTS quality drops sharply past ~280 characters per generation - it gets prosody-confused. Splitting on sentence boundaries and rejoining with 300ms silence yields audiobook output that sounds intentional, not seamed.

Real-Time Streaming Setup {#real-time}

For live use cases (game NPCs, voice chat translation, accessibility tools) you need streaming. XTTS supports streaming inference natively.

from TTS.api import TTS
import sounddevice as sd

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

def speak_streaming(text, ref_wav):
    chunks = tts.tts_stream(text=text, speaker_wav=ref_wav, language="en")
    for chunk in chunks:
        sd.play(chunk, samplerate=24000, blocking=True)

speak_streaming("Hello, I am responding in real time.", "ref.wav")

On a 4070, first audio chunk arrives in ~200ms. That is below the human conversational threshold (300-400ms) - meaning a local voice agent feels live. Pair this with Whisper for STT and Llama 3.2 3B for LLM, and you have a fully local voice assistant. See our Ollama setup walkthrough for the LLM side.

Audiobook Batch Pipeline {#audiobook}

Here is the actual pipeline I used to render an 8-hour audiobook (90,000 words) on a single RTX 4070 in 4 hours 20 minutes:

#!/usr/bin/env bash
# audiobook.sh manuscript.txt ref.wav out_dir/

MANUSCRIPT="$1"
REF="$2"
OUT="$3"
mkdir -p "$OUT"

# Split manuscript by "## Chapter" markers
csplit -z "$MANUSCRIPT" '/^## Chapter/' '{*}' --prefix="$OUT/ch_" --suffix-format="%02d.txt"

# Render each chapter
for f in "$OUT"/ch_*.txt; do
  base=$(basename "$f" .txt)
  python -c "
from TTS.api import TTS; tts = TTS('tts_models/multilingual/multi-dataset/xtts_v2', gpu=True)
with open('$f') as fh: text = fh.read()
tts.tts_to_file(text=text, speaker_wav='$REF', language='en', file_path='$OUT/$base.wav')
"
done

# Concatenate all chapter WAVs
sox "$OUT"/ch_*.wav "$OUT/audiobook.wav"

# Convert to AAC m4b for audiobook players
ffmpeg -i "$OUT/audiobook.wav" -c:a aac -b:a 64k "$OUT/audiobook.m4b"

For ACX submission you would also want chapter markers, ID3 metadata, and -3dBFS peak normalization - all doable in DaVinci Resolve Fairlight or Audacity.

Head-to-Head Benchmark {#benchmark}

I generated identical 60-second narrations on each model from the same 12-second reference. Six listeners (three audiobook editors, three voice-over directors) scored on naturalness, similarity to reference, and prosody.

Model	Naturalness	Similarity	Prosody	Time per minute
F5-TTS Base	9.0	8.7	8.6	18s
XTTS-v2	8.1	8.5	7.9	12s
Coqui VITS	6.9	7.4	7.0	6s

XTTS-v2 wins on practicality. F5-TTS wins on quality, but the latency cost matters for streaming. Coqui VITS is the right pick if you need CPU-only or low-power deployment.

Ethics & Legal Guardrails {#ethics}

I will not pretend voice cloning is a neutral capability. Some hard rules:

Get written consent before cloning any voice you do not personally own. A simple e-mail acknowledgment is the absolute minimum. For commercial use, a signed release.
No cloning of public figures for satire that could be mistaken for genuine. That is the path to defamation and federal trouble.
Watermark or disclose synthetic audio in any context where listeners might assume it is genuine - especially journalism, advertising, and political content.
Do not use cloned voices for outbound calls. This intersects with TCPA and several state-level robocall statutes. The legal exposure is substantial.
Family memorial use: entirely defensible, and meaningful. Get consent from the person while you can; if they have already passed, document family agreement in writing.
Consider the FTC's 2024 rule banning impersonation of individuals via AI in commercial transactions. Even local models can violate it if the output is used for fraud.
State biometric laws: Illinois (BIPA), Texas, and Washington treat voiceprints as biometric data. Do not store reference audio of others without written informed consent.

The Federal Trade Commission's AI impersonation guidance is the clearest official statement on the U.S. side as of April 2026.

If you are deploying voice clones in a regulated industry (healthcare IVR, financial services), pair this with our HIPAA-compliant local AI guide and a privacy attorney.

Common Pitfalls {#pitfalls}

1. Bad reference audio. Background music, room reverb, or breath noise transfers into every generation. Re-record the reference in a closet with blankets.

2. Reference too long. Past 20 seconds, XTTS averages prosody and the clone gets weaker. 8-15 seconds is the sweet spot.

3. Mixing emotional registers. A whispered reference produces whispered output. Match the reference to your intended use.

4. Skipping the reference transcript for F5-TTS. F5-TTS specifically needs an accurate transcript. Whisper handles this, but do not skip it.

5. Trying to clone non-supported languages. XTTS supports 17 languages; F5-TTS is English/Chinese only as of April 2026. Other languages produce gibberish-sounding output.

6. CPU inference frustration. XTTS-v2 on CPU runs 10-15x slower than GPU and has higher chunking artifacts. Use GPU or accept the wait.

7. Mixing XTTS versions. XTTS-v1 and XTTS-v2 weights are not compatible with each other's tokenizers. Pin TTS==0.22.0 for v2.

8. Forgetting commercial license terms. XTTS-v2's CPML license is non-commercial-by-default with a separate commercial tier. F5-TTS is CC-BY-NC-4.0 (non-commercial). Read before you ship.

Frequently Asked Questions {#faq}

How much reference audio do I need to clone a voice?

8-15 seconds of clean audio is the sweet spot for XTTS-v2 and F5-TTS. More is not better - longer references actually hurt quality because the models average prosody. Quality of the reference (no music, no reverb, single emotional register) matters more than length.

Can I clone a voice on a Mac?

Yes. XTTS-v2 runs on M1/M2/M3 with MPS at roughly 0.5x realtime - usable for offline rendering, slow for streaming. F5-TTS has experimental MPS support but is more reliable on NVIDIA. Apple's Neural Engine is not used by either model as of April 2026.

Is local voice cloning legal?

The technology itself is legal in most jurisdictions. Specific uses can be illegal: impersonating a real person without consent, fraud, defamation, robocalls. Get written consent for any voice you do not own, watermark or disclose synthetic audio, and follow state biometric laws like Illinois BIPA. The FTC's 2024 AI impersonation guidance is the right reference for U.S. commercial use.

How does it compare to ElevenLabs?

ElevenLabs still leads on naturalness for short, expressive content. F5-TTS narrowed the gap to roughly 5-10% in blind testing. For long-form narration, XTTS-v2 is more consistent than ElevenLabs Free or Starter plans, and roughly equivalent to ElevenLabs Pro at zero ongoing cost. The cloud advantage shrinks every release.

Can I use cloned voices commercially?

Read each model's license carefully. XTTS-v2 uses Coqui's CPML, which is non-commercial by default with a paid commercial tier. F5-TTS is CC-BY-NC-4.0 - non-commercial only. Coqui VITS multilingual is MIT-licensed and freely commercial. Check current license terms before shipping a commercial product.

What about real-time use cases like game NPCs?

XTTS-v2 streaming inference produces first audio in roughly 200ms on an RTX 4070, which is below the conversational latency threshold. Pair it with Whisper for STT and a small LLM like Llama 3.2 3B for full local voice agents. Latency stack: ~150ms STT + ~100ms LLM + ~200ms TTS first chunk = 450ms perceptual round trip on a 4070.

How do I make the clone sound less robotic?

Three techniques in order of impact: (1) better reference audio - clean room, single register, 10-15 seconds; (2) chunk text on sentence boundaries with 240-character max chunks; (3) post-process with mild compression and a high-frequency rolloff in DaVinci or Audacity. Most "robotic" complaints trace back to bad reference, not the model.

Will any of this run on a Raspberry Pi?

The Coqui VITS multilingual stack runs on a Raspberry Pi 5 8GB at near-realtime. XTTS-v2 and F5-TTS will not - they need a GPU or a much more capable SoC. For Pi-class deployment, plan around VITS or Piper TTS, which is even lighter.

The Practical Takeaway

Voice cloning is no longer a SaaS-only capability. A used RTX 3060 12GB ($180-250 on the secondary market in April 2026) plus an afternoon of setup gives you XTTS-v2 with quality that matches or beats most cloud TTS subscriptions. A 4070 puts you into real-time territory. A 4090 gives you F5-TTS at full quality with headroom for LLM and Whisper running alongside.

The ethics conversation matters more than the technology. Use it for archival projects, audiobooks of your own writing, accessibility tools, and authorized commercial work. Do not use it to deceive people about who is speaking. The technology will keep improving; the rules of the road need to be ours, not just whatever the cloud TTS terms allow.

Set up XTTS-v2 first. It is the practical default. Move to F5-TTS when you need the extra quality. Skip cloud TTS unless you have a specific multi-region scaling need that local cannot meet.

Local AI Voice Clone: Private Text-to-Speech with XTTS, F5-TTS & Coqui

Want to go deeper than this article?

Local AI Voice Clone: Private TTS Without Sending Your Voice to the Cloud

Quick Start: Clone a Voice in 90 Seconds

Table of Contents

Why Local Voice Cloning {#why-local}

Hardware Requirements {#hardware}

Minimum (works fine, slow)

Recommended (real-time-capable)

The Three Open Models That Matter {#models}

XTTS-v2: Best All-Rounder {#xtts}

Setup

Single-Shot Generation

Python API for Batch Use

Recommended Reference Audio

F5-TTS: Best Quality {#f5-tts}

Setup

Generate

When to Choose F5-TTS

Coqui Studio Stack: Production Workflow {#coqui}

Real-Time Streaming Setup {#real-time}

Audiobook Batch Pipeline {#audiobook}

Head-to-Head Benchmark {#benchmark}

Ethics & Legal Guardrails {#ethics}

Common Pitfalls {#pitfalls}

Frequently Asked Questions {#faq}

How much reference audio do I need to clone a voice?

Can I clone a voice on a Mac?

Is local voice cloning legal?

How does it compare to ElevenLabs?

Can I use cloned voices commercially?

What about real-time use cases like game NPCs?

How do I make the clone sound less robotic?

Will any of this run on a Raspberry Pi?

The Practical Takeaway

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

🎓 Continue Learning

Ship Voice Projects Without Cloud TTS Bills

Related Guides

Build Real AI on Your Machine

Continue Learning

Local Video Generation

Complete Ollama Guide

Local Transcription

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI