★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Multimodal

F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

F5-TTS is the open-source text-to-speech model that finally rivals commercial APIs on zero-shot voice cloning. Built on flow matching with a DiT-style transformer by SWivid (Shanghai AI Lab), it produces natural-sounding speech and clones voices from just 5-15 seconds of reference audio. Quality is competitive with ElevenLabs for general voiceover work; it surpasses OpenAI TTS / Azure / Google for cloning fidelity.

This guide covers everything: installation across NVIDIA / AMD / Apple, voice cloning from reference audio, multilingual generation, real-time streaming output, integration with LocalAI / ComfyUI / Open WebUI, fine-tuning for new languages, performance benchmarks, and the ethical / legal considerations of voice cloning at scale.

Table of Contents

  1. What F5-TTS Is
  2. How Flow Matching TTS Works
  3. F5-TTS vs XTTS v2 vs OpenVoice vs Spark-TTS
  4. Hardware Requirements
  5. Installation: pip, Docker, Source
  6. Your First Voice Clone
  7. Reference Audio Best Practices
  8. Multilingual Generation
  9. Streaming and Real-Time Output
  10. CLI Usage
  11. Python API
  12. Gradio Web UI
  13. LocalAI Integration (OpenAI-Compatible)
  14. ComfyUI / Open WebUI Integration
  15. Fine-Tuning for New Languages
  16. Performance Benchmarks
  17. Ethical and Legal Considerations
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What F5-TTS Is {#what-it-is}

F5-TTS ("Flow Matching with Diffusion-Transformer for Speech") is a 2024 open-source TTS model from SWivid. Architecture:

  • Flow matching — a diffusion-style training objective that produces sharper, more natural speech than older auto-regressive TTS
  • Diffusion Transformer (DiT) backbone — the same architecture family that powers Flux for image generation
  • Conditional generation — text + reference audio → cloned speech
  • Vocoder-free — generates mel-spectrograms with built-in HiFi-GAN-style decoder

Project: github.com/SWivid/F5-TTS. License: CC-BY-NC-4.0 (non-commercial unless commercial license obtained).


How Flow Matching TTS Works {#flow-matching}

Traditional TTS (Tacotron, FastSpeech) generates speech autoregressively token-by-token; quality is bounded by exposure bias and slow inference. Flow matching reframes generation as learning a smooth path between Gaussian noise and the target mel-spectrogram, then solving the ODE at inference time to denoise.

Practical impact:

  • Higher-quality prosody than autoregressive TTS
  • Faster inference (parallel ODE steps vs sequential tokens)
  • Better cross-lingual generalization
  • Robust to noisy training data

If you're familiar with diffusion image models, flow matching is the audio equivalent — and F5-TTS specifically uses the same DiT architecture family.


F5-TTS vs XTTS v2 vs OpenVoice vs Spark-TTS {#comparison}

PropertyF5-TTSXTTS v2 (Coqui)OpenVoice v2Spark-TTS
Voice cloning5-15 sec ref6-10 sec ref1-5 sec ref5-15 sec ref
LanguagesEN, ZH base; community ports17 official6 nativeEN, ZH base
Quality (MOS)4.34.03.94.2
Cross-lingualYes (some loss)BestYes (excellent)Yes
Emotion controlLimitedLimitedStyle transferLimited
Inference speed (RTX 4090, RTF)5x real-time3x4x5x
LicenseCC-BY-NC-4.0MPL 2.0MITApache 2.0
Best forGeneral cloningMultilingualStyle transferOpen commercial

For pure voice cloning quality on a non-commercial / research basis: F5-TTS. For commercial-friendly licensing with high quality: Spark-TTS or XTTS v2. For style/emotion transfer: OpenVoice v2.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements {#requirements}

HardwareRTF (real-time factor)
RTX 4090 (24 GB)~5x
RTX 4070 (12 GB)~3x
RTX 3060 (12 GB)~2x
Apple M4 Max (MPS)~1.5x
Apple M2 (MPS)~0.8x
RX 7900 XTX (ROCm)~3x
Ryzen 7 7700X (CPU)~0.3x

VRAM: ~3 GB for model + activations. CPU-only works but is slow. For real-time / interactive use cases, NVIDIA GPU recommended.


Installation: pip, Docker, Source {#installation}

pip

python3.10 -m venv ~/venvs/f5tts
source ~/venvs/f5tts/bin/activate
pip install --upgrade pip

# CUDA 12.4
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

# F5-TTS
pip install f5-tts

# Verify
f5-tts_infer-cli --help

Docker

docker run --gpus all --rm -it \
    -v $(pwd):/workspace \
    ghcr.io/swivid/f5-tts:latest \
    bash

From source

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

First-run downloads the model checkpoint (~1.5 GB) from Hugging Face on first use.


Your First Voice Clone {#first-clone}

f5-tts_infer-cli \
    --model F5TTS_Base \
    --ref_audio reference.wav \
    --ref_text "This is what the speaker is saying in the reference audio." \
    --gen_text "Hello, this is the cloned voice speaking new text." \
    --output_file output.wav

Required:

  • --ref_audio — 5-15 second WAV / MP3 / FLAC of the speaker
  • --ref_text — exact transcription of what they say
  • --gen_text — new text to generate in their voice

The transcription matters — F5-TTS uses it for alignment, and a wrong transcription degrades quality.


Reference Audio Best Practices {#reference-audio}

For best clone quality:

  • Length: 10-15 seconds (5 sec works but is tighter)
  • Format: mono WAV, 16-44 kHz, 16-bit PCM
  • Content: varied speech (different sentences, intonations) ideally
  • Quality: clean, no background noise / music / echo
  • Single speaker: F5-TTS will pick up whoever it hears
  • Microphone match: record reference with the same mic the speaker would use

If reference audio is noisy, run it through Demucs / ULTra / Whisper-VAD to clean first.

For voice with emotion (excited, calm, sad), include that emotion in the reference — F5-TTS will partially carry it through to generation.


Multilingual Generation {#multilingual}

The base F5-TTS model is trained on English and Chinese. For other languages:

  • Japanese: F5TTS_v1_japanese community fine-tune
  • Korean: several community models
  • Spanish / French / German / Italian: community variants
  • Multilingual: E2-TTS sister model has broader coverage
f5-tts_infer-cli --model F5TTS_v1_japanese \
    --ref_audio japanese_speaker.wav \
    --ref_text "リファレンステキスト" \
    --gen_text "新しい日本語のテキスト"

Cross-lingual cloning (English reference → Japanese output) works with some loss in voice fidelity. For best results, use a same-language reference.


Streaming and Real-Time Output {#streaming}

F5-TTS supports chunked streaming output — split long input text into sentence-level chunks, generate sequentially, stream each chunk's audio as it completes.

from f5_tts.api import F5TTS
import sounddevice as sd

f5 = F5TTS(model="F5TTS_Base")

text = "This is a long passage with multiple sentences. " * 10

for audio_chunk in f5.stream(
    ref_audio="reference.wav",
    ref_text="...",
    gen_text=text,
    chunk_size=120,  # tokens per chunk
):
    sd.play(audio_chunk, samplerate=24000, blocking=True)

Latency-to-first-audio on RTX 4090: ~300-500 ms. Suitable for interactive voice agents, real-time TTS in games, etc.


CLI Usage {#cli}

# Single generation
f5-tts_infer-cli \
    --model F5TTS_Base \
    --ref_audio ref.wav \
    --ref_text "Reference transcription" \
    --gen_text "New text" \
    --output_file out.wav

# Batch from file
f5-tts_infer-cli \
    --model F5TTS_Base \
    --ref_audio ref.wav \
    --ref_text "..." \
    --gen_file lines.txt \
    --output_dir outputs/

# Cross-fade between segments for long form
f5-tts_infer-cli ... --cross_fade_duration 0.15

For audiobook generation: split chapters into ~1-2 minute chunks, generate each, concat with cross-fade.


Python API {#python}

from f5_tts.api import F5TTS

f5 = F5TTS(
    model="F5TTS_Base",
    device="cuda",       # cuda / mps / cpu / rocm
)

# Single generation
audio, sample_rate = f5.infer(
    ref_audio="reference.wav",
    ref_text="Transcription of reference",
    gen_text="New text to generate",
    nfe_step=32,         # ODE steps; higher = better quality, slower
    cfg_strength=2.0,    # CFG-style guidance
)

# Save
import soundfile as sf
sf.write("output.wav", audio, sample_rate)

nfe_step controls quality / speed: 16 for fast, 32 for default, 64 for highest quality.


Gradio Web UI {#web-ui}

f5-tts_infer-gradio --port 7860 --host 0.0.0.0

Browse to http://localhost:7860. UI includes:

  • Reference audio upload + text
  • Generation text input
  • Model picker
  • Speed / quality sliders
  • Multi-speaker conversation mode (alternate voices)

Useful for quick experimentation and voice testing without writing code.


LocalAI Integration (OpenAI-Compatible) {#localai}

To expose F5-TTS as an OpenAI-compatible /v1/audio/speech endpoint, wrap it with LocalAI:

# models/f5tts.yaml
name: f5-tts
backend: f5tts
parameters:
  model: F5TTS_Base
  ref_audio: /build/voices/default.wav
  ref_text: "Default reference text"

Then OpenAI clients work unchanged:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

resp = client.audio.speech.create(
    model="f5-tts",
    voice="default",
    input="Hello world",
)
resp.stream_to_file("out.wav")

For per-call voice override, pass voice as a known LocalAI voice config name.


ComfyUI / Open WebUI Integration {#integrations}

ComfyUI

Community node packs (search "F5-TTS ComfyUI" in ComfyUI Manager) add F5-TTS nodes. Workflow: text input → F5-TTS Generate node → save WAV. Useful for AI-narration over generated images.

Open WebUI

Settings → Audio → TTS Engine → Custom OpenAI URL → point to LocalAI with F5-TTS configured. Open WebUI then reads bot replies aloud in the cloned voice.

SillyTavern / KoboldAI

Both support OpenAI-compatible TTS via custom endpoint. Configure the TTS provider URL to LocalAI; pick the F5-TTS voice.


Fine-Tuning for New Languages {#fine-tuning}

For a new language without an existing community model:

# Prepare clean speech corpus (1+ hour minimum, 10+ hours recommended)
# WAVs at 24 kHz mono, paired with transcriptions

git clone https://github.com/SWivid/F5-TTS
cd F5-TTS
python -m f5_tts.train --config configs/F5TTS_Base.yaml \
    --dataset_path /path/to/your/corpus

Fine-tuning a new language from scratch on RTX 4090: ~24-72 hours depending on corpus. For most users, waiting for / using a community fine-tune is simpler than training your own.


Performance Benchmarks {#benchmarks}

10-second generation, single-stream, RTX 4090 BF16:

SettingTimeRTF
nfe_step=16 (fast)1.2 sec8.3x
nfe_step=32 (default)2.0 sec5.0x
nfe_step=64 (high quality)3.8 sec2.6x

Quality difference: nfe_step=16 has noticeable artifacts on long outputs; 32 is the sweet spot; 64 is for audiobook / commercial-grade output.


Voice cloning is dual-use. Best practices for responsible deployment:

  1. Consent — only clone voices with explicit permission
  2. Disclosure — label synthetic audio (EU AI Act requires this in some contexts)
  3. Watermarking — embed inaudible signatures (Audio-WM, AudioSeal)
  4. Logging — audit trail of who generated what
  5. Refusals — block public-figure cloning, deepfake-of-living-people prompts
  6. Geofencing — comply with state-level synthetic media laws (Texas, Tennessee, California in the US)

For commercial deployments, treat voice cloning the same as biometric data — privacy, consent, and audit requirements apply. F5-TTS's CC-BY-NC license additionally restricts commercial use without separate licensing from SWivid.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Robotic / metallic outputReference too noisyUse cleaner reference audio
Wrong pace / pitchReference too shortUse 10-15 sec, varied content
Repeated phrasesnfe_step too lowBump to 32 or 64
Wrong language pronunciationWrong modelUse language-specific fine-tune
OOMnfe_step + long inputChunk text, lower nfe_step
Slow on AMDPyTorch CUDA buildReinstall torch with rocm6.2 index
Mac slowMPS not enabledVerify torch.backends.mps.is_available()

FAQ {#faq}

See answers to common F5-TTS questions below.


Sources: F5-TTS GitHub | F5-TTS paper | Hugging Face F5-TTS models | LocalAI F5-TTS backend | Internal benchmarks RTX 4090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes F5-TTS + LocalAI voice clone reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators