★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
AI Models

Best Local TTS Models 2026: 8 Open-Source Voices Tested

June 20, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

The best open source text to speech model in 2026 depends on what you need: for fast, lightweight narration on almost any machine, Kokoro-82M (82M params, Apache 2.0) is the winner — it runs in roughly 2-3 GB of VRAM and even on CPU. For the most natural voice with cloning, Resemble AI's Chatterbox (MIT-licensed, 0.5B params) is the pick — in the company's own blind listening study, 65.3% of listeners preferred its Turbo voice over ElevenLabs versus 24.5% for ElevenLabs. Below those two, XTTS v2 still has the broadest 17-language cloning (but is non-commercial), Piper is the king of tiny CPU/Raspberry Pi devices, and F5-TTS and Orpheus 3B are the strongest research-grade voice-cloning options. The honest caveat: license matters more than the demo reel here — several of the best-sounding models are research/non-commercial only, so we label each one truthfully below.

If you want one sentence: install Kokoro for speed, Chatterbox for quality + a clean MIT license, and Piper if you are on a Raspberry Pi. Everything else is a trade-off you only make for a specific reason.

What makes a "best" local TTS model in 2026?

Text-to-speech is not like ranking coding LLMs, where one HumanEval number settles it. A TTS model can be excellent at one thing and useless at another. The four axes that actually decide which model you should run locally:

  1. Footprint (VRAM / CPU). Kokoro fits in ~2-3 GB and runs on CPU; the larger autoregressive models want a GPU.
  2. Speed (real-time factor, RTF). RTF below 1.0 means faster than real time. Kokoro is famously fast; the bigger generative models trade speed for naturalness.
  3. License. This is the one people skip and regret. MIT/Apache models are safe for commercial products; CPML and CC-BY-NC models are research/personal only.
  4. Voice cloning. Some models clone a voice from a few seconds of reference audio; others only speak in their built-in voices. If you do not need cloning, you can pick a lighter model.

The rest of this guide ranks eight real, currently-maintained open-source models against those four axes, with a use-case table at the end so you can jump straight to the right one.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

The 8 best open-source TTS models in 2026 (spec table)

Here is the full comparison. VRAM figures are approximate and depend on quantization, batch size and how long a clip you generate — treat them as "what you should plan for," not exact. License is the most load-bearing column: read it before you ship anything commercial.

ModelParamsApprox VRAMLicenseVoice cloningBest for
Kokoro-82M82M~2-3 GB (runs on CPU)Apache 2.0❌ No (54 built-in voices)Fast narration, anywhere
Chatterbox0.5B~4-6 GBMIT✅ Zero-shot (~5s)Best quality + commercial use
XTTS v2~0.5B class~4-6 GBCPML (non-commercial)✅ Zero-shot (~6s), 17 langsMultilingual cloning (personal)
Pipertiny (VITS)<1 GB (CPU-first)GPL-3.0 (current fork)❌ NoRaspberry Pi / edge / CPU
F5-TTS~336M (base)~4-8 GBCode MIT / weights CC-BY-NC✅ Zero-shot (few seconds)Research-grade cloning
Orpheus 3B3B (Llama backbone)~8-12 GBApache 2.0✅ Zero-shot + emotion tagsExpressive, real-time, commercial
Bark~1B class (GPT-style)~6-12 GBMIT⚠️ Limited / non-deterministicSound effects, music, expressive
Fish Speech (OpenAudio S1-mini)0.5B (open variant)~4-6 GBCC-BY-NC-SA-4.0 (non-commercial)✅ Zero-shot (10-30s)Multilingual, research

A few things jump out. Kokoro, Chatterbox, Bark and Orpheus 3B carry permissive (Apache/MIT) licenses that are safe for commercial products. Piper is also commercially usable, but with a catch: the original rhasspy/piper (MIT) was archived in October 2025 and active development moved to the OHF-Voice/piper1-gpl fork, which is GPL-3.0 — still fine to use commercially, but its copyleft terms are a real consideration if you embed it in closed-source software (the old MIT weights/voices remain usable). XTTS v2 (CPML), F5-TTS weights (CC-BY-NC) and the open Fish Speech variant (CC-BY-NC-SA-4.0) are non-commercial — fine for personal projects and demos, not for a paid product. And the best-sounding model is not the biggest: Kokoro at 82M parameters beats models 10x its size on efficiency, which is exactly why it went viral.

#1 Kokoro-82M — the efficiency king

Kokoro-82M (released v1.0 on January 27, 2025) is the model most people should start with. It is genuinely tiny — 82 million parameters, weights under 1 GB at FP16 — yet it produces clean, natural narration that rivals much larger models. The v1.0 release ships 54 voices across 8 languages. Because it is so small, it runs comfortably in about 2-3 GB of GPU memory and is usable on CPU, and it is fast: on a high-end GPU its real-time factor sits around 0.03 (i.e. it generates roughly 30 seconds of audio per second of compute).

The trade-off: Kokoro does not clone voices. You get its 54 built-in voices, not your own. For audiobooks, narration, IVR systems and any app where you just need a good neutral voice, that is fine — and the Apache 2.0 license means you can ship it commercially. If you need to clone a specific person's voice, skip to Chatterbox or XTTS v2.

We have a full walkthrough in our Kokoro TTS local setup guide, including the OpenAI-compatible FastAPI server most people run it behind. You can also read the official details on the Kokoro-82M model card on Hugging Face.

#2 Chatterbox — the model that beat ElevenLabs

Chatterbox, open-sourced by Resemble AI under the MIT license, is the most interesting release in this list. It is built on a 0.5B-parameter Llama backbone, trained on roughly 500,000 hours of audio, and it does zero-shot voice cloning from about 5 seconds of reference audio.

The headline result: in Resemble AI's own blind listening study, 65.3% of listeners preferred the Chatterbox-Turbo voice over ElevenLabs, versus 24.5% who preferred ElevenLabs (10.2% neutral). An earlier round of the test put the figure at 63.75% preferring Chatterbox. The honest framing — and we will say it plainly — is that this is a vendor-run study, so apply the usual grain of salt. But it is still the most striking open-vs-closed TTS result we have seen, and an MIT license on a model this good is rare. Chatterbox also embeds Resemble's "PerTh" (Perceptual Threshold) neural watermark in every clip, which matters if you care about traceability of generated audio.

If you want the best-sounding local voice and a license you can build a product on, Chatterbox is the pick. You can read the model details on the Chatterbox model card.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

#3 XTTS v2 — broadest multilingual cloning (but non-commercial)

Coqui's XTTS v2 was the default open voice-cloning model for years, and it is still excellent: it clones a voice from a ~6-second reference clip and speaks 17 languages at 24 kHz. The catch is the license. XTTS v2 ships under the Coqui Public Model License (CPML), which is non-commercial — and because Coqui Inc. shut down in January 2024, there is no longer anyone to sell you a commercial license. So treat XTTS v2 as personal/research only.

That said, for sheer language coverage in cloning it is still hard to beat, and the tooling is mature. We have a complete tutorial in our XTTS v2 voice cloning guide, and a deeper model breakdown on the Coqui TTS model page. If your project is commercial and you wanted XTTS, use Chatterbox instead — same job, clean license.

#4 Piper — the Raspberry Pi / edge champion

Piper, from the Rhasspy team (now maintained at the OHF-Voice org), is the model to run when you have almost no compute. It uses VITS models exported to ONNX, runs comfortably on a Raspberry Pi 4, needs well under 1 GB, and is fast and offline-first. One licensing note: the original rhasspy/piper repo (MIT) was archived in October 2025, and active development moved to the OHF-Voice/piper1-gpl fork, which is GPL-3.0. You can still use it commercially, but GPL's copyleft terms are worth checking if you plan to ship it inside closed-source software. It does not clone voices, but it ships dozens of pre-trained voices across many languages.

Pick Piper for embedded devices, smart-home announcements, accessibility tools on low-power hardware, or any CPU-only deployment where Kokoro is overkill. The active repo is on GitHub at OHF-Voice/piper1-gpl.

#5-#8 — F5-TTS, Orpheus 3B, Bark, Fish Speech

The remaining four are strong but more specialized:

  • F5-TTS (~336M-param base, DiT flow-matching architecture, trained on ~100k hours) does excellent zero-shot cloning from a few seconds of audio with a fast non-autoregressive pipeline (RTF around 0.15). Its code is MIT but the released weights are CC-BY-NC (because of the Emilia training set), so it is non-commercial unless you retrain on permissive data.
  • Orpheus 3B (Canopy Labs, March 2025) is a 3B-parameter Llama-backbone speech LLM under Apache 2.0. It supports zero-shot cloning and guided emotion tags, with low latency for real-time use. It is the heaviest model here (plan for ~8-12 GB), but it is the best permissively-licensed option if you need expressive, emotional speech.
  • Bark (Suno, MIT) is a GPT-style fully generative model that produces speech plus non-verbal sounds, laughter, and even simple music. The trade-off is that it is non-deterministic — the same prompt can wander — so it is great for creative, expressive audio and weak for predictable narration.
  • Fish Speech / OpenAudio S1-mini is the 0.5B open-source sibling of Fish Audio's larger proprietary models. It does zero-shot cloning from 10-30 seconds of audio across 13 languages, but the open weights ship under CC-BY-NC-SA-4.0 (non-commercial) (the code is Apache-2.0), with the flagship S1 models served via paid API.

First-hand notes on running these locally

A few practical observations from running these on a single consumer GPU (an RTX 3090, 24 GB) — treat all numbers as approximate and hardware-dependent:

  • Kokoro is shockingly light. It barely registered on the VRAM meter (~2-3 GB) and generated long passages far faster than real time. On a CPU-only laptop it was still usable for short clips. This is the one you reach for when you just need a voice and do not want to think about hardware.
  • Chatterbox and XTTS v2 sat comfortably in the 4-6 GB range for normal-length clips, with first-audio latency of a couple of seconds. Quality on Chatterbox was the standout — it is the first local model where cloned output stopped sounding obviously synthetic to us.
  • Orpheus 3B is the hungry one. As a 3B speech-LLM it wants real GPU headroom (plan ~8-12 GB), and like any autoregressive model, the moment it spills out of VRAM the speed collapses. Keep it fully on the GPU.

If you want to estimate whether a specific model fits your card before downloading gigabytes of weights, plug the parameter count and your GPU into our VRAM calculator — it accounts for context and overhead the rough figures above gloss over.

Which local TTS model should you use? (use-case table)

Your goalBest modelWhy
Fast narration on any machineKokoro-82MTiny, fast, runs on CPU, Apache 2.0
Best quality + commercial productChatterboxMIT, beat ElevenLabs in vendor blind test
Clone a voice, personal projectXTTS v217 languages, mature tooling (non-commercial)
Raspberry Pi / edge / CPU onlyPiper<1 GB, ONNX, GPL-3.0, offline-first
Expressive/emotional, commercialOrpheus 3BApache 2.0, emotion tags, real-time
Sound effects, laughter, creativeBarkGenerative non-speech audio, MIT
Research-grade cloningF5-TTSFast flow-matching, few-second clone
Multilingual research/demoFish Speech S1-mini13 languages, zero-shot (non-commercial)

Key Takeaways

  1. Kokoro-82M is the best default open-source TTS in 2026 — 82M params, Apache 2.0, ~2-3 GB VRAM (or CPU), 54 voices in 8 languages, and faster than real time. No cloning, but for narration it is the easiest win.
  2. Chatterbox is the quality + license pick. Resemble AI's MIT-licensed 0.5B model did zero-shot cloning and was preferred over ElevenLabs by 65.3% to 24.5% in the company's own blind study (vendor-run — grain of salt, but striking).
  3. License is the real decision-maker. Kokoro, Chatterbox, Bark and Orpheus 3B are Apache/MIT (commercial-safe). Piper's active fork is GPL-3.0 (commercially usable but copyleft). XTTS v2 (CPML), F5-TTS weights (CC-BY-NC) and open Fish Speech (CC-BY-NC-SA-4.0) are non-commercial.
  4. Match the model to the constraint. Piper for tiny/edge devices, Orpheus 3B for expressive commercial speech, Bark for creative non-speech audio, XTTS v2/F5-TTS for personal cloning projects.
  5. Bigger is not better here. The 82M Kokoro out-competes models 10x its size on efficiency — the right axis for TTS is footprint + speed + license + cloning, not raw parameter count.

Next Steps

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators