Best Local TTS Models 2026: 8 Open-Source Voices Tested
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
The best open source text to speech model in 2026 depends on what you need: for fast, lightweight narration on almost any machine, Kokoro-82M (82M params, Apache 2.0) is the winner — it runs in roughly 2-3 GB of VRAM and even on CPU. For the most natural voice with cloning, Resemble AI's Chatterbox (MIT-licensed, 0.5B params) is the pick — in the company's own blind listening study, 65.3% of listeners preferred its Turbo voice over ElevenLabs versus 24.5% for ElevenLabs. Below those two, XTTS v2 still has the broadest 17-language cloning (but is non-commercial), Piper is the king of tiny CPU/Raspberry Pi devices, and F5-TTS and Orpheus 3B are the strongest research-grade voice-cloning options. The honest caveat: license matters more than the demo reel here — several of the best-sounding models are research/non-commercial only, so we label each one truthfully below.
If you want one sentence: install Kokoro for speed, Chatterbox for quality + a clean MIT license, and Piper if you are on a Raspberry Pi. Everything else is a trade-off you only make for a specific reason.
What makes a "best" local TTS model in 2026?
Text-to-speech is not like ranking coding LLMs, where one HumanEval number settles it. A TTS model can be excellent at one thing and useless at another. The four axes that actually decide which model you should run locally:
- Footprint (VRAM / CPU). Kokoro fits in ~2-3 GB and runs on CPU; the larger autoregressive models want a GPU.
- Speed (real-time factor, RTF). RTF below 1.0 means faster than real time. Kokoro is famously fast; the bigger generative models trade speed for naturalness.
- License. This is the one people skip and regret. MIT/Apache models are safe for commercial products; CPML and CC-BY-NC models are research/personal only.
- Voice cloning. Some models clone a voice from a few seconds of reference audio; others only speak in their built-in voices. If you do not need cloning, you can pick a lighter model.
The rest of this guide ranks eight real, currently-maintained open-source models against those four axes, with a use-case table at the end so you can jump straight to the right one.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
The 8 best open-source TTS models in 2026 (spec table)
Here is the full comparison. VRAM figures are approximate and depend on quantization, batch size and how long a clip you generate — treat them as "what you should plan for," not exact. License is the most load-bearing column: read it before you ship anything commercial.
| Model | Params | Approx VRAM | License | Voice cloning | Best for |
|---|---|---|---|---|---|
| Kokoro-82M | 82M | ~2-3 GB (runs on CPU) | Apache 2.0 | ❌ No (54 built-in voices) | Fast narration, anywhere |
| Chatterbox | 0.5B | ~4-6 GB | MIT | ✅ Zero-shot (~5s) | Best quality + commercial use |
| XTTS v2 | ~0.5B class | ~4-6 GB | CPML (non-commercial) | ✅ Zero-shot (~6s), 17 langs | Multilingual cloning (personal) |
| Piper | tiny (VITS) | <1 GB (CPU-first) | GPL-3.0 (current fork) | ❌ No | Raspberry Pi / edge / CPU |
| F5-TTS | ~336M (base) | ~4-8 GB | Code MIT / weights CC-BY-NC | ✅ Zero-shot (few seconds) | Research-grade cloning |
| Orpheus 3B | 3B (Llama backbone) | ~8-12 GB | Apache 2.0 | ✅ Zero-shot + emotion tags | Expressive, real-time, commercial |
| Bark | ~1B class (GPT-style) | ~6-12 GB | MIT | ⚠️ Limited / non-deterministic | Sound effects, music, expressive |
| Fish Speech (OpenAudio S1-mini) | 0.5B (open variant) | ~4-6 GB | CC-BY-NC-SA-4.0 (non-commercial) | ✅ Zero-shot (10-30s) | Multilingual, research |
A few things jump out. Kokoro, Chatterbox, Bark and Orpheus 3B carry permissive (Apache/MIT) licenses that are safe for commercial products. Piper is also commercially usable, but with a catch: the original rhasspy/piper (MIT) was archived in October 2025 and active development moved to the OHF-Voice/piper1-gpl fork, which is GPL-3.0 — still fine to use commercially, but its copyleft terms are a real consideration if you embed it in closed-source software (the old MIT weights/voices remain usable). XTTS v2 (CPML), F5-TTS weights (CC-BY-NC) and the open Fish Speech variant (CC-BY-NC-SA-4.0) are non-commercial — fine for personal projects and demos, not for a paid product. And the best-sounding model is not the biggest: Kokoro at 82M parameters beats models 10x its size on efficiency, which is exactly why it went viral.
#1 Kokoro-82M — the efficiency king
Kokoro-82M (released v1.0 on January 27, 2025) is the model most people should start with. It is genuinely tiny — 82 million parameters, weights under 1 GB at FP16 — yet it produces clean, natural narration that rivals much larger models. The v1.0 release ships 54 voices across 8 languages. Because it is so small, it runs comfortably in about 2-3 GB of GPU memory and is usable on CPU, and it is fast: on a high-end GPU its real-time factor sits around 0.03 (i.e. it generates roughly 30 seconds of audio per second of compute).
The trade-off: Kokoro does not clone voices. You get its 54 built-in voices, not your own. For audiobooks, narration, IVR systems and any app where you just need a good neutral voice, that is fine — and the Apache 2.0 license means you can ship it commercially. If you need to clone a specific person's voice, skip to Chatterbox or XTTS v2.
We have a full walkthrough in our Kokoro TTS local setup guide, including the OpenAI-compatible FastAPI server most people run it behind. You can also read the official details on the Kokoro-82M model card on Hugging Face.
#2 Chatterbox — the model that beat ElevenLabs
Chatterbox, open-sourced by Resemble AI under the MIT license, is the most interesting release in this list. It is built on a 0.5B-parameter Llama backbone, trained on roughly 500,000 hours of audio, and it does zero-shot voice cloning from about 5 seconds of reference audio.
The headline result: in Resemble AI's own blind listening study, 65.3% of listeners preferred the Chatterbox-Turbo voice over ElevenLabs, versus 24.5% who preferred ElevenLabs (10.2% neutral). An earlier round of the test put the figure at 63.75% preferring Chatterbox. The honest framing — and we will say it plainly — is that this is a vendor-run study, so apply the usual grain of salt. But it is still the most striking open-vs-closed TTS result we have seen, and an MIT license on a model this good is rare. Chatterbox also embeds Resemble's "PerTh" (Perceptual Threshold) neural watermark in every clip, which matters if you care about traceability of generated audio.
If you want the best-sounding local voice and a license you can build a product on, Chatterbox is the pick. You can read the model details on the Chatterbox model card.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
#3 XTTS v2 — broadest multilingual cloning (but non-commercial)
Coqui's XTTS v2 was the default open voice-cloning model for years, and it is still excellent: it clones a voice from a ~6-second reference clip and speaks 17 languages at 24 kHz. The catch is the license. XTTS v2 ships under the Coqui Public Model License (CPML), which is non-commercial — and because Coqui Inc. shut down in January 2024, there is no longer anyone to sell you a commercial license. So treat XTTS v2 as personal/research only.
That said, for sheer language coverage in cloning it is still hard to beat, and the tooling is mature. We have a complete tutorial in our XTTS v2 voice cloning guide, and a deeper model breakdown on the Coqui TTS model page. If your project is commercial and you wanted XTTS, use Chatterbox instead — same job, clean license.
#4 Piper — the Raspberry Pi / edge champion
Piper, from the Rhasspy team (now maintained at the OHF-Voice org), is the model to run when you have almost no compute. It uses VITS models exported to ONNX, runs comfortably on a Raspberry Pi 4, needs well under 1 GB, and is fast and offline-first. One licensing note: the original rhasspy/piper repo (MIT) was archived in October 2025, and active development moved to the OHF-Voice/piper1-gpl fork, which is GPL-3.0. You can still use it commercially, but GPL's copyleft terms are worth checking if you plan to ship it inside closed-source software. It does not clone voices, but it ships dozens of pre-trained voices across many languages.
Pick Piper for embedded devices, smart-home announcements, accessibility tools on low-power hardware, or any CPU-only deployment where Kokoro is overkill. The active repo is on GitHub at OHF-Voice/piper1-gpl.
#5-#8 — F5-TTS, Orpheus 3B, Bark, Fish Speech
The remaining four are strong but more specialized:
- F5-TTS (~336M-param base, DiT flow-matching architecture, trained on ~100k hours) does excellent zero-shot cloning from a few seconds of audio with a fast non-autoregressive pipeline (RTF around 0.15). Its code is MIT but the released weights are CC-BY-NC (because of the Emilia training set), so it is non-commercial unless you retrain on permissive data.
- Orpheus 3B (Canopy Labs, March 2025) is a 3B-parameter Llama-backbone speech LLM under Apache 2.0. It supports zero-shot cloning and guided emotion tags, with low latency for real-time use. It is the heaviest model here (plan for ~8-12 GB), but it is the best permissively-licensed option if you need expressive, emotional speech.
- Bark (Suno, MIT) is a GPT-style fully generative model that produces speech plus non-verbal sounds, laughter, and even simple music. The trade-off is that it is non-deterministic — the same prompt can wander — so it is great for creative, expressive audio and weak for predictable narration.
- Fish Speech / OpenAudio S1-mini is the 0.5B open-source sibling of Fish Audio's larger proprietary models. It does zero-shot cloning from 10-30 seconds of audio across 13 languages, but the open weights ship under CC-BY-NC-SA-4.0 (non-commercial) (the code is Apache-2.0), with the flagship S1 models served via paid API.
First-hand notes on running these locally
A few practical observations from running these on a single consumer GPU (an RTX 3090, 24 GB) — treat all numbers as approximate and hardware-dependent:
- Kokoro is shockingly light. It barely registered on the VRAM meter (~2-3 GB) and generated long passages far faster than real time. On a CPU-only laptop it was still usable for short clips. This is the one you reach for when you just need a voice and do not want to think about hardware.
- Chatterbox and XTTS v2 sat comfortably in the 4-6 GB range for normal-length clips, with first-audio latency of a couple of seconds. Quality on Chatterbox was the standout — it is the first local model where cloned output stopped sounding obviously synthetic to us.
- Orpheus 3B is the hungry one. As a 3B speech-LLM it wants real GPU headroom (plan ~8-12 GB), and like any autoregressive model, the moment it spills out of VRAM the speed collapses. Keep it fully on the GPU.
If you want to estimate whether a specific model fits your card before downloading gigabytes of weights, plug the parameter count and your GPU into our VRAM calculator — it accounts for context and overhead the rough figures above gloss over.
Which local TTS model should you use? (use-case table)
| Your goal | Best model | Why |
|---|---|---|
| Fast narration on any machine | Kokoro-82M | Tiny, fast, runs on CPU, Apache 2.0 |
| Best quality + commercial product | Chatterbox | MIT, beat ElevenLabs in vendor blind test |
| Clone a voice, personal project | XTTS v2 | 17 languages, mature tooling (non-commercial) |
| Raspberry Pi / edge / CPU only | Piper | <1 GB, ONNX, GPL-3.0, offline-first |
| Expressive/emotional, commercial | Orpheus 3B | Apache 2.0, emotion tags, real-time |
| Sound effects, laughter, creative | Bark | Generative non-speech audio, MIT |
| Research-grade cloning | F5-TTS | Fast flow-matching, few-second clone |
| Multilingual research/demo | Fish Speech S1-mini | 13 languages, zero-shot (non-commercial) |
Key Takeaways
- Kokoro-82M is the best default open-source TTS in 2026 — 82M params, Apache 2.0, ~2-3 GB VRAM (or CPU), 54 voices in 8 languages, and faster than real time. No cloning, but for narration it is the easiest win.
- Chatterbox is the quality + license pick. Resemble AI's MIT-licensed 0.5B model did zero-shot cloning and was preferred over ElevenLabs by 65.3% to 24.5% in the company's own blind study (vendor-run — grain of salt, but striking).
- License is the real decision-maker. Kokoro, Chatterbox, Bark and Orpheus 3B are Apache/MIT (commercial-safe). Piper's active fork is GPL-3.0 (commercially usable but copyleft). XTTS v2 (CPML), F5-TTS weights (CC-BY-NC) and open Fish Speech (CC-BY-NC-SA-4.0) are non-commercial.
- Match the model to the constraint. Piper for tiny/edge devices, Orpheus 3B for expressive commercial speech, Bark for creative non-speech audio, XTTS v2/F5-TTS for personal cloning projects.
- Bigger is not better here. The 82M Kokoro out-competes models 10x its size on efficiency — the right axis for TTS is footprint + speed + license + cloning, not raw parameter count.
Next Steps
- Ready to install the #1 pick? Follow our step-by-step Kokoro TTS local setup guide.
- Want to clone a voice? Start with the XTTS v2 voice cloning guide (personal use) or the Coqui TTS model page for the full background.
- Not sure a model fits your GPU? Run the parameter count through the VRAM calculator before downloading the weights.
- Building a full local AI stack? See our roundup of the best local AI models to pair your voice model with a local LLM.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui TTS (XTTS-v2): Local Voice Cloning Setup Guide
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- Build a Local Voice Assistant: Whisper + Ollama + Piper
- Chatterbox TTS Setup: Free ElevenLabs Killer (MIT, 2026)
- Coqui TTS Python Guide: pip install + XTTS API Examples
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Generate Subtitles Locally with Whisper (2026): Free & Private
- Is XTTS v2 / Coqui TTS Free for Commercial Use? (2026)
- Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model
Comments (0)
No comments yet. Be the first to share your thoughts!