Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Kokoro-82M is an open-weight text-to-speech model with just 82 million parameters, released under the permissive Apache 2.0 license — meaning you can use it commercially, the weights are about 327MB, and it runs fast even on a CPU. Its v1.0 release (January 27, 2025) ships 54 voices across 8 languages, it outputs natural 24kHz audio, and despite being a fraction of the size of rivals it reached #1 on the TTS Spaces Arena leaderboard when it launched (as v0.19). You install it with pip install kokoro soundfile plus the espeak-ng system package, then generate speech in about five lines of Python. For most people who want clean, offline narration without a GPU or a subscription, Kokoro is the easiest high-quality local TTS to recommend in 2026.
This guide covers the verified specs, a real local install with a working code example, and an honest comparison against Piper (the speed king) and XTTS v2 (the voice-cloning king) so you can pick the right tool.
What is Kokoro and why does 82M parameters matter?
Kokoro is an open-weight neural TTS model built by hexgrad. Under the hood it uses a StyleTTS 2 architecture paired with an ISTFTNet vocoder in a decoder-only design — no diffusion, no heavy encoder stack. That lean design is the whole story: it has 82 million parameters, which is tiny by modern standards.
For comparison, XTTS v2 carries hundreds of millions of parameters (its GPT module alone is roughly 443M), and large speech models like MetaVoice run past a billion. An 82M model is in a different weight class entirely — the full FP weights land under ~350MB on disk, small enough to sit comfortably alongside your other apps and even run on a phone.
Why is that notable? Three reasons:
- It fits anywhere. Sub-350MB weights mean no GPU is required and the download is trivial. You can bundle it into a desktop app or an edge device.
- It's fast. Fewer parameters means less compute per token of audio, which is why Kokoro generates speech faster than real time on a plain CPU (more on that below).
- It punched above its weight. At launch (as v0.19) Kokoro reached the #1 spot on the TTS Spaces Arena leaderboard — a blind, human-voted ranking — beating models many times larger on far less training data. It has since slid down the public leaderboard as newer entries piled in, but topping it at all from an 82M model is the headline. Small and good is the rare combination that makes it worth your attention.
In short: 82M parameters is the headline because it breaks the usual assumption that good TTS needs a big model and a big GPU. Kokoro proves you can get genuinely pleasant, near-real-time speech from something that runs on a laptop CPU.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What license, voices, and languages does Kokoro have?
Here are the verified facts on the model itself, which matter a lot if you're deciding whether you can ship it in a product.
- License: Apache 2.0. This is the big one. Apache 2.0 is genuinely permissive — you can use Kokoro commercially, modify it, and embed it in closed-source products without paying or asking. This is a sharp contrast to XTTS v2, whose Coqui license restricts commercial use.
- Parameters: 82 million. Decoder-only StyleTTS 2 + ISTFTNet.
- Output: 24kHz audio. Clean, natural-sounding speech at a 24,000 Hz sample rate.
- Voices: 54 in the v1.0 release (January 27, 2025), graded by quality and organized by language and gender (for example
af_bella,af_sarah,am_adam). - Languages: 8. American English and British English, plus Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese (selected via single-letter
lang_codevalues like'a'for American English).
One honest caveat worth stating up front: Kokoro does not do voice cloning. It's a fixed-voice model — you pick from its built-in voices, you don't feed it a sample of someone's voice to imitate. If cloning is what you need, that's an XTTS-class job, covered in our local AI voice cloning guide and the dedicated XTTS v2 voice cloning walkthrough.
How does Kokoro compare to Piper and XTTS?
The three most common open local TTS choices in 2026 are Kokoro, Piper, and XTTS v2 — and they're built for genuinely different jobs. Here's the verified head-to-head.
| Feature | Kokoro-82M | Piper | XTTS v2 |
|---|---|---|---|
| Parameters / size | ~82M, weights <~350MB | VITS/ONNX, ~20-200MB per voice | Large (GPT module ~443M) |
| License | Apache 2.0 (commercial OK) | MIT (original; a GPL-3.0 fork now leads) | Coqui license (commercial restricted) |
| Runs on CPU? | Yes, faster than real time | Yes, real-time on Raspberry Pi 4 | Practically needs a GPU |
| GPU memory if used | ~1-2GB | None needed | ~4-6GB |
| Voices | 54 built-in | Many per-voice model files | Built-in + your own |
| Voice cloning | No | No | Yes (zero-shot, ~6s sample) |
| Quality (TTS Arena) | #1 at launch, very natural | Good, clearly synthetic | High, very natural |
| Best for | Natural offline narration, apps | Edge devices, Raspberry Pi, lowest footprint | Cloning a specific voice |
The pattern is clear once you see it laid out:
- Piper is the minimalist's choice — tiny per-voice files, CPU-only, runs in real time even on a Raspberry Pi 4. Its quality is good but audibly synthetic next to Kokoro. Note that the original MIT-licensed Piper repo was archived in late 2025 and active development moved to a GPL-3.0 fork.
- XTTS v2 is the cloning specialist — it can mimic a target voice from about six seconds of reference audio across languages — but it's heavier (4-6GB VRAM), wants a GPU, and its license restricts commercial use.
- Kokoro sits in the sweet spot for natural, commercial-friendly, GPU-free speech: it sounds clearly better than Piper, ships under Apache 2.0 (unlike XTTS), and runs on a CPU (unlike XTTS) — as long as you're happy with its fixed voices.
Can Kokoro really run on a CPU?
Yes — and this is one of its biggest selling points. Because the model is only 82M parameters, it doesn't need a GPU to be usable. In published benchmarks Kokoro runs comfortably faster than real time on Apple Silicon CPU — community tests on M-series chips report anywhere from roughly 5x up to the teens times real time depending on the machine — meaning it synthesizes a minute of audio in well under a minute. On a GPU it jumps to many times real time again (tens of times faster), but the point is you don't need the GPU.
That CPU-friendliness puts Kokoro in the same practical bracket as Piper for "runs on modest hardware," while delivering noticeably more natural output. By contrast, XTTS v2 is slow on CPU and effectively expects a GPU with 4-6GB of VRAM — one comparison clocked Kokoro at roughly 10x faster than XTTS v2.
First-hand note: running Kokoro through the Python package on a mid-range laptop with no discrete GPU, short sentences came back in a fraction of a second and a few paragraphs of narration generated faster than they take to read aloud. There's no GPU spin-up, no warm-up lag — it feels more like calling a local function than invoking a heavy model. If you've fought with GPU drivers to get other speech models running, the contrast is striking.
If you're trying to figure out whether your machine can handle local AI workloads in general, our guide to the best local AI models for 8GB RAM covers what fits on modest hardware — and Kokoro is firmly in the "yes, even a laptop" category.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I install Kokoro TTS locally?
Installation is genuinely short. Kokoro is published as a Python package, and the only system-level dependency is espeak-ng, which handles phonemization (turning text into the sounds the model speaks).
1. Install the system dependency. On Debian/Ubuntu:
sudo apt-get install -y espeak-ng
On macOS use Homebrew (brew install espeak-ng); on Windows install the espeak-ng release and ensure it's on your PATH.
2. Install the Python packages:
pip install kokoro soundfile
The kokoro package is the inference library; soundfile writes the generated audio to a WAV file. The model weights (under ~350MB) download automatically from Hugging Face the first time you run a pipeline, so the first call takes a moment and after that it's cached locally.
That's the whole setup — no CUDA toolkit, no GPU drivers, no separate model-management daemon required.
What does a working Kokoro example look like?
Here's a minimal, complete example that generates a WAV file from text using the American English voice. It mirrors the pattern from Kokoro's own package, where KPipeline is the entry point and lang_code='a' selects American English.
from kokoro import KPipeline
import soundfile as sf
import numpy as np
# 'a' = American English. Other codes: 'b' British, 'e' Spanish,
# 'f' French, 'h' Hindi, 'i' Italian, 'j' Japanese,
# 'p' Brazilian Portuguese, 'z' Mandarin Chinese.
pipeline = KPipeline(lang_code='a')
text = "Kokoro is a tiny, open text-to-speech model that runs right here on your CPU."
# Pick one of the built-in voices, e.g. af_bella, af_sarah, am_adam.
audio_chunks = []
for _, _, chunk in pipeline(text, voice='af_bella'):
audio_chunks.append(chunk)
audio = np.concatenate(audio_chunks)
# Kokoro outputs 24kHz audio.
sf.write('output.wav', audio, 24000)
print("Saved output.wav")
A few things worth knowing about that snippet:
- It streams in chunks. The pipeline yields audio segment by segment, which is why we collect chunks and concatenate them. For long text this lets you start playing audio before the whole thing finishes.
- The sample rate is 24000. Always write at 24kHz — that's Kokoro's native output rate.
- Everything is local. After the first weight download, no network call is involved. Your text never leaves the machine — a real privacy advantage, which is the whole reason to run speech models locally in the first place (see our local AI privacy guide for why offline matters for sensitive content).
You can swap voice= and lang_code= to explore the 54 voices and 8 languages. The same local-first approach pairs naturally with offline speech-to-text — if you want the round trip, our walkthrough on generating subtitles locally with Whisper covers the transcription side.
When should you pick Kokoro over the alternatives?
Choose Kokoro when:
- You want natural-sounding speech without a GPU or a cloud subscription.
- You need a commercial-friendly license — Apache 2.0 lets you ship it in a paid product, which XTTS's license does not.
- You're building narration, audiobook generation, an accessibility reader, or a voice for a desktop/edge app and the built-in voices are good enough.
- You value a tiny footprint (under ~350MB) and a dead-simple install.
Choose Piper instead when you need the absolute smallest footprint or hard real-time on a Raspberry Pi 4, and you can accept slightly more synthetic quality.
Choose XTTS v2 instead when you specifically need voice cloning — reproducing a particular person's voice from a short sample — and you have a GPU and a license situation that works for you.
The honest summary: in 2026, if you don't need cloning, Kokoro is the default recommendation for high-quality local TTS. It's the best balance of quality, license freedom, and hardware accessibility on this list.
Key Takeaways
- Kokoro-82M is small and open. 82 million parameters, weights under ~350MB, Apache 2.0 license — commercial use is fully allowed.
- It runs on CPU faster than real time (roughly 5x and up on Apple Silicon CPU, many times faster again on a GPU), so no graphics card is required.
- v1.0 ships 54 voices across 8 languages at 24kHz, and at launch it ranked #1 on the TTS Spaces Arena despite being a fraction of competitors' size.
- Install is two commands —
espeak-ngpluspip install kokoro soundfile— and a working example is about five lines of Python. - It doesn't clone voices. For that, reach for XTTS v2. For the smallest possible footprint, reach for Piper. For natural, license-free, GPU-free speech, Kokoro wins.
Next Steps
- Need to imitate a specific voice instead of using a preset? Read our local AI voice cloning guide and the XTTS v2 voice cloning walkthrough.
- Want the full offline audio pipeline? Pair Kokoro's speech output with local transcription in our Whisper subtitles guide.
- Wondering what else your hardware can run? See the best local AI models for 8GB RAM.
- Curious why running models locally matters at all? Our local AI privacy guide explains the case for keeping your data on your own machine.
To go straight to the source, the model card and weights live on Hugging Face at hexgrad/Kokoro-82M, and the inference code is open-source on GitHub.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui XTTS v2: Free Voice Cloning Tutorial (17 Languages, 2026)
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Generate Subtitles Locally with Whisper (2026): Free & Private
- Local AI Voice Clone: 5 Open Models Tested (2026)
- Run Whisper Locally 2026: Free Offline Speech-to-Text Setup
- Voice Cloning Guide: 99% Accuracy in 30s (2026)
- Whisper Large V3: Run Locally for Speech-to-Text — Setup Guide 2026
- WhisperX 2026: Word Timestamps + Speaker Diarization Guide
Comments (0)
No comments yet. Be the first to share your thoughts!