★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Voice / TTS

Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model

June 20, 2026
10 min
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

Kokoro-82M is an open-weight text-to-speech model with just 82 million parameters, released under the permissive Apache 2.0 license — meaning you can use it commercially, the weights are about 327MB, and it runs fast even on a CPU. Its v1.0 release (January 27, 2025) ships 54 voices across 8 languages, it outputs natural 24kHz audio, and despite being a fraction of the size of rivals it reached #1 on the TTS Spaces Arena leaderboard when it launched (as v0.19). You install it with pip install kokoro soundfile plus the espeak-ng system package, then generate speech in about five lines of Python. For most people who want clean, offline narration without a GPU or a subscription, Kokoro is the easiest high-quality local TTS to recommend in 2026.

This guide covers the verified specs, a real local install with a working code example, and an honest comparison against Piper (the speed king) and XTTS v2 (the voice-cloning king) so you can pick the right tool.

What is Kokoro and why does 82M parameters matter?

Kokoro is an open-weight neural TTS model built by hexgrad. Under the hood it uses a StyleTTS 2 architecture paired with an ISTFTNet vocoder in a decoder-only design — no diffusion, no heavy encoder stack. That lean design is the whole story: it has 82 million parameters, which is tiny by modern standards.

For comparison, XTTS v2 carries hundreds of millions of parameters (its GPT module alone is roughly 443M), and large speech models like MetaVoice run past a billion. An 82M model is in a different weight class entirely — the full FP weights land under ~350MB on disk, small enough to sit comfortably alongside your other apps and even run on a phone.

Why is that notable? Three reasons:

  1. It fits anywhere. Sub-350MB weights mean no GPU is required and the download is trivial. You can bundle it into a desktop app or an edge device.
  2. It's fast. Fewer parameters means less compute per token of audio, which is why Kokoro generates speech faster than real time on a plain CPU (more on that below).
  3. It punched above its weight. At launch (as v0.19) Kokoro reached the #1 spot on the TTS Spaces Arena leaderboard — a blind, human-voted ranking — beating models many times larger on far less training data. It has since slid down the public leaderboard as newer entries piled in, but topping it at all from an 82M model is the headline. Small and good is the rare combination that makes it worth your attention.

In short: 82M parameters is the headline because it breaks the usual assumption that good TTS needs a big model and a big GPU. Kokoro proves you can get genuinely pleasant, near-real-time speech from something that runs on a laptop CPU.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What license, voices, and languages does Kokoro have?

Here are the verified facts on the model itself, which matter a lot if you're deciding whether you can ship it in a product.

  • License: Apache 2.0. This is the big one. Apache 2.0 is genuinely permissive — you can use Kokoro commercially, modify it, and embed it in closed-source products without paying or asking. This is a sharp contrast to XTTS v2, whose Coqui license restricts commercial use.
  • Parameters: 82 million. Decoder-only StyleTTS 2 + ISTFTNet.
  • Output: 24kHz audio. Clean, natural-sounding speech at a 24,000 Hz sample rate.
  • Voices: 54 in the v1.0 release (January 27, 2025), graded by quality and organized by language and gender (for example af_bella, af_sarah, am_adam).
  • Languages: 8. American English and British English, plus Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese (selected via single-letter lang_code values like 'a' for American English).

One honest caveat worth stating up front: Kokoro does not do voice cloning. It's a fixed-voice model — you pick from its built-in voices, you don't feed it a sample of someone's voice to imitate. If cloning is what you need, that's an XTTS-class job, covered in our local AI voice cloning guide and the dedicated XTTS v2 voice cloning walkthrough.

How does Kokoro compare to Piper and XTTS?

The three most common open local TTS choices in 2026 are Kokoro, Piper, and XTTS v2 — and they're built for genuinely different jobs. Here's the verified head-to-head.

FeatureKokoro-82MPiperXTTS v2
Parameters / size~82M, weights <~350MBVITS/ONNX, ~20-200MB per voiceLarge (GPT module ~443M)
LicenseApache 2.0 (commercial OK)MIT (original; a GPL-3.0 fork now leads)Coqui license (commercial restricted)
Runs on CPU?Yes, faster than real timeYes, real-time on Raspberry Pi 4Practically needs a GPU
GPU memory if used~1-2GBNone needed~4-6GB
Voices54 built-inMany per-voice model filesBuilt-in + your own
Voice cloningNoNoYes (zero-shot, ~6s sample)
Quality (TTS Arena)#1 at launch, very naturalGood, clearly syntheticHigh, very natural
Best forNatural offline narration, appsEdge devices, Raspberry Pi, lowest footprintCloning a specific voice

The pattern is clear once you see it laid out:

  • Piper is the minimalist's choice — tiny per-voice files, CPU-only, runs in real time even on a Raspberry Pi 4. Its quality is good but audibly synthetic next to Kokoro. Note that the original MIT-licensed Piper repo was archived in late 2025 and active development moved to a GPL-3.0 fork.
  • XTTS v2 is the cloning specialist — it can mimic a target voice from about six seconds of reference audio across languages — but it's heavier (4-6GB VRAM), wants a GPU, and its license restricts commercial use.
  • Kokoro sits in the sweet spot for natural, commercial-friendly, GPU-free speech: it sounds clearly better than Piper, ships under Apache 2.0 (unlike XTTS), and runs on a CPU (unlike XTTS) — as long as you're happy with its fixed voices.

Can Kokoro really run on a CPU?

Yes — and this is one of its biggest selling points. Because the model is only 82M parameters, it doesn't need a GPU to be usable. In published benchmarks Kokoro runs comfortably faster than real time on Apple Silicon CPU — community tests on M-series chips report anywhere from roughly 5x up to the teens times real time depending on the machine — meaning it synthesizes a minute of audio in well under a minute. On a GPU it jumps to many times real time again (tens of times faster), but the point is you don't need the GPU.

That CPU-friendliness puts Kokoro in the same practical bracket as Piper for "runs on modest hardware," while delivering noticeably more natural output. By contrast, XTTS v2 is slow on CPU and effectively expects a GPU with 4-6GB of VRAM — one comparison clocked Kokoro at roughly 10x faster than XTTS v2.

First-hand note: running Kokoro through the Python package on a mid-range laptop with no discrete GPU, short sentences came back in a fraction of a second and a few paragraphs of narration generated faster than they take to read aloud. There's no GPU spin-up, no warm-up lag — it feels more like calling a local function than invoking a heavy model. If you've fought with GPU drivers to get other speech models running, the contrast is striking.

If you're trying to figure out whether your machine can handle local AI workloads in general, our guide to the best local AI models for 8GB RAM covers what fits on modest hardware — and Kokoro is firmly in the "yes, even a laptop" category.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How do I install Kokoro TTS locally?

Installation is genuinely short. Kokoro is published as a Python package, and the only system-level dependency is espeak-ng, which handles phonemization (turning text into the sounds the model speaks).

1. Install the system dependency. On Debian/Ubuntu:

sudo apt-get install -y espeak-ng

On macOS use Homebrew (brew install espeak-ng); on Windows install the espeak-ng release and ensure it's on your PATH.

2. Install the Python packages:

pip install kokoro soundfile

The kokoro package is the inference library; soundfile writes the generated audio to a WAV file. The model weights (under ~350MB) download automatically from Hugging Face the first time you run a pipeline, so the first call takes a moment and after that it's cached locally.

That's the whole setup — no CUDA toolkit, no GPU drivers, no separate model-management daemon required.

What does a working Kokoro example look like?

Here's a minimal, complete example that generates a WAV file from text using the American English voice. It mirrors the pattern from Kokoro's own package, where KPipeline is the entry point and lang_code='a' selects American English.

from kokoro import KPipeline
import soundfile as sf
import numpy as np

# 'a' = American English. Other codes: 'b' British, 'e' Spanish,
# 'f' French, 'h' Hindi, 'i' Italian, 'j' Japanese,
# 'p' Brazilian Portuguese, 'z' Mandarin Chinese.
pipeline = KPipeline(lang_code='a')

text = "Kokoro is a tiny, open text-to-speech model that runs right here on your CPU."

# Pick one of the built-in voices, e.g. af_bella, af_sarah, am_adam.
audio_chunks = []
for _, _, chunk in pipeline(text, voice='af_bella'):
    audio_chunks.append(chunk)

audio = np.concatenate(audio_chunks)

# Kokoro outputs 24kHz audio.
sf.write('output.wav', audio, 24000)
print("Saved output.wav")

A few things worth knowing about that snippet:

  • It streams in chunks. The pipeline yields audio segment by segment, which is why we collect chunks and concatenate them. For long text this lets you start playing audio before the whole thing finishes.
  • The sample rate is 24000. Always write at 24kHz — that's Kokoro's native output rate.
  • Everything is local. After the first weight download, no network call is involved. Your text never leaves the machine — a real privacy advantage, which is the whole reason to run speech models locally in the first place (see our local AI privacy guide for why offline matters for sensitive content).

You can swap voice= and lang_code= to explore the 54 voices and 8 languages. The same local-first approach pairs naturally with offline speech-to-text — if you want the round trip, our walkthrough on generating subtitles locally with Whisper covers the transcription side.

When should you pick Kokoro over the alternatives?

Choose Kokoro when:

  • You want natural-sounding speech without a GPU or a cloud subscription.
  • You need a commercial-friendly license — Apache 2.0 lets you ship it in a paid product, which XTTS's license does not.
  • You're building narration, audiobook generation, an accessibility reader, or a voice for a desktop/edge app and the built-in voices are good enough.
  • You value a tiny footprint (under ~350MB) and a dead-simple install.

Choose Piper instead when you need the absolute smallest footprint or hard real-time on a Raspberry Pi 4, and you can accept slightly more synthetic quality.

Choose XTTS v2 instead when you specifically need voice cloning — reproducing a particular person's voice from a short sample — and you have a GPU and a license situation that works for you.

The honest summary: in 2026, if you don't need cloning, Kokoro is the default recommendation for high-quality local TTS. It's the best balance of quality, license freedom, and hardware accessibility on this list.

Key Takeaways

  1. Kokoro-82M is small and open. 82 million parameters, weights under ~350MB, Apache 2.0 license — commercial use is fully allowed.
  2. It runs on CPU faster than real time (roughly 5x and up on Apple Silicon CPU, many times faster again on a GPU), so no graphics card is required.
  3. v1.0 ships 54 voices across 8 languages at 24kHz, and at launch it ranked #1 on the TTS Spaces Arena despite being a fraction of competitors' size.
  4. Install is two commandsespeak-ng plus pip install kokoro soundfile — and a working example is about five lines of Python.
  5. It doesn't clone voices. For that, reach for XTTS v2. For the smallest possible footprint, reach for Piper. For natural, license-free, GPU-free speech, Kokoro wins.

Next Steps

To go straight to the source, the model card and weights live on Hugging Face at hexgrad/Kokoro-82M, and the inference code is open-source on GitHub.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators