Orpheus TTS is Canopy Labs' open-source (Apache 2.0) text-to-speech model built on a Llama-3B backbone, released in March 2025 as canopylabs/orpheus-3b-0.1-ft. To run it locally, pull a Q4_K_M or Q8_0 GGUF (the Q8_0 weights are about 4 GB) and serve it through a llama.cpp backend — it fits comfortably in roughly 8 GB of VRAM, supports inline emotion tags like <laugh> and <sigh>, streams at about 200 ms latency, and does zero-shot voice cloning. The fastest path for most people is to load the GGUF in LM Studio or llama.cpp, then put the Orpheus-FastAPI server in front of it to get an OpenAI-compatible /v1/audio/speech endpoint with eight English voices. Unlike tiny Kokoro (82M) or Resemble AI's Chatterbox (0.5B), Orpheus is a full 3B speech-LLM, so it trades VRAM for the most human-sounding, emotionally expressive output of the three.

If you have ever wanted a local voice that can actually laugh mid-sentence, sigh, or trail off naturally — instead of the flat, robotic delivery most open TTS gives you — Orpheus is the model to set up. This guide covers the real install, the honest VRAM numbers, how the emotion tags and voice cloning work, and exactly when Kokoro or Chatterbox is the smarter pick.

What is Orpheus TTS and why does it sound human?

Orpheus is not a classic TTS pipeline. It is a speech-LLM: an autoregressive model with a Llama-3B backbone that predicts audio tokens (decoded by the SNAC neural codec) the same way a chat model predicts text tokens. Canopy Labs' thesis was that if you give a capable LLM the right speech tokenizer and a large curated speech corpus, it learns prosody, rhythm and emotion the way it learns grammar. That architecture is why Orpheus can place a laugh or a sigh in a sentence convincingly — the expressiveness is learned, not bolted on with post-processing.

The released lineup as of mid-2026:

Finetuned production model — canopylabs/orpheus-3b-0.1-ft (3B), the one almost everyone runs.
Pretrained base — a 3B base checkpoint for your own fine-tuning.
Multilingual family — additional language pairs published after the English launch.
Smaller sizes (1B / 400M / 150M) appear on Canopy Labs' roadmap but, as of this writing, the 3B is the practical local model — treat the smaller sizes as "announced, not yet your daily driver."

It ships under Apache 2.0, which is genuinely permissive for commercial use. For a wider survey of the open field, see our roundup of the best local TTS models.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

How much VRAM does Orpheus TTS need? (~8GB)

Here is the honest part, because VRAM is what decides whether this fits your machine. The original orpheus-speech pip package leans on vLLM and is happiest on a 16GB+ card. But the community quickly produced GGUF quants, and those are what bring Orpheus down to consumer hardware. The Q8_0 GGUF weights are about 4 GB on disk; add the SNAC audio decoder and a working KV cache and you land in the ~8 GB neighborhood at runtime.

Path	Quant / format	Approx VRAM	Best for
GGUF via llama.cpp	Q4_K_M	~6-7 GB	8GB cards (RTX 3060/4060), tight setups
GGUF via llama.cpp	Q8_0 (~4 GB weights)	~8 GB	Best quality/size balance, 8-12GB cards
Native vLLM	FP16	~16 GB+	16-24GB cards, lowest latency streaming
Both models co-resident	quantized	~24 GB	SNAC + LLM together with headroom

Practical fit guide: an 8GB GPU runs the Q4_K_M or Q8_0 GGUF fine through llama.cpp or LM Studio. A 12GB card gives you comfortable headroom for longer context and faster generation. If you want the official vLLM streaming path at FP16, plan on 16GB+. To size a specific quant against your exact card, run the numbers through our VRAM calculator.

How do I install Orpheus TTS locally?

There are two routes. Pick based on your GPU.

Route A - native (vLLM, 16GB+ GPU). This is the official package and gives the lowest-latency streaming:

pip install orpheus-speech   # pulls vLLM under the hood

from orpheus_tts import OrpheusModel

model = OrpheusModel(model_name="canopylabs/orpheus-3b-0.1-ft")
syn = model.generate_speech(prompt="Hey, this is Orpheus running locally.", voice="tara")

Route B - GGUF (8GB GPU, recommended for most local users). Download a quantized GGUF (for example the Q8_0 build from a community quanter on Hugging Face), load it in LM Studio or run it with llama.cpp's server, then point the Orpheus-FastAPI frontend at it. This is the path that gets Orpheus onto an 8GB card. If you have set up GGUF models with Ollama or LM Studio before, you already know the shape of this; if not, our Chatterbox setup guide walks through the same local-server pattern step by step.

Either way, the model card and reference code live on the official Canopy Labs Orpheus-TTS repo.

How do the emotion tags work?

This is Orpheus's signature feature. You embed tags inline in your text and the model performs them as paralinguistic events — not as spoken words. The trained tags are:

<laugh>  <chuckle>  <sigh>  <cough>  <sniffle>  <groan>  <yawn>  <gasp>

So a prompt like:

I can't believe that actually worked <laugh> ... okay, give me a second <sigh> let me try it again.

...produces a genuine laugh and a weary sigh in the right places, with the surrounding speech adjusting its tone around them. In my own testing on an 8GB card with the Q8_0 GGUF, the laugh and sigh tags landed reliably and sounded natural; the rarer tags (groan, yawn) were a little more hit-or-miss and sometimes needed a re-roll. Treat those numbers as an approximate single-machine impression, not a benchmark. The practical tip: emotion tags are most convincing when the text around them justifies the emotion — drop a <laugh> after something actually funny and it sells; bolt one onto neutral text and it can sound forced.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

The eight voices and zero-shot cloning

The finetuned English model ships with eight named voices — tara, leah, jess, leo, dan, mia, zac, zoe — and Canopy Labs recommends tara as the strongest default. You select one per request (voice="tara").

Beyond the built-ins, Orpheus does zero-shot voice cloning: give it a short reference clip and it mimics that speaker without any fine-tuning. Quality scales with clean reference audio — a few seconds of clear, single-speaker speech works far better than a noisy minute. If your main goal is cloning rather than expressive narration, also weigh Chatterbox and XTTS, which are purpose-built around cloning; our XTTS v2 voice cloning guide covers that workflow in depth.

Streaming and OpenAI-compatible FastAPI serving

Orpheus is built for realtime: Canopy Labs reports about 200 ms streaming latency, reducible to roughly 100 ms with input streaming, which is low enough for voice agents and live narration.

For serving, the cleanest local setup is the community Orpheus-FastAPI project. It exposes an OpenAI-compatible endpoint at /v1/audio/speech, so anything that already talks to OpenAI's TTS API can point at your local box instead by swapping the base URL. It runs as a frontend that connects to an external inference server — llama.cpp, LM Studio, or any OpenAI-compatible server hosting the Orpheus GGUF — configured via an ORPHEUS_API_URL environment variable:

curl http://localhost:5005/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"orpheus","input":"Streaming locally with a real laugh <laugh>","voice":"tara"}' \
  --output speech.wav

That OpenAI-compatible shape is the big unlock: you get cloud-API ergonomics with a fully local, no-per-character-cost model behind it.

Orpheus vs Kokoro vs Chatterbox

These three cover the spectrum of local TTS in 2026, and the right one depends entirely on your priority — expressiveness, footprint, or cloning. The numbers below come from each project's official model card and repo.

Feature	Orpheus 3B	Kokoro	Chatterbox
Maker	Canopy Labs	hexgrad	Resemble AI
Parameters	3B (Llama backbone)	82M	0.5B
License	Apache 2.0	Apache 2.0	MIT
Approx VRAM	~8 GB (Q4/Q8 GGUF)	under 2 GB	~6 GB class
Emotion control	Inline tags (laugh/sigh/etc.)	None (neutral)	Exaggeration dial (0.5 neutral, ~0.7+ dramatic)
Voice cloning	Yes (zero-shot)	No (fixed voices)	Yes (few-second ref)
Best at	Most human/expressive speech	Tiny, fast, low-resource	Cloning + tunable emotion

The trade-off is clean to state. Orpheus is the pick when you want the most human, emotionally alive voice and you can spare ~8 GB — it is the only one of the three with explicit inline paralinguistic tags. Kokoro wins on pure efficiency: at 82M it runs on almost anything (under 2 GB) and produces clean neutral narration, but it can't clone and stays emotionally flat. Chatterbox sits in the middle — a half-billion-parameter model whose standout is an emotion-exaggeration dial (Resemble AI puts 0.5 at neutral and ~0.7+ for dramatic) plus fast few-second cloning. For tiny-footprint narration start with our Kokoro local setup guide; for tunable cloning, the Chatterbox setup guide.

Key Takeaways

Orpheus TTS is a 3B Llama-backbone speech-LLM from Canopy Labs (Apache 2.0, canopylabs/orpheus-3b-0.1-ft) that learns emotion and prosody like a language model learns grammar.
It runs in roughly 8 GB of VRAM via GGUF (Q8_0 weights ~4 GB) through llama.cpp or LM Studio; the native vLLM path wants 16GB+.
Inline emotion tags (<laugh>, <sigh>, <chuckle>, <gasp> and more) are trained features, not prompt hacks — they perform paralinguistic events inside speech.
Eight English voices (tara is the default) plus zero-shot cloning, with ~200 ms streaming (≈100 ms with input streaming).
Serve it OpenAI-style with Orpheus-FastAPI's /v1/audio/speech endpoint — drop-in for the OpenAI TTS API, fully local.
Choose by priority: Orpheus for expressiveness, Kokoro (82M) for the lightest footprint, Chatterbox (0.5B) for tunable emotion + cloning.

Next Steps

New to local voice models? Start with the field overview in Best Local TTS Models, which ranks the open options by quality and footprint.
Want tunable emotion plus fast cloning instead? Follow the Chatterbox TTS setup guide.
On a tiny GPU or CPU-only box? Kokoro's 82M model runs in under 2 GB.
Focused on cloning a specific voice? The XTTS v2 voice cloning guide covers the reference-audio workflow end to end.
Not sure a model fits your card? Plug your GPU and target quant into the VRAM calculator before you download the weights.

Orpheus TTS Setup 2026: Human-Like Emotional Local Voice

Want to go deeper than this article?

What is Orpheus TTS and why does it sound human?

Reading articles is good. Building is better.

How much VRAM does Orpheus TTS need? (~8GB)

How do I install Orpheus TTS locally?

How do the emotion tags work?

Reading articles is good. Building is better.

The eight voices and zero-shot cloning

Streaming and OpenAI-compatible FastAPI serving

Orpheus vs Kokoro vs Chatterbox

Key Takeaways

Next Steps

Voice working locally? Build the whole pipeline.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Best Local TTS Models

Chatterbox TTS Setup Guide

Kokoro TTS Local Setup

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Voice working locally? Build the whole pipeline.