OpenVoice v2 Guide (2026): Voice Cloning with Style and Emotion Control
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
OpenVoice v2 is the voice cloning model that gives you the most control. Where F5-TTS clones a voice exactly as the reference and XTTS handles multilingual cleanly, OpenVoice lets you decouple voice timbre from speaking style — so the same speaker can deliver lines cheerfully, angrily, whispering, or shouting on demand. It also clones from just 1-5 seconds of reference audio (vs F5-TTS's 5-15 sec), and ships under a clean MIT license that's safe for commercial use.
For game NPCs that need varied emotional delivery, character voice acting, accessibility tools that adapt tone to context, or any application where the reference audio doesn't carry the right emotional palette — OpenVoice v2 is the right tool. This guide covers everything: setup, the two-stage architecture, style/emotion controls, accent transfer, multilingual generation, integration with LocalAI / SillyTavern / game engines, and benchmarks vs F5-TTS / XTTS / commercial services.
Table of Contents
- What OpenVoice v2 Is
- The Two-Stage Architecture
- Style, Emotion, and Accent Control
- OpenVoice v2 vs F5-TTS vs XTTS v2
- Hardware Requirements
- Installation
- Your First Voice Clone
- Style and Emotion Examples
- Cross-Lingual Cloning
- Streaming Real-Time Output
- Python API
- LocalAI Integration
- SillyTavern / Game Engine Integration
- MIT License Implications
- Performance Benchmarks
- Tuning Recipes
- Ethical Considerations
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What OpenVoice v2 Is {#what-it-is}
OpenVoice v2 (myshell-ai/OpenVoice) is a voice cloning model from MyShell.ai released in mid-2024. It introduces explicit decoupling of:
- Voice timbre (the physical "sound" of a voice — pitch, resonance, vocal tract characteristics)
- Speaking style (cheerful, sad, angry, whispering, shouting, terrified, friendly, default)
- Accent (US English, UK English, Indian English, Australian, Spanish, French, Chinese, Japanese, Korean)
Most TTS systems bake style into voice — clone a sad voice and you get sad output. OpenVoice v2 lets you clone Speaker A and have them deliver lines in 8 different emotional registers from the same reference.
License: MIT. Project: github.com/myshell-ai/OpenVoice. Active maintenance.
The Two-Stage Architecture {#architecture}
OpenVoice v2 splits TTS into two models:
Text + Style + Accent ──> [Base Speaker TTS] ──> Generic-voice audio
│
▼
[Tone Color Converter]
│
Reference audio ────────────────────────────────┘
▼
Cloned-timbre audio
The Base Speaker TTS produces speech in the chosen style/accent with a generic timbre. The Tone Color Converter then swaps in the target speaker's vocal characteristics while preserving style. Two models, ~350 MB each, ~700 MB total.
Style, Emotion, and Accent Control {#style-control}
Available styles (English):
default— neutralcheerful— upbeat, positivesad— slow, low energyangry— emphatic, sharpwhispering— soft, breathyshouting— loud, projectedterrified— fearful, tremblingfriendly— warm, conversational
Accents (English):
en-default,en-us,en-uk,en-india,en-australia
Other languages have a smaller style set (typically just default + one or two emotions), since training data is more limited for non-English emotional speech.
Combine: style="cheerful", language="en-australia" produces upbeat Australian English; then tone-convert to your target speaker's timbre.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
OpenVoice v2 vs F5-TTS vs XTTS v2 {#comparison}
| Property | OpenVoice v2 | F5-TTS | XTTS v2 |
|---|---|---|---|
| Reference audio length | 1-5 sec | 5-15 sec | 6-15 sec |
| Languages | 6 (well-tuned) | 2 base + community | 17 |
| Style / emotion control | Explicit | Implicit | Implicit |
| Accent control | Explicit | None | None |
| Voice quality (MOS) | 4.0 | 4.3 | 4.0 |
| Inference speed (RTX 4090) | 8x RTF | 5x | 4x |
| First-audio latency | 200 ms | 280 ms | 280 ms |
| License | MIT | CC-BY-NC-4.0 | CPML (ambiguous) |
Pick OpenVoice v2 for: style-controlled cloning, short references, MIT licensing, real-time apps. Pick F5-TTS for: highest voice fidelity. Pick XTTS for: broadest multilingual.
Hardware Requirements {#requirements}
| Hardware | RTF |
|---|---|
| RTX 4090 | 8x |
| RTX 4070 | 5x |
| RTX 3060 | 4x |
| Apple M4 Max (MPS) | 3x |
| RX 7900 XTX (ROCm) | 3x |
| Ryzen 7 7700X (CPU) | 0.5x |
VRAM 2-3 GB. The smallest of the major voice-cloning models, runs on cheap GPUs comfortably.
Installation {#installation}
python3.10 -m venv ~/venvs/openvoice
source ~/venvs/openvoice/bin/activate
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
git clone https://github.com/myshell-ai/OpenVoice
cd OpenVoice
pip install -e .
# Download checkpoints
python -c "from openvoice import api; api.download_checkpoints()"
For Docker:
docker run --gpus all --rm -it \
-v $(pwd):/workspace \
ghcr.io/myshell-ai/openvoice:v2 \
bash
Your First Voice Clone {#first-clone}
from openvoice import api
# Initialize models
base_speaker_tts = api.BaseSpeakerTTS(
"checkpoints_v2/base_speakers/EN/config.json",
device="cuda"
)
base_speaker_tts.load_ckpt("checkpoints_v2/base_speakers/EN/checkpoint.pth")
tone_color_converter = api.ToneColorConverter(
"checkpoints_v2/converter/config.json",
device="cuda"
)
tone_color_converter.load_ckpt("checkpoints_v2/converter/checkpoint.pth")
# Stage 1: extract target speaker tone embedding from reference
target_se, _ = api.se_extractor.get_se(
"reference.wav",
tone_color_converter,
)
# Stage 2: generate base audio in chosen style
base_speaker_tts.tts(
text="Hello, this is a test.",
output_path="base.wav",
speaker="default", # or cheerful, sad, angry, etc.
language="English",
speed=1.0,
)
# Stage 3: convert base audio's tone to target speaker
tone_color_converter.convert(
audio_src_path="base.wav",
src_se=base_speaker_tts.hps.data.spk2id["default"], # default base speaker
tgt_se=target_se,
output_path="output.wav",
)
That's it: short reference + style + accent → cloned audio in any combination.
Style and Emotion Examples {#style-examples}
styles = ["default", "cheerful", "sad", "angry", "whispering", "shouting", "terrified", "friendly"]
for style in styles:
base_speaker_tts.tts(
text="The deal is closing in five minutes.",
output_path=f"base_{style}.wav",
speaker=style,
language="English",
)
tone_color_converter.convert(
audio_src_path=f"base_{style}.wav",
src_se=base_speaker_tts.hps.data.spk2id["default"],
tgt_se=target_se,
output_path=f"output_{style}.wav",
)
Same speaker, same line, eight emotional deliveries. Useful for game NPCs, character voiceovers, accessibility tools that adjust tone to context.
Cross-Lingual Cloning {#cross-lingual}
# Reference: English speaker
# Generate: Japanese audio with their voice
base_speaker_tts.tts(
text="こんにちは、これはテストです。",
output_path="base_ja.wav",
speaker="default",
language="Japanese",
)
tone_color_converter.convert(
audio_src_path="base_ja.wav",
src_se=base_speaker_tts.hps.data.spk2id["default"],
tgt_se=target_se, # English reference's tone
output_path="output_ja.wav",
)
The English speaker's voice timbre delivers Japanese with native-sounding Japanese pronunciation and prosody. This works because the tone color converter is language-agnostic.
Streaming Real-Time Output {#streaming}
For real-time voice agents:
for chunk in base_speaker_tts.tts_stream(
text=long_text,
speaker="friendly",
language="English",
chunk_size=200, # tokens per chunk
):
converted = tone_color_converter.convert_chunk(chunk, src_se, tgt_se)
play(converted) # play immediately, don't wait for full text
First-audio latency: ~200 ms on RTX 4090. Suitable for voice chatbots, customer service avatars, game dialogue systems.
Python API {#python-api}
The full API is documented at github.com/myshell-ai/OpenVoice/blob/main/openvoice/api.py. Key classes:
BaseSpeakerTTS— text → generic voice in chosen styleToneColorConverter— generic voice → cloned voicese_extractor— extract speaker embedding from reference audio
For convenience wrappers, the community provides higher-level classes (e.g., OpenVoicePipeline from third-party packages).
LocalAI Integration {#localai}
# models/openvoice.yaml
name: openvoice
backend: openvoice # community-contributed backend
parameters:
base_model: checkpoints_v2/base_speakers/EN/checkpoint.pth
converter_model: checkpoints_v2/converter/checkpoint.pth
voice_wav: /build/voices/default-en.wav
Then OpenAI clients:
client.audio.speech.create(
model="openvoice",
voice="default-en",
input="Hello world",
extra_body={"style": "cheerful"},
)
See LocalAI Setup Guide.
SillyTavern / Game Engine Integration {#integrations}
SillyTavern
Use the AllTalk wrapper with OpenVoice backend (community plugin), or directly via OpenAI-compatible LocalAI URL. Style can be passed per-character: angry character has style=angry in the system prompt converted to the API call.
Unity / Unreal game engines
OpenVoice v2 ships ONNX exports (community-built) usable from C++/C# game runtimes. For runtime voice gen in shipped games, the latency and CPU cost are usable for non-real-time NPC dialogue. For dynamic voice (player-driven), pre-bake on a server.
Real-time avatars (Vtubing, agents)
OpenVoice v2 + Pipecat or LiveKit + lip-sync animation (Wav2Lip) creates real-time animated avatars with cloned voices. Latency budget: 500-1000 ms end-to-end is achievable.
MIT License Implications {#license}
MIT is the most permissive open-source license. You can:
- Use commercially without attribution (attribution still good practice)
- Modify and redistribute
- Bundle into closed-source products
- Sell as paid service
Among major voice-cloning models, only OpenVoice v2 has MIT. F5-TTS is CC-BY-NC-4.0 (non-commercial). XTTS v2 is CPML (commercial license tier defunct after Coqui shutdown). For commercial deployment without licensing risk, OpenVoice v2 is the cleanest choice.
Performance Benchmarks {#benchmarks}
10-second generation, RTX 4090:
| Model | Time | RTF |
|---|---|---|
| OpenVoice v2 (base + converter) | 1.25 sec | 8x |
| F5-TTS | 2.0 sec | 5x |
| XTTS v2 | 2.5 sec | 4x |
| OpenVoice v2 streaming first-chunk | 200 ms | n/a |
OpenVoice v2 is fastest among the open-source voice-cloning options.
Tuning Recipes {#tuning}
Game NPC voices (style-driven)
# Cache target_se per character; vary style per dialogue line
guard_se = api.se_extractor.get_se("guard_voice.wav", tone_color_converter)
merchant_se = api.se_extractor.get_se("merchant_voice.wav", tone_color_converter)
def speak(text, character_se, style):
base_speaker_tts.tts(text=text, output_path="tmp.wav", speaker=style, language="English")
tone_color_converter.convert("tmp.wav", default_src_se, character_se, "out.wav")
speak("Halt!", guard_se, "angry")
speak("Welcome, traveler.", merchant_se, "friendly")
Audiobook with character voices
Pre-extract target_se for each character (narrator, protagonist, antagonist, supporting cast). Loop through dialogue, set speaker by character, set style by emotional context, generate.
Real-time voice agent
# Stream incremental tokens from LLM, generate audio chunks
for token in llm.stream(prompt):
if sentence_complete:
audio = tts(sentence)
play_async(audio)
Latency target: 500-800 ms from LLM token to audio playback.
Ethical Considerations {#ethics}
Same considerations as F5-TTS / XTTS v2:
- Consent before cloning
- Disclosure of synthetic audio
- Watermarking where possible
- Logging
- Refusal of public-figure / fraud cloning
- Compliance with state-level synthetic media laws
OpenVoice v2's style/emotion controls add a layer of risk: a cloned voice expressing emotions the original speaker never approved (e.g., a clone of a public figure "shouting") is more harmful than a neutral clone. Apply extra scrutiny to style-varied outputs in public-facing applications.
See F5-TTS ethics for the broader voice-cloning legal landscape.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Robotic output | Reference too noisy | Use cleaner reference audio |
| Style not applying | Wrong language's style set | Some styles only available in English |
| Voice match poor | Reference too short | Use 3-5 sec instead of 1-2 |
| OOM | Long text input | Chunk text into sentences |
| Slow on AMD | PyTorch CUDA build | Reinstall with rocm6.2 |
| Cross-lingual voice drift | Script mismatch | Same-script reference where possible |
FAQ {#faq}
See answers to common OpenVoice v2 questions below.
Sources: OpenVoice GitHub | MyShell.ai | OpenVoice paper | Internal benchmarks RTX 4090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!