★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Multimodal

OpenVoice v2 Guide (2026): Voice Cloning with Style and Emotion Control

May 1, 2026
20 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

OpenVoice v2 is the voice cloning model that gives you the most control. Where F5-TTS clones a voice exactly as the reference and XTTS handles multilingual cleanly, OpenVoice lets you decouple voice timbre from speaking style — so the same speaker can deliver lines cheerfully, angrily, whispering, or shouting on demand. It also clones from just 1-5 seconds of reference audio (vs F5-TTS's 5-15 sec), and ships under a clean MIT license that's safe for commercial use.

For game NPCs that need varied emotional delivery, character voice acting, accessibility tools that adapt tone to context, or any application where the reference audio doesn't carry the right emotional palette — OpenVoice v2 is the right tool. This guide covers everything: setup, the two-stage architecture, style/emotion controls, accent transfer, multilingual generation, integration with LocalAI / SillyTavern / game engines, and benchmarks vs F5-TTS / XTTS / commercial services.

Table of Contents

  1. What OpenVoice v2 Is
  2. The Two-Stage Architecture
  3. Style, Emotion, and Accent Control
  4. OpenVoice v2 vs F5-TTS vs XTTS v2
  5. Hardware Requirements
  6. Installation
  7. Your First Voice Clone
  8. Style and Emotion Examples
  9. Cross-Lingual Cloning
  10. Streaming Real-Time Output
  11. Python API
  12. LocalAI Integration
  13. SillyTavern / Game Engine Integration
  14. MIT License Implications
  15. Performance Benchmarks
  16. Tuning Recipes
  17. Ethical Considerations
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What OpenVoice v2 Is {#what-it-is}

OpenVoice v2 (myshell-ai/OpenVoice) is a voice cloning model from MyShell.ai released in mid-2024. It introduces explicit decoupling of:

  • Voice timbre (the physical "sound" of a voice — pitch, resonance, vocal tract characteristics)
  • Speaking style (cheerful, sad, angry, whispering, shouting, terrified, friendly, default)
  • Accent (US English, UK English, Indian English, Australian, Spanish, French, Chinese, Japanese, Korean)

Most TTS systems bake style into voice — clone a sad voice and you get sad output. OpenVoice v2 lets you clone Speaker A and have them deliver lines in 8 different emotional registers from the same reference.

License: MIT. Project: github.com/myshell-ai/OpenVoice. Active maintenance.


The Two-Stage Architecture {#architecture}

OpenVoice v2 splits TTS into two models:

Text + Style + Accent ──> [Base Speaker TTS] ──> Generic-voice audio
                                                    │
                                                    ▼
                                          [Tone Color Converter]
                                                    │
   Reference audio ────────────────────────────────┘
                                                    ▼
                                          Cloned-timbre audio

The Base Speaker TTS produces speech in the chosen style/accent with a generic timbre. The Tone Color Converter then swaps in the target speaker's vocal characteristics while preserving style. Two models, ~350 MB each, ~700 MB total.


Style, Emotion, and Accent Control {#style-control}

Available styles (English):

  • default — neutral
  • cheerful — upbeat, positive
  • sad — slow, low energy
  • angry — emphatic, sharp
  • whispering — soft, breathy
  • shouting — loud, projected
  • terrified — fearful, trembling
  • friendly — warm, conversational

Accents (English):

  • en-default, en-us, en-uk, en-india, en-australia

Other languages have a smaller style set (typically just default + one or two emotions), since training data is more limited for non-English emotional speech.

Combine: style="cheerful", language="en-australia" produces upbeat Australian English; then tone-convert to your target speaker's timbre.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

OpenVoice v2 vs F5-TTS vs XTTS v2 {#comparison}

PropertyOpenVoice v2F5-TTSXTTS v2
Reference audio length1-5 sec5-15 sec6-15 sec
Languages6 (well-tuned)2 base + community17
Style / emotion controlExplicitImplicitImplicit
Accent controlExplicitNoneNone
Voice quality (MOS)4.04.34.0
Inference speed (RTX 4090)8x RTF5x4x
First-audio latency200 ms280 ms280 ms
LicenseMITCC-BY-NC-4.0CPML (ambiguous)

Pick OpenVoice v2 for: style-controlled cloning, short references, MIT licensing, real-time apps. Pick F5-TTS for: highest voice fidelity. Pick XTTS for: broadest multilingual.


Hardware Requirements {#requirements}

HardwareRTF
RTX 40908x
RTX 40705x
RTX 30604x
Apple M4 Max (MPS)3x
RX 7900 XTX (ROCm)3x
Ryzen 7 7700X (CPU)0.5x

VRAM 2-3 GB. The smallest of the major voice-cloning models, runs on cheap GPUs comfortably.


Installation {#installation}

python3.10 -m venv ~/venvs/openvoice
source ~/venvs/openvoice/bin/activate

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

git clone https://github.com/myshell-ai/OpenVoice
cd OpenVoice
pip install -e .

# Download checkpoints
python -c "from openvoice import api; api.download_checkpoints()"

For Docker:

docker run --gpus all --rm -it \
    -v $(pwd):/workspace \
    ghcr.io/myshell-ai/openvoice:v2 \
    bash

Your First Voice Clone {#first-clone}

from openvoice import api

# Initialize models
base_speaker_tts = api.BaseSpeakerTTS(
    "checkpoints_v2/base_speakers/EN/config.json",
    device="cuda"
)
base_speaker_tts.load_ckpt("checkpoints_v2/base_speakers/EN/checkpoint.pth")

tone_color_converter = api.ToneColorConverter(
    "checkpoints_v2/converter/config.json",
    device="cuda"
)
tone_color_converter.load_ckpt("checkpoints_v2/converter/checkpoint.pth")

# Stage 1: extract target speaker tone embedding from reference
target_se, _ = api.se_extractor.get_se(
    "reference.wav",
    tone_color_converter,
)

# Stage 2: generate base audio in chosen style
base_speaker_tts.tts(
    text="Hello, this is a test.",
    output_path="base.wav",
    speaker="default",          # or cheerful, sad, angry, etc.
    language="English",
    speed=1.0,
)

# Stage 3: convert base audio's tone to target speaker
tone_color_converter.convert(
    audio_src_path="base.wav",
    src_se=base_speaker_tts.hps.data.spk2id["default"],   # default base speaker
    tgt_se=target_se,
    output_path="output.wav",
)

That's it: short reference + style + accent → cloned audio in any combination.


Style and Emotion Examples {#style-examples}

styles = ["default", "cheerful", "sad", "angry", "whispering", "shouting", "terrified", "friendly"]

for style in styles:
    base_speaker_tts.tts(
        text="The deal is closing in five minutes.",
        output_path=f"base_{style}.wav",
        speaker=style,
        language="English",
    )
    tone_color_converter.convert(
        audio_src_path=f"base_{style}.wav",
        src_se=base_speaker_tts.hps.data.spk2id["default"],
        tgt_se=target_se,
        output_path=f"output_{style}.wav",
    )

Same speaker, same line, eight emotional deliveries. Useful for game NPCs, character voiceovers, accessibility tools that adjust tone to context.


Cross-Lingual Cloning {#cross-lingual}

# Reference: English speaker
# Generate: Japanese audio with their voice

base_speaker_tts.tts(
    text="こんにちは、これはテストです。",
    output_path="base_ja.wav",
    speaker="default",
    language="Japanese",
)

tone_color_converter.convert(
    audio_src_path="base_ja.wav",
    src_se=base_speaker_tts.hps.data.spk2id["default"],
    tgt_se=target_se,            # English reference's tone
    output_path="output_ja.wav",
)

The English speaker's voice timbre delivers Japanese with native-sounding Japanese pronunciation and prosody. This works because the tone color converter is language-agnostic.


Streaming Real-Time Output {#streaming}

For real-time voice agents:

for chunk in base_speaker_tts.tts_stream(
    text=long_text,
    speaker="friendly",
    language="English",
    chunk_size=200,    # tokens per chunk
):
    converted = tone_color_converter.convert_chunk(chunk, src_se, tgt_se)
    play(converted)    # play immediately, don't wait for full text

First-audio latency: ~200 ms on RTX 4090. Suitable for voice chatbots, customer service avatars, game dialogue systems.


Python API {#python-api}

The full API is documented at github.com/myshell-ai/OpenVoice/blob/main/openvoice/api.py. Key classes:

  • BaseSpeakerTTS — text → generic voice in chosen style
  • ToneColorConverter — generic voice → cloned voice
  • se_extractor — extract speaker embedding from reference audio

For convenience wrappers, the community provides higher-level classes (e.g., OpenVoicePipeline from third-party packages).


LocalAI Integration {#localai}

# models/openvoice.yaml
name: openvoice
backend: openvoice            # community-contributed backend
parameters:
  base_model: checkpoints_v2/base_speakers/EN/checkpoint.pth
  converter_model: checkpoints_v2/converter/checkpoint.pth
  voice_wav: /build/voices/default-en.wav

Then OpenAI clients:

client.audio.speech.create(
    model="openvoice",
    voice="default-en",
    input="Hello world",
    extra_body={"style": "cheerful"},
)

See LocalAI Setup Guide.


SillyTavern / Game Engine Integration {#integrations}

SillyTavern

Use the AllTalk wrapper with OpenVoice backend (community plugin), or directly via OpenAI-compatible LocalAI URL. Style can be passed per-character: angry character has style=angry in the system prompt converted to the API call.

Unity / Unreal game engines

OpenVoice v2 ships ONNX exports (community-built) usable from C++/C# game runtimes. For runtime voice gen in shipped games, the latency and CPU cost are usable for non-real-time NPC dialogue. For dynamic voice (player-driven), pre-bake on a server.

Real-time avatars (Vtubing, agents)

OpenVoice v2 + Pipecat or LiveKit + lip-sync animation (Wav2Lip) creates real-time animated avatars with cloned voices. Latency budget: 500-1000 ms end-to-end is achievable.


MIT License Implications {#license}

MIT is the most permissive open-source license. You can:

  • Use commercially without attribution (attribution still good practice)
  • Modify and redistribute
  • Bundle into closed-source products
  • Sell as paid service

Among major voice-cloning models, only OpenVoice v2 has MIT. F5-TTS is CC-BY-NC-4.0 (non-commercial). XTTS v2 is CPML (commercial license tier defunct after Coqui shutdown). For commercial deployment without licensing risk, OpenVoice v2 is the cleanest choice.


Performance Benchmarks {#benchmarks}

10-second generation, RTX 4090:

ModelTimeRTF
OpenVoice v2 (base + converter)1.25 sec8x
F5-TTS2.0 sec5x
XTTS v22.5 sec4x
OpenVoice v2 streaming first-chunk200 msn/a

OpenVoice v2 is fastest among the open-source voice-cloning options.


Tuning Recipes {#tuning}

Game NPC voices (style-driven)

# Cache target_se per character; vary style per dialogue line
guard_se = api.se_extractor.get_se("guard_voice.wav", tone_color_converter)
merchant_se = api.se_extractor.get_se("merchant_voice.wav", tone_color_converter)

def speak(text, character_se, style):
    base_speaker_tts.tts(text=text, output_path="tmp.wav", speaker=style, language="English")
    tone_color_converter.convert("tmp.wav", default_src_se, character_se, "out.wav")

speak("Halt!", guard_se, "angry")
speak("Welcome, traveler.", merchant_se, "friendly")

Audiobook with character voices

Pre-extract target_se for each character (narrator, protagonist, antagonist, supporting cast). Loop through dialogue, set speaker by character, set style by emotional context, generate.

Real-time voice agent

# Stream incremental tokens from LLM, generate audio chunks
for token in llm.stream(prompt):
    if sentence_complete:
        audio = tts(sentence)
        play_async(audio)

Latency target: 500-800 ms from LLM token to audio playback.


Ethical Considerations {#ethics}

Same considerations as F5-TTS / XTTS v2:

  • Consent before cloning
  • Disclosure of synthetic audio
  • Watermarking where possible
  • Logging
  • Refusal of public-figure / fraud cloning
  • Compliance with state-level synthetic media laws

OpenVoice v2's style/emotion controls add a layer of risk: a cloned voice expressing emotions the original speaker never approved (e.g., a clone of a public figure "shouting") is more harmful than a neutral clone. Apply extra scrutiny to style-varied outputs in public-facing applications.

See F5-TTS ethics for the broader voice-cloning legal landscape.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Robotic outputReference too noisyUse cleaner reference audio
Style not applyingWrong language's style setSome styles only available in English
Voice match poorReference too shortUse 3-5 sec instead of 1-2
OOMLong text inputChunk text into sentences
Slow on AMDPyTorch CUDA buildReinstall with rocm6.2
Cross-lingual voice driftScript mismatchSame-script reference where possible

FAQ {#faq}

See answers to common OpenVoice v2 questions below.


Sources: OpenVoice GitHub | MyShell.ai | OpenVoice paper | Internal benchmarks RTX 4090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes OpenVoice + LocalAI deploy for emotion-controlled TTS. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators