What is OpenVoice v2 and how is it different from F5-TTS / XTTS?

OpenVoice v2 is MyShell.ai's open-source voice cloning model with a key differentiator: explicit control over speaking **style** (cheerful / sad / angry / shouting) and **accent** independent of voice timbre. Where F5-TTS clones the speaker's natural style and XTTS captures their general delivery, OpenVoice lets you transfer voice timbre while specifying a different emotion or accent for each generation. It also clones from very short references (1-5 seconds) — the shortest of the three. Architecturally it's a two-stage system: a base TTS produces controllable speech, then a tone color converter swaps in the target speaker's timbre. License is MIT — most permissive of the major voice-cloning models.

What hardware does OpenVoice v2 need?

NVIDIA GPU with 4 GB+ VRAM is comfortable. The two models combined are ~700 MB. CPU-only works but is slower (~0.5x real-time vs 8x on RTX 4090). Apple Silicon via PyTorch MPS works. For real-time / interactive use, an RTX 4060 or better is recommended; for batch generation, even old hardware is sufficient.

What languages does OpenVoice v2 support?

Six built-in languages: English (US, UK, Indian, Australian accents), Spanish, French, Chinese, Japanese, Korean. The base TTS is multilingual; the tone converter is language-agnostic so you can clone an English speaker and have them speak Japanese (with accent and timbre transfer). Coverage is narrower than XTTS v2 (17 languages) but the languages it does support are well-tuned. For most consumer use cases (English + 1-2 other major languages), OpenVoice v2 is sufficient.

How does style and emotion control work?

OpenVoice v2 has a base TTS model that accepts style tokens at inference: `cheerful`, `sad`, `angry`, `whispering`, `shouting`, `terrified`, `friendly`, `default`. The base model produces speech in the chosen style with a generic voice; the tone color converter then swaps in your target speaker's timbre while preserving the style. Result: you can have "Speaker A speaking cheerfully," "Speaker A speaking sadly," "Speaker B with Australian accent speaking angrily," etc. — with the same reference audio for the timbre.

How fast is OpenVoice v2 inference?

On RTX 4090: ~8x real-time (8 seconds of audio in 1 second of compute). RTX 3060: ~4x. M4 Max via MPS: ~3x. CPU: ~0.5x. Latency to first audio: ~200 ms with streaming. Faster than F5-TTS / XTTS v2 single-stream because the model is smaller. For real-time voice agents, OpenVoice v2 is one of the better latency-quality tradeoffs.

Can I use OpenVoice v2 commercially?

Yes — MIT license is the most commercial-friendly of the major voice cloning models. You can integrate into proprietary products, sell as a service, fine-tune for specific markets, etc. without copyleft requirements. (Always validate licensing with your legal team for your specific use case, especially around voice cloning ethics — see our [F5-TTS ethics section](/blog/f5-tts-setup-guide#ethics) for the broader voice-cloning legal landscape.)

Does OpenVoice v2 support real-time streaming?

Yes — chunked output is supported via the API, with first-audio latency around 200 ms on RTX 4090. For real-time voice agents and interactive avatars, OpenVoice v2 + a streaming front-end (gradio with audio buffer, or LiveKit integration) is a viable stack. For sub-100ms latency, look at Moshi or other dedicated real-time models.

How do I integrate OpenVoice v2 with my existing stack?

Three paths. (1) **Python API**: `from openvoice.api import ToneColorConverter, BaseSpeakerTTS`. (2) **OpenAI-compatible**: wrap with [LocalAI](/blog/localai-setup-guide) configured for OpenVoice backend (community-contributed). (3) **MyShell hosted API**: paid hosted version at MyShell.ai for non-self-hosters. For most local-first setups, the Python API is straightforward; for OpenAI-API replacement, LocalAI is the cleanest path.

OpenVoice v2 Guide (2026): Voice Cloning with Style and Emotion Control

OpenVoice v2 is the voice cloning model that gives you the most control. Where F5-TTS clones a voice exactly as the reference and XTTS handles multilingual cleanly, OpenVoice lets you decouple voice timbre from speaking style — so the same speaker can deliver lines cheerfully, angrily, whispering, or shouting on demand. It also clones from just 1-5 seconds of reference audio (vs F5-TTS's 5-15 sec), and ships under a clean MIT license that's safe for commercial use.

For game NPCs that need varied emotional delivery, character voice acting, accessibility tools that adapt tone to context, or any application where the reference audio doesn't carry the right emotional palette — OpenVoice v2 is the right tool. This guide covers everything: setup, the two-stage architecture, style/emotion controls, accent transfer, multilingual generation, integration with LocalAI / SillyTavern / game engines, and benchmarks vs F5-TTS / XTTS / commercial services.

What OpenVoice v2 Is
The Two-Stage Architecture
Style, Emotion, and Accent Control
OpenVoice v2 vs F5-TTS vs XTTS v2
Hardware Requirements
Installation
Your First Voice Clone
Style and Emotion Examples
Cross-Lingual Cloning
Streaming Real-Time Output
Python API
LocalAI Integration
SillyTavern / Game Engine Integration
MIT License Implications
Performance Benchmarks
Tuning Recipes
Ethical Considerations
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What OpenVoice v2 Is {#what-it-is}

OpenVoice v2 (myshell-ai/OpenVoice) is a voice cloning model from MyShell.ai released in mid-2024. It introduces explicit decoupling of:

Voice timbre (the physical "sound" of a voice — pitch, resonance, vocal tract characteristics)
Speaking style (cheerful, sad, angry, whispering, shouting, terrified, friendly, default)
Accent (US English, UK English, Indian English, Australian, Spanish, French, Chinese, Japanese, Korean)

Most TTS systems bake style into voice — clone a sad voice and you get sad output. OpenVoice v2 lets you clone Speaker A and have them deliver lines in 8 different emotional registers from the same reference.

License: MIT. Project: github.com/myshell-ai/OpenVoice. Active maintenance.

The Two-Stage Architecture {#architecture}

OpenVoice v2 splits TTS into two models:

Text + Style + Accent ──> [Base Speaker TTS] ──> Generic-voice audio
                                                    │
                                                    ▼
                                          [Tone Color Converter]
                                                    │
   Reference audio ────────────────────────────────┘
                                                    ▼
                                          Cloned-timbre audio

The Base Speaker TTS produces speech in the chosen style/accent with a generic timbre. The Tone Color Converter then swaps in the target speaker's vocal characteristics while preserving style. Two models, ~350 MB each, ~700 MB total.

Style, Emotion, and Accent Control {#style-control}

Available styles (English):

default — neutral
cheerful — upbeat, positive
sad — slow, low energy
angry — emphatic, sharp
whispering — soft, breathy
shouting — loud, projected
terrified — fearful, trembling
friendly — warm, conversational

Accents (English):

en-default, en-us, en-uk, en-india, en-australia

Other languages have a smaller style set (typically just default + one or two emotions), since training data is more limited for non-English emotional speech.

Combine: style="cheerful", language="en-australia" produces upbeat Australian English; then tone-convert to your target speaker's timbre.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

OpenVoice v2 vs F5-TTS vs XTTS v2 {#comparison}

Property	OpenVoice v2	F5-TTS	XTTS v2
Reference audio length	1-5 sec	5-15 sec	6-15 sec
Languages	6 (well-tuned)	2 base + community	17
Style / emotion control	Explicit	Implicit	Implicit
Accent control	Explicit	None	None
Voice quality (MOS)	4.0	4.3	4.0
Inference speed (RTX 4090)	8x RTF	5x	4x
First-audio latency	200 ms	280 ms	280 ms
License	MIT	CC-BY-NC-4.0	CPML (ambiguous)

Pick OpenVoice v2 for: style-controlled cloning, short references, MIT licensing, real-time apps. Pick F5-TTS for: highest voice fidelity. Pick XTTS for: broadest multilingual.

Hardware Requirements {#requirements}

Hardware	RTF
RTX 4090	8x
RTX 4070	5x
RTX 3060	4x
Apple M4 Max (MPS)	3x
RX 7900 XTX (ROCm)	3x
Ryzen 7 7700X (CPU)	0.5x

VRAM 2-3 GB. The smallest of the major voice-cloning models, runs on cheap GPUs comfortably.

Installation {#installation}

python3.10 -m venv ~/venvs/openvoice
source ~/venvs/openvoice/bin/activate

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

git clone https://github.com/myshell-ai/OpenVoice
cd OpenVoice
pip install -e .

# Download checkpoints
python -c "from openvoice import api; api.download_checkpoints()"

For Docker:

docker run --gpus all --rm -it \
    -v $(pwd):/workspace \
    ghcr.io/myshell-ai/openvoice:v2 \
    bash

Your First Voice Clone {#first-clone}

from openvoice import api

# Initialize models
base_speaker_tts = api.BaseSpeakerTTS(
    "checkpoints_v2/base_speakers/EN/config.json",
    device="cuda"
)
base_speaker_tts.load_ckpt("checkpoints_v2/base_speakers/EN/checkpoint.pth")

tone_color_converter = api.ToneColorConverter(
    "checkpoints_v2/converter/config.json",
    device="cuda"
)
tone_color_converter.load_ckpt("checkpoints_v2/converter/checkpoint.pth")

# Stage 1: extract target speaker tone embedding from reference
target_se, _ = api.se_extractor.get_se(
    "reference.wav",
    tone_color_converter,
)

# Stage 2: generate base audio in chosen style
base_speaker_tts.tts(
    text="Hello, this is a test.",
    output_path="base.wav",
    speaker="default",          # or cheerful, sad, angry, etc.
    language="English",
    speed=1.0,
)

# Stage 3: convert base audio's tone to target speaker
tone_color_converter.convert(
    audio_src_path="base.wav",
    src_se=base_speaker_tts.hps.data.spk2id["default"],   # default base speaker
    tgt_se=target_se,
    output_path="output.wav",
)

That's it: short reference + style + accent → cloned audio in any combination.

Style and Emotion Examples {#style-examples}

styles = ["default", "cheerful", "sad", "angry", "whispering", "shouting", "terrified", "friendly"]

for style in styles:
    base_speaker_tts.tts(
        text="The deal is closing in five minutes.",
        output_path=f"base_{style}.wav",
        speaker=style,
        language="English",
    )
    tone_color_converter.convert(
        audio_src_path=f"base_{style}.wav",
        src_se=base_speaker_tts.hps.data.spk2id["default"],
        tgt_se=target_se,
        output_path=f"output_{style}.wav",
    )

Same speaker, same line, eight emotional deliveries. Useful for game NPCs, character voiceovers, accessibility tools that adjust tone to context.

Cross-Lingual Cloning {#cross-lingual}

# Reference: English speaker
# Generate: Japanese audio with their voice

base_speaker_tts.tts(
    text="こんにちは、これはテストです。",
    output_path="base_ja.wav",
    speaker="default",
    language="Japanese",
)

tone_color_converter.convert(
    audio_src_path="base_ja.wav",
    src_se=base_speaker_tts.hps.data.spk2id["default"],
    tgt_se=target_se,            # English reference's tone
    output_path="output_ja.wav",
)

The English speaker's voice timbre delivers Japanese with native-sounding Japanese pronunciation and prosody. This works because the tone color converter is language-agnostic.

Streaming Real-Time Output {#streaming}

For real-time voice agents:

for chunk in base_speaker_tts.tts_stream(
    text=long_text,
    speaker="friendly",
    language="English",
    chunk_size=200,    # tokens per chunk
):
    converted = tone_color_converter.convert_chunk(chunk, src_se, tgt_se)
    play(converted)    # play immediately, don't wait for full text

First-audio latency: ~200 ms on RTX 4090. Suitable for voice chatbots, customer service avatars, game dialogue systems.

Python API {#python-api}

The full API is documented at github.com/myshell-ai/OpenVoice/blob/main/openvoice/api.py. Key classes:

BaseSpeakerTTS — text → generic voice in chosen style
ToneColorConverter — generic voice → cloned voice
se_extractor — extract speaker embedding from reference audio

For convenience wrappers, the community provides higher-level classes (e.g., OpenVoicePipeline from third-party packages).

LocalAI Integration {#localai}

# models/openvoice.yaml
name: openvoice
backend: openvoice            # community-contributed backend
parameters:
  base_model: checkpoints_v2/base_speakers/EN/checkpoint.pth
  converter_model: checkpoints_v2/converter/checkpoint.pth
  voice_wav: /build/voices/default-en.wav

Then OpenAI clients:

client.audio.speech.create(
    model="openvoice",
    voice="default-en",
    input="Hello world",
    extra_body={"style": "cheerful"},
)

See LocalAI Setup Guide.

SillyTavern / Game Engine Integration {#integrations}

SillyTavern

Use the AllTalk wrapper with OpenVoice backend (community plugin), or directly via OpenAI-compatible LocalAI URL. Style can be passed per-character: angry character has style=angry in the system prompt converted to the API call.

Unity / Unreal game engines

OpenVoice v2 ships ONNX exports (community-built) usable from C++/C# game runtimes. For runtime voice gen in shipped games, the latency and CPU cost are usable for non-real-time NPC dialogue. For dynamic voice (player-driven), pre-bake on a server.

Real-time avatars (Vtubing, agents)

OpenVoice v2 + Pipecat or LiveKit + lip-sync animation (Wav2Lip) creates real-time animated avatars with cloned voices. Latency budget: 500-1000 ms end-to-end is achievable.

MIT License Implications {#license}

MIT is the most permissive open-source license. You can:

Use commercially without attribution (attribution still good practice)
Modify and redistribute
Bundle into closed-source products
Sell as paid service

Among major voice-cloning models, only OpenVoice v2 has MIT. F5-TTS is CC-BY-NC-4.0 (non-commercial). XTTS v2 is CPML (commercial license tier defunct after Coqui shutdown). For commercial deployment without licensing risk, OpenVoice v2 is the cleanest choice.

Performance Benchmarks {#benchmarks}

10-second generation, RTX 4090:

Model	Time	RTF
OpenVoice v2 (base + converter)	1.25 sec	8x
F5-TTS	2.0 sec	5x
XTTS v2	2.5 sec	4x
OpenVoice v2 streaming first-chunk	200 ms	n/a

OpenVoice v2 is fastest among the open-source voice-cloning options.

Tuning Recipes {#tuning}

Game NPC voices (style-driven)

# Cache target_se per character; vary style per dialogue line
guard_se = api.se_extractor.get_se("guard_voice.wav", tone_color_converter)
merchant_se = api.se_extractor.get_se("merchant_voice.wav", tone_color_converter)

def speak(text, character_se, style):
    base_speaker_tts.tts(text=text, output_path="tmp.wav", speaker=style, language="English")
    tone_color_converter.convert("tmp.wav", default_src_se, character_se, "out.wav")

speak("Halt!", guard_se, "angry")
speak("Welcome, traveler.", merchant_se, "friendly")

Audiobook with character voices

Pre-extract target_se for each character (narrator, protagonist, antagonist, supporting cast). Loop through dialogue, set speaker by character, set style by emotional context, generate.

Real-time voice agent

# Stream incremental tokens from LLM, generate audio chunks
for token in llm.stream(prompt):
    if sentence_complete:
        audio = tts(sentence)
        play_async(audio)

Latency target: 500-800 ms from LLM token to audio playback.

Ethical Considerations {#ethics}

Same considerations as F5-TTS / XTTS v2:

Consent before cloning
Disclosure of synthetic audio
Watermarking where possible
Logging
Refusal of public-figure / fraud cloning
Compliance with state-level synthetic media laws

OpenVoice v2's style/emotion controls add a layer of risk: a cloned voice expressing emotions the original speaker never approved (e.g., a clone of a public figure "shouting") is more harmful than a neutral clone. Apply extra scrutiny to style-varied outputs in public-facing applications.

See F5-TTS ethics for the broader voice-cloning legal landscape.

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
Robotic output	Reference too noisy	Use cleaner reference audio
Style not applying	Wrong language's style set	Some styles only available in English
Voice match poor	Reference too short	Use 3-5 sec instead of 1-2
OOM	Long text input	Chunk text into sentences
Slow on AMD	PyTorch CUDA build	Reinstall with rocm6.2
Cross-lingual voice drift	Script mismatch	Same-script reference where possible

FAQ {#faq}

See answers to common OpenVoice v2 questions below.

Sources: OpenVoice GitHub | MyShell.ai | OpenVoice paper | Internal benchmarks RTX 4090, M4 Max.

Related guides:

OpenVoice v2 Guide (2026): Voice Cloning with Style and Emotion Control

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What OpenVoice v2 Is {#what-it-is}

The Two-Stage Architecture {#architecture}

Style, Emotion, and Accent Control {#style-control}

Reading articles is good. Building is better.

OpenVoice v2 vs F5-TTS vs XTTS v2 {#comparison}

Hardware Requirements {#requirements}

Installation {#installation}

Your First Voice Clone {#first-clone}

Style and Emotion Examples {#style-examples}

Cross-Lingual Cloning {#cross-lingual}

Streaming Real-Time Output {#streaming}

Python API {#python-api}

LocalAI Integration {#localai}

SillyTavern / Game Engine Integration {#integrations}

SillyTavern

Unity / Unreal game engines

Real-time avatars (Vtubing, agents)

MIT License Implications {#license}

Performance Benchmarks {#benchmarks}

Tuning Recipes {#tuning}

Game NPC voices (style-driven)

Audiobook with character voices

Real-time voice agent

Ethical Considerations {#ethics}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

F5-TTS Setup Guide

XTTS v2 Voice Cloning Guide

Local AI Voice Clone

Local AI Game NPCs

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI