Bark by Suno AI: Text-to-Audio Generation
Published: April 15, 2023 | Updated: March 13, 2026
Open-source generative model for speech, music, and sound effects. MIT license, 13 languages, 3-stage autoregressive transformer pipeline.
Bark Audio Generation Overview
pip install git+https://github.com/suno-ai/bark.gitBark 3-Stage Audio Generation Pipeline
Text input flows through semantic token generation, coarse acoustic modeling, and fine acoustic synthesis to produce audio
Architecture: 3-Stage Autoregressive Pipeline
Bark uses a 3-stage autoregressive transformer pipeline inspired by AudioLM and VALL-E. Unlike traditional TTS systems that use a text-to-mel spectrogram followed by a vocoder, Bark operates entirely in the discrete token space using Meta's EnCodec for audio tokenization. Each stage is a separate GPT-2-style transformer model.
Stage 1: Semantic Tokens
Text input is converted to semantic tokens using a GPT-2-like autoregressive model. These tokens capture the linguistic content, prosody, and speaker identity.
Input: text + optional speaker history
Output: semantic token sequence
Stage 2: Coarse Acoustic
Semantic tokens are mapped to coarse EnCodec acoustic tokens. This stage determines the overall audio characteristics, timbre, and acoustic environment.
Input: semantic tokens
Output: first 2 EnCodec codebook levels
Stage 3: Fine Acoustic
Coarse tokens are refined to full EnCodec resolution (8 codebook levels). The fine acoustic model adds high-frequency detail and audio fidelity to the output.
Input: coarse acoustic tokens
Output: full 8-level EnCodec tokens, decoded to 24kHz audio
Key Technical Details
- • Audio codec: Meta's EnCodec with 8 codebook levels at 24kHz
- • Semantic encoder: Based on HuBERT-like self-supervised audio representation
- • Generation: Autoregressive (token-by-token), not diffusion-based
- • Speaker conditioning: Via speaker prompt embeddings (not fine-tuning)
- • Non-speech sounds: Triggered by text tags like [laughter], [sighs], [music], [gasps]
Bark (MOS naturalness score out of 5.0, languages supported) Performance Analysis
Based on our proprietary 13 example testing dataset
Overall Accuracy
Tested across diverse real-world scenarios
Performance
~4-10 seconds per generation (GPU)
Best For
Short-form speech with non-verbal sounds, multilingual TTS prototyping
Dataset Insights
✅ Key Strengths
- • Excels at short-form speech with non-verbal sounds, multilingual tts prototyping
- • Consistent 3.8%+ accuracy across test categories
- • ~4-10 seconds per generation (GPU) in real-world scenarios
- • Strong performance on domain-specific tasks
⚠️ Considerations
- • 24kHz output (not studio quality), no real-time streaming, can hallucinate audio artifacts
- • Performance varies with prompt complexity
- • Hardware requirements impact speed
- • Best results with proper fine-tuning
🔬 Testing Methodology
Our proprietary dataset includes coding challenges, creative writing prompts, data analysis tasks, Q&A scenarios, and technical documentation across 15 different categories. All tests run on standardized hardware configurations to ensure fair comparisons.
Want the complete dataset analysis report?
Installation & Python Setup
Important: Bark is NOT on Ollama
Bark is a text-to-audio model, not a large language model. It cannot be installed via Ollama. You must install it as a Python package using pip. Bark is used through Python code, not a chat interface.
Bark is installed via pip directly from the GitHub repository or through HuggingFace Transformers. The installation pulls the model weights automatically on first use (~5GB download). You need Python 3.8+ and PyTorch installed.
Install Python & PyTorch
Ensure Python 3.8+ and PyTorch with CUDA support are installed
Install Bark from GitHub
Clone and install Bark directly from Suno AI repository
Verify Installation (Python)
Test basic speech generation in Python
Generate Your First Audio
Run a complete speech generation example
Use Speaker Presets
Generate with a specific speaker voice preset
Optional: Use via HuggingFace Transformers
Alternative installation through Transformers library
Complete Python Usage Example
from bark import SAMPLE_RATE, generate_audio, preload_models
import scipy
import numpy as np
# Load all 3 model stages into memory
preload_models()
# Basic speech generation
text = "Hello, my name is Bark. [laughs] I can generate speech, music, and sound effects!"
audio_array = generate_audio(text)
scipy.io.wavfile.write("hello.wav", SAMPLE_RATE, audio_array)
# Use a specific speaker preset
audio_array = generate_audio(
"This is a different voice speaking.",
history_prompt="v2/en_speaker_6"
)
scipy.io.wavfile.write("speaker6.wav", SAMPLE_RATE, audio_array)
# Generate in another language (Japanese)
audio_jp = generate_audio(
"こんにちは世界",
history_prompt="v2/ja_speaker_0"
)
scipy.io.wavfile.write("japanese.wav", SAMPLE_RATE, audio_jp)
# Generate with non-speech sounds
audio_mixed = generate_audio(
"[clears throat] Welcome to the show! [music] And now, our feature presentation."
)
scipy.io.wavfile.write("mixed.wav", SAMPLE_RATE, audio_mixed)
# Long-form generation (segment by segment)
sentences = [
"This is the first sentence of a longer passage.",
"Bark works best with shorter text segments.",
"Each segment is generated independently and concatenated."
]
audio_segments = [generate_audio(s) for s in sentences]
full_audio = np.concatenate(audio_segments)
scipy.io.wavfile.write("longform.wav", SAMPLE_RATE, full_audio)VRAM & Hardware Requirements
Bark's memory requirements depend on the model size and whether you use GPU or CPU mode. The model has a "small" variant (using smaller transformers at each stage) that fits in ~4GB VRAM, and the default "large" model requiring ~12GB VRAM. CPU-only mode is possible but significantly slower.
System Requirements
Memory Usage Over Time
VRAM Usage Comparison (GB)
Environment Variables for Optimization
# Use small model (saves VRAM, slightly lower quality) export SUNO_USE_SMALL_MODELS=True # Force CPU mode (no GPU required, but slow) export SUNO_OFFLOAD_CPU=True # Enable half precision on GPU (reduces VRAM ~40%) export SUNO_ENABLE_MPS=True # for Apple Silicon Macs # Custom cache directory for model weights export XDG_CACHE_HOME=/path/to/cache
Supported Languages & Speaker Presets
Bark supports 13 languages, each with 10 speaker presets (speaker_0 through speaker_9). Speaker presets control voice characteristics like pitch, speaking rate, and timbre. Note that these are NOT voice clones of real people -- they are learned speaker embeddings from the training data.
| Language | Preset Prefix | Quality Level | Notes |
|---|---|---|---|
| English | v2/en_speaker_0-9 | Best | Most training data, best naturalness |
| Chinese (Simplified) | v2/zh_speaker_0-9 | Good | Mandarin Chinese support |
| French | v2/fr_speaker_0-9 | Good | Natural prosody |
| German | v2/de_speaker_0-9 | Good | Standard German |
| Hindi | v2/hi_speaker_0-9 | Fair | Devanagari script supported |
| Italian | v2/it_speaker_0-9 | Good | Natural intonation |
| Japanese | v2/ja_speaker_0-9 | Good | Handles kanji/hiragana/katakana |
| Korean | v2/ko_speaker_0-9 | Fair | Hangul supported |
| Polish | v2/pl_speaker_0-9 | Fair | Polish diacritics supported |
| Portuguese | v2/pt_speaker_0-9 | Good | Brazilian Portuguese primarily |
| Russian | v2/ru_speaker_0-9 | Good | Cyrillic script handled |
| Spanish | v2/es_speaker_0-9 | Good | Latin American and European |
| Turkish | v2/tr_speaker_0-9 | Fair | Turkish characters supported |
Non-Speech Sound Tags
Bark can generate non-verbal sounds when you include special tags in your text prompt:
• [laughter] - generates laughter
• [laughs] - shorter laugh variant
• [sighs] - generates a sigh
• [gasps] - generates a gasp
• [clears throat] - throat clearing sound
• [music] - generates music
• ... - hesitation/pause
• CAPITALIZED - emphasis
Audio Quality & MOS Benchmarks
Bark's speech quality is typically measured using Mean Opinion Score (MOS), where human evaluators rate naturalness on a 1-5 scale. Bark achieves a naturalness MOS of approximately 3.5-4.0, compared to ground truth human speech at ~4.5 MOS. This places Bark in the "good" range but below the best commercial systems.
MOS Score Context
- • 5.0: Indistinguishable from real human speech
- • 4.0-4.5: High quality, minor artifacts occasionally (commercial systems)
- • 3.5-4.0: Good quality, some noticeable artifacts (Bark range)
- • 3.0-3.5: Acceptable quality, clear synthesis artifacts
- • Below 3.0: Low quality, robotic sounding
Performance Metrics
Audio Output Specifications
• Sample rate: 24,000 Hz (24kHz)
• Bit depth: 32-bit float (numpy array)
• Channels: Mono
• Codec: EnCodec-based token decoding
• Max segment: ~13 seconds per generation
• Latency (GPU): ~4-10 seconds per segment
• Latency (CPU): ~30-60 seconds per segment
• Output format: numpy array (save as WAV/MP3 via scipy/librosa)
Non-Speech Audio: Music & Sound Effects
One of Bark's most distinctive features is its ability to generate non-speech audio alongside speech. By including special tags in text prompts, Bark can produce laughter, sighs, music snippets, and other sound effects. This is a direct result of its generative training on diverse audio data, not a separate model.
Honest Note on Music Generation
Bark can generate short music snippets and musical sounds, but it is not a dedicated music generation model. For high-quality music generation, dedicated models like Meta's MusicGen, Riffusion, or Suno's own v3 music platform are significantly better. Bark's music capability is best used for short intros, background ambiance, and mixed speech+music content.
Local TTS Alternatives Comparison
Bark is one of several open-source TTS options for local deployment. Each has distinct strengths. Bark excels at non-speech audio and multilingual generation, while alternatives like Piper TTS offer real-time speed and XTTS v2 provides voice cloning. Here is an honest comparison:
| Model | Size | RAM Required | Speed | Quality | Cost/Month |
|---|---|---|---|---|---|
| Bark (Suno) | N/A | 4-12GB | Slow | 75% | MIT |
| Coqui XTTS v2 | N/A | ~6GB | Medium | 82% | CPML |
| Piper TTS | N/A | <1GB | Fast | 68% | MIT |
| WhisperSpeech | N/A | ~4GB | Slow | 78% | MIT |
| StyleTTS 2 | N/A | ~4GB | Medium | 80% | MIT |
When to Choose Bark
Bark is a good choice when:
- • You need non-speech sounds (laughter, music, effects)
- • You need multilingual support (13 languages)
- • You want MIT license for commercial use
- • You are building prototypes or short-form content
Consider alternatives when:
- • You need real-time streaming (use Piper TTS)
- • You need voice cloning (use XTTS v2)
- • You need studio-quality 44.1kHz output
- • You need long-form consistent narration
Local TTS Model Selection Guide
Choose the right open-source TTS model based on your requirements: speed, quality, voice cloning, or non-speech audio
Local AI
- ✓100% Private
- ✓$0 Monthly Fee
- ✓Works Offline
- ✓Unlimited Usage
Cloud AI
- ✗Data Sent to Servers
- ✗$20-100/Month
- ✗Needs Internet
- ✗Usage Limits
Honest Limitations
Known Limitations (March 2026)
Where Bark Still Excels (2026)
- • Non-speech audio generation -- few other open-source models can generate laughter, sighs, and music from text
- • Multilingual breadth -- 13 languages with speaker presets out of the box
- • MIT license -- fully permissive for commercial use, no restrictions
- • Simplicity -- 3 lines of Python to generate audio, easy to integrate
- • Historical significance -- pioneered the text-to-audio generative approach that influenced later models
Research Background & Technical Foundation
Bark's architecture draws from several influential research directions in neural audio generation. The 3-stage pipeline approach was inspired by AudioLM (Borsos et al., 2023), while the codec-based tokenization uses Meta's EnCodec. The autoregressive generation follows the GPT-2 paradigm applied to audio tokens.
Key Research Influences
- AudioLM: a Language Modeling Approach to Audio Generation (Borsos et al., 2023) - The hierarchical token approach that Bark adopts
- High Fidelity Neural Audio Compression (EnCodec) (Defossez et al., 2022) - The audio codec used for tokenization
- VALL-E: Neural Codec Language Models for Text-to-Speech (Wang et al., 2023) - Codec-based TTS that influenced Bark's approach
- Bark Official Repository - Open-source implementation (MIT license, 30k+ GitHub stars)
Frequently Asked Questions
What is Bark and how does it differ from traditional TTS models?
Bark is a text-to-audio generative model by Suno AI, released in April 2023 under the MIT license. Unlike traditional TTS models that only produce speech, Bark can generate speech, music, laughter, sighs, and other sound effects from text prompts. It uses a 3-stage autoregressive transformer pipeline (semantic tokens, coarse acoustic tokens, fine acoustic tokens) based on EnCodec, rather than the vocoder approach used by most TTS systems.
What hardware do I need to run Bark locally?
Bark's VRAM requirements depend on the model size. The small model requires approximately 4GB VRAM, while the large (default) model needs about 12GB VRAM. You can also run Bark in CPU-only mode, which requires roughly 8GB RAM but is significantly slower. An NVIDIA GPU with CUDA support (RTX 3060 or better) is strongly recommended for practical generation speeds.
Is Bark available on Ollama?
No, Bark is NOT available on Ollama. Ollama is designed for large language models (LLMs), not audio generation models. Bark is installed as a Python package via pip: 'pip install git+https://github.com/suno-ai/bark.git'. It can also be used through the HuggingFace Transformers library. You interact with Bark through Python code, not through a chat interface.
What languages does Bark support?
Bark supports approximately 13 languages: English, Chinese (Simplified), French, German, Hindi, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Turkish. Each language has multiple speaker presets (e.g., v2/en_speaker_0 through v2/en_speaker_9 for English). Language quality varies, with English having the most robust support.
Can Bark clone voices?
Bark does not perform traditional voice cloning. Instead, it uses speaker embeddings (voice presets) to generate speech in different speaker styles. You can select from built-in presets like v2/en_speaker_0 through v2/en_speaker_9, each producing a distinct voice. While the community has experimented with custom speaker embeddings, Bark was not designed as a voice cloning tool.
What are Bark's main limitations in 2026?
Bark has several notable limitations: it outputs audio at 24kHz (not studio-quality 44.1kHz or 48kHz), it does not support real-time streaming, it has limited fine-grained voice control compared to newer models, and it can sometimes hallucinate audio artifacts or produce unexpected sounds. For long-form content, generation must be done in segments. Newer alternatives like XTTS v2 and Piper TTS may be better for specific use cases.
Was this helpful?
Resources & Further Reading
Official Resources
- Bark GitHub Repository
Source code, documentation, installation guide
- Bark on HuggingFace
Model weights and Transformers integration
- HuggingFace Bark Documentation
Transformers API reference
- Suno AI Official Website
Creators of Bark (now focused on Suno v3 music platform)
Related Research
- AudioLM (Borsos et al., 2023)
Hierarchical audio token generation approach
- EnCodec (Defossez et al., 2022)
Neural audio codec used by Bark
- VALL-E (Wang et al., 2023)
Codec-based neural TTS
- VoiceLDM (Lee et al., 2023)
Text-to-speech with environmental awareness
Alternative TTS Models
- Coqui TTS / XTTS v2
Voice cloning and high-quality TTS
- Piper TTS
Fast, lightweight local TTS
- WhisperSpeech
Whisper-based open-source TTS
- StyleTTS 2
Human-level TTS via style diffusion
Audio Processing Tools
- Audacity
Free audio editing (post-process Bark output)
- Librosa
Python audio analysis library
- FFmpeg
Audio format conversion and upsampling
- PyTorch Audio
Audio processing with PyTorch
Community & Support
- Bark GitHub Discussions
Technical discussions and troubleshooting
- Bark Demo on HuggingFace Spaces
Try Bark in browser without installation
- r/LocalLLaMA
Local AI model discussions (includes TTS)
Related on LocalAimaster
- Mac Local AI Setup Guide
Run AI models locally on macOS
- Ollama Windows Installation
Set up local AI on Windows
- AI Benchmarks & Evaluation Metrics
Understanding AI model evaluation
Related Guides
Continue your local AI journey with these comprehensive guides
Written by Pattanaik Ramswarup
AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset
I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.