WhisperX Guide (2026): Word-Level Timestamps + Speaker Diarization for Local STT
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
WhisperX is the right tool when plain transcription is not enough. Built on Faster-Whisper, it adds forced phoneme alignment for sub-100ms word-level timestamps and pyannote-based speaker diarization. For podcasts, interviews, meeting transcripts, video subtitles, and any workflow where you need to know exactly who said what when, WhisperX is the production-grade open-source answer.
This guide covers everything: installation, getting the Hugging Face token for pyannote, basic transcription with diarization, output formats (SRT/VTT/JSON), batching for long-form audio, multi-language support, integration with video pipelines, accuracy tuning, and the cases where WhisperX is overkill vs. where it's essential.
Table of Contents
- What WhisperX Is
- Word-Level Timestamps via Forced Alignment
- Speaker Diarization via Pyannote
- Hardware Requirements
- Installation
- Hugging Face Token Setup
- Your First Transcription with Speakers
- CLI Usage
- Python API
- Output Formats: SRT, VTT, JSON
- Multi-Language Support
- Setting Speaker Counts
- Batched Long-Form Inference
- Integration with Video Pipelines
- Accuracy Tuning
- Performance Benchmarks
- WhisperX vs Commercial Services
- Tuning Recipes
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What WhisperX Is {#what-it-is}
WhisperX (m-bain/whisperX) is a Python library + CLI that wraps Faster-Whisper and adds:
- Forced phoneme alignment via wav2vec2 → accurate word-level timestamps (sub-100 ms vs Whisper's ~1 sec drift)
- Speaker diarization via pyannote.audio 3.1 → "who said what when"
- Batched inference with VAD-based segmentation → higher throughput on long audio
- Output formatters for SRT, VTT, TSV, JSON
Project: github.com/m-bain/whisperX. License: Apache 2.0. Active maintenance.
Word-Level Timestamps via Forced Alignment {#alignment}
Whisper natively emits timestamps but only at the segment level (typically 5-30 seconds per segment). For word-level timestamps, Whisper interpolates — and the result drifts by hundreds of milliseconds.
WhisperX runs the transcript through a wav2vec2 phoneme alignment model that maps each word to its precise audio position. Result: timestamps accurate to <100 ms, sufficient for karaoke-style word highlighting in subtitles.
The alignment step adds ~10-20% to total processing time but is essential for video subtitles, dubbed content alignment, and any time-sensitive transcript use.
Speaker Diarization via Pyannote {#diarization}
Pyannote 3.1 is the open-source state-of-the-art for speaker diarization. WhisperX:
- Runs pyannote on the audio → speaker turn boundaries
- Aligns Whisper transcript words with diarization output
- Assigns each word to a speaker label (
SPEAKER_00,SPEAKER_01, ...)
For 2-3 clean speakers: 90-95% accuracy. For 4-6 speakers: 80-88%. For crowded meetings with crosstalk: 70-80%.
You can map labels to real names post-hoc: SPEAKER_00 → Alice, SPEAKER_01 → Bob based on the first known voice sample.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements {#requirements}
| Hardware | Transcription | + Alignment | + Diarization |
|---|---|---|---|
| RTX 4090 (FP16) | 72x RTF | 60x | 30x |
| RTX 4070 (FP16) | 50x | 40x | 22x |
| RTX 3060 (INT8) | 35x | 28x | 12x |
| Apple M4 Max (MPS) | 25x | 20x | 8x |
| Ryzen 7 7700X (CPU) | 10x | 8x | 0.5x (slow) |
VRAM: 6 GB minimum, 8 GB+ comfortable. CPU diarization is impractical except for very small workloads.
Installation {#installation}
python3.11 -m venv ~/venvs/whisperx
source ~/venvs/whisperx/bin/activate
# CUDA 12.4 + cuDNN
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install whisperx
pip install nvidia-cudnn-cu12==9.0.0 # if not already installed
For CPU only:
pip install whisperx
# torch CPU is installed by default
For Docker:
docker run --gpus all --rm -v $(pwd):/workspace ghcr.io/m-bain/whisperx:latest \
whisperx /workspace/audio.mp3 --model large-v3 --hf_token $HF_TOKEN
Hugging Face Token Setup {#hf-token}
Diarization needs a Hugging Face token because pyannote models are gated.
- Visit huggingface.co/pyannote/speaker-diarization-3.1 → click "Agree and access repository"
- Visit huggingface.co/pyannote/segmentation-3.0 → same
- Go to huggingface.co/settings/tokens → create a Read token
- Export:
export HF_TOKEN=hf_xxxxxxxxxxxxxxx
Pass to WhisperX via --hf_token (CLI) or use_auth_token (Python).
For transcription-only (no diarization), no token needed.
Your First Transcription with Speakers {#first-transcription}
whisperx audio.mp3 \
--model large-v3 \
--diarize \
--hf_token $HF_TOKEN \
--min_speakers 2 \
--max_speakers 4 \
--output_format all
Outputs in current directory:
audio.srt— subtitles with speaker labelsaudio.vtt— WebVTT for HTML5 videoaudio.json— full structured output (words, timestamps, speakers)audio.tsv— tab-separated for spreadsheet importaudio.txt— plain text
CLI Usage {#cli}
# Basic transcription only (fast)
whisperx audio.mp3 --model large-v3
# Word timestamps (alignment)
whisperx audio.mp3 --model large-v3 --align
# Full pipeline: transcription + alignment + diarization
whisperx audio.mp3 \
--model large-v3 \
--diarize \
--hf_token $HF_TOKEN \
--min_speakers 2 --max_speakers 4 \
--language en \
--batch_size 16 \
--compute_type float16 \
--output_format srt \
--highlight_words True
Useful flags:
--language en— skip language detection (faster, more accurate)--batch_size 16— VAD-batched inference (RTX 4090: bump to 32)--compute_type float16—int8_float16for tighter VRAM--highlight_words True— per-word highlights in SRT output--print_progress True— progress bar
Python API {#python-api}
import whisperx
device = "cuda"
audio = whisperx.load_audio("audio.mp3")
# 1. Transcribe with Whisper
model = whisperx.load_model("large-v3", device, compute_type="float16")
result = model.transcribe(audio, batch_size=16)
# 2. Align (word-level timestamps)
align_model, metadata = whisperx.load_align_model(
language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], align_model, metadata, audio, device)
# 3. Diarize
diarize_model = whisperx.DiarizationPipeline(
use_auth_token=HF_TOKEN, device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)
# Result has: result["segments"] with .words[] containing .word .start .end .speaker
Output Formats: SRT, VTT, JSON {#output-formats}
SRT (subtitles)
1
00:00:00,520 --> 00:00:03,840
[SPEAKER_00] Welcome to the show, today we are talking about local AI.
2
00:00:04,120 --> 00:00:07,200
[SPEAKER_01] Thanks for having me. It is great to be here.
VTT (HTML5 video)
WEBVTT
00:00.520 --> 00:03.840
<v SPEAKER_00>Welcome to the show, today we are talking about local AI.</v>
00:04.120 --> 00:07.200
<v SPEAKER_01>Thanks for having me. It is great to be here.</v>
JSON (full structured)
{
"segments": [{
"start": 0.52, "end": 3.84,
"text": "Welcome to the show...",
"speaker": "SPEAKER_00",
"words": [
{ "word": "Welcome", "start": 0.52, "end": 0.91, "speaker": "SPEAKER_00" },
{ "word": "to", "start": 0.94, "end": 1.06, "speaker": "SPEAKER_00" },
{ "word": "the", "start": 1.08, "end": 1.18, "speaker": "SPEAKER_00" }
]
}],
"language": "en"
}
The JSON drives downstream applications: TikTok-style word highlights, video editor sync, searchable archives.
Multi-Language Support {#multilingual}
WhisperX inherits Whisper's 99-language support. Alignment models exist for the major languages:
| Language | Alignment model |
|---|---|
| English | wav2vec2-base-960h |
| German | wav2vec2-large-xlsr-53-german |
| Spanish | wav2vec2-large-xlsr-53-spanish |
| French | wav2vec2-large-xlsr-53-french |
| Italian, Portuguese, Russian, ... | language-specific wav2vec2 |
| Chinese / Japanese / Korean | smaller alignment models |
Pass --language es etc. WhisperX auto-loads the matching alignment model.
For languages without an alignment model, transcription works but word timestamps fall back to interpolated.
Diarization is language-agnostic.
Setting Speaker Counts {#speaker-counts}
If you know the speaker count, tell pyannote:
whisperx audio.mp3 --diarize --min_speakers 2 --max_speakers 2 --hf_token $HF_TOKEN
Hard limits dramatically improve accuracy. For two-person podcasts, set min=max=2. For meetings of unknown size, leave defaults (auto-detect, typically 1-10).
If you provide just one bound (--min_speakers 3), pyannote uses it as a hint but may produce more or fewer.
Batched Long-Form Inference {#batched}
For long audio (1+ hours):
whisperx audio.mp3 \
--model large-v3 \
--batch_size 32 \
--diarize --hf_token $HF_TOKEN
VAD pre-segmentation splits audio at silences, then transcribes 32 segments in parallel on GPU. Throughput on RTX 4090 large-v3: ~250x real-time before diarization.
For massive archives (1,000+ hours), use the Python API with explicit chunking and async pyannote calls.
Integration with Video Pipelines {#video}
For video subtitle generation:
# Extract audio from video
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 audio.wav
# Transcribe with WhisperX
whisperx audio.wav --model large-v3 --align --output_format srt
# Burn subtitles back into video
ffmpeg -i video.mp4 -vf "subtitles=audio.srt" video_subtitled.mp4
For a fully automated pipeline:
# auto-subtitle.sh
#!/bin/bash
ffmpeg -i "$1" -vn -ar 16000 -ac 1 audio.wav
whisperx audio.wav --model large-v3 --align --language en --output_format srt
ffmpeg -i "$1" -vf "subtitles=audio.srt:force_style='FontSize=22'" "${1%.*}_sub.mp4"
rm audio.wav audio.srt
Accuracy Tuning {#accuracy}
To improve diarization quality:
- Clean audio first — run Demucs or Audacity noise reduction
- Set speaker counts —
min_speakers/max_speakers - Use higher-quality model — large-v3, not turbo (turbo is English-tuned, less accurate for diarization edge cases)
- Adjust VAD sensitivity — pass custom
vad_optionsdict - Post-process — manually correct
SPEAKER_Xlabels with sed/awk; cluster similar voices across recordings with voice fingerprinting
For mission-critical accuracy (court records, depositions): consider commercial Pyannote Premium or human-in-the-loop verification.
Performance Benchmarks {#benchmarks}
1-hour podcast, large-v3 model, RTX 4090:
| Pipeline | Time | RTF |
|---|---|---|
| Transcription only | 50 sec | 72x |
| + Alignment | 70 sec | 51x |
| + Diarization | 120 sec | 30x |
For long-form (10+ hours): batch processing scales nearly linearly, so 10 hours of audio takes ~20 minutes.
WhisperX vs Commercial Services {#vs-commercial}
| Property | WhisperX | Otter.ai | Rev | AssemblyAI |
|---|---|---|---|---|
| Cost | Free + GPU | $20/mo | $0.25/min | $0.65/hour |
| Privacy | Local | Cloud | Cloud | Cloud |
| Diarization accuracy (2-speaker) | 90-95% | 92-96% | 95% | 94% |
| Real-time | No | Yes | No | Yes |
| Custom vocabulary | Yes (post-edit) | Yes | Yes | Yes |
| Language coverage | 99 | 6 | 30+ | 12 |
For most local podcasters, journalists, researchers: WhisperX matches commercial services on quality with full privacy and zero per-minute cost.
Tuning Recipes {#tuning}
Podcast / interview transcription
whisperx interview.mp3 \
--model large-v3 --align --diarize \
--min_speakers 2 --max_speakers 3 \
--language en \
--hf_token $HF_TOKEN \
--output_format srt
YouTube auto-captioning
yt-dlp -x --audio-format wav "https://youtu.be/..."
whisperx *.wav --model large-v3 --align --output_format vtt
Meeting transcription with up to 8 speakers
whisperx meeting.wav \
--model large-v3 --align --diarize \
--max_speakers 8 \
--hf_token $HF_TOKEN \
--output_format json
Multi-language interview
whisperx multilingual.mp3 \
--model large-v3 --align --diarize \
--hf_token $HF_TOKEN
# Auto-detects primary language; alignment uses matching wav2vec2
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| "Could not download pyannote model" | HF token / license not accepted | Visit pyannote pages, accept license |
| Alignment model not found | Language without aligner | Falls back to interpolated word timestamps |
| Diarization wrong speakers | Voices too similar | Set min/max speakers; clean audio |
| OOM with batch_size 32 | VRAM too tight | Drop to 8 or 16 |
| Slow on CPU | Diarization compute-bound | Use GPU for diarization at minimum |
| WhisperX version mismatch with Faster-Whisper | Pin versions | Use venv with both pinned |
FAQ {#faq}
See answers to common WhisperX questions below.
Sources: WhisperX GitHub | pyannote.audio | WhisperX paper | Internal benchmarks RTX 4090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!