★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Multimodal

WhisperX Guide (2026): Word-Level Timestamps + Speaker Diarization for Local STT

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

WhisperX is the right tool when plain transcription is not enough. Built on Faster-Whisper, it adds forced phoneme alignment for sub-100ms word-level timestamps and pyannote-based speaker diarization. For podcasts, interviews, meeting transcripts, video subtitles, and any workflow where you need to know exactly who said what when, WhisperX is the production-grade open-source answer.

This guide covers everything: installation, getting the Hugging Face token for pyannote, basic transcription with diarization, output formats (SRT/VTT/JSON), batching for long-form audio, multi-language support, integration with video pipelines, accuracy tuning, and the cases where WhisperX is overkill vs. where it's essential.

Table of Contents

  1. What WhisperX Is
  2. Word-Level Timestamps via Forced Alignment
  3. Speaker Diarization via Pyannote
  4. Hardware Requirements
  5. Installation
  6. Hugging Face Token Setup
  7. Your First Transcription with Speakers
  8. CLI Usage
  9. Python API
  10. Output Formats: SRT, VTT, JSON
  11. Multi-Language Support
  12. Setting Speaker Counts
  13. Batched Long-Form Inference
  14. Integration with Video Pipelines
  15. Accuracy Tuning
  16. Performance Benchmarks
  17. WhisperX vs Commercial Services
  18. Tuning Recipes
  19. Troubleshooting
  20. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What WhisperX Is {#what-it-is}

WhisperX (m-bain/whisperX) is a Python library + CLI that wraps Faster-Whisper and adds:

  • Forced phoneme alignment via wav2vec2 → accurate word-level timestamps (sub-100 ms vs Whisper's ~1 sec drift)
  • Speaker diarization via pyannote.audio 3.1 → "who said what when"
  • Batched inference with VAD-based segmentation → higher throughput on long audio
  • Output formatters for SRT, VTT, TSV, JSON

Project: github.com/m-bain/whisperX. License: Apache 2.0. Active maintenance.


Word-Level Timestamps via Forced Alignment {#alignment}

Whisper natively emits timestamps but only at the segment level (typically 5-30 seconds per segment). For word-level timestamps, Whisper interpolates — and the result drifts by hundreds of milliseconds.

WhisperX runs the transcript through a wav2vec2 phoneme alignment model that maps each word to its precise audio position. Result: timestamps accurate to <100 ms, sufficient for karaoke-style word highlighting in subtitles.

The alignment step adds ~10-20% to total processing time but is essential for video subtitles, dubbed content alignment, and any time-sensitive transcript use.


Speaker Diarization via Pyannote {#diarization}

Pyannote 3.1 is the open-source state-of-the-art for speaker diarization. WhisperX:

  1. Runs pyannote on the audio → speaker turn boundaries
  2. Aligns Whisper transcript words with diarization output
  3. Assigns each word to a speaker label (SPEAKER_00, SPEAKER_01, ...)

For 2-3 clean speakers: 90-95% accuracy. For 4-6 speakers: 80-88%. For crowded meetings with crosstalk: 70-80%.

You can map labels to real names post-hoc: SPEAKER_00 → Alice, SPEAKER_01 → Bob based on the first known voice sample.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements {#requirements}

HardwareTranscription+ Alignment+ Diarization
RTX 4090 (FP16)72x RTF60x30x
RTX 4070 (FP16)50x40x22x
RTX 3060 (INT8)35x28x12x
Apple M4 Max (MPS)25x20x8x
Ryzen 7 7700X (CPU)10x8x0.5x (slow)

VRAM: 6 GB minimum, 8 GB+ comfortable. CPU diarization is impractical except for very small workloads.


Installation {#installation}

python3.11 -m venv ~/venvs/whisperx
source ~/venvs/whisperx/bin/activate

# CUDA 12.4 + cuDNN
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install whisperx
pip install nvidia-cudnn-cu12==9.0.0   # if not already installed

For CPU only:

pip install whisperx
# torch CPU is installed by default

For Docker:

docker run --gpus all --rm -v $(pwd):/workspace ghcr.io/m-bain/whisperx:latest \
    whisperx /workspace/audio.mp3 --model large-v3 --hf_token $HF_TOKEN

Hugging Face Token Setup {#hf-token}

Diarization needs a Hugging Face token because pyannote models are gated.

  1. Visit huggingface.co/pyannote/speaker-diarization-3.1 → click "Agree and access repository"
  2. Visit huggingface.co/pyannote/segmentation-3.0 → same
  3. Go to huggingface.co/settings/tokens → create a Read token
  4. Export: export HF_TOKEN=hf_xxxxxxxxxxxxxxx

Pass to WhisperX via --hf_token (CLI) or use_auth_token (Python).

For transcription-only (no diarization), no token needed.


Your First Transcription with Speakers {#first-transcription}

whisperx audio.mp3 \
    --model large-v3 \
    --diarize \
    --hf_token $HF_TOKEN \
    --min_speakers 2 \
    --max_speakers 4 \
    --output_format all

Outputs in current directory:

  • audio.srt — subtitles with speaker labels
  • audio.vtt — WebVTT for HTML5 video
  • audio.json — full structured output (words, timestamps, speakers)
  • audio.tsv — tab-separated for spreadsheet import
  • audio.txt — plain text

CLI Usage {#cli}

# Basic transcription only (fast)
whisperx audio.mp3 --model large-v3

# Word timestamps (alignment)
whisperx audio.mp3 --model large-v3 --align

# Full pipeline: transcription + alignment + diarization
whisperx audio.mp3 \
    --model large-v3 \
    --diarize \
    --hf_token $HF_TOKEN \
    --min_speakers 2 --max_speakers 4 \
    --language en \
    --batch_size 16 \
    --compute_type float16 \
    --output_format srt \
    --highlight_words True

Useful flags:

  • --language en — skip language detection (faster, more accurate)
  • --batch_size 16 — VAD-batched inference (RTX 4090: bump to 32)
  • --compute_type float16int8_float16 for tighter VRAM
  • --highlight_words True — per-word highlights in SRT output
  • --print_progress True — progress bar

Python API {#python-api}

import whisperx

device = "cuda"
audio = whisperx.load_audio("audio.mp3")

# 1. Transcribe with Whisper
model = whisperx.load_model("large-v3", device, compute_type="float16")
result = model.transcribe(audio, batch_size=16)

# 2. Align (word-level timestamps)
align_model, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(result["segments"], align_model, metadata, audio, device)

# 3. Diarize
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token=HF_TOKEN, device=device
)
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=4)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Result has: result["segments"] with .words[] containing .word .start .end .speaker

Output Formats: SRT, VTT, JSON {#output-formats}

SRT (subtitles)

1
00:00:00,520 --> 00:00:03,840
[SPEAKER_00] Welcome to the show, today we are talking about local AI.

2
00:00:04,120 --> 00:00:07,200
[SPEAKER_01] Thanks for having me. It is great to be here.

VTT (HTML5 video)

WEBVTT

00:00.520 --> 00:03.840
<v SPEAKER_00>Welcome to the show, today we are talking about local AI.</v>

00:04.120 --> 00:07.200
<v SPEAKER_01>Thanks for having me. It is great to be here.</v>

JSON (full structured)

{
  "segments": [{
    "start": 0.52, "end": 3.84,
    "text": "Welcome to the show...",
    "speaker": "SPEAKER_00",
    "words": [
      { "word": "Welcome", "start": 0.52, "end": 0.91, "speaker": "SPEAKER_00" },
      { "word": "to", "start": 0.94, "end": 1.06, "speaker": "SPEAKER_00" },
      { "word": "the", "start": 1.08, "end": 1.18, "speaker": "SPEAKER_00" }
    ]
  }],
  "language": "en"
}

The JSON drives downstream applications: TikTok-style word highlights, video editor sync, searchable archives.


Multi-Language Support {#multilingual}

WhisperX inherits Whisper's 99-language support. Alignment models exist for the major languages:

LanguageAlignment model
Englishwav2vec2-base-960h
Germanwav2vec2-large-xlsr-53-german
Spanishwav2vec2-large-xlsr-53-spanish
Frenchwav2vec2-large-xlsr-53-french
Italian, Portuguese, Russian, ...language-specific wav2vec2
Chinese / Japanese / Koreansmaller alignment models

Pass --language es etc. WhisperX auto-loads the matching alignment model.

For languages without an alignment model, transcription works but word timestamps fall back to interpolated.

Diarization is language-agnostic.


Setting Speaker Counts {#speaker-counts}

If you know the speaker count, tell pyannote:

whisperx audio.mp3 --diarize --min_speakers 2 --max_speakers 2 --hf_token $HF_TOKEN

Hard limits dramatically improve accuracy. For two-person podcasts, set min=max=2. For meetings of unknown size, leave defaults (auto-detect, typically 1-10).

If you provide just one bound (--min_speakers 3), pyannote uses it as a hint but may produce more or fewer.


Batched Long-Form Inference {#batched}

For long audio (1+ hours):

whisperx audio.mp3 \
    --model large-v3 \
    --batch_size 32 \
    --diarize --hf_token $HF_TOKEN

VAD pre-segmentation splits audio at silences, then transcribes 32 segments in parallel on GPU. Throughput on RTX 4090 large-v3: ~250x real-time before diarization.

For massive archives (1,000+ hours), use the Python API with explicit chunking and async pyannote calls.


Integration with Video Pipelines {#video}

For video subtitle generation:

# Extract audio from video
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 audio.wav

# Transcribe with WhisperX
whisperx audio.wav --model large-v3 --align --output_format srt

# Burn subtitles back into video
ffmpeg -i video.mp4 -vf "subtitles=audio.srt" video_subtitled.mp4

For a fully automated pipeline:

# auto-subtitle.sh
#!/bin/bash
ffmpeg -i "$1" -vn -ar 16000 -ac 1 audio.wav
whisperx audio.wav --model large-v3 --align --language en --output_format srt
ffmpeg -i "$1" -vf "subtitles=audio.srt:force_style='FontSize=22'" "${1%.*}_sub.mp4"
rm audio.wav audio.srt

Accuracy Tuning {#accuracy}

To improve diarization quality:

  1. Clean audio first — run Demucs or Audacity noise reduction
  2. Set speaker countsmin_speakers / max_speakers
  3. Use higher-quality model — large-v3, not turbo (turbo is English-tuned, less accurate for diarization edge cases)
  4. Adjust VAD sensitivity — pass custom vad_options dict
  5. Post-process — manually correct SPEAKER_X labels with sed/awk; cluster similar voices across recordings with voice fingerprinting

For mission-critical accuracy (court records, depositions): consider commercial Pyannote Premium or human-in-the-loop verification.


Performance Benchmarks {#benchmarks}

1-hour podcast, large-v3 model, RTX 4090:

PipelineTimeRTF
Transcription only50 sec72x
+ Alignment70 sec51x
+ Diarization120 sec30x

For long-form (10+ hours): batch processing scales nearly linearly, so 10 hours of audio takes ~20 minutes.


WhisperX vs Commercial Services {#vs-commercial}

PropertyWhisperXOtter.aiRevAssemblyAI
CostFree + GPU$20/mo$0.25/min$0.65/hour
PrivacyLocalCloudCloudCloud
Diarization accuracy (2-speaker)90-95%92-96%95%94%
Real-timeNoYesNoYes
Custom vocabularyYes (post-edit)YesYesYes
Language coverage99630+12

For most local podcasters, journalists, researchers: WhisperX matches commercial services on quality with full privacy and zero per-minute cost.


Tuning Recipes {#tuning}

Podcast / interview transcription

whisperx interview.mp3 \
    --model large-v3 --align --diarize \
    --min_speakers 2 --max_speakers 3 \
    --language en \
    --hf_token $HF_TOKEN \
    --output_format srt

YouTube auto-captioning

yt-dlp -x --audio-format wav "https://youtu.be/..."
whisperx *.wav --model large-v3 --align --output_format vtt

Meeting transcription with up to 8 speakers

whisperx meeting.wav \
    --model large-v3 --align --diarize \
    --max_speakers 8 \
    --hf_token $HF_TOKEN \
    --output_format json

Multi-language interview

whisperx multilingual.mp3 \
    --model large-v3 --align --diarize \
    --hf_token $HF_TOKEN
# Auto-detects primary language; alignment uses matching wav2vec2

Troubleshooting {#troubleshooting}

SymptomCauseFix
"Could not download pyannote model"HF token / license not acceptedVisit pyannote pages, accept license
Alignment model not foundLanguage without alignerFalls back to interpolated word timestamps
Diarization wrong speakersVoices too similarSet min/max speakers; clean audio
OOM with batch_size 32VRAM too tightDrop to 8 or 16
Slow on CPUDiarization compute-boundUse GPU for diarization at minimum
WhisperX version mismatch with Faster-WhisperPin versionsUse venv with both pinned

FAQ {#faq}

See answers to common WhisperX questions below.


Sources: WhisperX GitHub | pyannote.audio | WhisperX paper | Internal benchmarks RTX 4090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes WhisperX + LocalAI transcription pipeline. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators