Q: Can I get word-level timestamps in formats like SRT, VTT, or JSON?

Yes. WhisperX outputs SRT, VTT, TSV, JSON, and TXT directly. SRT/VTT for video subtitles include word-level highlights when `--highlight_words True` is set. JSON output includes per-word start/end timestamps and speaker labels (when diarization enabled). For TikTok / YouTube karaoke-style captions where each word lights up as spoken, the JSON output drives that animation directly.

Q: How is WhisperX maintained?

WhisperX is by Max Bain (m-bain/whisperX on GitHub), originally a research project at Oxford. Active maintenance with regular releases. Apache 2.0 licensed. Community contributions are healthy. The project also offers a hosted API at whisperX.com for non-self-hosters; the open-source version has feature parity for self-hosted use.

Q: Can WhisperX do real-time / streaming transcription with diarization?

Not natively — WhisperX is designed for batched offline processing. Real-time diarization is fundamentally harder than batch (requires online clustering of voices). For streaming transcription without speaker labels, use Faster-Whisper streaming (~500 ms latency). For real-time meeting captions with speakers, the right pattern is: stream Faster-Whisper for live captions, batch-process the recording with WhisperX afterward for the final transcript with speaker labels. Or use commercial services like AssemblyAI / Deepgram Nova for real-time diarization.

Question 1

What is WhisperX and why do I need it instead of Faster-Whisper?

Accepted Answer

WhisperX is built on top of Faster-Whisper and adds three things: (1) **forced phoneme alignment** via wav2vec2 for accurate word-level timestamps (Whisper's native word timestamps drift up to 1 second; WhisperX is sub-100 ms), (2) **speaker diarization** via pyannote.audio (who said what), and (3) **batched VAD-based inference** for higher throughput. If you need subtitles with precise per-word timing, podcasts/interviews with speaker labels, or meeting transcripts that distinguish participants — WhisperX. For plain transcription without alignment / speakers, [Faster-Whisper](/blog/faster-whisper-guide) alone is simpler.

Question 2

How accurate is WhisperX speaker diarization?

Accepted Answer

For 2-3 speakers in clean recordings, diarization accuracy is typically 90-95% (DER 5-10%). For 4-6 speakers, 80-88%. For meetings with many overlapping speakers, accuracy drops to 70-80%. Pyannote 3.1 (the diarization model WhisperX uses) is excellent for clear two-person podcasts and conference calls; struggles with crosstalk and similar-sounding voices. Provide `min_speakers` and `max_speakers` if you know the count to improve accuracy. For mission-critical diarization (court records, depositions), professional services or commercial tools like Pyannote Premium / Rev still beat open-source.

Question 3

What hardware does WhisperX need?

Accepted Answer

NVIDIA GPU strongly recommended due to pyannote diarization model. Minimum 6 GB VRAM for large-v3 + diarization; 8 GB+ comfortable. CPU-only works for transcription but diarization is very slow (10-50x real-time slowdown). On RTX 4090 with large-v3 + diarization: ~30x real-time. RTX 3060: ~12x real-time. For batch use cases (transcribing podcast archives), CPU is fine for transcription but pre-bake the diarization on GPU.

Question 4

Do I need a Hugging Face token to use WhisperX?

Accepted Answer

Yes, for diarization only. Pyannote 3.1 requires accepting two model licenses on Hugging Face: `pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0`. Visit each model card, accept the license, and create a HF access token. Pass it via `--hf_token` CLI flag or `use_auth_token` Python argument. For transcription-only (no diarization), no token needed. The HF token requirement is a regular point of frustration but it is a research / non-commercial license arrangement, not a paywall.

Question 5

How does WhisperX compare to commercial diarization (Otter.ai, Rev, Dialpad)?

Accepted Answer

For 2-speaker podcasts and clear interviews: WhisperX is comparable to Otter.ai and Rev for diarization accuracy at zero per-minute cost. For meeting transcription with many participants: commercial services often have better diarization (specialized models, voice fingerprinting across recordings) and integrated UX. Privacy: WhisperX is local — audio never leaves your machine. Cost: WhisperX is free; Otter / Rev charge $10-30/month or per-minute. For most podcasters and small businesses doing < 100 hrs/month of transcription, WhisperX matches or beats the paid tier on quality with full privacy.

Question 6

Can I get word-level timestamps in formats like SRT, VTT, or JSON?

Accepted Answer

Yes. WhisperX outputs SRT, VTT, TSV, JSON, and TXT directly. SRT/VTT for video subtitles include word-level highlights when `--highlight_words True` is set. JSON output includes per-word start/end timestamps and speaker labels (when diarization enabled). For TikTok / YouTube karaoke-style captions where each word lights up as spoken, the JSON output drives that animation directly.

Question 7

How is WhisperX maintained?

Accepted Answer

WhisperX is by Max Bain (m-bain/whisperX on GitHub), originally a research project at Oxford. Active maintenance with regular releases. Apache 2.0 licensed. Community contributions are healthy. The project also offers a hosted API at whisperX.com for non-self-hosters; the open-source version has feature parity for self-hosted use.

Question 8

Can WhisperX do real-time / streaming transcription with diarization?

Accepted Answer

Not natively — WhisperX is designed for batched offline processing. Real-time diarization is fundamentally harder than batch (requires online clustering of voices). For streaming transcription without speaker labels, use Faster-Whisper streaming (~500 ms latency). For real-time meeting captions with speakers, the right pattern is: stream Faster-Whisper for live captions, batch-process the recording with WhisperX afterward for the final transcript with speaker labels. Or use commercial services like AssemblyAI / Deepgram Nova for real-time diarization.

Hardware	Transcription	+ Alignment	+ Diarization
RTX 4090 (FP16)	72x RTF	60x	30x
RTX 4070 (FP16)	50x	40x	22x
RTX 3060 (INT8)	35x	28x	12x
Apple M4 Max (MPS)	25x	20x	8x
Ryzen 7 7700X (CPU)	10x	8x	0.5x (slow)

Language	Alignment model
English	wav2vec2-base-960h
German	wav2vec2-large-xlsr-53-german
Spanish	wav2vec2-large-xlsr-53-spanish
French	wav2vec2-large-xlsr-53-french
Italian, Portuguese, Russian, ...	language-specific wav2vec2
Chinese / Japanese / Korean	smaller alignment models

Pipeline	Time	RTF
Transcription only	50 sec	72x
+ Alignment	70 sec	51x
+ Diarization	120 sec	30x

Symptom	Cause	Fix
"Could not download pyannote model"	HF token / license not accepted	Visit pyannote pages, accept license
Alignment model not found	Language without aligner	Falls back to interpolated word timestamps
Diarization wrong speakers	Voices too similar	Set min/max speakers; clean audio
OOM with batch_size 32	VRAM too tight	Drop to 8 or 16
Slow on CPU	Diarization compute-bound	Use GPU for diarization at minimum
WhisperX version mismatch with Faster-Whisper	Pin versions	Use venv with both pinned

Property	WhisperX	Otter.ai	Rev	AssemblyAI
Cost	Free + GPU	$20/mo	$0.25/min	$0.65/hour
Privacy	Local	Cloud	Cloud	Cloud
Diarization accuracy (2-speaker)	90-95%	92-96%	95%	94%
Real-time	No	Yes	No	Yes
Custom vocabulary	Yes (post-edit)	Yes	Yes	Yes
Language coverage	99	6	30+	12

WhisperX Guide (2026): Word-Level Timestamps + Speaker Diarization for Local STT

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What WhisperX Is {#what-it-is}

Word-Level Timestamps via Forced Alignment {#alignment}

Speaker Diarization via Pyannote {#diarization}

Reading articles is good. Building is better.

Hardware Requirements {#requirements}

Installation {#installation}

Hugging Face Token Setup {#hf-token}

Your First Transcription with Speakers {#first-transcription}

CLI Usage {#cli}

Python API {#python-api}

Output Formats: SRT, VTT, JSON {#output-formats}

SRT (subtitles)

VTT (HTML5 video)

JSON (full structured)

Multi-Language Support {#multilingual}

Setting Speaker Counts {#speaker-counts}

Batched Long-Form Inference {#batched}

Integration with Video Pipelines {#video}

Accuracy Tuning {#accuracy}

Performance Benchmarks {#benchmarks}

WhisperX vs Commercial Services {#vs-commercial}

Tuning Recipes {#tuning}

Podcast / interview transcription

YouTube auto-captioning

Meeting transcription with up to 8 speakers

Multi-language interview

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Faster-Whisper Guide

Whisper Local Speech-to-Text

Local AI Meeting Transcription

Local AI Podcast Production

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI