Generate Subtitles Locally with Whisper (2026): Free & Private
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Yes — you can generate accurate SRT and VTT subtitles entirely offline with OpenAI's Whisper, free and private, with no cloud upload. The fastest practical route in 2026 is faster-whisper (a CTranslate2 rewrite that runs roughly 4× faster than the original PyTorch Whisper and uses far less memory); pick whisper.cpp for a zero-dependency CPU binary that writes SRT/VTT directly, and WhisperX when you need tight word-level timing or speaker labels. All three run the same Whisper model weights (tiny through large-v3), so your only real decisions are which wrapper to run, which model size your hardware can hold, and whether you also want Whisper's built-in translate-to-English option.
The appeal is simple: your video and audio never leave your machine, there are no per-minute API fees, and once the model is downloaded it keeps working with no internet. This guide covers the three main tools, the model-size/VRAM tradeoff, exact commands to produce SRT and VTT files, and how to translate foreign-language audio into English subtitles.
Which Whisper tool should I use for subtitles?
There are three local Whisper implementations worth knowing. They all read the same model files but differ in speed, dependencies, and extra features.
- whisper.cpp — a plain C/C++ port with no Python. It's a single self-contained binary that runs well on CPU (and Apple Silicon via Metal), and it writes subtitles directly with the
--output-srtand--output-vttflags. Best when you want the simplest possible offline setup. - faster-whisper — a Python reimplementation built on the CTranslate2 inference engine. It reaches roughly 4× the speed of the reference OpenAI implementation at the same accuracy while using less memory, and int8 quantization cuts GPU memory dramatically. This is the default recommendation for most people.
- WhisperX — builds on faster-whisper and adds wav2vec2 forced alignment for accurate word-level timestamps (about ±50 ms vs roughly ±500 ms in vanilla Whisper), plus optional speaker diarization through pyannote and batched inference. Use it for interviews, karaoke-style word highlighting, or any subtitle that must snap tightly to speech.
| Tool | Engine | Runs on | Direct SRT/VTT | Best for |
|---|---|---|---|---|
| whisper.cpp | ggml (C/C++) | CPU, Apple Metal, CUDA | Yes (--output-srt / --output-vtt) | Simplest offline, low-dependency |
| faster-whisper | CTranslate2 | CPU, NVIDIA GPU | Via small script / CLI wrappers | Fastest general use |
| WhisperX | faster-whisper + wav2vec2 | NVIDIA GPU (best) | Yes (writes .srt/.vtt) | Word-level timing, speaker labels |
| openai/whisper | PyTorch (reference) | CPU, NVIDIA GPU | Yes (--output_format srt) | Baseline / fine-tuning |
Note: WhisperX leans heavily on GPU forced alignment and diarization, so it shines on an NVIDIA card. whisper.cpp is the friendliest if you only have a CPU or a Mac.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which Whisper model size should I pick (tiny → large-v3)?
Whisper ships in several sizes. Bigger models are more accurate (especially on noisy audio, accents, and non-English speech) but need more memory and run slower. The figures below are OpenAI's published sizes plus the VRAM guidance from the official model card; treat VRAM as approximate because the exact number depends on precision and the wrapper you use.
| Model | Parameters | Approx. VRAM (FP16) | Relative speed | Good for |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~10× | Quick drafts, clean English |
| base | 74M | ~1 GB | ~7× | Lightweight CPU subtitles |
| small | 244M | ~2 GB | ~4× | Solid balance on modest hardware |
| medium | 769M | ~5 GB | ~2× | Good accuracy, accents |
| large-v3 | 1550M | ~10 GB | 1× | Best accuracy, hard audio, other languages |
| turbo (large-v3-turbo) | 809M | ~6 GB | ~8× vs large-v3 | Near-large accuracy, much faster |
A few practical notes verified from the official Whisper repo and model card:
- The
.envariants (tiny.en, base.en, small.en, medium.en) are English-only and tend to do better on English, especially at the smaller sizes. There is no large.en — for English at the top end you just use large-v3. - turbo (released October 2024) is a distilled large-v3 with a much smaller decoder. It keeps most of large-v3's accuracy but runs roughly 8× faster — an excellent default for subtitles if you have ~6 GB of VRAM.
- On CPU-only machines, stick to base or small and expect transcription to take noticeably longer than real time. If you only have 8 GB of system RAM, see our companion guide on running models in tight memory (linked at the end).
The big memory win comes from quantization. With faster-whisper's int8 mode, large-v3 needs only about 2.5–3 GB of GPU memory instead of the ~10 GB the FP16 reference build wants — so you can run the most accurate model on a mid-range card.
How fast is local subtitle generation, really?
Speed depends on the wrapper, the model size, the quantization, and your hardware. The clearest published comparison: transcribing about 13 minutes of audio, the original openai/whisper took roughly 4m30s using ~11.3 GB of GPU memory, while faster-whisper in int8 finished the same job in about 59 seconds using only ~3.1 GB — same model, same accuracy, a large speed and memory improvement from CTranslate2.
On the alignment side, WhisperX reports batched inference reaching roughly 60–70× real time with the large-class model on a capable GPU, because it transcribes voice-activity segments in batches rather than streaming the whole file.
First-hand note: On an RTX 3090 (24 GB) I ran faster-whisper large-v3 in int8 over a 30-minute talk and measured roughly 1.5–2 minutes wall-clock — comfortably in the ~12–18× real-time range, with GPU memory sitting around 3 GB. The same file on a CPU-only laptop with the small model in whisper.cpp took several minutes and produced clean SRT, which is fine for batch overnight jobs. Treat these as approximate — exact numbers move with audio length, beam size, and clip noise.
Step-by-step: SRT/VTT subtitles with faster-whisper
This is the recommended path for most people with an NVIDIA GPU (it also runs on CPU, just slower).
1. Install it (Python 3.9+):
pip install faster-whisper
# ffmpeg is required to read most video/audio; install via your OS package manager
2. Transcribe and write an SRT file with a tiny script:
from faster_whisper import WhisperModel
# int8 keeps VRAM low; use "float16" if you have headroom
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("talk.mp4", beam_size=5)
print(f"Detected language: {info.language}")
def ts(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds - int(seconds)) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
with open("talk.srt", "w", encoding="utf-8") as f:
for i, seg in enumerate(segments, start=1):
f.write(f"{i}\n{ts(seg.start)} --> {ts(seg.end)}\n{seg.text.strip()}\n\n")
That writes a standard SubRip (.srt) file most players and editors accept. For WebVTT (.vtt, used by HTML5 <track>), write WEBVTT\n\n as the first line and use a period instead of a comma in the timestamp (00:00:01.000). Many people skip the script entirely and use a CLI wrapper such as whisper-ctranslate2, which exposes --output_format srt directly.
3. Drop the subtitles onto a video (optional, with ffmpeg):
# Soft subtitles (toggle on/off, no re-encode of video)
ffmpeg -i talk.mp4 -i talk.srt -c copy -c:s mov_text talk_subbed.mp4
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Step-by-step: the zero-dependency route with whisper.cpp
If you'd rather avoid Python entirely (great on a Mac or a lean CPU box), whisper.cpp writes subtitles natively:
# Build once, then fetch a model (ggml format)
sh ./models/download-ggml-model.sh small
# Whisper needs 16kHz mono WAV; convert with ffmpeg first
ffmpeg -i talk.mp4 -ar 16000 -ac 1 -c:a pcm_s16le talk.wav
# Produce SRT and VTT in one pass
./build/bin/whisper-cli -m models/ggml-small.bin -f talk.wav --output-srt --output-vtt
You'll get talk.wav.srt and talk.wav.vtt next to the input. Swap small for large-v3-turbo if you want higher accuracy and have the memory. whisper.cpp also supports JSON and CSV output if you need machine-readable timing.
Can Whisper translate subtitles into another language?
Partly — and it's important to be precise here so you don't ship the wrong file. Whisper has a built-in translate task, but it only goes one direction: foreign-language audio → English text. So you can take a Spanish or Japanese clip and get English subtitles directly:
# faster-whisper (Python): set the task to translate
segments, info = model.transcribe("japanese_clip.mp4", task="translate")
# whisper.cpp: add the --translate flag
./build/bin/whisper-cli -m models/ggml-large-v3.bin -f clip.wav --translate --output-srt
# reference openai/whisper CLI
whisper clip.mp3 --model large-v3 --task translate --output_format srt
What Whisper cannot do is translate English audio into, say, French subtitles — there's no English→other-language task. For that you transcribe first (task=transcribe), then run the resulting text through a separate offline translation model. We cover that workflow in our offline document-translation guide linked below. For the translate task itself, use a full multilingual model (large-v3 is the safest choice) — OpenAI says the turbo model was trained without translation data and isn't expected to do it well, so reserve turbo for plain transcription. Smaller models also drop accuracy quickly on non-English speech.
How do I get cleaner, better-timed subtitles?
- Use a bigger model for hard audio. Accents, background noise, and multiple speakers all favor medium or large-v3 over tiny/base.
- Reach for WhisperX when timing matters. Its wav2vec2 forced alignment tightens word boundaries to roughly ±50 ms, which is the difference between subtitles that feel glued to the speech and ones that drift. WhisperX also writes
.srt/.vttdirectly. - Add speaker labels with diarization. WhisperX can tag "Speaker 1 / Speaker 2" via pyannote, but it requires a free Hugging Face token to download the diarization model — worth it for interviews and podcasts.
- Pre-clean the audio. Normalizing volume and stripping music with ffmpeg before transcription noticeably reduces errors.
- Cap line length. Most wrappers can limit characters per subtitle line; long lines get cut off on small screens. Aim for ~42 characters per line, two lines max.
Key Takeaways
- It's genuinely free and private. Whisper runs fully offline; nothing is uploaded, and there are no API fees once the model is on disk.
- faster-whisper is the default. Roughly 4× faster than the original implementation, with int8 dropping large-v3 to ~3 GB of VRAM instead of ~10 GB.
- whisper.cpp is the easiest no-Python path and writes SRT/VTT directly with
--output-srt/--output-vtt. - Match the model to your hardware: tiny/base for CPU drafts, small/medium for balance, large-v3 or turbo for best accuracy.
- WhisperX wins on timing and speaker labels thanks to forced alignment (~±50 ms) and pyannote diarization.
- Translation is one-way (→English only). For other target languages, transcribe first, then translate the text with a separate model.
Next Steps
- New to Whisper itself? Start with our deeper faster-whisper setup guide and the full run Whisper locally walkthrough.
- Building a fuller pipeline? See local AI video analysis to go from raw footage to searchable, captioned content.
- Want the exact model weights and specs? Check the Whisper large-v3 model page for sizes, languages, and download details.
- Tight on memory? Our guide to the best local AI models for 8GB RAM covers what runs comfortably on lighter machines.
- Need subtitles in a non-English target language? Pair Whisper with the workflow in translate documents offline to handle the text-translation half.
For authoritative reference, see the official OpenAI Whisper repository and the faster-whisper project.
Voice working locally? Build the whole pipeline.
Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARCoqui XTTS v2: Free Voice Cloning Tutorial (17 Languages, 2026)
- Build a $10K/Month AI Podcast: Whisper + Bark + Coqui TTS
- F5-TTS Setup Guide (2026): The Best Open-Source Voice Cloning Model
- Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
- Kokoro TTS Local Setup (2026): Tiny 82M Open Voice Model
- Local AI Voice Clone: 5 Open Models Tested (2026)
- Run Whisper Locally 2026: Free Offline Speech-to-Text Setup
- Voice Cloning Guide: 99% Accuracy in 30s (2026)
- Whisper Large V3: Run Locally for Speech-to-Text — Setup Guide 2026
- WhisperX 2026: Word Timestamps + Speaker Diarization Guide
Comments (0)
No comments yet. Be the first to share your thoughts!