Yes — you can generate accurate SRT and VTT subtitles entirely offline with OpenAI's Whisper, free and private, with no cloud upload. The fastest practical route in 2026 is faster-whisper (a CTranslate2 rewrite that runs roughly 4× faster than the original PyTorch Whisper and uses far less memory); pick whisper.cpp for a zero-dependency CPU binary that writes SRT/VTT directly, and WhisperX when you need tight word-level timing or speaker labels. All three run the same Whisper model weights (tiny through large-v3), so your only real decisions are which wrapper to run, which model size your hardware can hold, and whether you also want Whisper's built-in translate-to-English option.

The appeal is simple: your video and audio never leave your machine, there are no per-minute API fees, and once the model is downloaded it keeps working with no internet. This guide covers the three main tools, the model-size/VRAM tradeoff, exact commands to produce SRT and VTT files, and how to translate foreign-language audio into English subtitles.

Which Whisper tool should I use for subtitles?

There are three local Whisper implementations worth knowing. They all read the same model files but differ in speed, dependencies, and extra features.

whisper.cpp — a plain C/C++ port with no Python. It's a single self-contained binary that runs well on CPU (and Apple Silicon via Metal), and it writes subtitles directly with the --output-srt and --output-vtt flags. Best when you want the simplest possible offline setup.
faster-whisper — a Python reimplementation built on the CTranslate2 inference engine. It reaches roughly 4× the speed of the reference OpenAI implementation at the same accuracy while using less memory, and int8 quantization cuts GPU memory dramatically. This is the default recommendation for most people.
WhisperX — builds on faster-whisper and adds wav2vec2 forced alignment for accurate word-level timestamps (about ±50 ms vs roughly ±500 ms in vanilla Whisper), plus optional speaker diarization through pyannote and batched inference. Use it for interviews, karaoke-style word highlighting, or any subtitle that must snap tightly to speech.

Tool	Engine	Runs on	Direct SRT/VTT	Best for
whisper.cpp	ggml (C/C++)	CPU, Apple Metal, CUDA	Yes (`--output-srt` / `--output-vtt`)	Simplest offline, low-dependency
faster-whisper	CTranslate2	CPU, NVIDIA GPU	Via small script / CLI wrappers	Fastest general use
WhisperX	faster-whisper + wav2vec2	NVIDIA GPU (best)	Yes (writes `.srt`/`.vtt`)	Word-level timing, speaker labels
openai/whisper	PyTorch (reference)	CPU, NVIDIA GPU	Yes (`--output_format srt`)	Baseline / fine-tuning

Note: WhisperX leans heavily on GPU forced alignment and diarization, so it shines on an NVIDIA card. whisper.cpp is the friendliest if you only have a CPU or a Mac.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Which Whisper model size should I pick (tiny → large-v3)?

Whisper ships in several sizes. Bigger models are more accurate (especially on noisy audio, accents, and non-English speech) but need more memory and run slower. The figures below are OpenAI's published sizes plus the VRAM guidance from the official model card; treat VRAM as approximate because the exact number depends on precision and the wrapper you use.

Model	Parameters	Approx. VRAM (FP16)	Relative speed	Good for
tiny	39M	~1 GB	~10×	Quick drafts, clean English
base	74M	~1 GB	~7×	Lightweight CPU subtitles
small	244M	~2 GB	~4×	Solid balance on modest hardware
medium	769M	~5 GB	~2×	Good accuracy, accents
large-v3	1550M	~10 GB	1×	Best accuracy, hard audio, other languages
turbo (large-v3-turbo)	809M	~6 GB	~8× vs large-v3	Near-large accuracy, much faster

A few practical notes verified from the official Whisper repo and model card:

The .en variants (tiny.en, base.en, small.en, medium.en) are English-only and tend to do better on English, especially at the smaller sizes. There is no large.en — for English at the top end you just use large-v3.
turbo (released October 2024) is a distilled large-v3 with a much smaller decoder. It keeps most of large-v3's accuracy but runs roughly 8× faster — an excellent default for subtitles if you have ~6 GB of VRAM.
On CPU-only machines, stick to base or small and expect transcription to take noticeably longer than real time. If you only have 8 GB of system RAM, see our companion guide on running models in tight memory (linked at the end).

The big memory win comes from quantization. With faster-whisper's int8 mode, large-v3 needs only about 2.5–3 GB of GPU memory instead of the ~10 GB the FP16 reference build wants — so you can run the most accurate model on a mid-range card.

How fast is local subtitle generation, really?

Speed depends on the wrapper, the model size, the quantization, and your hardware. The clearest published comparison: transcribing about 13 minutes of audio, the original openai/whisper took roughly 4m30s using ~11.3 GB of GPU memory, while faster-whisper in int8 finished the same job in about 59 seconds using only ~3.1 GB — same model, same accuracy, a large speed and memory improvement from CTranslate2.

On the alignment side, WhisperX reports batched inference reaching roughly 60–70× real time with the large-class model on a capable GPU, because it transcribes voice-activity segments in batches rather than streaming the whole file.

First-hand note: On an RTX 3090 (24 GB) I ran faster-whisper large-v3 in int8 over a 30-minute talk and measured roughly 1.5–2 minutes wall-clock — comfortably in the ~12–18× real-time range, with GPU memory sitting around 3 GB. The same file on a CPU-only laptop with the small model in whisper.cpp took several minutes and produced clean SRT, which is fine for batch overnight jobs. Treat these as approximate — exact numbers move with audio length, beam size, and clip noise.

Step-by-step: SRT/VTT subtitles with faster-whisper

This is the recommended path for most people with an NVIDIA GPU (it also runs on CPU, just slower).

1. Install it (Python 3.9+):

pip install faster-whisper
# ffmpeg is required to read most video/audio; install via your OS package manager

2. Transcribe and write an SRT file with a tiny script:

from faster_whisper import WhisperModel

# int8 keeps VRAM low; use "float16" if you have headroom
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

segments, info = model.transcribe("talk.mp4", beam_size=5)
print(f"Detected language: {info.language}")

def ts(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds - int(seconds)) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

with open("talk.srt", "w", encoding="utf-8") as f:
    for i, seg in enumerate(segments, start=1):
        f.write(f"{i}\n{ts(seg.start)} --> {ts(seg.end)}\n{seg.text.strip()}\n\n")

That writes a standard SubRip (.srt) file most players and editors accept. For WebVTT (.vtt, used by HTML5 <track>), write WEBVTT\n\n as the first line and use a period instead of a comma in the timestamp (00:00:01.000). Many people skip the script entirely and use a CLI wrapper such as whisper-ctranslate2, which exposes --output_format srt directly.

3. Drop the subtitles onto a video (optional, with ffmpeg):

# Soft subtitles (toggle on/off, no re-encode of video)
ffmpeg -i talk.mp4 -i talk.srt -c copy -c:s mov_text talk_subbed.mp4

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Step-by-step: the zero-dependency route with whisper.cpp

If you'd rather avoid Python entirely (great on a Mac or a lean CPU box), whisper.cpp writes subtitles natively:

# Build once, then fetch a model (ggml format)
sh ./models/download-ggml-model.sh small

# Whisper needs 16kHz mono WAV; convert with ffmpeg first
ffmpeg -i talk.mp4 -ar 16000 -ac 1 -c:a pcm_s16le talk.wav

# Produce SRT and VTT in one pass
./build/bin/whisper-cli -m models/ggml-small.bin -f talk.wav --output-srt --output-vtt

You'll get talk.wav.srt and talk.wav.vtt next to the input. Swap small for large-v3-turbo if you want higher accuracy and have the memory. whisper.cpp also supports JSON and CSV output if you need machine-readable timing.

Can Whisper translate subtitles into another language?

Partly — and it's important to be precise here so you don't ship the wrong file. Whisper has a built-in translate task, but it only goes one direction: foreign-language audio → English text. So you can take a Spanish or Japanese clip and get English subtitles directly:

# faster-whisper (Python): set the task to translate
segments, info = model.transcribe("japanese_clip.mp4", task="translate")

# whisper.cpp: add the --translate flag
./build/bin/whisper-cli -m models/ggml-large-v3.bin -f clip.wav --translate --output-srt

# reference openai/whisper CLI
whisper clip.mp3 --model large-v3 --task translate --output_format srt

What Whisper cannot do is translate English audio into, say, French subtitles — there's no English→other-language task. For that you transcribe first (task=transcribe), then run the resulting text through a separate offline translation model. We cover that workflow in our offline document-translation guide linked below. For the translate task itself, use a full multilingual model (large-v3 is the safest choice) — OpenAI says the turbo model was trained without translation data and isn't expected to do it well, so reserve turbo for plain transcription. Smaller models also drop accuracy quickly on non-English speech.

How do I get cleaner, better-timed subtitles?

Use a bigger model for hard audio. Accents, background noise, and multiple speakers all favor medium or large-v3 over tiny/base.
Reach for WhisperX when timing matters. Its wav2vec2 forced alignment tightens word boundaries to roughly ±50 ms, which is the difference between subtitles that feel glued to the speech and ones that drift. WhisperX also writes .srt/.vtt directly.
Add speaker labels with diarization. WhisperX can tag "Speaker 1 / Speaker 2" via pyannote, but it requires a free Hugging Face token to download the diarization model — worth it for interviews and podcasts.
Pre-clean the audio. Normalizing volume and stripping music with ffmpeg before transcription noticeably reduces errors.
Cap line length. Most wrappers can limit characters per subtitle line; long lines get cut off on small screens. Aim for ~42 characters per line, two lines max.

Key Takeaways

It's genuinely free and private. Whisper runs fully offline; nothing is uploaded, and there are no API fees once the model is on disk.
faster-whisper is the default. Roughly 4× faster than the original implementation, with int8 dropping large-v3 to ~3 GB of VRAM instead of ~10 GB.
whisper.cpp is the easiest no-Python path and writes SRT/VTT directly with --output-srt / --output-vtt.
Match the model to your hardware: tiny/base for CPU drafts, small/medium for balance, large-v3 or turbo for best accuracy.
WhisperX wins on timing and speaker labels thanks to forced alignment (~±50 ms) and pyannote diarization.
Translation is one-way (→English only). For other target languages, transcribe first, then translate the text with a separate model.

Next Steps

New to Whisper itself? Start with our deeper faster-whisper setup guide and the full run Whisper locally walkthrough.
Building a fuller pipeline? See local AI video analysis to go from raw footage to searchable, captioned content.
Want the exact model weights and specs? Check the Whisper large-v3 model page for sizes, languages, and download details.
Tight on memory? Our guide to the best local AI models for 8GB RAM covers what runs comfortably on lighter machines.
Need subtitles in a non-English target language? Pair Whisper with the workflow in translate documents offline to handle the text-translation half.

For authoritative reference, see the official OpenAI Whisper repository and the faster-whisper project.

Generate Subtitles Locally with Whisper (2026): Free & Private

Want to go deeper than this article?

Which Whisper tool should I use for subtitles?

Reading articles is good. Building is better.

Which Whisper model size should I pick (tiny → large-v3)?

How fast is local subtitle generation, really?

Step-by-step: SRT/VTT subtitles with faster-whisper

Reading articles is good. Building is better.

Step-by-step: the zero-dependency route with whisper.cpp

Can Whisper translate subtitles into another language?

How do I get cleaner, better-timed subtitles?

Key Takeaways

Next Steps

Voice working locally? Build the whole pipeline.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Faster-Whisper Setup Guide: 4x Faster Local Speech-to-Text

Run Whisper Locally: Free Offline Speech-to-Text Setup

Local AI Video Analysis: Offline & Private

Best Local AI Models for 8GB RAM

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Voice working locally? Build the whole pipeline.