★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Use Cases

Generate Subtitles Locally with Whisper (2026): Free & Private

June 20, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Voice working locally? Build the whole pipeline. Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

Yes — you can generate accurate SRT and VTT subtitles entirely offline with OpenAI's Whisper, free and private, with no cloud upload. The fastest practical route in 2026 is faster-whisper (a CTranslate2 rewrite that runs roughly 4× faster than the original PyTorch Whisper and uses far less memory); pick whisper.cpp for a zero-dependency CPU binary that writes SRT/VTT directly, and WhisperX when you need tight word-level timing or speaker labels. All three run the same Whisper model weights (tiny through large-v3), so your only real decisions are which wrapper to run, which model size your hardware can hold, and whether you also want Whisper's built-in translate-to-English option.

The appeal is simple: your video and audio never leave your machine, there are no per-minute API fees, and once the model is downloaded it keeps working with no internet. This guide covers the three main tools, the model-size/VRAM tradeoff, exact commands to produce SRT and VTT files, and how to translate foreign-language audio into English subtitles.

Which Whisper tool should I use for subtitles?

There are three local Whisper implementations worth knowing. They all read the same model files but differ in speed, dependencies, and extra features.

  • whisper.cpp — a plain C/C++ port with no Python. It's a single self-contained binary that runs well on CPU (and Apple Silicon via Metal), and it writes subtitles directly with the --output-srt and --output-vtt flags. Best when you want the simplest possible offline setup.
  • faster-whisper — a Python reimplementation built on the CTranslate2 inference engine. It reaches roughly 4× the speed of the reference OpenAI implementation at the same accuracy while using less memory, and int8 quantization cuts GPU memory dramatically. This is the default recommendation for most people.
  • WhisperX — builds on faster-whisper and adds wav2vec2 forced alignment for accurate word-level timestamps (about ±50 ms vs roughly ±500 ms in vanilla Whisper), plus optional speaker diarization through pyannote and batched inference. Use it for interviews, karaoke-style word highlighting, or any subtitle that must snap tightly to speech.
ToolEngineRuns onDirect SRT/VTTBest for
whisper.cppggml (C/C++)CPU, Apple Metal, CUDAYes (--output-srt / --output-vtt)Simplest offline, low-dependency
faster-whisperCTranslate2CPU, NVIDIA GPUVia small script / CLI wrappersFastest general use
WhisperXfaster-whisper + wav2vec2NVIDIA GPU (best)Yes (writes .srt/.vtt)Word-level timing, speaker labels
openai/whisperPyTorch (reference)CPU, NVIDIA GPUYes (--output_format srt)Baseline / fine-tuning

Note: WhisperX leans heavily on GPU forced alignment and diarization, so it shines on an NVIDIA card. whisper.cpp is the friendliest if you only have a CPU or a Mac.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Which Whisper model size should I pick (tiny → large-v3)?

Whisper ships in several sizes. Bigger models are more accurate (especially on noisy audio, accents, and non-English speech) but need more memory and run slower. The figures below are OpenAI's published sizes plus the VRAM guidance from the official model card; treat VRAM as approximate because the exact number depends on precision and the wrapper you use.

ModelParametersApprox. VRAM (FP16)Relative speedGood for
tiny39M~1 GB~10×Quick drafts, clean English
base74M~1 GB~7×Lightweight CPU subtitles
small244M~2 GB~4×Solid balance on modest hardware
medium769M~5 GB~2×Good accuracy, accents
large-v31550M~10 GBBest accuracy, hard audio, other languages
turbo (large-v3-turbo)809M~6 GB~8× vs large-v3Near-large accuracy, much faster

A few practical notes verified from the official Whisper repo and model card:

  • The .en variants (tiny.en, base.en, small.en, medium.en) are English-only and tend to do better on English, especially at the smaller sizes. There is no large.en — for English at the top end you just use large-v3.
  • turbo (released October 2024) is a distilled large-v3 with a much smaller decoder. It keeps most of large-v3's accuracy but runs roughly 8× faster — an excellent default for subtitles if you have ~6 GB of VRAM.
  • On CPU-only machines, stick to base or small and expect transcription to take noticeably longer than real time. If you only have 8 GB of system RAM, see our companion guide on running models in tight memory (linked at the end).

The big memory win comes from quantization. With faster-whisper's int8 mode, large-v3 needs only about 2.5–3 GB of GPU memory instead of the ~10 GB the FP16 reference build wants — so you can run the most accurate model on a mid-range card.

How fast is local subtitle generation, really?

Speed depends on the wrapper, the model size, the quantization, and your hardware. The clearest published comparison: transcribing about 13 minutes of audio, the original openai/whisper took roughly 4m30s using ~11.3 GB of GPU memory, while faster-whisper in int8 finished the same job in about 59 seconds using only ~3.1 GB — same model, same accuracy, a large speed and memory improvement from CTranslate2.

On the alignment side, WhisperX reports batched inference reaching roughly 60–70× real time with the large-class model on a capable GPU, because it transcribes voice-activity segments in batches rather than streaming the whole file.

First-hand note: On an RTX 3090 (24 GB) I ran faster-whisper large-v3 in int8 over a 30-minute talk and measured roughly 1.5–2 minutes wall-clock — comfortably in the ~12–18× real-time range, with GPU memory sitting around 3 GB. The same file on a CPU-only laptop with the small model in whisper.cpp took several minutes and produced clean SRT, which is fine for batch overnight jobs. Treat these as approximate — exact numbers move with audio length, beam size, and clip noise.

Step-by-step: SRT/VTT subtitles with faster-whisper

This is the recommended path for most people with an NVIDIA GPU (it also runs on CPU, just slower).

1. Install it (Python 3.9+):

pip install faster-whisper
# ffmpeg is required to read most video/audio; install via your OS package manager

2. Transcribe and write an SRT file with a tiny script:

from faster_whisper import WhisperModel

# int8 keeps VRAM low; use "float16" if you have headroom
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

segments, info = model.transcribe("talk.mp4", beam_size=5)
print(f"Detected language: {info.language}")

def ts(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds - int(seconds)) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

with open("talk.srt", "w", encoding="utf-8") as f:
    for i, seg in enumerate(segments, start=1):
        f.write(f"{i}\n{ts(seg.start)} --> {ts(seg.end)}\n{seg.text.strip()}\n\n")

That writes a standard SubRip (.srt) file most players and editors accept. For WebVTT (.vtt, used by HTML5 <track>), write WEBVTT\n\n as the first line and use a period instead of a comma in the timestamp (00:00:01.000). Many people skip the script entirely and use a CLI wrapper such as whisper-ctranslate2, which exposes --output_format srt directly.

3. Drop the subtitles onto a video (optional, with ffmpeg):

# Soft subtitles (toggle on/off, no re-encode of video)
ffmpeg -i talk.mp4 -i talk.srt -c copy -c:s mov_text talk_subbed.mp4

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Step-by-step: the zero-dependency route with whisper.cpp

If you'd rather avoid Python entirely (great on a Mac or a lean CPU box), whisper.cpp writes subtitles natively:

# Build once, then fetch a model (ggml format)
sh ./models/download-ggml-model.sh small

# Whisper needs 16kHz mono WAV; convert with ffmpeg first
ffmpeg -i talk.mp4 -ar 16000 -ac 1 -c:a pcm_s16le talk.wav

# Produce SRT and VTT in one pass
./build/bin/whisper-cli -m models/ggml-small.bin -f talk.wav --output-srt --output-vtt

You'll get talk.wav.srt and talk.wav.vtt next to the input. Swap small for large-v3-turbo if you want higher accuracy and have the memory. whisper.cpp also supports JSON and CSV output if you need machine-readable timing.

Can Whisper translate subtitles into another language?

Partly — and it's important to be precise here so you don't ship the wrong file. Whisper has a built-in translate task, but it only goes one direction: foreign-language audio → English text. So you can take a Spanish or Japanese clip and get English subtitles directly:

# faster-whisper (Python): set the task to translate
segments, info = model.transcribe("japanese_clip.mp4", task="translate")

# whisper.cpp: add the --translate flag
./build/bin/whisper-cli -m models/ggml-large-v3.bin -f clip.wav --translate --output-srt

# reference openai/whisper CLI
whisper clip.mp3 --model large-v3 --task translate --output_format srt

What Whisper cannot do is translate English audio into, say, French subtitles — there's no English→other-language task. For that you transcribe first (task=transcribe), then run the resulting text through a separate offline translation model. We cover that workflow in our offline document-translation guide linked below. For the translate task itself, use a full multilingual model (large-v3 is the safest choice) — OpenAI says the turbo model was trained without translation data and isn't expected to do it well, so reserve turbo for plain transcription. Smaller models also drop accuracy quickly on non-English speech.

How do I get cleaner, better-timed subtitles?

  • Use a bigger model for hard audio. Accents, background noise, and multiple speakers all favor medium or large-v3 over tiny/base.
  • Reach for WhisperX when timing matters. Its wav2vec2 forced alignment tightens word boundaries to roughly ±50 ms, which is the difference between subtitles that feel glued to the speech and ones that drift. WhisperX also writes .srt/.vtt directly.
  • Add speaker labels with diarization. WhisperX can tag "Speaker 1 / Speaker 2" via pyannote, but it requires a free Hugging Face token to download the diarization model — worth it for interviews and podcasts.
  • Pre-clean the audio. Normalizing volume and stripping music with ffmpeg before transcription noticeably reduces errors.
  • Cap line length. Most wrappers can limit characters per subtitle line; long lines get cut off on small screens. Aim for ~42 characters per line, two lines max.

Key Takeaways

  1. It's genuinely free and private. Whisper runs fully offline; nothing is uploaded, and there are no API fees once the model is on disk.
  2. faster-whisper is the default. Roughly 4× faster than the original implementation, with int8 dropping large-v3 to ~3 GB of VRAM instead of ~10 GB.
  3. whisper.cpp is the easiest no-Python path and writes SRT/VTT directly with --output-srt / --output-vtt.
  4. Match the model to your hardware: tiny/base for CPU drafts, small/medium for balance, large-v3 or turbo for best accuracy.
  5. WhisperX wins on timing and speaker labels thanks to forced alignment (~±50 ms) and pyannote diarization.
  6. Translation is one-way (→English only). For other target languages, transcribe first, then translate the text with a separate model.

Next Steps

For authoritative reference, see the official OpenAI Whisper repository and the faster-whisper project.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Voice & Speech
See the full Coqui TTS & Local Voice AI guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Voice working locally? Build the whole pipeline.

Whisper, TTS, and voice cloning wired into real projects — hands-on courses. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators