Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Faster-Whisper is the production-grade reimplementation of OpenAI's Whisper. Same models, same accuracy, 4x faster on GPU and 2x faster on CPU. Built on CTranslate2 with INT8 / FP16 quantization, fused attention kernels, and SIMD-optimized CPU paths. For any local speech-to-text deployment in 2026 — meeting transcription, podcast workflow, real-time captions, call-center audio, voice agents, accessibility tools — Faster-Whisper is the right default.
This guide covers everything: installation across platforms, model selection (tiny through large-v3-turbo), quantization choices (INT8 vs FP16), batched and streaming inference, WhisperX / Distil-Whisper integration, OpenAI-compatible API exposure via LocalAI or whisper-asr-webservice, real-time captioning patterns, and benchmarks across hardware.
Table of Contents
- What Faster-Whisper Is
- Faster-Whisper vs OpenAI Whisper vs WhisperX vs Distil-Whisper
- Whisper Model Variants
- Hardware Requirements
- Installation: pip, Docker, Source
- Your First Transcription
- Quantization: INT8 vs FP16 vs FP32
- Batched Transcription
- Streaming Transcription
- Word-Level Timestamps
- VAD Pre-Segmentation
- Distil-Whisper Integration
- WhisperX for Speaker Diarization
- OpenAI-Compatible API via LocalAI
- whisper-asr-webservice (Docker)
- Performance Benchmarks
- Multilingual Accuracy
- Tuning Recipes
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Faster-Whisper Is {#what-it-is}
Faster-Whisper (SYSTRAN/faster-whisper on GitHub) is a Python wrapper around CTranslate2-converted Whisper models. CTranslate2 is OpenNMT's high-performance inference engine for transformer models — it provides custom CUDA kernels, INT8/FP16 quantization, fused operations, and SIMD-optimized CPU code.
Result: same Whisper model accuracy with significantly better throughput and lower memory.
Project: github.com/SYSTRAN/faster-whisper. License: MIT. Active maintenance.
Faster-Whisper vs OpenAI Whisper vs WhisperX vs Distil-Whisper {#comparison}
| Project | Speed (RTF, RTX 4090) | Accuracy | Features |
|---|---|---|---|
| openai-whisper | 18x | Reference | Word ts, no diarization |
| Faster-Whisper | 72x | Same as Whisper | + INT8 / batched / VAD |
| WhisperX | 70x | Same | + diarization + better word ts |
| Distil-Whisper | 100x+ | -1% WER | Distilled, smaller |
| whisper.cpp | 50x (CPU) / 60x (GPU) | Same | C++, GGUF |
For most users: Faster-Whisper. For subtitles / interviews with speaker labels: WhisperX. For latency-critical real-time: Distil-Whisper. For embedded / no-Python: whisper.cpp.
Whisper Model Variants {#models}
| Model | Params | Size (FP16) | Speed | Accuracy | Best For |
|---|---|---|---|---|---|
| tiny | 39M | 75 MB | Fastest | Lowest | Embedded, real-time |
| base | 74M | 142 MB | Very fast | Low | Real-time UIs |
| small | 244M | 466 MB | Fast | Decent | Fast batch |
| medium | 769M | 1.5 GB | Medium | Good | Most use cases |
| large-v2 | 1550M | 2.9 GB | Slower | Excellent | Legacy |
| large-v3 | 1550M | 2.9 GB | Slower | Best | Accuracy-critical |
| large-v3-turbo | 809M | 1.5 GB | Fast | Near-large-v3 (English) | Default for English |
Released October 2024: large-v3-turbo. 8x faster decoding than large-v3 with minimal accuracy loss for English. For non-English, stick with large-v3.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements {#requirements}
| Hardware | large-v3 (RTF) | large-v3-turbo (RTF) |
|---|---|---|
| RTX 4090 (FP16) | 72x | 250x+ |
| RTX 4070 (FP16) | 50x | 180x |
| RTX 3060 (INT8) | 35x | 120x |
| Apple M4 Max (Core ML / MPS) | 25x | 90x |
| Ryzen 7 7700X (INT8 CPU) | 10x | 35x |
| Raspberry Pi 5 (CPU) | 0.5x | 1.5x |
CPU-only works for batch use cases. GPU is for real-time / high-throughput.
Installation: pip, Docker, Source {#installation}
pip
python3.11 -m venv ~/venvs/fwhisper
source ~/venvs/fwhisper/bin/activate
# CUDA + cuBLAS + cuDNN
pip install faster-whisper
# Or for CPU only
pip install faster-whisper --no-deps # then install nvidia-* deps if GPU
For CUDA, ensure cuBLAS and cuDNN are installed (system or via NVIDIA pip wheels):
pip install nvidia-cudnn-cu12==9.0.0
Docker
docker run --gpus all --rm -it \
-v $(pwd):/workspace \
ghcr.io/systran/faster-whisper:latest \
bash
whisper-asr-webservice (one-liner OpenAI-compatible)
docker run -d -p 9000:9000 \
-e ASR_MODEL=large-v3 \
-e ASR_ENGINE=faster_whisper \
--gpus all \
onerahmet/openai-whisper-asr-webservice:latest-gpu
Now http://localhost:9000/asr and /v1/audio/transcriptions work.
Your First Transcription {#first-transcription}
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Detected language: {info.language} (prob {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
That's it. segments is a generator — iterate to consume; the model lazily transcribes. For full output:
result_text = " ".join(seg.text for seg in segments)
Quantization: INT8 vs FP16 vs FP32 {#quantization}
compute_type parameter:
float16— default for GPU, best speed/quality balanceint8_float16— INT8 weights, FP16 activations (GPU)int8— INT8 weights, INT8 activations (best for CPU)float32— reference, slowest
| Compute type | Speed | Accuracy | Memory |
|---|---|---|---|
| float32 | 1.0x | reference | 6 GB |
| float16 | 2.0x | reference | 3 GB |
| int8_float16 | 2.5x | -0.1% WER | 1.8 GB |
| int8 | 2.0x (CPU best) | -0.3% WER | 1.5 GB |
For GPU: float16 default; bump to int8_float16 if you need lower VRAM. For CPU: int8.
Batched Transcription {#batched}
For multiple audio files or long audio with VAD-segmentation:
from faster_whisper import WhisperModel, BatchedInferencePipeline
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)
segments, info = batched.transcribe("long_audio.mp3", batch_size=16)
Throughput on RTX 4090 large-v3 batch_size=16: ~250x real-time. Use for transcribing archives, podcast back-catalogs, call-center recordings.
Streaming Transcription {#streaming}
Faster-Whisper itself is not designed for streaming — it expects complete audio. For streaming use:
- whisper_streaming (community): polls Faster-Whisper on a sliding window with VAD-aware re-segmentation. Latency ~500-800 ms.
- WhisperLive: Faster-Whisper-based real-time server with web UI.
- VOSK + Whisper hybrid: VOSK for low-latency partial, Faster-Whisper for confirmed segments.
For sub-200 ms real-time captions, consider Moshi or Voxtral. For 500-800 ms latency, Faster-Whisper streaming is fine.
# Conceptual streaming pattern
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
buffer = np.zeros(0, dtype=np.float32)
def callback(indata, frames, time, status):
global buffer
buffer = np.concatenate([buffer, indata[:, 0]])
if len(buffer) >= 16000 * 5: # 5-second chunks
segments, _ = model.transcribe(buffer, beam_size=1)
for seg in segments:
print(seg.text)
buffer = np.zeros(0, dtype=np.float32)
with sd.InputStream(samplerate=16000, callback=callback):
sd.sleep(60_000)
Word-Level Timestamps {#word-timestamps}
segments, info = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print(f"{word.start:.2f} - {word.end:.2f}: {word.word}")
Word-level timestamps add ~10% overhead. Useful for subtitling, video editing alignment, karaoke-style captioning.
For higher-accuracy word timestamps, use WhisperX which does forced alignment via wav2vec2 phoneme models.
VAD Pre-Segmentation {#vad}
segments, info = model.transcribe("audio.mp3", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500))
VAD (Voice Activity Detection) splits audio at silence boundaries before transcription. Benefits:
- Better accuracy (Whisper isn't confused by long silences)
- Faster (skips silent regions)
- Better word timestamps
Faster-Whisper bundles a Silero VAD model. Default settings work well; tune min_silence_duration_ms for your audio type.
Distil-Whisper Integration {#distil}
from faster_whisper import WhisperModel
# Distil-large-v3 — 6x faster than large-v3 at -1% WER (English)
model = WhisperModel("distil-large-v3", device="cuda", compute_type="float16")
Distil-Whisper is from Hugging Face (distil-whisper/distil-large-v3). Smaller decoder + same encoder. Best for:
- English-only deployments
- Latency-critical real-time
- Limited compute budgets
For non-English, stick with large-v3 or large-v3-turbo.
WhisperX for Speaker Diarization {#whisperx}
pip install whisperx
import whisperx
device = "cuda"
audio = whisperx.load_audio("interview.mp3")
model = whisperx.load_model("large-v3", device, compute_type="float16")
result = model.transcribe(audio, batch_size=16)
# Word-level alignment
align_model, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], align_model, metadata, audio, device)
# Speaker diarization (needs HF token for pyannote)
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# Each segment now has "speaker": "SPEAKER_00", "SPEAKER_01", etc.
For interviews, podcasts with multiple speakers, meetings — WhisperX is the right tool.
OpenAI-Compatible API via LocalAI {#localai}
# models/whisper.yaml
name: whisper-1
backend: faster-whisper
parameters:
model: large-v3
compute_type: float16
Then OpenAI clients work:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")
with open("audio.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
)
print(transcript.text)
See LocalAI Setup Guide.
whisper-asr-webservice (Docker) {#asr-webservice}
The simplest production deployment:
docker run -d --name whisper \
-p 9000:9000 \
-e ASR_MODEL=large-v3 \
-e ASR_ENGINE=faster_whisper \
-e ASR_DEVICE=cuda \
--gpus all \
--restart unless-stopped \
onerahmet/openai-whisper-asr-webservice:latest-gpu
Endpoints:
POST /asr— multipart upload, returns JSON / VTT / SRT / textPOST /v1/audio/transcriptions— OpenAI-compatiblePOST /detect-language— language detection only
For multi-tenant production, put behind LiteLLM or similar gateway.
Performance Benchmarks {#benchmarks}
Transcribing a 1-hour podcast, large-v3, FP16:
| Hardware | Faster-Whisper | openai-whisper |
|---|---|---|
| RTX 4090 | 50 sec | 200 sec |
| RTX 3090 | 90 sec | 360 sec |
| RTX 3060 | 165 sec | 660 sec |
| Apple M4 Max | 145 sec | 540 sec |
| Ryzen 7 7700X (CPU) | 360 sec | 1,800 sec |
| Raspberry Pi 5 (CPU) | 7,200 sec | 28,000 sec |
Faster-Whisper is consistently 4x faster than openai-whisper across all hardware.
For batched transcription at large scale (e.g., transcribing 1,000 hours): aggregate throughput on 8x RTX 4090 with batch_size=16 is ~2,000x real-time — 1,000 hours in ~30 minutes wall time.
Multilingual Accuracy {#multilingual}
Whisper large-v3 WER on FLEURS benchmark:
| Tier | Languages | WER |
|---|---|---|
| Excellent | EN, ES, FR, DE, IT, PT, NL | 4-9% |
| Good | RU, JA, KO, ZH, AR, PL, TR | 9-18% |
| Fair | HI, VI, TH, FA, HE | 15-25% |
| Limited | many low-resource languages | 25-50%+ |
Use language="en" to skip language detection (faster, more accurate when language is known). For non-English, large-v3 outperforms large-v3-turbo (which is English-tuned).
Tuning Recipes {#tuning}
Real-time English caption server
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
# Pair with VAD streaming pattern; expect 200-300 ms latency
Audiobook / podcast batch
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)
# Use batched.transcribe() with batch_size=16
CPU-only office workstation
model = WhisperModel("large-v3", device="cpu", compute_type="int8", cpu_threads=8)
# 1 hour of audio in ~6 minutes
Multilingual customer call analysis
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(audio, language=None) # auto-detect
# WhisperX for speaker diarization on top
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| cuDNN not found | Missing system lib | pip install nvidia-cudnn-cu12==9.0.0 |
| Hallucinated transcripts | Long silence, no VAD | Enable vad_filter=True |
| Wrong language detected | Auto-detect failed | Pass language="en" explicitly |
| OOM with large-v3 | VRAM tight | Use int8_float16 or smaller model |
| Slow on AMD | PyTorch CUDA build | Reinstall ctranslate2 with ROCm |
| Mac CoreML not used | MPS path | Use Whisper.cpp on Mac for Core ML |
| Streaming high latency | Wrong model | Use turbo or distil for real-time |
| Word timestamps inaccurate | Forced alignment needed | Use WhisperX |
FAQ {#faq}
See answers to common Faster-Whisper questions below.
Sources: Faster-Whisper GitHub | WhisperX | Distil-Whisper | whisper-asr-webservice | Internal benchmarks RTX 4090, M4 Max, Ryzen 7 7700X.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!