What is Faster-Whisper and how is it different from openai-whisper?

Faster-Whisper is a reimplementation of OpenAI's Whisper using CTranslate2 (a high-performance inference engine for transformer models). Same model weights, same accuracy — but 4x faster on GPU and 2x faster on CPU thanks to fused kernels, INT8/FP16 quantization, and CPU SIMD optimization. Memory use is also lower: large-v3 in FP16 fits in 5 GB VRAM (vs 10 GB in openai-whisper). For any production deployment of Whisper transcription, Faster-Whisper is the right default — there is essentially no downside vs upstream Whisper.

How does Faster-Whisper compare to WhisperX and Distil-Whisper?

Faster-Whisper is the foundation. WhisperX is built on top of Faster-Whisper and adds: forced phoneme alignment for word-level timestamps, speaker diarization (pyannote), VAD-based pre-segmentation. For subtitles or interview transcription with speaker labels, WhisperX. Distil-Whisper is a distilled (smaller, faster) version of Whisper from Hugging Face — 6x faster than openai-whisper, slightly lower accuracy (~1% WER increase). Distil-Whisper + Faster-Whisper gives you ~10-12x speedup vs upstream. For real-time / latency-sensitive use, Distil-Whisper. For accuracy-critical, original Whisper large-v3 via Faster-Whisper.

What hardware do I need for Faster-Whisper?

CPU-only works well — Faster-Whisper is heavily SIMD-optimized. On Ryzen 7 7700X with INT8: large-v3 transcribes 1 hour of audio in ~6 minutes (10x real-time). On RTX 4090 with FP16: 1 hour in ~50 seconds (72x real-time). Memory: large-v3 INT8 needs ~3 GB VRAM, FP16 needs 5 GB. small / medium models work on any modern laptop. For most use cases, CPU is fast enough; GPU is for batch or real-time streaming.

Which Whisper model should I use: tiny / base / small / medium / large-v3 / turbo?

For most users: **large-v3** with Faster-Whisper INT8 — best accuracy at ~1.5x slower than base. For real-time / mobile: **small** or **base**. For maximum speed with small accuracy loss: **whisper-large-v3-turbo** (Oct 2024 release, 8x faster than large-v3 at slightly worse accuracy). For non-English: large-v3 has the best multilingual coverage; turbo is English-tuned. For low-VRAM: tiny or base. Model sizes: tiny=39M, base=74M, small=244M, medium=769M, large=1550M, turbo=809M parameters.

Can Faster-Whisper transcribe in real-time / streaming?

Yes — Faster-Whisper supports streaming transcription with VAD-based segmentation. Latency on RTX 4090 with large-v3: ~500-800 ms first-token. With turbo: ~200-300 ms. For sub-200 ms real-time, use Whisper streaming patterns with tiny/base models or alternatives like Moshi / Voxtral. Communities have built `whisper-streaming` and `whisper-live` projects on top of Faster-Whisper for production real-time transcription (live captions, voice agents, dictation).

How do I expose Faster-Whisper as an OpenAI-compatible API?

Two paths. (1) **whisper.cpp's HTTP server**: `./server -m models/ggml-large-v3.bin --port 8080 --convert` — exposes `/v1/audio/transcriptions`. (2) **LocalAI**: configure the whisper backend with Faster-Whisper internally — same OpenAI endpoint, more flexibility. (3) **whisper-asr-webservice** (community Docker): `docker run -p 9000:9000 -e ASR_ENGINE=faster_whisper onerahmet/openai-whisper-asr-webservice`. The third is the simplest one-liner for production. All three speak the OpenAI `/v1/audio/transcriptions` API spec.

Does Faster-Whisper support batched transcription?

Yes — `BatchedInferencePipeline` processes multiple audio chunks in parallel on GPU. Throughput on RTX 4090 with large-v3 batch size 16: ~250x real-time aggregate. For transcribing podcast archives or call-center audio at scale, batched mode is the right approach. WhisperX uses batched inference by default.

How accurate is Whisper for non-English languages?

large-v3 has strong coverage for 99 languages. WER (word error rate) ranges: top tier (English, Spanish, French, German, Italian) ~5-10%; mid tier (Russian, Mandarin, Japanese, Korean) ~10-20%; lower tier (Arabic, Hindi, Vietnamese, Thai) ~15-30%; lowest tier (small / regional languages) up to 50%+. For best non-English accuracy: use large-v3 (not turbo, which is English-optimized). For specific languages, fine-tuned variants (e.g., Whisper-Large-V3-French) can outperform the base model.

Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text

Faster-Whisper is the production-grade reimplementation of OpenAI's Whisper. Same models, same accuracy, 4x faster on GPU and 2x faster on CPU. Built on CTranslate2 with INT8 / FP16 quantization, fused attention kernels, and SIMD-optimized CPU paths. For any local speech-to-text deployment in 2026 — meeting transcription, podcast workflow, real-time captions, call-center audio, voice agents, accessibility tools — Faster-Whisper is the right default.

This guide covers everything: installation across platforms, model selection (tiny through large-v3-turbo), quantization choices (INT8 vs FP16), batched and streaming inference, WhisperX / Distil-Whisper integration, OpenAI-compatible API exposure via LocalAI or whisper-asr-webservice, real-time captioning patterns, and benchmarks across hardware.

What Faster-Whisper Is
Faster-Whisper vs OpenAI Whisper vs WhisperX vs Distil-Whisper
Whisper Model Variants
Hardware Requirements
Installation: pip, Docker, Source
Your First Transcription
Quantization: INT8 vs FP16 vs FP32
Batched Transcription
Streaming Transcription
Word-Level Timestamps
VAD Pre-Segmentation
Distil-Whisper Integration
WhisperX for Speaker Diarization
OpenAI-Compatible API via LocalAI
whisper-asr-webservice (Docker)
Performance Benchmarks
Multilingual Accuracy
Tuning Recipes
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What Faster-Whisper Is {#what-it-is}

Faster-Whisper (SYSTRAN/faster-whisper on GitHub) is a Python wrapper around CTranslate2-converted Whisper models. CTranslate2 is OpenNMT's high-performance inference engine for transformer models — it provides custom CUDA kernels, INT8/FP16 quantization, fused operations, and SIMD-optimized CPU code.

Result: same Whisper model accuracy with significantly better throughput and lower memory.

Project: github.com/SYSTRAN/faster-whisper. License: MIT. Active maintenance.

Faster-Whisper vs OpenAI Whisper vs WhisperX vs Distil-Whisper {#comparison}

Project	Speed (RTF, RTX 4090)	Accuracy	Features
openai-whisper	18x	Reference	Word ts, no diarization
Faster-Whisper	72x	Same as Whisper	+ INT8 / batched / VAD
WhisperX	70x	Same	+ diarization + better word ts
Distil-Whisper	100x+	-1% WER	Distilled, smaller
whisper.cpp	50x (CPU) / 60x (GPU)	Same	C++, GGUF

For most users: Faster-Whisper. For subtitles / interviews with speaker labels: WhisperX. For latency-critical real-time: Distil-Whisper. For embedded / no-Python: whisper.cpp.

Whisper Model Variants {#models}

Model	Params	Size (FP16)	Speed	Accuracy	Best For
tiny	39M	75 MB	Fastest	Lowest	Embedded, real-time
base	74M	142 MB	Very fast	Low	Real-time UIs
small	244M	466 MB	Fast	Decent	Fast batch
medium	769M	1.5 GB	Medium	Good	Most use cases
large-v2	1550M	2.9 GB	Slower	Excellent	Legacy
large-v3	1550M	2.9 GB	Slower	Best	Accuracy-critical
large-v3-turbo	809M	1.5 GB	Fast	Near-large-v3 (English)	Default for English

Released October 2024: large-v3-turbo. 8x faster decoding than large-v3 with minimal accuracy loss for English. For non-English, stick with large-v3.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Hardware Requirements {#requirements}

Hardware	large-v3 (RTF)	large-v3-turbo (RTF)
RTX 4090 (FP16)	72x	250x+
RTX 4070 (FP16)	50x	180x
RTX 3060 (INT8)	35x	120x
Apple M4 Max (Core ML / MPS)	25x	90x
Ryzen 7 7700X (INT8 CPU)	10x	35x
Raspberry Pi 5 (CPU)	0.5x	1.5x

CPU-only works for batch use cases. GPU is for real-time / high-throughput.

Installation: pip, Docker, Source {#installation}

pip

python3.11 -m venv ~/venvs/fwhisper
source ~/venvs/fwhisper/bin/activate

# CUDA + cuBLAS + cuDNN
pip install faster-whisper

# Or for CPU only
pip install faster-whisper --no-deps  # then install nvidia-* deps if GPU

For CUDA, ensure cuBLAS and cuDNN are installed (system or via NVIDIA pip wheels):

pip install nvidia-cudnn-cu12==9.0.0

Docker

docker run --gpus all --rm -it \
    -v $(pwd):/workspace \
    ghcr.io/systran/faster-whisper:latest \
    bash

whisper-asr-webservice (one-liner OpenAI-compatible)

docker run -d -p 9000:9000 \
    -e ASR_MODEL=large-v3 \
    -e ASR_ENGINE=faster_whisper \
    --gpus all \
    onerahmet/openai-whisper-asr-webservice:latest-gpu

Now http://localhost:9000/asr and /v1/audio/transcriptions work.

Your First Transcription {#first-transcription}

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print(f"Detected language: {info.language} (prob {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

That's it. segments is a generator — iterate to consume; the model lazily transcribes. For full output:

result_text = " ".join(seg.text for seg in segments)

Quantization: INT8 vs FP16 vs FP32 {#quantization}

compute_type parameter:

float16 — default for GPU, best speed/quality balance
int8_float16 — INT8 weights, FP16 activations (GPU)
int8 — INT8 weights, INT8 activations (best for CPU)
float32 — reference, slowest

Compute type	Speed	Accuracy	Memory
float32	1.0x	reference	6 GB
float16	2.0x	reference	3 GB
int8_float16	2.5x	-0.1% WER	1.8 GB
int8	2.0x (CPU best)	-0.3% WER	1.5 GB

For GPU: float16 default; bump to int8_float16 if you need lower VRAM. For CPU: int8.

Batched Transcription {#batched}

For multiple audio files or long audio with VAD-segmentation:

from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)

segments, info = batched.transcribe("long_audio.mp3", batch_size=16)

Throughput on RTX 4090 large-v3 batch_size=16: ~250x real-time. Use for transcribing archives, podcast back-catalogs, call-center recordings.

Streaming Transcription {#streaming}

Faster-Whisper itself is not designed for streaming — it expects complete audio. For streaming use:

whisper_streaming (community): polls Faster-Whisper on a sliding window with VAD-aware re-segmentation. Latency ~500-800 ms.
WhisperLive: Faster-Whisper-based real-time server with web UI.
VOSK + Whisper hybrid: VOSK for low-latency partial, Faster-Whisper for confirmed segments.

For sub-200 ms real-time captions, consider Moshi or Voxtral. For 500-800 ms latency, Faster-Whisper streaming is fine.

# Conceptual streaming pattern
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
buffer = np.zeros(0, dtype=np.float32)

def callback(indata, frames, time, status):
    global buffer
    buffer = np.concatenate([buffer, indata[:, 0]])
    if len(buffer) >= 16000 * 5:  # 5-second chunks
        segments, _ = model.transcribe(buffer, beam_size=1)
        for seg in segments:
            print(seg.text)
        buffer = np.zeros(0, dtype=np.float32)

with sd.InputStream(samplerate=16000, callback=callback):
    sd.sleep(60_000)

Word-Level Timestamps {#word-timestamps}

segments, info = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"{word.start:.2f} - {word.end:.2f}: {word.word}")

Word-level timestamps add ~10% overhead. Useful for subtitling, video editing alignment, karaoke-style captioning.

For higher-accuracy word timestamps, use WhisperX which does forced alignment via wav2vec2 phoneme models.

VAD Pre-Segmentation {#vad}

segments, info = model.transcribe("audio.mp3", vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500))

VAD (Voice Activity Detection) splits audio at silence boundaries before transcription. Benefits:

Better accuracy (Whisper isn't confused by long silences)
Faster (skips silent regions)
Better word timestamps

Faster-Whisper bundles a Silero VAD model. Default settings work well; tune min_silence_duration_ms for your audio type.

Distil-Whisper Integration {#distil}

from faster_whisper import WhisperModel

# Distil-large-v3 — 6x faster than large-v3 at -1% WER (English)
model = WhisperModel("distil-large-v3", device="cuda", compute_type="float16")

Distil-Whisper is from Hugging Face (distil-whisper/distil-large-v3). Smaller decoder + same encoder. Best for:

English-only deployments
Latency-critical real-time
Limited compute budgets

For non-English, stick with large-v3 or large-v3-turbo.

WhisperX for Speaker Diarization {#whisperx}

pip install whisperx

import whisperx

device = "cuda"
audio = whisperx.load_audio("interview.mp3")
model = whisperx.load_model("large-v3", device, compute_type="float16")
result = model.transcribe(audio, batch_size=16)

# Word-level alignment
align_model, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], align_model, metadata, audio, device)

# Speaker diarization (needs HF token for pyannote)
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Each segment now has "speaker": "SPEAKER_00", "SPEAKER_01", etc.

For interviews, podcasts with multiple speakers, meetings — WhisperX is the right tool.

OpenAI-Compatible API via LocalAI {#localai}

# models/whisper.yaml
name: whisper-1
backend: faster-whisper
parameters:
  model: large-v3
  compute_type: float16

Then OpenAI clients work:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
    )
print(transcript.text)

See LocalAI Setup Guide.

whisper-asr-webservice (Docker) {#asr-webservice}

The simplest production deployment:

docker run -d --name whisper \
    -p 9000:9000 \
    -e ASR_MODEL=large-v3 \
    -e ASR_ENGINE=faster_whisper \
    -e ASR_DEVICE=cuda \
    --gpus all \
    --restart unless-stopped \
    onerahmet/openai-whisper-asr-webservice:latest-gpu

Endpoints:

POST /asr — multipart upload, returns JSON / VTT / SRT / text
POST /v1/audio/transcriptions — OpenAI-compatible
POST /detect-language — language detection only

For multi-tenant production, put behind LiteLLM or similar gateway.

Performance Benchmarks {#benchmarks}

Transcribing a 1-hour podcast, large-v3, FP16:

Hardware	Faster-Whisper	openai-whisper
RTX 4090	50 sec	200 sec
RTX 3090	90 sec	360 sec
RTX 3060	165 sec	660 sec
Apple M4 Max	145 sec	540 sec
Ryzen 7 7700X (CPU)	360 sec	1,800 sec
Raspberry Pi 5 (CPU)	7,200 sec	28,000 sec

Faster-Whisper is consistently 4x faster than openai-whisper across all hardware.

For batched transcription at large scale (e.g., transcribing 1,000 hours): aggregate throughput on 8x RTX 4090 with batch_size=16 is ~2,000x real-time — 1,000 hours in ~30 minutes wall time.

Multilingual Accuracy {#multilingual}

Whisper large-v3 WER on FLEURS benchmark:

Tier	Languages	WER
Excellent	EN, ES, FR, DE, IT, PT, NL	4-9%
Good	RU, JA, KO, ZH, AR, PL, TR	9-18%
Fair	HI, VI, TH, FA, HE	15-25%
Limited	many low-resource languages	25-50%+

Use language="en" to skip language detection (faster, more accurate when language is known). For non-English, large-v3 outperforms large-v3-turbo (which is English-tuned).

Tuning Recipes {#tuning}

Real-time English caption server

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
# Pair with VAD streaming pattern; expect 200-300 ms latency

Audiobook / podcast batch

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
batched = BatchedInferencePipeline(model=model)
# Use batched.transcribe() with batch_size=16

CPU-only office workstation

model = WhisperModel("large-v3", device="cpu", compute_type="int8", cpu_threads=8)
# 1 hour of audio in ~6 minutes

Multilingual customer call analysis

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe(audio, language=None)  # auto-detect
# WhisperX for speaker diarization on top

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
cuDNN not found	Missing system lib	`pip install nvidia-cudnn-cu12==9.0.0`
Hallucinated transcripts	Long silence, no VAD	Enable `vad_filter=True`
Wrong language detected	Auto-detect failed	Pass `language="en"` explicitly
OOM with large-v3	VRAM tight	Use `int8_float16` or smaller model
Slow on AMD	PyTorch CUDA build	Reinstall ctranslate2 with ROCm
Mac CoreML not used	MPS path	Use Whisper.cpp on Mac for Core ML
Streaming high latency	Wrong model	Use turbo or distil for real-time
Word timestamps inaccurate	Forced alignment needed	Use WhisperX

FAQ {#faq}

See answers to common Faster-Whisper questions below.

Sources: Faster-Whisper GitHub | WhisperX | Distil-Whisper | whisper-asr-webservice | Internal benchmarks RTX 4090, M4 Max, Ryzen 7 7700X.

Related guides:

Faster-Whisper Setup Guide (2026): 4x Faster Local Speech-to-Text

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Faster-Whisper Is {#what-it-is}

Faster-Whisper vs OpenAI Whisper vs WhisperX vs Distil-Whisper {#comparison}

Whisper Model Variants {#models}

Reading articles is good. Building is better.

Hardware Requirements {#requirements}

Installation: pip, Docker, Source {#installation}

pip

Docker

whisper-asr-webservice (one-liner OpenAI-compatible)

Your First Transcription {#first-transcription}

Quantization: INT8 vs FP16 vs FP32 {#quantization}

Batched Transcription {#batched}

Streaming Transcription {#streaming}

Word-Level Timestamps {#word-timestamps}

VAD Pre-Segmentation {#vad}

Distil-Whisper Integration {#distil}

WhisperX for Speaker Diarization {#whisperx}

OpenAI-Compatible API via LocalAI {#localai}

whisper-asr-webservice (Docker) {#asr-webservice}

Performance Benchmarks {#benchmarks}

Multilingual Accuracy {#multilingual}

Tuning Recipes {#tuning}

Real-time English caption server

Audiobook / podcast batch

CPU-only office workstation

Multilingual customer call analysis

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Whisper Local Speech-to-Text

Local AI Meeting Transcription

Local AI Podcast Production

F5-TTS Setup Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI