Which Whisper model size should I use?

For clean single-speaker audio, the small model (244M parameters, ~1GB VRAM with INT8) provides 95% of large-v3 accuracy at 6x the speed. For noisy audio, meetings with multiple speakers, or accented speech, large-v3 (1.55B parameters, ~5GB VRAM with INT8) is worth the extra compute. Use tiny or base only for real-time applications on weak hardware.

How much VRAM does Whisper need?

With INT8 quantization in faster-whisper: tiny needs ~0.5GB, base ~0.5GB, small ~1GB, medium ~2.5GB, and large-v3 ~5GB VRAM. Without quantization (FP16), roughly double those numbers. CPU inference works for all model sizes but is significantly slower, especially for medium and large-v3.

Run Whisper Locally 2026: Free Offline Speech-to-Text Setup

Q: Can Whisper transcribe in real-time from a microphone?

Yes, with appropriate hardware. The tiny and base models run faster than real-time on any modern CPU. The small model needs a basic dedicated GPU (GTX 1060+) for real-time performance. The large-v3 model requires an RTX 3070 or better for real-time transcription. Real-time setups typically use 1.5-3 second audio chunks with a processing thread.

Q: Does Whisper work on Apple Silicon Macs?

Yes. whisper.cpp with Metal acceleration runs 60-70% faster than CPU-only on Apple Silicon Macs (M1/M2/M3/M4). faster-whisper also works via CPU or the PyTorch MPS backend. An M2 Pro can process the large-v3 model at roughly 5x real-time speed, meaning a 60-minute file takes about 12 minutes.

Q: What audio formats does Whisper support?

Whisper accepts any format that ffmpeg can read: MP3, WAV, M4A, FLAC, OGG, WMA, AAC, and video files (MP4, MKV, WebM, AVI). Audio is internally converted to 16kHz mono WAV for processing. Provide the highest quality source available for best accuracy.

Published on April 10, 2026 • 20 min read

Quick answer: To run Whisper locally, install faster-whisper (pip install faster-whisper) plus ffmpeg, then transcribe with the large-v3 model for best accuracy or small for speed. It runs fully offline, costs nothing (MIT license), and on an RTX 3090 faster-whisper transcribes a 60-minute file in about 4m40s with large-v3 (12.9x faster than real time). No GPU? Use whisper.cpp with the tiny or base model on CPU.

OpenAI released Whisper as an open-source speech recognition model in September 2022. Three years later, it remains the best general-purpose speech-to-text system you can run on your own hardware. I use it daily to transcribe client meetings, generate podcast subtitles, and voice-dictate notes into Obsidian. Everything stays on my machine. Nothing touches a cloud API.

This guide covers three installation methods (original Whisper, whisper.cpp, faster-whisper), real benchmarks across every model size, and practical workflows for batch transcription and real-time dictation.

What you will learn:

Which Whisper model size matches your hardware
Three installation methods ranked by speed and compatibility
Batch transcription of audio files and folders
Real-time microphone transcription setup
Integration with Ollama for transcribe-then-summarize pipelines
Privacy advantages over cloud transcription services

If you are setting up Whisper on a Mac specifically, start with the Mac local AI setup guide for Apple Silicon optimization, then come back here for Whisper-specific configuration.

What Is Whisper
Model Sizes and Hardware Requirements
Method 1: Original OpenAI Whisper
Method 2: whisper.cpp (CPU Optimized)
Method 3: faster-whisper (Recommended)
Real-Time Transcription
Batch Processing Workflows
Language Support
Accuracy Benchmarks
Integration with Ollama
Privacy Advantages

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What Is Whisper {#what-is-whisper}

Whisper is an automatic speech recognition (ASR) model trained on 680,000 hours of multilingual audio data scraped from the web. OpenAI released it under the MIT license, meaning you can use it for anything, commercial included, with zero restrictions.

The model uses an encoder-decoder Transformer architecture. Audio goes in as mel spectrograms (80-channel log-mel features computed from 16kHz audio), and text comes out as tokens. It handles transcription (same language), translation (any language to English), and language detection in a single model.

What makes Whisper special is not any single capability but the combination: it handles background noise, accents, technical jargon, and multiple speakers better than any other open model. The original Whisper repository on GitHub has over 72,000 stars and remains actively maintained.

Key specifications:

Architecture: Encoder-decoder Transformer
Training data: 680,000 hours of labeled audio
Languages: 100+ languages for transcription, any-to-English translation
License: MIT (fully permissive, commercial use allowed)
Latest version: large-v3 (released November 2023, still state-of-the-art for general use)

Model Sizes and Hardware Requirements {#model-sizes}

Whisper comes in six sizes. Picking the right one depends entirely on your hardware and accuracy needs.

Model	Parameters	VRAM (FP16)	VRAM (INT8)	Disk Size	Relative Speed
tiny	39M	~1GB	~0.5GB	75MB	32x
base	74M	~1GB	~0.5GB	142MB	16x
small	244M	~2GB	~1GB	466MB	6x
medium	769M	~5GB	~2.5GB	1.5GB	2x
large-v3	1.55B	~10GB	~5GB	2.9GB	1x
turbo	809M	~6GB	~3GB	1.6GB	8x

Speed column explained: A file that takes 60 seconds to transcribe on large-v3 takes roughly 2 seconds on tiny. These are relative figures, not absolute.

Which model should you use?

No GPU or integrated graphics only: Use tiny or base with whisper.cpp. CPU inference is viable up to the small model but painfully slow beyond that.

4-6GB VRAM (GTX 1070, RTX 3060, M1 8GB): Use small or medium with faster-whisper INT8 quantization. The small model is the sweet spot here: 95% of large-v3 accuracy at 6x the speed.

8-12GB VRAM (RTX 3070, RTX 4070, M2 16GB): Use large-v3 with faster-whisper. You have enough memory, and the accuracy difference over small is meaningful for noisy audio and accented speech.

16GB+ VRAM (RTX 3090, RTX 4090, M3 Pro 36GB): Use large-v3 without quantization. You can also run real-time transcription comfortably at this tier. See the hardware requirements guide for full VRAM tables across all GPU models.

Method 1: Original OpenAI Whisper {#method-original}

The reference implementation. Use this if you want the canonical experience or need to modify the model code.

Installation

# Create a virtual environment (recommended)
python3 -m venv whisper-env
source whisper-env/bin/activate

# Install Whisper
pip install openai-whisper

# Install ffmpeg (required for audio processing)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows:
choco install ffmpeg

Basic Usage

# Transcribe a file
whisper audio.mp3 --model small --language en

# Transcribe with translation to English
whisper japanese_meeting.mp3 --model medium --task translate

# Output specific format
whisper lecture.wav --model large-v3 --output_format srt

# Specify output directory
whisper interview.mp3 --model small --output_dir ./transcripts

Output Formats

Whisper generates multiple output files by default:

.txt — Plain text transcript
.vtt — WebVTT subtitles (for web video)
.srt — SubRip subtitles (for most video players)
.tsv — Tab-separated with timestamps
.json — Full output with word-level timing

Performance (Original Whisper)

On an RTX 3090 transcribing a 60-minute English podcast (clear audio, single speaker):

Model	Processing Time	Real-time Factor	VRAM Used
tiny	48 seconds	75x faster	1.1GB
base	1 min 22 sec	44x faster	1.2GB
small	3 min 10 sec	19x faster	2.3GB
medium	8 min 45 sec	6.9x faster	5.4GB
large-v3	18 min 30 sec	3.2x faster	10.1GB

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Method 2: whisper.cpp (CPU Optimized) {#method-whisper-cpp}

whisper.cpp is a C/C++ port by Georgi Gerganov (the creator of llama.cpp). It runs on pure CPU with SIMD optimizations, making it the best choice for machines without a dedicated GPU. It also supports Metal acceleration on Apple Silicon.

Installation

# Clone the repository
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

# Build with optimizations
# For x86 Linux/Windows:
make -j$(nproc)

# For Apple Silicon Mac (Metal acceleration):
make -j$(sysctl -n hw.ncpu) WHISPER_METAL=1

# For NVIDIA GPU (CUDA):
make -j$(nproc) WHISPER_CUDA=1

# Download a model
bash models/download-ggml-model.sh large-v3

Usage

# Basic transcription
./main -m models/ggml-large-v3.bin -f audio.wav

# With language detection
./main -m models/ggml-large-v3.bin -f audio.wav -l auto

# Output SRT subtitles
./main -m models/ggml-large-v3.bin -f audio.wav --output-srt

# Use 8 threads (match your CPU core count)
./main -m models/ggml-large-v3.bin -f audio.wav -t 8

# Convert audio to required format first (16kHz WAV)
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Performance (whisper.cpp)

On a Ryzen 7 5700X (8 cores) and Apple M2 Pro, transcribing the same 60-minute podcast:

Model	Ryzen 7 CPU	M2 Pro (Metal)	M2 Pro (CPU only)
tiny	32 sec	18 sec	28 sec
base	1 min 5 sec	35 sec	55 sec
small	4 min 20 sec	1 min 50 sec	3 min 30 sec
medium	14 min	5 min 20 sec	11 min
large-v3	38 min	12 min 40 sec	32 min

Metal acceleration on Apple Silicon cuts processing time by 60-70% compared to CPU-only. This makes whisper.cpp the recommended method for Mac users who want the small or medium model.

Method 3: faster-whisper (Recommended) {#method-faster-whisper}

faster-whisper uses CTranslate2, a custom inference engine optimized for Transformer models. It is 4x faster than the original Whisper and uses less memory thanks to INT8 quantization. This is what I use daily.

Installation

# Create virtual environment
python3 -m venv faster-whisper-env
source faster-whisper-env/bin/activate

# Install faster-whisper
pip install faster-whisper

# For CUDA GPU acceleration (requires CUDA 12+)
pip install faster-whisper[cuda]

Basic Usage

from faster_whisper import WhisperModel

# Load model (auto-detects GPU)
# Options: "tiny", "base", "small", "medium", "large-v3"
model = WhisperModel("large-v3", device="cuda", compute_type="int8")

# Transcribe
segments, info = model.transcribe("meeting.mp3", beam_size=5)

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

CLI Wrapper Script

#!/usr/bin/env python3
"""Fast local transcription with faster-whisper."""
import sys
import argparse
from faster_whisper import WhisperModel

def transcribe(audio_path, model_size="large-v3", language=None, output_format="txt"):
    model = WhisperModel(model_size, device="auto", compute_type="int8")

    segments, info = model.transcribe(
        audio_path,
        beam_size=5,
        language=language,
        vad_filter=True,          # Skip silence (huge speedup)
        vad_parameters=dict(
            min_silence_duration_ms=500,
            speech_pad_ms=200
        )
    )

    print(f"Language: {info.language} ({info.language_probability:.0%})")

    if output_format == "srt":
        for i, seg in enumerate(segments, 1):
            start = format_timestamp(seg.start)
            end = format_timestamp(seg.end)
            print(f"{i}")
            print(f"{start} --> {end}")
            print(f"{seg.text.strip()}\n")
    else:
        for seg in segments:
            print(seg.text.strip())

def format_timestamp(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("audio", help="Path to audio file")
    parser.add_argument("--model", default="large-v3", help="Model size")
    parser.add_argument("--language", default=None, help="Language code (e.g., en, ja, de)")
    parser.add_argument("--format", default="txt", choices=["txt", "srt"])
    args = parser.parse_args()
    transcribe(args.audio, args.model, args.language, args.format)

Save as transcribe.py and use:

python transcribe.py meeting.mp3 --model large-v3 --format srt > meeting.srt

Performance (faster-whisper)

Same 60-minute podcast on an RTX 3090:

Model	Processing Time	Real-time Factor	VRAM Used
tiny	12 seconds	300x faster	0.5GB
base	22 seconds	164x faster	0.6GB
small	52 seconds	69x faster	1.1GB
medium	2 min 10 sec	28x faster	2.6GB
large-v3	4 min 40 sec	12.9x faster	4.8GB

faster-whisper with INT8 is 4x faster than original Whisper and uses half the VRAM. There is no reason to use the original implementation unless you need to modify the model architecture itself.

The VAD (Voice Activity Detection) filter adds another 20-40% speedup by skipping silence. Enable it with vad_filter=True. On a meeting recording with typical pauses, a 60-minute file might only contain 38 minutes of actual speech.

Real-Time Transcription {#real-time}

Real-time transcription captures audio from your microphone and produces text as you speak. This requires a model that runs faster than real-time on your hardware.

Minimum hardware for real-time:

tiny/base model: Any modern CPU (no GPU needed)
small model: GTX 1060 6GB or M1 Mac
large-v3 model: RTX 3070 or better

Setup with faster-whisper

#!/usr/bin/env python3
"""Real-time speech-to-text with faster-whisper."""
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
import queue
import threading

# Configuration
MODEL_SIZE = "small"       # Use "small" for balance of speed + accuracy
SAMPLE_RATE = 16000
CHUNK_DURATION = 3         # Process 3 seconds of audio at a time
SILENCE_THRESHOLD = 0.01

audio_queue = queue.Queue()
model = WhisperModel(MODEL_SIZE, device="auto", compute_type="int8")

def audio_callback(indata, frames, time_info, status):
    """Called for each audio chunk from the microphone."""
    audio_queue.put(indata.copy())

def process_audio():
    """Process queued audio chunks."""
    buffer = np.array([], dtype=np.float32)

    while True:
        chunk = audio_queue.get()
        audio_data = chunk.flatten().astype(np.float32)
        buffer = np.concatenate([buffer, audio_data])

        # Process when buffer has enough audio
        if len(buffer) >= SAMPLE_RATE * CHUNK_DURATION:
            # Check if there is actual speech
            if np.abs(buffer).mean() > SILENCE_THRESHOLD:
                segments, _ = model.transcribe(
                    buffer,
                    beam_size=1,          # Faster for real-time
                    language="en",
                    vad_filter=True
                )
                for seg in segments:
                    print(seg.text.strip(), end=" ", flush=True)
            buffer = np.array([], dtype=np.float32)

# Start real-time transcription
print("Listening... (Ctrl+C to stop)")
processor = threading.Thread(target=process_audio, daemon=True)
processor.start()

with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=audio_callback):
    try:
        while True:
            sd.sleep(100)
    except KeyboardInterrupt:
        print("\nStopped.")

Install dependencies:

pip install sounddevice numpy faster-whisper

This setup produces text with roughly 3-second latency. For lower latency, reduce CHUNK_DURATION to 1.5 seconds, but expect more fragmented output.

Batch Processing Workflows {#batch-processing}

Transcribe an Entire Directory

#!/bin/bash
# batch_transcribe.sh - Transcribe all audio files in a directory
INPUT_DIR="$1"
OUTPUT_DIR="${2:-./transcripts}"
MODEL="${3:-large-v3}"

mkdir -p "$OUTPUT_DIR"

for file in "$INPUT_DIR"/*.{mp3,wav,m4a,flac,ogg,mp4,mkv,webm}; do
    [ -f "$file" ] || continue
    basename=$(basename "$file" | sed 's/\.[^.]*$//')
    echo "Transcribing: $file"

    python3 -c "
from faster_whisper import WhisperModel
model = WhisperModel('$MODEL', device='auto', compute_type='int8')
segments, info = model.transcribe('$file', beam_size=5, vad_filter=True)
with open('$OUTPUT_DIR/' + basename + '.txt', 'w') as f:
    for seg in segments:
        f.write(f'[{seg.start:.1f}s] {seg.text.strip()}\n')
print(f'  Language: {info.language}, Duration: {info.duration:.0f}s')
"
done
echo "All transcriptions saved to $OUTPUT_DIR"

Usage:

chmod +x batch_transcribe.sh
./batch_transcribe.sh ./recordings ./transcripts large-v3

Podcast Workflow

Here is my actual workflow for transcribing a podcast episode and generating show notes:

# Step 1: Download podcast episode
yt-dlp -x --audio-format mp3 "https://youtube.com/watch?v=EPISODE_ID" -o episode.mp3

# Step 2: Transcribe with faster-whisper
python3 -c "
from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device='cuda', compute_type='int8')
segments, _ = model.transcribe('episode.mp3', beam_size=5, vad_filter=True)
with open('transcript.txt', 'w') as f:
    for seg in segments:
        mins = int(seg.start // 60)
        secs = int(seg.start % 60)
        f.write(f'[{mins:02d}:{secs:02d}] {seg.text.strip()}\n')
"

# Step 3: Generate summary with Ollama
ollama run llama3.2 "Summarize this podcast transcript into key points and timestamps:" < transcript.txt > summary.md

That last step is the real power move: Whisper produces the transcript, and a local LLM generates the summary. No cloud services involved. The entire pipeline runs offline.

Language Support {#language-support}

Whisper handles 100+ languages, but accuracy varies significantly by language and model size.

Top-Tier Accuracy (WER under 5% on clean audio)

English, Spanish, French, German, Italian, Portuguese, Japanese, Chinese (Mandarin), Korean, Dutch, Russian, Polish, Turkish

Strong Accuracy (WER 5-10%)

Swedish, Danish, Norwegian, Finnish, Czech, Romanian, Hungarian, Greek, Thai, Vietnamese, Indonesian, Arabic, Hindi

Usable but Imperfect (WER 10-20%)

Ukrainian, Bulgarian, Croatian, Malay, Tagalog, Swahili, Urdu, Bengali

Language-Specific Tips

# Force a specific language (faster and more accurate than auto-detect)
segments, info = model.transcribe("audio.mp3", language="ja")

# Translate any language to English
segments, info = model.transcribe("german_lecture.mp3", task="translate")

# Initial prompt helps with domain-specific terms
segments, info = model.transcribe(
    "medical_recording.mp3",
    language="en",
    initial_prompt="This is a cardiology consultation discussing myocardial infarction, "
                   "troponin levels, and echocardiography results."
)

The initial_prompt trick is underrated. By providing domain vocabulary in the prompt, Whisper significantly improves recognition of technical terms, proper nouns, and uncommon words.

Accuracy Benchmarks {#accuracy-benchmarks}

Word Error Rate (WER) on standard benchmarks. Lower is better.

LibriSpeech (Clean English, Read Speech)

Model	WER (test-clean)	WER (test-other)
tiny	7.6%	14.8%
base	5.0%	10.3%
small	3.4%	7.6%
medium	2.9%	6.1%
large-v3	2.0%	4.2%
turbo	2.5%	5.1%

Real-World Accuracy (Our Testing)

We tested on harder scenarios: meetings with crosstalk, YouTube videos with background music, phone calls, and accented speakers.

Scenario	small	medium	large-v3
Clean podcast (single speaker)	3.8%	2.6%	1.9%
Meeting (3-4 speakers, some overlap)	11.2%	7.8%	5.1%
YouTube video (background music)	14.5%	9.3%	6.7%
Phone call (compressed audio)	9.8%	6.4%	4.3%
Heavy accent (Indian English)	12.1%	7.2%	4.8%
Noisy environment (cafe)	18.3%	12.1%	8.2%

Takeaway: The jump from small to large-v3 matters most in difficult audio conditions. For clean, single-speaker audio, the small model is plenty. For meetings and noisy recordings, large-v3 is worth the extra compute.

Integration with Ollama {#ollama-integration}

The most powerful local AI workflow combines Whisper transcription with LLM processing. Transcribe audio locally, then use Ollama to summarize, extract action items, translate, or answer questions about the content.

Transcribe-and-Summarize Pipeline

#!/usr/bin/env python3
"""Transcribe audio and generate AI summary with Ollama."""
import sys
import requests
from faster_whisper import WhisperModel

def transcribe_and_summarize(audio_path):
    # Step 1: Transcribe
    print("Transcribing...")
    model = WhisperModel("large-v3", device="auto", compute_type="int8")
    segments, info = model.transcribe(audio_path, beam_size=5, vad_filter=True)

    transcript = ""
    for seg in segments:
        mins = int(seg.start // 60)
        secs = int(seg.start % 60)
        transcript += f"[{mins:02d}:{secs:02d}] {seg.text.strip()}\n"

    print(f"Transcribed {info.duration:.0f}s of {info.language} audio")

    # Step 2: Summarize with Ollama
    print("Generating summary...")
    prompt = f"""Analyze this transcript and provide:
1. A 3-sentence summary
2. Key topics discussed (bullet points)
3. Action items mentioned (if any)
4. Notable quotes

Transcript:
{transcript[:8000]}"""  # Trim to fit context window

    response = requests.post("http://localhost:11434/api/generate", json={
        "model": "llama3.2",
        "prompt": prompt,
        "stream": False
    })

    summary = response.json()["response"]

    # Save both outputs
    with open(audio_path.rsplit(".", 1)[0] + "_transcript.txt", "w") as f:
        f.write(transcript)
    with open(audio_path.rsplit(".", 1)[0] + "_summary.md", "w") as f:
        f.write(summary)

    print(f"\nSummary:\n{summary}")

if __name__ == "__main__":
    transcribe_and_summarize(sys.argv[1])

This pipeline processes a 1-hour meeting recording in about 6 minutes on an RTX 3090 (4 min transcription + 2 min summarization). You can also run this on a dedicated AI server. If you are considering building one, the homelab AI server build guide walks through the hardware and setup.

Meeting Minutes Automation

# Cron job: auto-transcribe any new files in ~/Recordings
# Add to crontab -e:
*/5 * * * * find ~/Recordings -name "*.mp3" -newer ~/Recordings/.last_processed -exec python3 ~/transcribe_summarize.py {} \; && touch ~/Recordings/.last_processed

Privacy Advantages {#privacy-advantages}

Cloud transcription services (Google Speech-to-Text, AWS Transcribe, AssemblyAI) send your audio to external servers. For many use cases, that is unacceptable.

Scenarios where local Whisper is mandatory:

Legal: Attorney-client privileged conversations, depositions, court proceedings
Medical: Patient consultations, therapy sessions (HIPAA compliance)
Corporate: Board meetings, M&A discussions, proprietary strategy sessions
Journalism: Source interviews, especially with whistleblowers
Personal: Private conversations you simply do not want stored on someone else's servers

Local Whisper processes everything in your machine's RAM and VRAM. Audio files never leave your hardware. There is no telemetry, no logging, no data retention policy to worry about.

For a broader view of privacy implications, the run AI offline guide covers air-gapped setups where the machine has no internet connection at all.

Troubleshooting

Common Issues

"CUDA out of memory"

# Use a smaller model or INT8 quantization
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
# Or fall back to CPU
model = WhisperModel("large-v3", device="cpu", compute_type="int8")

"No such file or directory: ffmpeg"

# Install ffmpeg
sudo apt install ffmpeg   # Ubuntu
brew install ffmpeg        # macOS

Hallucinated text during silence

Whisper sometimes generates phantom text during silent sections. Fix with VAD filtering:

segments, info = model.transcribe(
    "audio.mp3",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=1000)
)

Slow performance on Mac

Make sure you built whisper.cpp with Metal support:

make clean && make -j$(sysctl -n hw.ncpu) WHISPER_METAL=1

Next Steps

You now have local speech-to-text running with full privacy. Here is where to go next:

Build an AI pipeline. Combine Whisper transcription with Ollama summarization for automated meeting notes, podcast show notes, or voice journaling. To close the loop into a talking assistant, our local voice assistant guide (Whisper + Ollama + Piper) wires speech-to-text, an LLM, and text-to-speech into one fully offline stack.
Scale up. If you are transcribing large volumes, consider a dedicated AI server that can process files around the clock without tying up your workstation.
Go fully offline. Follow the run AI offline guide for an air-gapped setup where both Whisper and your LLM run without any internet connection.

Frequently Asked Questions

Is Whisper really free to use commercially?

Yes. OpenAI released Whisper under the MIT license. You can use it in commercial products, modify the code, and redistribute it. There are no usage fees, API keys, or restrictions. The model weights are included.

How accurate is Whisper compared to Google Speech-to-Text?

On clean English audio, Whisper large-v3 achieves a Word Error Rate under 3%, which matches or beats Google Speech-to-Text. On noisy audio or accented speech, Whisper large-v3 typically matches cloud services. The small model is less accurate but still usable for most purposes. If you want an even faster open ASR alternative, see our Parakeet vs Whisper comparison — NVIDIA's Parakeet model trades some language coverage for big speed gains.

Can Whisper transcribe in real-time from a microphone?

Yes, with the right hardware. The tiny and base models run faster than real-time on any modern CPU. The small model needs a basic GPU for real-time. Large-v3 in real-time requires an RTX 3070 or better. Our guide includes a complete real-time transcription script.

Does Whisper work on Apple Silicon Macs?

Absolutely. whisper.cpp with Metal acceleration runs 60-70% faster than CPU-only on Apple Silicon. faster-whisper also works on Mac using CPU or Metal via PyTorch MPS backend. An M2 Pro can run the large-v3 model at roughly 5x real-time speed.

What audio formats does Whisper support?

Whisper accepts any audio format that ffmpeg can read: MP3, WAV, M4A, FLAC, OGG, WMA, AAC, and video files (MP4, MKV, WebM, AVI). Audio is internally converted to 16kHz mono WAV. For best results, provide the highest quality source you have.

Can Whisper identify different speakers (speaker diarization)?

The base Whisper model does not perform speaker diarization. However, you can combine it with pyannote-audio for speaker identification. The pipeline runs locally: pyannote segments the audio by speaker, then Whisper transcribes each segment. This adds processing time but works well for meetings.

How much storage do Whisper models use?

From 75MB (tiny) to 2.9GB (large-v3). If you install all models, total disk usage is about 5.2GB. Models are downloaded once and cached in ~/.cache/huggingface/ for faster-whisper or the models/ directory for whisper.cpp.

Conclusion

Whisper is genuinely one of the best open-source AI models available. It works, it is free, it runs on modest hardware, and it keeps your audio private. The faster-whisper implementation with INT8 quantization makes even the large-v3 model practical on mid-range GPUs, and whisper.cpp brings Metal-accelerated inference to every Apple Silicon Mac.

For most people, faster-whisper with the small or large-v3 model covers every transcription need. Pair it with Ollama for summarization, and you have a completely private meeting-notes pipeline that outperforms most commercial alternatives.

The audio on your machine stays on your machine. That alone makes local Whisper worth setting up.

Want to build a complete local AI stack? Start with our hardware requirements guide to size your setup, then follow the Mac or Linux setup guide for your platform.

Run Whisper Locally 2026: Free Offline Speech-to-Text Setup

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Is Whisper {#what-is-whisper}

Model Sizes and Hardware Requirements {#model-sizes}

Which model should you use?

Method 1: Original OpenAI Whisper {#method-original}

Installation

Basic Usage

Output Formats

Performance (Original Whisper)

Reading articles is good. Building is better.

Method 2: whisper.cpp (CPU Optimized) {#method-whisper-cpp}

Installation

Usage

Performance (whisper.cpp)

Method 3: faster-whisper (Recommended) {#method-faster-whisper}

Installation

Basic Usage

CLI Wrapper Script

Performance (faster-whisper)

Real-Time Transcription {#real-time}

Setup with faster-whisper

Batch Processing Workflows {#batch-processing}

Transcribe an Entire Directory

Podcast Workflow

Language Support {#language-support}

Top-Tier Accuracy (WER under 5% on clean audio)

Strong Accuracy (WER 5-10%)

Usable but Imperfect (WER 10-20%)

Language-Specific Tips

Accuracy Benchmarks {#accuracy-benchmarks}

LibriSpeech (Clean English, Read Speech)

Real-World Accuracy (Our Testing)

Integration with Ollama {#ollama-integration}

Transcribe-and-Summarize Pipeline

Meeting Minutes Automation

Privacy Advantages {#privacy-advantages}

Troubleshooting

Common Issues

Next Steps

Frequently Asked Questions

Is Whisper really free to use commercially?

How accurate is Whisper compared to Google Speech-to-Text?

Can Whisper transcribe in real-time from a microphone?

Does Whisper work on Apple Silicon Macs?

What audio formats does Whisper support?

Can Whisper identify different speakers (speaker diarization)?

How much storage do Whisper models use?

Conclusion

Voice working locally? Build the whole pipeline.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Get Local AI Tips Weekly

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Voice working locally? Build the whole pipeline.