NVIDIA · Open-Weight Speech-to-Text
NVIDIA Parakeet TDT: Fastest Local Speech-to-Text
NVIDIA Parakeet TDT 0.6B v3 is, as of mid-2026, the fastest open speech-to-text model you can run locally: a 600-million-parameter FastConformer-TDT model that hits a 6.34% average Word Error Rate on the Hugging Face Open ASR Leaderboard while transcribing at ~3,333× realtime (RTFx 3,332.74). It handles 25 European languages with automatic language detection, ships open weights under CC-BY-4.0, and runs on a single NVIDIA GPU. The English-only v2 is even more accurate at 6.05% WER (RTFx 3,386). The trade-off versus Whisper is breadth: Whisper covers 99 languages, Parakeet trades that for an order-of-magnitude speed advantage.
Self-hostable. Parakeet ships open weights (CC-BY-4.0) — you download the model and run it offline on your own GPU, no API and no per-minute fee. If you also want broad language coverage, compare it against Whisper large-v3.
Key takeaways
- →Speed is the headline: ~3,333× realtime (RTFx 3,332.74) for v3 — roughly 40–50× faster throughput than Whisper large-v3 (RTFx ~68.56).
- →Accuracy holds up: 6.34% WER (v3, 25 languages) and 6.05% WER (v2, English) on the Open ASR Leaderboard.
- →Tiny footprint: only 600M parameters — NVIDIA states it needs at least 2GB of GPU RAM to load.
- →Open weights, CC-BY-4.0: free to use commercially with attribution. Runs via NVIDIA NeMo or Hugging Face Transformers.
- →The catch: it is NVIDIA-GPU oriented and far narrower on languages than Whisper's 99 — pick by use case.
Why this model matters
Most people's mental model of local speech-to-text is Whisper, because for years it was the only open option that was genuinely good. Parakeet TDT changes the calculus on one axis that matters enormously in production: throughput. A FastConformer encoder paired with a TDT (Token-and-Duration Transducer) decoder lets the model skip the long runs of blank frames that a normal transducer would emit, which is where the dramatic speed comes from. The practical result is that batch-transcribing a large archive of audio, or keeping up with many concurrent streams, becomes cheap.
The other reason it belongs on a local-AI site: the weights are genuinely open under CC-BY-4.0, the model is small (600M parameters), and it loads on modest GPUs. You are not renting a transcription API — you own the pipeline.
Specs at a glance
| Attribute | Parakeet TDT 0.6B v3 | Parakeet TDT 0.6B v2 |
|---|---|---|
| Vendor | NVIDIA | NVIDIA |
| Release date | August 14, 2025 | May 1, 2025 |
| Parameters | 600 million | 600 million |
| Architecture | FastConformer-TDT | FastConformer-TDT |
| Languages | 25 European (auto-detect) | English |
| Avg WER (Open ASR) | 6.34% | 6.05% |
| RTFx (throughput) | 3,332.74 | 3,386 (batch 128) |
| Single-pass audio | ~24 min full attention; ~3 hr local attention | Up to ~24 min |
| Timestamps | Word-level | Word-level |
| License | CC-BY-4.0 | CC-BY-4.0 |
| Min GPU RAM to load | ~2GB | ~2GB |
| Runtime | NVIDIA NeMo · HF Transformers | NVIDIA NeMo · HF Transformers |
Sources: parakeet-tdt-0.6b-v3 model card and parakeet-tdt-0.6b-v2 model card (NVIDIA, Hugging Face). WER and RTFx are the vendor-reported Open ASR Leaderboard figures; RTFx scales with batch size and audio length, so your numbers will differ.
The Parakeet TDT line
"Parakeet" is NVIDIA's family of FastConformer-based ASR models; the TDT variants pair that encoder with a Token-and-Duration Transducer decoder. The two that matter for local use today are both the 0.6B size:
Parakeet TDT 0.6B v3 — the multilingual one
Released August 14, 2025. Extends v2 from English to 25 European languages and adds automatic language detection — you do not have to tell it the language up front. Average WER of 6.34% on the Open ASR Leaderboard at RTFx 3,332.74. This is the default pick unless you only ever transcribe English.
Parakeet TDT 0.6B v2 — the English specialist
Released May 1, 2025. English-only, and slightly more accurate on English at 6.05% WER (RTFx 3,386 at batch size 128). If your workload is English audio only, v2 is the sharper tool; v3 is the one to reach for the moment any other European language appears.
There is also an older, separate parakeet-tdt-1.1b (~1.1B parameters, English-only, ~7.02% mean WER, jointly developed with Suno) — it predates the v2/v3 line and is not a "v3" of the 0.6B model. For new local deployments the 0.6B v3 is both faster and more capable, so we focus on it here.
Benchmarks & speed
The headline metrics are accuracy (Word Error Rate, lower is better) and throughput (RTFx — how many seconds of audio it transcribes per second of compute, higher is better). Parakeet's differentiator is that it posts excellent WER while sitting near the top of the leaderboard for throughput.
| Metric | Parakeet TDT 0.6B v3 | Parakeet TDT 0.6B v2 | Notes |
|---|---|---|---|
| Open ASR Leaderboard WER | 6.34% | 6.05% | Average across the leaderboard's English test sets. |
| RTFx (throughput) | 3,332.74 | 3,386.02 | v2 measured at batch size 128; ~3,333× realtime. |
| FLEURS WER (v3) | 11.97% | — | Multilingual set; harder than clean English. |
| MLS WER (v3) | 7.83% | — | Multilingual LibriSpeech. |
| WER at SNR -5 (noise) | 19.88% | — | Heavy MUSAN noise vs 6.34% clean — accuracy degrades in noise, as expected. |
Source: NVIDIA Parakeet TDT 0.6B v3 / v2 model cards on Hugging Face. RTFx depends on batch size, GPU, and audio length — treat the figures as best-case leaderboard conditions, not a guarantee for your hardware.
Parakeet vs Whisper
This is the comparison everyone actually wants. Whisper large-v3 (~1.55B parameters) is the incumbent open model and covers 99 languages; Parakeet TDT trades that breadth for a huge throughput advantage and a much smaller footprint.
| Attribute | Parakeet TDT 0.6B v3 | Whisper large-v3 |
|---|---|---|
| Parameters | 600M | ~1.55B |
| Languages | 25 European | 99 |
| Throughput (RTFx) | ~3,333 | ~68.56 |
| Vendor | NVIDIA | OpenAI |
| License | CC-BY-4.0 | Apache 2.0 |
| Best for | High-volume / many-stream English & European transcription | Maximum language coverage |
On English accuracy the two are competitive; the decisive difference is speed. The RTFx gap (~3,333 vs ~68.56) means Parakeet processes audio roughly 40–50× faster per unit of compute, which is why it dominates batch-transcription and real-time scenarios. A worthwhile aside: because Parakeet is a transducer model rather than an autoregressive sequence-to-sequence decoder, it is far less prone to the "hallucinated text on silence" failure mode that Whisper users sometimes hit — though NVIDIA does not publish a formal silence-robustness number, so treat that as an architectural observation, not a benchmarked claim. We dig into the full side-by-side in Parakeet vs Whisper.
Running Parakeet on your own hardware
Parakeet is small. NVIDIA lists a minimum of ~2GB of GPU RAM just to load the 600M-parameter model, so it fits comfortably on essentially any modern NVIDIA card — you do not need a workstation GPU. The longest single-pass audio depends on attention mode: roughly 24 minutes with full attention (NVIDIA cites an A100 80GB for that figure) or up to ~3 hours with local attention enabled.
First-hand framing. The vendor RTFx of ~3,333 is a batched, leaderboard-optimal number on datacenter hardware — do not expect it on a laptop GPU. As an approximate real-world anchor: on a single consumer NVIDIA card (e.g. an RTX 4090-class GPU) you should plan around tens to low-hundreds of times realtime for typical single-file transcription, and only approach the headline figure with large batches of short clips on a high-end datacenter GPU. The point that survives the caveats is that even the conservative end is dramatically faster than a Whisper-large pipeline. Always benchmark on your own audio and GPU before sizing capacity.
To run it, install NVIDIA NeMo (the reference path, which also gives you fine-tuning) or use the Hugging Face Transformers integration. If you are coming from a Whisper setup and want a faster drop-in, our faster-whisper guide covers the optimized-Whisper route, and the Whisper large-v3 page details the model Parakeet is most often compared against.
Who should pick Parakeet TDT
| If you are… | Best pick | Why |
|---|---|---|
| Batch-transcribing a large English/European archive | Parakeet TDT 0.6B v3 | Top-tier throughput (~3,333× realtime) at 6.34% WER. |
| Doing English-only work and want the lowest WER | Parakeet TDT 0.6B v2 | 6.05% WER on English, slightly sharper than v3. |
| Transcribing languages outside Europe | Whisper large-v3 | 99-language coverage; Parakeet v3 is limited to 25 European languages. |
| Upgrading a Whisper pipeline for speed | faster-whisper or Parakeet | Both cut latency; Parakeet wins on raw throughput if your languages fit. |
| Deciding between the two head-to-head | Parakeet vs Whisper | Full accuracy, speed, and language breakdown. |
Build a private transcription pipeline
Parakeet TDT plus a local runtime gives you fast, offline speech-to-text with zero per-minute cost. The Local AI Master deployment course walks through pulling open-weight models, setting up the GPU runtime, and wiring them into a real pipeline — the same workflow whether you run Parakeet, Whisper, or both.
See the deployment course →Related models & guides
- → Whisper large-v3 — the 99-language incumbent Parakeet is measured against
- → Parakeet vs Whisper — the full accuracy and speed head-to-head
- → faster-whisper guide — the optimized-Whisper route for faster local transcription
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.