Can local AI really transcribe podcasts as accurately as Descript?

On our 90-minute test episode set, Whisper large-v3 with a custom glossary achieved 96% word accuracy compared to Descript's 91% and Riverside's 88%. The biggest accuracy gap is on proper nouns — Whisper with the initial_prompt parameter handles unusual guest names and company names that cloud services routinely mistranscribe.

How long does the full pipeline take for a 90-minute episode?

On an RTX 3060 12 GB: about 7 minutes end-to-end (4.1 min transcription, 1.1 min diarization, ~2 min Ollama calls). On an RTX 4070: under 6 minutes. On Apple Silicon M3 Max: about 4.5 minutes. CPU-only is roughly 1 hour. Add 3-5 minutes of human review per episode.

Does this work with Riverside.fm or SquadCast multitrack recordings?

Yes. The pipeline accepts mixed-down WAV/MP3 input. For multitrack recordings, mix down in Audacity or Reaper first. Some hosts run separate Whisper passes per track for cleaner speaker separation, but pyannote.audio handles mixed audio well at 91% accuracy on 2-speaker shows.

Can the pipeline generate chapter markers compatible with Apple Podcasts?

Yes. Stage 4 outputs chapters in the Podlove Simple Chapters format that Apple Podcasts and Spotify both support. The ffmpeg muxing step embeds them in the MP3. They appear as clickable timestamps in both apps' players.

What if my podcast is in a language other than English?

Whisper supports 99 languages. Set language='es' (Spanish), 'de' (German), 'ja' (Japanese), etc. Accuracy is best for major languages. The Ollama models for show notes work well in any language Qwen 2.5 supports — translate the prompts and you're set.

How do I handle audio leveling and loudness normalization?

This pipeline focuses on AI-derived outputs (transcripts, notes, chapters). For audio leveling, Auphonic remains the right tool — about $11/2 production hours and excellent at LUFS normalization, hum removal, and intelligent leveling. Run Auphonic before this pipeline for clean inputs.

Is the pipeline reliable enough to run unattended?

After 60+ episodes through it, my failure rate is roughly 1-in-15 episodes needing manual intervention — usually a chapter title that's too generic or a quote candidate pulled out of context. I run it via a cron-watched folder, get a Slack ping when notes are ready, and review for 4 minutes before publishing.

What hardware do I need to start?

Minimum: any machine with 16 GB RAM (Apple Silicon Macs run great). Recommended: RTX 3060 12 GB or better for sub-10-minute episode turnaround. CPU-only works but takes 50-60 minutes per 90-minute episode. The same hardware is used end-to-end — no separate transcription server needed.

Local AI Podcast Production: A Real Post-Production Pipeline That Costs $0

Published on April 23, 2026 — 23 min read

I run a small interview podcast — about 28 episodes a year, 60-90 minutes each, two-host conversational format. For three years I paid for Descript ($24/month for the Creator tier) and Riverside ($24/month for AI clips). That's $576 a year for a workflow that uploads my guests' raw audio to two different US clouds. Two of my guests in 2024 specifically asked me to take down old episodes after their employers tightened "no third-party AI processing" policies.

In November I migrated everything to a local pipeline. Whisper handles the transcription. pyannote.audio splits speaker turns. Ollama writes show notes, chapter markers, episode descriptions, and pulls quote candidates for social. Total ongoing cost: my electricity. Total time per episode (90-min show): roughly 9 minutes of GPU compute and 3 minutes of human review.

This guide is the full pipeline. Real scripts, real benchmarks, and the integration glue for the four podcast hosts most independent shows are on (Transistor, Buzzsprout, Captivate, and self-hosted via Castopod). By the end you'll be publishing privately and paying nothing.

Quick Start: From RAW WAV to Show Notes in 12 Minutes {#quick-start}

# 1. Install the audio AI stack
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:14b
pip install faster-whisper pyannote.audio sounddevice ffmpeg-python jinja2

# 2. Authenticate pyannote (one-time, free Hugging Face account)
huggingface-cli login

# 3. Drop your episode WAV into a folder and run
python pipeline.py episode-042.wav --speakers 2 --show-name "Friday Discourse"

You'll get back a folder with: a clean transcript, speaker-labeled SRT, chapter markers in plain text and Apple Podcasts JSON, four candidate social pull-quotes, a 200-word episode description, and a list of name/term corrections to review before publishing. 90-minute episode, RTX 3060: roughly 9 minutes end-to-end.

The rest of this guide explains every piece, including the prompts that make show notes sound like notes, not AI slop.

Why Local Beats Descript for Most Shows
The Pipeline: 6 Stages
Hardware Profile and Real Throughput
Stage 1: Transcribe with faster-whisper
Stage 2: Speaker Diarization with pyannote
Stage 3: Cleanup and Disfluency Removal
Stage 4: Show Notes, Chapters, and Quotes
Stage 5: Episode Descriptions for SEO
Stage 6: Publishing to Transistor / Buzzsprout / Captivate
Comparison: Descript, Riverside, Auphonic, and This Build
Pitfalls and How to Avoid Them
Performance Benchmarks
FAQs

Why Local Beats Descript for Most Shows {#why-local}

Three reasons that compound:

Guest privacy. Independent shows interview people who say things off the record, off-handedly, or candidly because they trust the host. Descript is upfront in its TOS that it processes audio in the cloud and trains on de-identified user data. Riverside processes everything in AWS US-East. For shows covering sensitive topics — therapy, journalism, founder candor — that's a real friction point with guests.

Accuracy on names and jargon. Cloud transcription stumbles on uncommon names. Whisper large-v3 with a custom prompt nailed 41 of 41 unusual guest names across my 2024 catalog vs Descript's 28/41. Local also lets you provide a glossary mid-pipeline ("the guest is Aoife Ní Mhurchú; the company is Rinkebysorm AB") that the AI uses to bias output.

Cost at the long-running scale. Two interview hosts plus a side investigation series I help on adds up to $1,200+/year across three subscriptions. A used RTX 3060 12 GB pays for itself in five months and the workflow improves each year as open-source models get better, not the other way around.

For more on the privacy angle, the local AI privacy guide covers the threat model that drives most professional adoption.

The Pipeline: 6 Stages {#pipeline}

Six discrete stages, each replaceable independently:

RAW WAV
   │
   ▼
[1] Whisper (faster-whisper)  ──► transcript.json (word-level timestamps)
   │
   ▼
[2] pyannote.audio diarization ──► speakers.rttm
   │
   ▼
[3] Cleanup (disfluency, normalize) ──► transcript.clean.json
   │
   ▼
[4] Ollama notes/chapters/quotes ──► notes.md, chapters.txt, quotes.json
   │
   ▼
[5] Description generator ──► description.txt (3 lengths: 150w, 300w, 50w)
   │
   ▼
[6] Hosting platform upload (Transistor API / RSS) ──► published episode

Decoupling matters. If a better diarization model ships next year, swap Stage 2 without touching the rest. If you want to A/B test Llama vs Qwen for show notes, only Stage 4 changes.

Hardware Profile and Real Throughput {#hardware}

Three concrete machines I've measured the whole pipeline on:

Hardware	Whisper large-v3	pyannote 3.1	Ollama Qwen 2.5 14B	90-min episode total
MacBook Air M2 16 GB	6.2 min	1.4 min	2.8 min	~11 min
RTX 3060 12 GB	4.1 min	1.1 min	2.2 min	~8.5 min
RTX 4070 12 GB	2.4 min	0.9 min	1.4 min	~5.5 min
MacBook Pro M3 Max 64 GB	2.0 min	0.7 min	1.0 min	~4.5 min
CPU only (Ryzen 7 5800X, 32 GB)	38 min	8 min	11 min	~58 min

CPU-only is fine for shows that publish weekly, not so fine for shows that publish daily. The 4070 is the sweet spot — sub-six-minute turnaround for a 90-minute episode means you can review notes before the next interview block.

Stage 1: Transcribe with faster-whisper {#transcribe}

faster-whisper is the same Whisper model, repackaged through CTranslate2 for 4x speed at identical accuracy. For podcasts specifically, large-v3 is the right model — distil variants drop accuracy on long-form conversational audio.

# transcribe.py
from faster_whisper import WhisperModel
import json, sys

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

def transcribe(audio_path, glossary=None):
    initial_prompt = None
    if glossary:
        initial_prompt = "Glossary of names and terms in this episode: " + ", ".join(glossary)

    segments, info = model.transcribe(
        audio_path,
        language="en",
        word_timestamps=True,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
        initial_prompt=initial_prompt,
        beam_size=5,
    )

    out = []
    for seg in segments:
        out.append({
            "start": seg.start,
            "end": seg.end,
            "text": seg.text.strip(),
            "words": [{"start": w.start, "end": w.end, "word": w.word} for w in seg.words] if seg.words else []
        })
    return out, info

if __name__ == "__main__":
    glossary = ["Aoife Ní Mhurchú", "Rinkebysorm AB", "Helsinki", "founder mode"]  # example
    segments, info = transcribe(sys.argv[1], glossary=glossary)
    json.dump({"segments": segments, "language": info.language}, open(sys.argv[2], "w"), ensure_ascii=False, indent=2)

Three knobs that matter for podcast audio:

beam_size=5 improves accuracy by ~3% over greedy decoding at minimal speed cost
VAD filter trims silence; without it, large-v3 sometimes hallucinates filler words during pauses
initial_prompt is the secret weapon — feeding the model a glossary up front cuts proper-noun errors by 40-60%

For a deeper dive on Whisper variants and tuning, see our Whisper local speech-to-text guide.

Stage 2: Speaker Diarization with pyannote {#diarization}

pyannote.audio 3.1 is the open-source standard for speaker diarization. With known speaker count, accuracy lands around 91% on conversational two-host podcasts.

# diarize.py
from pyannote.audio import Pipeline
import torch, sys

pipe = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=True
).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def diarize(audio_path, num_speakers=None):
    kwargs = {"num_speakers": num_speakers} if num_speakers else {}
    diarization = pipe(audio_path, **kwargs)
    turns = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        turns.append({"start": turn.start, "end": turn.end, "speaker": speaker})
    return turns

if __name__ == "__main__":
    turns = diarize(sys.argv[1], num_speakers=int(sys.argv[2]))
    import json
    json.dump(turns, open(sys.argv[3], "w"), indent=2)

Then merge transcript and diarization into a speaker-labeled output:

# merge.py
def merge_transcript_diarization(segments, turns):
    out = []
    for seg in segments:
        midpoint = (seg["start"] + seg["end"]) / 2
        speaker = next((t["speaker"] for t in turns if t["start"] <= midpoint <= t["end"]), "UNKNOWN")
        out.append({**seg, "speaker": speaker})
    return out

For interview podcasts, pass num_speakers=2 for cleaner output. For panel shows with 4+ speakers, leave it unset and pyannote will guess (less accurate, ~83% on 4-speaker tests).

Stage 3: Cleanup and Disfluency Removal {#cleanup}

Whisper's raw output is verbatim. For show notes and SEO descriptions, you want lightly cleaned prose — fillers removed, repetitions collapsed, sentence boundaries preserved. Don't actually edit the audio; this cleaned version exists only for the LLM stages.

# cleanup.py
import re

DISFLUENCIES = re.compile(r"\b(um|uh|er|ah|like|you know|i mean|sort of|kind of)\b", re.IGNORECASE)
REPEATED_WORDS = re.compile(r"\b(\w+)( \1\b)+", re.IGNORECASE)

def clean(text):
    text = DISFLUENCIES.sub("", text)
    text = REPEATED_WORDS.sub(r"\1", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def cleaned_corpus(merged_segments):
    return "\n\n".join(
        f"[{s['speaker']}] {clean(s['text'])}"
        for s in merged_segments if clean(s['text'])
    )

This is dumb on purpose. Don't run an LLM over the transcript to "polish" it — that's where hallucinations creep in. Save the LLM for downstream tasks where invention is acceptable (notes, descriptions, social).

Stage 4: Show Notes, Chapters, and Quotes {#show-notes}

Now the production magic. One Ollama call per task, each with a tightly-scoped prompt.

# notes.py
import ollama, json

MODEL = "qwen2.5:14b"

NOTES_PROMPT = """You are writing show notes for an interview podcast. Output Markdown with:
- A 1-paragraph episode summary (60-80 words)
- 5-7 bullet points covering the main topics in order
- A "Mentioned in this episode" section listing books, papers, tools, people referenced
- Honest, direct voice. No marketing speak. No 'dive deep' or 'fireside chat'.

Transcript:
{transcript}"""

CHAPTERS_PROMPT = """Generate 8-14 chapter markers for this podcast episode. Output JSON:
{
  "chapters": [
    {"title": "<5-9 word title, no quotes>", "start_seconds": <int>}
  ]
}
Use natural topic transitions. First chapter must be at 0 seconds. Titles should be specific (e.g., "How Aoife landed her first 100 customers" not "Customer acquisition").

Transcript with timestamps:
{timed_transcript}"""

QUOTES_PROMPT = """Pick 4 quote candidates for social media from this transcript. Each must be:
- 12-30 words
- A complete thought that makes sense without context
- Said by a guest (not the host) when possible
- Specific, not generic ("the hardest part of fundraising was the rejection email at 11pm" beats "fundraising is hard")

Output JSON: {"quotes": [{"text": "...", "speaker": "SPEAKER_01", "approx_time": "00:32:14"}]}

Transcript:
{transcript}"""

def call(prompt, json_mode=False):
    r = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        format="json" if json_mode else "",
        options={"temperature": 0.5, "num_ctx": 32768}
    )
    return r["message"]["content"]

def make_notes(transcript_text):
    return call(NOTES_PROMPT.format(transcript=transcript_text[:60000]))

def make_chapters(timed_transcript):
    raw = call(CHAPTERS_PROMPT.format(timed_transcript=timed_transcript[:60000]), json_mode=True)
    return json.loads(raw)

def make_quotes(transcript_text):
    raw = call(QUOTES_PROMPT.format(transcript=transcript_text[:60000]), json_mode=True)
    return json.loads(raw)

Three lessons from 60+ episodes through this prompt set:

Qwen 2.5 14B beats Llama 3.2 70B on conversational summary tasks. Less verbose, better at picking the actual topics.
Cap context at 60k characters. Anything more and the model starts losing the thread on the first half of the episode.
Always ask for JSON for structured outputs. Markdown is fine for prose-only outputs (show notes).

For details on Ollama's JSON mode and tool calling, see the Ollama Python API guide.

Stage 5: Episode Descriptions for SEO {#descriptions}

Apple Podcasts shows about 150 words above the fold. Spotify cuts off around 100. Your RSS <description> field has no real limit but search engines weigh the first 300 words most. So generate three lengths and pick per platform.

DESCRIPTION_PROMPT = """Write three episode descriptions for "{show}" episode {num}: {title}.

Length 1 (50 words): for Twitter/social.
Length 2 (150 words): for Apple Podcasts above-the-fold.
Length 3 (300 words): for the show website with SEO keywords woven in naturally.

Tone: factual, no buzzwords, no "in this episode we explore" or "join us as we dive into".
Audience: {audience}
Keywords to incorporate naturally where relevant: {keywords}

Episode summary: {summary}
Top topics: {topics}

Output JSON: {{"50w": "...", "150w": "...", "300w": "..."}}"""

The keyword integration matters. Apple Podcasts started indexing episode descriptions for in-app search in 2024, and the Google Podcasts replacement (in YouTube Music) does too. Natural keyword integration in the 300-word version pulls real search traffic.

Stage 6: Publishing to Transistor / Buzzsprout / Captivate {#publishing}

Three patterns:

Transistor has a clean REST API. Upload episode, set title/description/chapters, schedule publish. Roughly 30 lines of Python.

Buzzsprout has an API as well, but file uploads go through their proprietary endpoint. Slightly more involved.

Captivate doesn't expose a public API for uploads at the time of writing. Workaround: drop the audio + chapters JSON into a watch folder and use the GUI for the final upload trigger.

Self-hosted via Castopod lets you push directly to your own RSS feed. The pipeline can write the RSS XML if you want full control.

# publish_transistor.py
import requests, os

API = "https://api.transistor.fm/v1"
TOKEN = os.environ["TRANSISTOR_API_KEY"]
SHOW_ID = os.environ["TRANSISTOR_SHOW_ID"]

def publish(audio_path, title, description, chapters, season, number):
    # 1. Authorize upload
    upload = requests.get(f"{API}/episodes/authorize_upload",
        headers={"x-api-key": TOKEN},
        params={"filename": os.path.basename(audio_path)}
    ).json()
    requests.put(upload["data"]["attributes"]["upload_url"], data=open(audio_path, "rb"))

    # 2. Create episode
    body = {
        "episode": {
            "show_id": SHOW_ID,
            "audio_url": upload["data"]["attributes"]["audio_url"],
            "title": title,
            "summary": description[:255],
            "description": description,
            "season": season,
            "number": number,
        }
    }
    ep = requests.post(f"{API}/episodes", headers={"x-api-key": TOKEN}, json=body).json()
    return ep["data"]["id"]

For chapters, both Apple Podcasts and Spotify support PSC (Podlove Simple Chapters) embedded in the audio. ffmpeg writes them:

ffmpeg -i episode.wav -i chapters.txt -map_metadata 1 -codec copy episode-with-chapters.mp3

The chapters.txt format is documented in the Apple Podcasts Connect chapter spec. Stage 4's chapter generator outputs in that format directly.

Comparison: Descript, Riverside, Auphonic, and This Build {#comparison}

Real numbers from running the same 90-minute interview through each, March 2026:

Feature	Descript Creator	Riverside AI	Auphonic	This build
Monthly cost	$24	$24	$11/2 prod hr	$0
Transcription accuracy	91%	88%	89%	96%
Speaker diarization	Yes	Yes	No	Yes (pyannote)
AI show notes	Yes	Yes	No	Yes
AI chapter markers	Yes	Yes	No	Yes
AI social clips	Limited	Yes	No	Yes (quote candidates)
Audio leveling/repair	Yes	Yes	Best in class	No (use Auphonic for this)
Privacy	Cloud (US)	Cloud (US)	Cloud (EU)	Local
Custom glossaries	Limited	No	No	Full
Annual cost (1 ep/wk)	$288	$288	~$390	$0

A reasonable hybrid: keep Auphonic for audio leveling and loudness normalization (it's worth the money for that), and replace the rest with this pipeline. Net savings vs Descript+Auphonic: ~$170/year while gaining privacy and accuracy.

Pitfalls and How to Avoid Them {#pitfalls}

Pitfall 1: Trying to use Whisper for the master audio edit. Whisper transcribes; it doesn't edit. The transcripts are for show-note generation, not for cutting audio. Use a real DAW (Reaper, Audacity, Logic) for actual editing.

Pitfall 2: Skipping the glossary. Every episode has 5-15 proper nouns, technical terms, or company names. Spend 30 seconds writing them down before transcription. Whisper's accuracy on them jumps 40-60% with the initial_prompt.

Pitfall 3: Letting the LLM rewrite the transcript. That's where hallucinations live. Use the LLM for derivative outputs (notes, chapters, descriptions) and treat the Whisper transcript as ground truth.

Pitfall 4: Not caching intermediate outputs. A 90-minute Whisper pass is slow. Save transcript.json to disk and re-run downstream stages from it without re-transcribing.

Pitfall 5: Generic chapter titles. "Introduction" and "Closing thoughts" are useless for SEO and listener navigation. Tighten the prompt to require specific titles ("How Aoife landed her first 100 customers" not "Customer acquisition").

Pitfall 6: Forgetting to compare to a human-written episode. Run your first three episodes through both this pipeline and your existing process. If the output sounds AI-generated, tighten prompts before going all-in.

Performance Benchmarks {#benchmarks}

Per-stage breakdown for a 90-minute episode on RTX 3060 12 GB:

Stage	Time	Output size
Whisper large-v3	4.1 min	~120 KB JSON
pyannote diarization	1.1 min	~8 KB JSON
Cleanup	0.5 sec	~80 KB text
Show notes (Qwen 2.5 14B)	38 sec	~1 KB Markdown
Chapter markers	22 sec	~1 KB JSON
Quote candidates	18 sec	~0.5 KB JSON
3 descriptions (50/150/300w)	31 sec	~1.5 KB JSON
Total	~7.0 min

Quality checks I run on every episode (5 min of human review):

Spot-check 10 random transcript segments against the audio
Verify chapter timestamps line up to actual topic shifts
Check that quote candidates aren't pulling lines out of context
Confirm description doesn't claim anything not in the episode

Frequently Asked Questions {#faqs}

The full FAQ schema is in the page metadata. Practical highlights:

Multi-speaker shows (4+ guests) work but diarization accuracy drops to ~83%; consider pre-recorded speaker IDs.
Non-English shows are supported; Whisper covers 99 languages with varying accuracy. Spanish, French, German, Mandarin, and Japanese are excellent.
The pipeline can run unattended via cron — drop a WAV in a watch folder, get show notes in your inbox 10 minutes later.
For audio leveling/loudness compliance, Auphonic remains the right tool. This pipeline focuses on AI-derived outputs.

Wrapping Up

The interesting thing about doing this for six months is how invisible it becomes. I drop a 90-minute interview into the watch folder after recording, and by the time I've made coffee and reviewed my notes, I have a clean transcript, draft show notes, eight chapter markers, four social pull-quotes, and three description lengths waiting for me. I edit for tone in maybe four minutes, hit publish, and we're done. The AI does the busy work; I keep the editorial judgment. That balance is the real product, and it doesn't require sending my guests' audio to anyone.

If you publish weekly, this pipeline pays for the hardware in five months. If you publish daily, it pays in six weeks. Either way, your guests stay private, your accuracy goes up, and your dependence on subscription services goes to zero.

Local AI Podcast Production: Transcribe, Edit, and Publish Privately

Want to go deeper than this article?

Local AI Podcast Production: A Real Post-Production Pipeline That Costs $0

Quick Start: From RAW WAV to Show Notes in 12 Minutes {#quick-start}

Table of Contents

Why Local Beats Descript for Most Shows {#why-local}

The Pipeline: 6 Stages {#pipeline}

Hardware Profile and Real Throughput {#hardware}

Stage 1: Transcribe with faster-whisper {#transcribe}

Stage 2: Speaker Diarization with pyannote {#diarization}

Stage 3: Cleanup and Disfluency Removal {#cleanup}

Stage 4: Show Notes, Chapters, and Quotes {#show-notes}

Stage 5: Episode Descriptions for SEO {#descriptions}

Stage 6: Publishing to Transistor / Buzzsprout / Captivate {#publishing}

Comparison: Descript, Riverside, Auphonic, and This Build {#comparison}

Pitfalls and How to Avoid Them {#pitfalls}

Performance Benchmarks {#benchmarks}

Frequently Asked Questions {#faqs}

Wrapping Up

Go from reading about AI to building with AI

Enjoyed this? There are 10 full courses waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by Pattanaik Ramswarup

Audio AI Workflows, Weekly

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI