Free course — 2 free chapters of every course. No credit card.Start learning free
AI Workflows

Local AI Podcast Production: Transcribe, Edit, and Publish Privately

April 23, 2026
23 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Local AI Podcast Production: A Real Post-Production Pipeline That Costs $0

Published on April 23, 2026 — 23 min read

I run a small interview podcast — about 28 episodes a year, 60-90 minutes each, two-host conversational format. For three years I paid for Descript ($24/month for the Creator tier) and Riverside ($24/month for AI clips). That's $576 a year for a workflow that uploads my guests' raw audio to two different US clouds. Two of my guests in 2024 specifically asked me to take down old episodes after their employers tightened "no third-party AI processing" policies.

In November I migrated everything to a local pipeline. Whisper handles the transcription. pyannote.audio splits speaker turns. Ollama writes show notes, chapter markers, episode descriptions, and pulls quote candidates for social. Total ongoing cost: my electricity. Total time per episode (90-min show): roughly 9 minutes of GPU compute and 3 minutes of human review.

This guide is the full pipeline. Real scripts, real benchmarks, and the integration glue for the four podcast hosts most independent shows are on (Transistor, Buzzsprout, Captivate, and self-hosted via Castopod). By the end you'll be publishing privately and paying nothing.

Quick Start: From RAW WAV to Show Notes in 12 Minutes {#quick-start}

# 1. Install the audio AI stack
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:14b
pip install faster-whisper pyannote.audio sounddevice ffmpeg-python jinja2

# 2. Authenticate pyannote (one-time, free Hugging Face account)
huggingface-cli login

# 3. Drop your episode WAV into a folder and run
python pipeline.py episode-042.wav --speakers 2 --show-name "Friday Discourse"

You'll get back a folder with: a clean transcript, speaker-labeled SRT, chapter markers in plain text and Apple Podcasts JSON, four candidate social pull-quotes, a 200-word episode description, and a list of name/term corrections to review before publishing. 90-minute episode, RTX 3060: roughly 9 minutes end-to-end.

The rest of this guide explains every piece, including the prompts that make show notes sound like notes, not AI slop.


Table of Contents

  1. Why Local Beats Descript for Most Shows
  2. The Pipeline: 6 Stages
  3. Hardware Profile and Real Throughput
  4. Stage 1: Transcribe with faster-whisper
  5. Stage 2: Speaker Diarization with pyannote
  6. Stage 3: Cleanup and Disfluency Removal
  7. Stage 4: Show Notes, Chapters, and Quotes
  8. Stage 5: Episode Descriptions for SEO
  9. Stage 6: Publishing to Transistor / Buzzsprout / Captivate
  10. Comparison: Descript, Riverside, Auphonic, and This Build
  11. Pitfalls and How to Avoid Them
  12. Performance Benchmarks
  13. FAQs

Why Local Beats Descript for Most Shows {#why-local}

Three reasons that compound:

Guest privacy. Independent shows interview people who say things off the record, off-handedly, or candidly because they trust the host. Descript is upfront in its TOS that it processes audio in the cloud and trains on de-identified user data. Riverside processes everything in AWS US-East. For shows covering sensitive topics — therapy, journalism, founder candor — that's a real friction point with guests.

Accuracy on names and jargon. Cloud transcription stumbles on uncommon names. Whisper large-v3 with a custom prompt nailed 41 of 41 unusual guest names across my 2024 catalog vs Descript's 28/41. Local also lets you provide a glossary mid-pipeline ("the guest is Aoife Ní Mhurchú; the company is Rinkebysorm AB") that the AI uses to bias output.

Cost at the long-running scale. Two interview hosts plus a side investigation series I help on adds up to $1,200+/year across three subscriptions. A used RTX 3060 12 GB pays for itself in five months and the workflow improves each year as open-source models get better, not the other way around.

For more on the privacy angle, the local AI privacy guide covers the threat model that drives most professional adoption.


The Pipeline: 6 Stages {#pipeline}

Six discrete stages, each replaceable independently:

RAW WAV
   │
   ▼
[1] Whisper (faster-whisper)  ──► transcript.json (word-level timestamps)
   │
   ▼
[2] pyannote.audio diarization ──► speakers.rttm
   │
   ▼
[3] Cleanup (disfluency, normalize) ──► transcript.clean.json
   │
   ▼
[4] Ollama notes/chapters/quotes ──► notes.md, chapters.txt, quotes.json
   │
   ▼
[5] Description generator ──► description.txt (3 lengths: 150w, 300w, 50w)
   │
   ▼
[6] Hosting platform upload (Transistor API / RSS) ──► published episode

Decoupling matters. If a better diarization model ships next year, swap Stage 2 without touching the rest. If you want to A/B test Llama vs Qwen for show notes, only Stage 4 changes.


Hardware Profile and Real Throughput {#hardware}

Three concrete machines I've measured the whole pipeline on:

HardwareWhisper large-v3pyannote 3.1Ollama Qwen 2.5 14B90-min episode total
MacBook Air M2 16 GB6.2 min1.4 min2.8 min~11 min
RTX 3060 12 GB4.1 min1.1 min2.2 min~8.5 min
RTX 4070 12 GB2.4 min0.9 min1.4 min~5.5 min
MacBook Pro M3 Max 64 GB2.0 min0.7 min1.0 min~4.5 min
CPU only (Ryzen 7 5800X, 32 GB)38 min8 min11 min~58 min

CPU-only is fine for shows that publish weekly, not so fine for shows that publish daily. The 4070 is the sweet spot — sub-six-minute turnaround for a 90-minute episode means you can review notes before the next interview block.


Stage 1: Transcribe with faster-whisper {#transcribe}

faster-whisper is the same Whisper model, repackaged through CTranslate2 for 4x speed at identical accuracy. For podcasts specifically, large-v3 is the right model — distil variants drop accuracy on long-form conversational audio.

# transcribe.py
from faster_whisper import WhisperModel
import json, sys

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

def transcribe(audio_path, glossary=None):
    initial_prompt = None
    if glossary:
        initial_prompt = "Glossary of names and terms in this episode: " + ", ".join(glossary)

    segments, info = model.transcribe(
        audio_path,
        language="en",
        word_timestamps=True,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
        initial_prompt=initial_prompt,
        beam_size=5,
    )

    out = []
    for seg in segments:
        out.append({
            "start": seg.start,
            "end": seg.end,
            "text": seg.text.strip(),
            "words": [{"start": w.start, "end": w.end, "word": w.word} for w in seg.words] if seg.words else []
        })
    return out, info

if __name__ == "__main__":
    glossary = ["Aoife Ní Mhurchú", "Rinkebysorm AB", "Helsinki", "founder mode"]  # example
    segments, info = transcribe(sys.argv[1], glossary=glossary)
    json.dump({"segments": segments, "language": info.language}, open(sys.argv[2], "w"), ensure_ascii=False, indent=2)

Three knobs that matter for podcast audio:

  • beam_size=5 improves accuracy by ~3% over greedy decoding at minimal speed cost
  • VAD filter trims silence; without it, large-v3 sometimes hallucinates filler words during pauses
  • initial_prompt is the secret weapon — feeding the model a glossary up front cuts proper-noun errors by 40-60%

For a deeper dive on Whisper variants and tuning, see our Whisper local speech-to-text guide.


Stage 2: Speaker Diarization with pyannote {#diarization}

pyannote.audio 3.1 is the open-source standard for speaker diarization. With known speaker count, accuracy lands around 91% on conversational two-host podcasts.

# diarize.py
from pyannote.audio import Pipeline
import torch, sys

pipe = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=True
).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

def diarize(audio_path, num_speakers=None):
    kwargs = {"num_speakers": num_speakers} if num_speakers else {}
    diarization = pipe(audio_path, **kwargs)
    turns = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        turns.append({"start": turn.start, "end": turn.end, "speaker": speaker})
    return turns

if __name__ == "__main__":
    turns = diarize(sys.argv[1], num_speakers=int(sys.argv[2]))
    import json
    json.dump(turns, open(sys.argv[3], "w"), indent=2)

Then merge transcript and diarization into a speaker-labeled output:

# merge.py
def merge_transcript_diarization(segments, turns):
    out = []
    for seg in segments:
        midpoint = (seg["start"] + seg["end"]) / 2
        speaker = next((t["speaker"] for t in turns if t["start"] <= midpoint <= t["end"]), "UNKNOWN")
        out.append({**seg, "speaker": speaker})
    return out

For interview podcasts, pass num_speakers=2 for cleaner output. For panel shows with 4+ speakers, leave it unset and pyannote will guess (less accurate, ~83% on 4-speaker tests).


Stage 3: Cleanup and Disfluency Removal {#cleanup}

Whisper's raw output is verbatim. For show notes and SEO descriptions, you want lightly cleaned prose — fillers removed, repetitions collapsed, sentence boundaries preserved. Don't actually edit the audio; this cleaned version exists only for the LLM stages.

# cleanup.py
import re

DISFLUENCIES = re.compile(r"\b(um|uh|er|ah|like|you know|i mean|sort of|kind of)\b", re.IGNORECASE)
REPEATED_WORDS = re.compile(r"\b(\w+)( \1\b)+", re.IGNORECASE)

def clean(text):
    text = DISFLUENCIES.sub("", text)
    text = REPEATED_WORDS.sub(r"\1", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def cleaned_corpus(merged_segments):
    return "\n\n".join(
        f"[{s['speaker']}] {clean(s['text'])}"
        for s in merged_segments if clean(s['text'])
    )

This is dumb on purpose. Don't run an LLM over the transcript to "polish" it — that's where hallucinations creep in. Save the LLM for downstream tasks where invention is acceptable (notes, descriptions, social).


Stage 4: Show Notes, Chapters, and Quotes {#show-notes}

Now the production magic. One Ollama call per task, each with a tightly-scoped prompt.

# notes.py
import ollama, json

MODEL = "qwen2.5:14b"

NOTES_PROMPT = """You are writing show notes for an interview podcast. Output Markdown with:
- A 1-paragraph episode summary (60-80 words)
- 5-7 bullet points covering the main topics in order
- A "Mentioned in this episode" section listing books, papers, tools, people referenced
- Honest, direct voice. No marketing speak. No 'dive deep' or 'fireside chat'.

Transcript:
{transcript}"""

CHAPTERS_PROMPT = """Generate 8-14 chapter markers for this podcast episode. Output JSON:
{
  "chapters": [
    {"title": "<5-9 word title, no quotes>", "start_seconds": <int>}
  ]
}
Use natural topic transitions. First chapter must be at 0 seconds. Titles should be specific (e.g., "How Aoife landed her first 100 customers" not "Customer acquisition").

Transcript with timestamps:
{timed_transcript}"""

QUOTES_PROMPT = """Pick 4 quote candidates for social media from this transcript. Each must be:
- 12-30 words
- A complete thought that makes sense without context
- Said by a guest (not the host) when possible
- Specific, not generic ("the hardest part of fundraising was the rejection email at 11pm" beats "fundraising is hard")

Output JSON: {"quotes": [{"text": "...", "speaker": "SPEAKER_01", "approx_time": "00:32:14"}]}

Transcript:
{transcript}"""

def call(prompt, json_mode=False):
    r = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        format="json" if json_mode else "",
        options={"temperature": 0.5, "num_ctx": 32768}
    )
    return r["message"]["content"]

def make_notes(transcript_text):
    return call(NOTES_PROMPT.format(transcript=transcript_text[:60000]))

def make_chapters(timed_transcript):
    raw = call(CHAPTERS_PROMPT.format(timed_transcript=timed_transcript[:60000]), json_mode=True)
    return json.loads(raw)

def make_quotes(transcript_text):
    raw = call(QUOTES_PROMPT.format(transcript=transcript_text[:60000]), json_mode=True)
    return json.loads(raw)

Three lessons from 60+ episodes through this prompt set:

  • Qwen 2.5 14B beats Llama 3.2 70B on conversational summary tasks. Less verbose, better at picking the actual topics.
  • Cap context at 60k characters. Anything more and the model starts losing the thread on the first half of the episode.
  • Always ask for JSON for structured outputs. Markdown is fine for prose-only outputs (show notes).

For details on Ollama's JSON mode and tool calling, see the Ollama Python API guide.


Stage 5: Episode Descriptions for SEO {#descriptions}

Apple Podcasts shows about 150 words above the fold. Spotify cuts off around 100. Your RSS <description> field has no real limit but search engines weigh the first 300 words most. So generate three lengths and pick per platform.

DESCRIPTION_PROMPT = """Write three episode descriptions for "{show}" episode {num}: {title}.

Length 1 (50 words): for Twitter/social.
Length 2 (150 words): for Apple Podcasts above-the-fold.
Length 3 (300 words): for the show website with SEO keywords woven in naturally.

Tone: factual, no buzzwords, no "in this episode we explore" or "join us as we dive into".
Audience: {audience}
Keywords to incorporate naturally where relevant: {keywords}

Episode summary: {summary}
Top topics: {topics}

Output JSON: {{"50w": "...", "150w": "...", "300w": "..."}}"""

The keyword integration matters. Apple Podcasts started indexing episode descriptions for in-app search in 2024, and the Google Podcasts replacement (in YouTube Music) does too. Natural keyword integration in the 300-word version pulls real search traffic.


Stage 6: Publishing to Transistor / Buzzsprout / Captivate {#publishing}

Three patterns:

Transistor has a clean REST API. Upload episode, set title/description/chapters, schedule publish. Roughly 30 lines of Python.

Buzzsprout has an API as well, but file uploads go through their proprietary endpoint. Slightly more involved.

Captivate doesn't expose a public API for uploads at the time of writing. Workaround: drop the audio + chapters JSON into a watch folder and use the GUI for the final upload trigger.

Self-hosted via Castopod lets you push directly to your own RSS feed. The pipeline can write the RSS XML if you want full control.

# publish_transistor.py
import requests, os

API = "https://api.transistor.fm/v1"
TOKEN = os.environ["TRANSISTOR_API_KEY"]
SHOW_ID = os.environ["TRANSISTOR_SHOW_ID"]

def publish(audio_path, title, description, chapters, season, number):
    # 1. Authorize upload
    upload = requests.get(f"{API}/episodes/authorize_upload",
        headers={"x-api-key": TOKEN},
        params={"filename": os.path.basename(audio_path)}
    ).json()
    requests.put(upload["data"]["attributes"]["upload_url"], data=open(audio_path, "rb"))

    # 2. Create episode
    body = {
        "episode": {
            "show_id": SHOW_ID,
            "audio_url": upload["data"]["attributes"]["audio_url"],
            "title": title,
            "summary": description[:255],
            "description": description,
            "season": season,
            "number": number,
        }
    }
    ep = requests.post(f"{API}/episodes", headers={"x-api-key": TOKEN}, json=body).json()
    return ep["data"]["id"]

For chapters, both Apple Podcasts and Spotify support PSC (Podlove Simple Chapters) embedded in the audio. ffmpeg writes them:

ffmpeg -i episode.wav -i chapters.txt -map_metadata 1 -codec copy episode-with-chapters.mp3

The chapters.txt format is documented in the Apple Podcasts Connect chapter spec. Stage 4's chapter generator outputs in that format directly.


Comparison: Descript, Riverside, Auphonic, and This Build {#comparison}

Real numbers from running the same 90-minute interview through each, March 2026:

FeatureDescript CreatorRiverside AIAuphonicThis build
Monthly cost$24$24$11/2 prod hr$0
Transcription accuracy91%88%89%96%
Speaker diarizationYesYesNoYes (pyannote)
AI show notesYesYesNoYes
AI chapter markersYesYesNoYes
AI social clipsLimitedYesNoYes (quote candidates)
Audio leveling/repairYesYesBest in classNo (use Auphonic for this)
PrivacyCloud (US)Cloud (US)Cloud (EU)Local
Custom glossariesLimitedNoNoFull
Annual cost (1 ep/wk)$288$288~$390$0

A reasonable hybrid: keep Auphonic for audio leveling and loudness normalization (it's worth the money for that), and replace the rest with this pipeline. Net savings vs Descript+Auphonic: ~$170/year while gaining privacy and accuracy.


Pitfalls and How to Avoid Them {#pitfalls}

Pitfall 1: Trying to use Whisper for the master audio edit. Whisper transcribes; it doesn't edit. The transcripts are for show-note generation, not for cutting audio. Use a real DAW (Reaper, Audacity, Logic) for actual editing.

Pitfall 2: Skipping the glossary. Every episode has 5-15 proper nouns, technical terms, or company names. Spend 30 seconds writing them down before transcription. Whisper's accuracy on them jumps 40-60% with the initial_prompt.

Pitfall 3: Letting the LLM rewrite the transcript. That's where hallucinations live. Use the LLM for derivative outputs (notes, chapters, descriptions) and treat the Whisper transcript as ground truth.

Pitfall 4: Not caching intermediate outputs. A 90-minute Whisper pass is slow. Save transcript.json to disk and re-run downstream stages from it without re-transcribing.

Pitfall 5: Generic chapter titles. "Introduction" and "Closing thoughts" are useless for SEO and listener navigation. Tighten the prompt to require specific titles ("How Aoife landed her first 100 customers" not "Customer acquisition").

Pitfall 6: Forgetting to compare to a human-written episode. Run your first three episodes through both this pipeline and your existing process. If the output sounds AI-generated, tighten prompts before going all-in.


Performance Benchmarks {#benchmarks}

Per-stage breakdown for a 90-minute episode on RTX 3060 12 GB:

StageTimeOutput size
Whisper large-v34.1 min~120 KB JSON
pyannote diarization1.1 min~8 KB JSON
Cleanup0.5 sec~80 KB text
Show notes (Qwen 2.5 14B)38 sec~1 KB Markdown
Chapter markers22 sec~1 KB JSON
Quote candidates18 sec~0.5 KB JSON
3 descriptions (50/150/300w)31 sec~1.5 KB JSON
Total~7.0 min

Quality checks I run on every episode (5 min of human review):

  • Spot-check 10 random transcript segments against the audio
  • Verify chapter timestamps line up to actual topic shifts
  • Check that quote candidates aren't pulling lines out of context
  • Confirm description doesn't claim anything not in the episode

Frequently Asked Questions {#faqs}

The full FAQ schema is in the page metadata. Practical highlights:

  • Multi-speaker shows (4+ guests) work but diarization accuracy drops to ~83%; consider pre-recorded speaker IDs.
  • Non-English shows are supported; Whisper covers 99 languages with varying accuracy. Spanish, French, German, Mandarin, and Japanese are excellent.
  • The pipeline can run unattended via cron — drop a WAV in a watch folder, get show notes in your inbox 10 minutes later.
  • For audio leveling/loudness compliance, Auphonic remains the right tool. This pipeline focuses on AI-derived outputs.

Wrapping Up

The interesting thing about doing this for six months is how invisible it becomes. I drop a 90-minute interview into the watch folder after recording, and by the time I've made coffee and reviewed my notes, I have a clean transcript, draft show notes, eight chapter markers, four social pull-quotes, and three description lengths waiting for me. I edit for tone in maybe four minutes, hit publish, and we're done. The AI does the busy work; I keep the editorial judgment. That balance is the real product, and it doesn't require sending my guests' audio to anyone.

If you publish weekly, this pipeline pays for the hardware in five months. If you publish daily, it pays in six weeks. Either way, your guests stay private, your accuracy goes up, and your dependence on subscription services goes to zero.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 23, 2026🔄 Last Updated: April 23, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

Audio AI Workflows, Weekly

Practical local AI builds for podcasters, creators, and audio pros. New pipelines and prompt recipes every Thursday.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators