Scrape YouTube Captions into Clear-Text on Debian/Ubuntu

By thePR0M3TH3AN ✝️ BIP110 June 13, 2025 · Edited June 13, 2025

Ever wished you could grab the spoken words from a YouTube video without opening a browser? On any Debian-based distro—Ubuntu, Pop!\_OS, Linux Mint, Elementary, etc.—you can do it with nothing more than `yt-dlp` (the modern fork of `youtube-dl`) and a couple of classic GNU utilities. In a minute or two you’ll have a tidy, searchable `transcript.txt` sitting in your working directory.

Scrape YouTube Captions into Clear-Text on Debian/Ubuntu

Prerequisites
1 Install yt-dlp system-wide
2 Download the auto-generated captions as SRT
3 Strip index numbers and timecodes
4 (Option-al) Extra polish
5 Enjoy your transcript
- TL;DR (copy-paste cheat sheet)

Prerequisites

# Make sure your package list is fresh
sudo apt update

# (Optional but recommended) ffmpeg lets yt-dlp negotiate the best formats
sudo apt install ffmpeg jq

1 Install yt-dlp system-wide

sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp \
  -o /usr/local/bin/yt-dlp && sudo chmod a+rx /usr/local/bin/yt-dlp

A single binary is dropped into /usr/local/bin/; version upgrades are as simple as running the same command again in the future.

2 Download the auto-generated captions as SRT

Pick any video ID—here we’ll use knAGgxzYqw8 (Chase Hughes on What is Money?).

VIDEO="knAGgxzYqw8"  # change this to your target ID

yt-dlp --skip-download \
       --write-auto-sub \
       --sub-lang en \
       --sub-format srt \
       -o "${VIDEO}.%(ext)s" \
       "https://youtu.be/${VIDEO}"
# Result: knAGgxzYqw8.en.srt

Flags explained

Flag	Purpose
`--skip-download`	ignore the actual video, we only want captions
`--write-auto-sub`	fall back to YouTube’s auto-generated subtitles
`--sub-lang en`	grab English only (adjust if you need another language)
`--sub-format srt`	SRT is the simplest to strip; VTT also works
`-o`	sets a predictable filename: `<id>.en.srt`

3 Strip index numbers and timecodes

grep -vE '^[0-9]+$|^[0-9]{2}:' "${VIDEO}.en.srt" \
  | sed '/^[[:space:]]*$/d' \
  > "${VIDEO}.txt"

Breakdown

grep -vE '^[0-9]+$|^[0-9]{2}:'
- Removes the line counters (3911) and any line that begins with a timestamp (02:33:40,800 --> …).
sed '/^[[:space:]]*$/d'
- Deletes leftover blank lines.
Output is redirected to <ID>.txt—in our example: knAGgxzYqw8.txt.

4 (Option-al) Extra polish

Remove bracketed stage cues such as [Music] or [Applause] and collapse back-to-back duplicates:

grep -v '^\[.*\]$' "${VIDEO}.txt" \
  | awk 'prev != $0 {print} {prev=$0}' \
  > "${VIDEO}_clean.txt"

mv "${VIDEO}_clean.txt" "${VIDEO}.txt"

5 Enjoy your transcript

less "${VIDEO}.txt"          # page through
grep -i "keyword" "${VIDEO}.txt"   # quick search

You now have a plain-text file ready for note-taking, quoting, or feeding into your favorite AI summarizer—no browser or third-party web services required.

TL;DR (copy-paste cheat sheet)

sudo apt update && sudo apt install ffmpeg jq -y
sudo curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp \
  -o /usr/local/bin/yt-dlp && sudo chmod a+rx /usr/local/bin/yt-dlp

VIDEO="knAGgxzYqw8"                                      # video ID
yt-dlp --skip-download --write-auto-sub --sub-lang en \
       --sub-format srt -o "${VIDEO}.%(ext)s" \
       "https://youtu.be/${VIDEO}"

grep -vE '^[0-9]+$|^[0-9]{2}:' "${VIDEO}.en.srt" \
  | sed '/^[[:space:]]*$/d' \
  | grep -v '^\[.*\]$' \
  | awk 'prev != $0 {print} {prev=$0}' \
  > "${VIDEO}.txt"

less "${VIDEO}.txt"

Happy transcribing!

Adam Malin

adammalin.com

npub15jnttpymeytm80hatjqcvhhqhzrhx6gxp8pq0wn93rhnu8s9h9dsha32lx

You can view and write comments on this or any other post by using the Satcom browser extention.

value4value Did you find any value from this article? Click here to send me a tip!

#youtube #transcript #llm #ai #text #learning

Write a comment

No comments yet.

Scrape YouTube Captions into Clear-Text on Debian/Ubuntu

§Prerequisites

§1 Install yt-dlp system-wide

§2 Download the auto-generated captions as SRT

§3 Strip index numbers and timecodes

§4 (Option-al) Extra polish

§5 Enjoy your transcript

§TL;DR (copy-paste cheat sheet)

テクノロジー・セクター・パフォーマンス分析：AI主導の構造変化と「メモリ・スーパーサイクル」の深層

Mon, Jun 1, 2026

MCP Agent Launch Audit