Transcription vs Captions vs Subtitles: What's the Difference?
Transcript, captions, and subtitles explained: how they differ in timing, sound cues, language, and file format — and how to produce each from one recording.
The words "transcript," "captions," and "subtitles" get used interchangeably, but they are three distinct things. The fastest way to keep them straight:
A transcript is the full text of spoken audio with no timing — a document you read. Captions are that same text split into short, time-synced segments shown over video, including sound cues like [music], in the same language, for accessibility. Subtitles are time-synced dialogue translated for viewers who can hear but don't speak the language.
Get those three sentences right and everything else falls into place.
The core distinctions
The differences come down to four questions: Is it timed? Does it include sound cues? What language is it in? And who is it for?
| Transcript | Closed captions | Subtitles | |
|---|---|---|---|
| Timing | None — continuous text | Time-synced to audio | Time-synced to audio |
| Sound cues | No | Yes — [music], [applause] | No — dialogue only |
| Language | Same as audio | Same as audio | Usually translated |
| File format | TXT, DOCX | SRT, VTT | SRT, VTT |
| Primary use | Reading, search, records | Accessibility (deaf/HoH) | Translation for viewers |
The single biggest divide is timing. A transcript is a standalone text. Captions and subtitles are timed text — each line is paired with a start and end timestamp so it appears in sync with the video.
The second divide is purpose. Captions exist for people who cannot hear the audio, so they describe non-speech sounds. Subtitles assume the viewer can hear perfectly well but doesn't understand the language, so they translate only the dialogue.
Captions vs subtitles, more precisely
This is the pair people confuse most, and the distinction is a US/UK terminology split as much as a functional one.
In US usage, captions are same-language and include sound effects, speaker labels, and tone — built for deaf and hard-of-hearing audiences. Subtitles are translations of dialogue for hearing viewers. In British and European usage, "subtitles" often covers both, and accessibility versions are called "subtitles for the deaf and hard of hearing" (SDH).
Functionally, remember it this way: if the text tells you a phone is ringing offscreen, it's captioning. If it's translating French speech into English, it's subtitling.
Closed vs open captions
Captions come in two delivery styles:
Closed captions live in a separate file (SRT or VTT) that rides alongside the video. Viewers toggle them on or off, and platforms like YouTube can restyle or hide them. This is the flexible, accessibility-standard default.
Open captions are burned into the video pixels themselves. They're always visible and cannot be turned off — useful for social feeds that autoplay muted, where you can't rely on a viewer enabling captions. The trade-off: you can't restyle them, translate them, or turn them off later without re-rendering the video.
File formats: TXT vs SRT vs VTT
Format follows function:
- TXT / DOCX — plain transcript text, no timestamps. Read it, search it, paste it into a report.
- SRT (SubRip) — the universal timed-text format. Numbered segments, each with a start/end timestamp and a line or two of text. Works almost everywhere.
- VTT (WebVTT) — the web-native equivalent, used by HTML5 video and platforms like Microsoft Teams. Adds styling and positioning options on top of SRT's basics.
A transcript file and a caption file can contain the exact same words. The only difference is whether timestamps are attached.
Produce all three from one recording
Here's the practical part: you don't create these separately. You transcribe once and derive the rest.
-
Transcribe the recording. Upload your audio or video and let the AI produce the text. With TranscribTxt running ElevenLabs Scribe, this is the single accurate pass everything else builds on. Export it as TXT and you have your transcript.
-
Export timed text for captions. Export the same result as SRT (or VTT for web and Teams) and you have closed captions in the original language. If you'd rather have them baked into the file, see the video captions generator and how to add subtitles automatically for the burn-in workflow. Going from a raw video file is covered in MP4 to SRT subtitles.
-
Translate for subtitles. Take that timed text and translate the dialogue into your target language. Because TranscribTxt supports 99 languages, you can translate and transcribe to generate subtitle tracks for an international audience from the same source recording.
One transcription, three deliverables: a readable document, an accessibility caption track, and translated subtitles.
Why it matters: accessibility and SEO
Getting this right isn't pedantry. Captions are an accessibility requirement — they make video usable for the roughly 1 in 5 people with hearing loss, and they're what lets the 85% of social video watched on mute actually land. Subtitles expand reach to audiences who don't speak your language. And the transcript is what search engines and AI tools actually read: a plain-text version of your video they can index, quote, and surface. Publishing a transcript alongside a video is one of the simplest SEO wins available.
So pick deliberately: transcript to be read and found, captions to be accessible, subtitles to be understood across languages.
The good news is you only have to do the hard part once. TranscribTxt's free plan covers 5 files a month with no card required, and Pro is $12/mo for 1,200 minutes — audio is deleted after transcription. Start with a transcript, then export the captions and subtitles you need from it.
Frequently Asked Questions
What is the difference between transcription and captions?
A transcript is the full text of spoken audio with no timing — a continuous document you read on its own. Captions are that same text broken into short, time-synced segments displayed over video, including sound cues like [music] or [applause]. Transcripts are for reading; captions are for watching with the audio off.
What is the difference between captions and subtitles?
Captions are in the same language as the audio and include non-speech sound cues such as [door slams], designed for deaf and hard-of-hearing viewers. Subtitles assume the viewer can hear and only render dialogue, usually translated into another language. Both are time-synced; captions serve accessibility, subtitles serve translation.
What is the difference between open and closed captions?
Closed captions are stored in a separate file (SRT or VTT) that viewers can turn on or off, and platforms can style or hide them. Open captions are burned directly into the video pixels and are always visible — they cannot be switched off. Closed captions are more flexible; open captions guarantee display everywhere.
What file format do captions and subtitles use?
Captions and subtitles use timed-text formats — most commonly SRT (SubRip) and VTT (WebVTT). Both contain text segments paired with start and end timestamps. Transcripts, by contrast, are saved as plain TXT or DOCX with no timing. TranscribTxt exports TXT for transcripts and SRT for captions and subtitles.
Can I create all three from one recording?
Yes. Transcribe the recording once to get a transcript (TXT), then export the same result as SRT or VTT to use as captions, and translate that timed text into another language to produce subtitles. One transcription pass supplies the source text for all three deliverables.