Guide 9 min read2026-05-24

How to transcribe an interview: complete guide (2026)

Step-by-step guide to transcribing interviews accurately, manual vs AI methods, speaker labeling, formatting tips, and best tools for journalists, researchers, and podcasters.

Transcribing an interview turns a conversation into a searchable, quotable, shareable document. Whether you're a journalist filing a story, a researcher coding qualitative data, or a podcaster repurposing episodes into blog posts, a clean transcript is one of the most useful assets you can produce from an interview.

This guide covers everything: manual vs AI transcription, accuracy tips, how to format a transcript with speaker labels, and which tools work best for different use cases.

Manual vs AI Transcription: Which Should You Use?

Manual transcription

Manual transcription means listening to the audio and typing everything you hear, either yourself or through a professional transcription service like Rev's human tier.

When manual is worth it:

Legal depositions, court proceedings, or medical records where 100% accuracy is required
Audio with extremely heavy accents, thick jargon, or very poor recording quality
Interviews conducted in rare languages not covered by AI tools

The cost of manual transcription: A professional typist produces roughly 1 hour of transcript for every 4–6 hours of audio. At $1.50/minute through a service like Rev, a 45-minute interview costs $67.50. Doing it yourself costs zero dollars but a full afternoon.

AI transcription

AI transcription uses speech recognition models to convert audio to text automatically. The turnaround for a 45-minute interview is typically under 4 minutes.

When AI transcription is the right choice:

Journalism, content creation, podcasting, or research where 95–98% accuracy is sufficient
High volume, multiple interviews per week
Budget constraints or tight deadlines

The practical workflow for most people: run AI transcription first, then spend 15–20 minutes reviewing and correcting. This is faster and cheaper than either pure manual transcription or trusting AI output without review.

How to Prepare Your Audio for Better Accuracy

AI accuracy is largely determined by audio quality before the model even processes a word.

Recording tips

Use a dedicated microphone. Laptop built-ins pick up keyboard noise, room echo, and HVAC hum. A USB condenser mic or a lavalier mic costs $30–$80 and makes a measurable difference.
Record in a quiet space. Close doors, turn off fans and air conditioning, record away from windows facing busy streets.
Keep levels consistent. If you're interviewing via phone or video call, record both sides at similar volumes, a huge imbalance between tracks is the most common accuracy killer.
Avoid crosstalk. Brief interruptions are fine; sustained overlapping speech is where AI models consistently fail. Train yourself to pause before responding.

File format

Most AI transcription tools accept MP3, MP4, WAV, M4A, and OGG. WAV and FLAC preserve the most audio data, but MP3 at 128 kbps or higher is perfectly acceptable. Avoid heavily compressed voice memos if you can.

Step-by-Step: Transcribing an Interview with AI

Here is the workflow that produces the best results for a typical 30–60 minute interview.

Step 1: Upload the file

Go to TranscribTxt, drag your audio or video file into the upload zone. No account is required to start. Supported formats include MP4, MP3, WAV, M4A, and WEBM.

Step 2: Select language and enable speaker labels

Choose your interview language. If participants speak two languages, pick the dominant one. Enable speaker diarization if the option is available, this identifies who spoke each line automatically.

Step 3: Download the transcript

Processing typically completes in 2–5 minutes for a one-hour file. Download as TXT for editing or as SRT if you need subtitles for a video interview.

Step 4: Review and correct

Open the transcript alongside the audio. Use a media player that lets you control speed (VLC at 0.75x is ideal for review). Focus on:

Proper nouns, names, and technical terms (these are where AI makes the most errors)
Speaker label assignments
Sentence boundaries where the speaker paused mid-thought

Plan for roughly 15 minutes of review per 30 minutes of audio.

Formatting an Interview Transcript

A well-formatted transcript serves multiple purposes: it's easier to read, easier to quote from, and easier to archive.

Standard interview format

INTERVIEW: [Project name or article title]
DATE: [Recording date]
PARTICIPANTS: [Names and roles]
INTERVIEWER: [Your name]

[00:00:00]

INTERVIEWER: Can you start by telling me a bit about your background?

RESPONDENT: Sure. I've been working in urban planning for about fifteen years now, mostly focused on public transit infrastructure...

[00:05:00]

Key formatting rules

Speaker labels in capitals. INTERVIEWER: and RESPONDENT: (or the person's name, e.g. MARIA CHEN:) on their own line, followed by the spoken text.

Timestamps every 2–5 minutes. Use [HH:MM:SS] format. This lets you jump back to the original audio for any quote.

Verbatim vs. clean verbatim. Verbatim includes every "um," "uh," and false start, necessary for linguistic research or legal use. Clean verbatim removes filler words and false starts while keeping the meaning intact. For journalism and podcasting, clean verbatim is standard.

Mark inaudible sections. Use [inaudible 00:23:45] when audio is too unclear to transcribe. Never guess.

Note non-verbal cues if relevant. [laughs], [pause], [papers shuffling] add context for qualitative researchers.

Use Cases and What Each Needs

Journalism

Journalists need verbatim quotes they can attribute with confidence. The key requirement is accuracy on names, titles, and technical terms, exactly where AI makes mistakes. After AI transcription, search and correct all proper nouns before filing.

Timestamps are valuable for fact-checking: if an editor questions a quote, you can jump to the exact second in the recording.

Academic research

Qualitative researchers often use transcripts for thematic coding, identifying patterns across dozens of interviews. For this use case, speaker labels are critical, and verbatim transcription (including false starts and filler words) may be required by your methodology.

Some IRB protocols require you to describe your transcription method. AI transcription is generally accepted when paired with human review.

Podcasting

Podcast transcripts serve three purposes simultaneously: SEO (search engines can index the text), accessibility (deaf listeners can read rather than listen), and content repurposing (a 45-minute transcript can become 3–4 blog posts).

For podcasts, clean verbatim is standard. Timestamps are less critical unless you're using the transcript as show notes with chapter links.

Accuracy Tips for Two-Speaker and Multi-Speaker Interviews

Multi-speaker audio is the hardest case for any transcription tool.

Record separate tracks when possible. If you're recording a video call, tools like Craig (for Discord) or Riverside.fm record each participant on a separate audio track. Feed individual tracks to your transcription tool for dramatically better accuracy.
Start each track with a speaker identification. Have each participant say their name at the beginning: "This is Maria Chen." AI diarization uses vocal fingerprinting, a brief clear identification at the start improves assignment accuracy throughout.
Post-process diarization errors in blocks. AI tools sometimes flip speakers for a stretch and then correct. When reviewing, look for blocks where the voice clearly doesn't match the label, and reassign the entire block at once.

When to Use Human Transcription Instead

Despite how far AI has come, there are situations where human transcription is still the better choice:

Audio quality is poor. Background noise, phone audio, or overlapping speakers in a crowded room will push AI accuracy below 85%. A human transcriptionist can use context and judgment to fill in gaps.
High-stakes legal or medical use. Any document that will be used in court, submitted to a regulatory body, or included in medical records should be reviewed by a human professional.
Rare languages or heavy regional dialects. Most AI tools are optimized for major language variants. Regional dialects and minority languages can produce very poor results.

For most everyday interview transcription, AI with human review is the sweet spot: the speed and cost of automation, with the accuracy of a final human pass.

Frequently Asked Questions

How long does it take to transcribe a 1-hour interview?

Manual transcription takes 4–6 hours per hour of audio for an experienced typist. AI transcription tools like TranscribTxt process a 1-hour interview in 3–5 minutes. Even after editing the AI output for accuracy, the total time is usually under 30 minutes.

What is the best format for an interview transcript?

The standard format uses speaker labels (e.g. INTERVIEWER: / RESPONDENT:) on their own line, followed by the spoken text. Include timestamps every 2–5 minutes so readers can cross-reference the original recording. Add a header with the date, participants, and context.

How accurate is AI transcription for interviews?

Modern AI tools achieve 95–98% word accuracy on clean audio with a single speaker. Accuracy drops to 85–92% with two or more overlapping speakers, strong accents, or background noise. Always review the output before publishing or using in research.

Do I need speaker labels (diarization) for an interview transcript?

For any interview with two or more participants, speaker labels are essential. Without them, the transcript is difficult to follow and attribute quotes correctly. Most AI tools support diarization on paid plans, look for 'speaker identification' or 'speaker diarization' in the feature list.

Back to all guides