Guide 8 min read2026-06-07

How to transcribe a focus group recording

Transcribe focus group recordings with AI speaker labels, set realistic accuracy expectations for crosstalk, and export clean transcripts for coding.

The quickest way to transcribe a focus group: upload the recording to an AI transcription tool that supports speaker labels (diarization), and export the result as TXT or JSON. Expect lower accuracy than a one-on-one interview because of crosstalk and overlapping speech. Good recording setup and a short review pass make the biggest difference.

Focus groups are the hardest case in transcription. You have six to ten people, side conversations, people talking over each other, and voices at different distances from the mic. This guide walks through how to set up the recording, run it through AI, and clean up the output so it is ready for coding.

Why focus groups are hard to transcribe

A one-on-one interview is the easy case: two distinct voices, usually taking turns, often close to a microphone. Focus groups break every one of those assumptions.

More speakers. Six to ten participants plus a moderator means more voices for the model to keep separate.
Overlapping speech and crosstalk. People interrupt, agree out loud, and hold side conversations. When two voices overlap, the model has to guess, and it often merges or drops words.
Distance from the mic. Participants sit around a table, so some are close and some are three meters away. Quieter, distant voices transcribe worse.
Similar voices. Diarization separates speakers by voice characteristics. Two participants with similar pitch and accent are easy to confuse.

Set realistic expectations. A clean, disciplined focus group can reach the low-to-mid 90s in word accuracy. A noisy room with people talking over each other can fall into the 70s or 80s, with frequent speaker mix-ups. No AI tool is reliable on heavy crosstalk, so the audio you capture sets the ceiling on what any tool can do. For the full picture of what drives these numbers, see the AI transcription accuracy guide.

Before you record: setup that massively improves the transcript

What you do before the session matters more than which tool you pick. A few habits dramatically raise accuracy.

One mic per person, if you can. Lavalier mics or a multi-channel recorder give each voice a clean, separate signal. This is the single biggest improvement for both accuracy and speaker separation.
If you only have one mic, place it well. Put a quality omnidirectional table mic at the center of the table, equidistant from participants. Avoid laptop built-in mics for groups larger than four.
The moderator names speakers out loud. When you say "Go ahead, Maria," the name appears in the transcript, which makes it far easier to map Speaker 2 to a real person later. Do this especially early in the session.
Set a one-at-a-time ground rule. Ask participants to avoid talking over each other and to pause before responding. Even partial discipline cuts overlap significantly.
Control the room. Close windows, turn off HVAC if loud, and choose a space with soft surfaces to reduce echo. Reverberation confuses both transcription and diarization.

These are the same fundamentals that apply to any multi-speaker recording. Our interview transcription guide and how to transcribe an interview cover the recording side in more depth.

Step by step: transcribing the recording

Once you have the file, the workflow is straightforward.

Step 1: Upload the recording.

Go to TranscribTxt and drag in your file. It accepts MP4, MOV, WebM, MP3, M4A, and WAV, plus YouTube or other URLs if your session was recorded to a video platform. If you recorded to multiple channels, upload the mixed file, or, if your tool supports it, the best single combined track.

Step 2: Enable speaker labels (diarization).

Diarization is the feature that automatically detects when the voice changes and tags each line as Speaker 1, Speaker 2, and so on. On TranscribTxt, speaker labels are available on the Pro and Business plans. The free plan transcribes the audio but produces a continuous transcript without speaker separation, so for a focus group you will want Pro or Business. If you are new to the concept, speaker diarization explained covers how it works and where it struggles.

Language is detected automatically across 99 languages, so you do not need to set it manually.

Step 3: Let it process, then export.

TranscribTxt runs on ElevenLabs Scribe. A typical session transcribes in a few minutes. Export plain TXT for reading and for import into analysis software, JSON when you want word-level timestamps for coding and precise quote-pulling, or SRT if you need time-coded captions. Word-level timestamps are useful in qualitative work because you can jump straight back to the audio for any quote you want to verify.

Cleaning up the transcript and assigning real names

AI output is a draft, not a finished transcript. Plan a review pass, and budget more time than you would for an interview because of the speaker complexity.

Map Speaker labels to real names. The tool cannot know who is who, so it gives you Speaker 1, Speaker 2, and so on. Use the moderator's spoken cues ("Thanks, James") and the opening introductions to figure out which label is which participant, then find-and-replace each label with the real name. Do this once at the top and it propagates.
Fix crosstalk sections. Where two people spoke at once, the model may have merged lines or assigned them to the wrong speaker. Spot-check these against the audio, especially around lively moments.
Correct proper nouns and jargon. Product names, brand terms, and participant names are common error spots. A quick find-and-replace handles repeated terms.
Watch for speaker drift. Over a long session, diarization can occasionally split one person into two labels or merge two into one. Skim for sudden label changes mid-thought.

A focused 20 to 40 minute cleanup pass on a 60-minute session is normal, and it is still a fraction of the four-to-six hours manual transcription would take.

Handing off to analysis software

Once names are assigned and crosstalk is cleaned, the transcript is ready for coding.

NVivo and Atlas.ti: Import the plain TXT export. The speaker names carry through as text, so you can code by participant. Most qualitative analysis tools read clean TXT without any conversion.
Timestamped workflows: Use the JSON export when you want to link codes back to exact moments in the audio, or to build a quote bank with precise time references.
Spreadsheets and manual coding: TXT pastes cleanly into a sheet if you prefer to code line by line.

For a broader look at tools built around this kind of workflow, see the best transcription software for researchers in 2026.

Try it on your next session

The free plan gives you 5 files per month with no credit card, which is enough to test the transcription quality on a sample of your audio. For multi-speaker focus groups you will want speaker labels, which start on Pro ($12/month, 1,200 minutes) and Business ($29/month, 6,000 minutes). Files are deleted after transcription, which matters when your recordings contain participant data.

Capture the cleanest audio you can, enable speaker labels, and budget a short cleanup pass. Do that, and a focus group that would have taken a day to transcribe by hand is ready for analysis the same afternoon. Try it free.

Frequently Asked Questions

What's the best way to transcribe a focus group?

Record with the best audio you can capture, ideally one mic per participant or a quality table mic, then upload the file to an AI transcription tool with speaker labels (diarization) enabled. Export to TXT or JSON with timestamps, then do a review pass to fix speaker assignments and crosstalk before analysis.

How do you transcribe multiple speakers?

Use a transcription tool with speaker diarization, which automatically detects voice changes and tags each line as Speaker 1, Speaker 2, and so on. The tool cannot know real names, so you map labels to participants afterward. Diarization works best when speakers talk one at a time and each voice is distinct.

How accurate is AI on focus group audio?

Lower than on one-on-one interviews. Clean focus group audio with disciplined turn-taking can reach the low-to-mid 90s in word accuracy, but heavy crosstalk, distant speakers, and overlapping speech can drop it into the 70s or 80s. No AI tool is reliable on heavy overlap, so recording quality matters most.

Does TranscribTxt label who said what in a focus group?

Yes. Speaker labels (diarization) are available on the Pro and Business plans. The tool tags each line by speaker automatically. The free plan transcribes the audio but does not separate speakers, so for multi-person focus groups you will want Pro or Business.

Can I export a focus group transcript for NVivo or Atlas.ti?

Yes. Export plain TXT for import into most qualitative analysis software, including NVivo and Atlas.ti. For timestamped or word-level work, export JSON. SRT is also available if you need time-coded captions. The speaker labels carry into the exported text for coding.

Back to all guides