Comparison 8 min read2026-06-28

Transcription Accuracy Test 2026: TranscribTxt vs Whisper (real WER data)

We ran our own word-error-rate benchmark on LibriSpeech test-clean — TranscribTxt (ElevenLabs Scribe v2) vs four sizes of OpenAI Whisper. Real numbers, full method, reproducible.

We ran our own accuracy benchmark instead of quoting a marketing number. On 100 clean-speech utterances from the standard LibriSpeech test-clean set, TranscribTxt — powered by ElevenLabs Scribe v2 — scored a 2.08% word error rate (97.9% accuracy), the lowest of every engine tested, including all four sizes of OpenAI Whisper. Here is exactly how we measured it, so you can reproduce it.

Word Error Rate by transcription engine on LibriSpeech test-clean — TranscribTxt 2.08%, Whisper medium 2.89%, large-v3 3.41%, small 3.60%, base 5.31%

The results

Engine	Word Error Rate	Word accuracy
TranscribTxt (ElevenLabs Scribe v2)	2.08%	97.92%
OpenAI Whisper medium	2.89%	97.11%
OpenAI Whisper large-v3	3.41%	96.59%
OpenAI Whisper small	3.60%	96.40%
OpenAI Whisper base	5.31%	94.69%

Lower WER is better. Across 100 utterances (2,111 reference words), TranscribTxt produced the fewest errors. The practical story is twofold: it is at least on par with the best open model, and it is clearly ahead of the small Whisper models (base, small) that quietly power a lot of "free" transcription tools.

Why we ran our own test

Every transcription tool claims to be "99% accurate." Almost none show the math. After Google's 2026 shift toward answering questions directly, the pages that earn trust — from readers and from AI search — are the ones built on first-party evidence, not recycled claims. So rather than repeat a number, we measured one.

This is also the honest version of advice we already give in our guide to the most accurate transcription software: on clean audio the leading models are close, and your recording quality usually matters more than the brand on the box.

Method (so you can reproduce it)

Dataset: LibriSpeech test-clean — a public benchmark of real human readers with ground-truth transcripts. We selected 100 utterances deterministically (evenly spread across speakers), totaling 2,111 reference words.
Engines: TranscribTxt's production engine, ElevenLabs Scribe v2, called through the same API the app uses; and OpenAI Whisper in four sizes (base, small, medium, large-v3) via faster-whisper, int8, beam size 5.
Metric: Word Error Rate computed with the jiwer library, after lowercasing, stripping punctuation, and collapsing whitespace — the standard normalization for ASR evaluation.
No cherry-picking: the utterance selection is fixed and the same audio went to every engine.

Reading the numbers honestly

A few things worth stating plainly:

All of these are good. On clean read speech, every engine here lands between 94.7% and 97.9% accuracy. The difference between the top engines is a handful of words across thousands.
Medium beat large-v3. That looks backwards, but it is real: on short, clean clips Whisper large-v3 sometimes hallucinates or repeats words, nudging its error rate above the steadier medium model. Bigger is not always more accurate.
Clean audio is close to a best case. Background noise, multiple speakers, strong accents, and weak microphones make every engine worse. This benchmark measures the ceiling, not your podcast recorded in a café.
It is a 100-utterance sample. Big enough to be directional, not a peer-reviewed study. The honest claim is "TranscribTxt led our test and sits with the best models," not "TranscribTxt wins every recording."

What this means for you

If you want top-tier accuracy without installing anything, a hosted tool on Scribe v2 (like TranscribTxt) gives you the most accurate engine in this test with zero setup. If you have the technical skills and time, Whisper large-v3 is free and excellent — just expect setup work and the occasional hallucination on clean clips.

Either way, test your own audio. Take a representative 2–3 minute clip, run it through TranscribTxt and one alternative, and read both transcripts against what was actually said. The most accurate tool for clean studio audio is not always the best for noisy field recordings — your sample is the only benchmark that matters for your work.

Frequently Asked Questions

How accurate is TranscribTxt?

In our own benchmark on 100 LibriSpeech test-clean utterances (2,111 words), TranscribTxt — which runs on ElevenLabs Scribe v2 — scored a 2.08% word error rate, or 97.9% word accuracy. That was the lowest error rate of every engine we tested, including all four sizes of OpenAI Whisper. On clean, clearly recorded speech you can expect roughly 97–98% accuracy; noisy audio, heavy accents, and crosstalk lower that for every tool.

What is Word Error Rate (WER)?

Word Error Rate is the standard accuracy metric for speech-to-text. It counts the words a transcript gets wrong — substitutions, insertions, and deletions — divided by the number of words actually spoken. A 2% WER means 2 errors per 100 words, or 98% accuracy. Lower is better. It is measured against a known-correct reference transcript after normalizing case and punctuation.

Is TranscribTxt more accurate than Whisper?

In this test, yes — TranscribTxt (Scribe v2) at 2.08% WER beat every Whisper size, including large-v3 (3.41%). But all leading models cluster in the 95–98% range on clean audio, so the gap is small in absolute terms. The honest takeaway is that TranscribTxt is at least on par with the best open model and clearly ahead of the small Whisper models that power many free tools.

Why did Whisper medium beat Whisper large-v3 in your test?

On clean, short utterances, Whisper large-v3 is known to occasionally hallucinate extra words or repeat phrases, which raises its word error rate above the smaller, steadier medium model. This is a documented behavior, not a setup error — we used the same standard configuration (beam size 5, int8) for every Whisper size. It is a good reminder that bigger is not automatically more accurate.

Can I reproduce this benchmark?

Yes. We used the public LibriSpeech test-clean dataset, selected 100 utterances deterministically, transcribed each with ElevenLabs Scribe v2 (the engine behind TranscribTxt) and with faster-whisper (base, small, medium, large-v3, int8, beam 5), normalized case and punctuation, and computed WER with the jiwer library against the ground-truth transcripts. Anyone can run the same dataset through the same tools and check the numbers.

Does this mean TranscribTxt will be 98% accurate on my audio?

Not necessarily. This test uses clean, single-speaker read speech, which is close to a best case. Real-world recordings with background noise, multiple speakers, accents, or poor microphones will be harder for every engine. The reliable way to know is to test your own audio: run a representative 2–3 minute clip and read the transcript against what was actually said.

Back to all guides