Guide 6 min read min read2026-06-07

What Is Word Error Rate (WER)? The Standard Transcription Accuracy Metric

Word Error Rate (WER) measures transcription accuracy as the share of words wrong. Learn the formula, a worked example, and what counts as a good WER.

Word Error Rate (WER) is the standard metric for measuring transcription accuracy. It is the total number of word errors—substitutions, deletions, and insertions—divided by the number of words in the correct reference transcript. Lower is better: a WER of 5% means roughly 95% of words were transcribed correctly.

If you have ever compared transcription tools, you have seen "99% accurate" claims. WER is the rigorous number behind those claims. Understanding it tells you what to actually expect from any AI transcription service.

The WER formula

WER is defined with a simple equation:

WER = (S + D + I) / N

Where:

S = substitutions (a word transcribed incorrectly)
D = deletions (a word missing from the transcript)
I = insertions (an extra word the model added)
N = total number of words in the reference transcript

Add up the three error types, divide by the number of reference words, and multiply by 100 for a percentage. Note that the denominator is the reference word count, not the transcript word count—which is why insertions can, in theory, push WER above 100%.

The three error types

Every word-level mistake falls into one of three buckets:

Substitution — the model heard the wrong word. Reference says "their," transcript says "there."
Deletion — the model skipped a word entirely. Reference says "the quick brown fox," transcript says "the brown fox."
Insertion — the model added a word that was not spoken. Reference says "let us go," transcript says "let us all go."

To count these consistently, the transcript is first aligned against the reference using minimum edit distance (the Levenshtein algorithm). The alignment finds the smallest combination of substitutions, deletions, and insertions needed to turn one string into the other.

A worked example

Suppose the reference (correct) sentence has 10 words:

"The meeting is scheduled for nine on Tuesday next week"

And the AI produces:

"The meeting is scheduled at nine Tuesday the next week"

Here is the error breakdown:

Error type	Detail	Count
Substitution	"for" → "at"	1
Deletion	"on" dropped	1
Insertion	"the" added before "next"	1
Total errors (S + D + I)		3

With N = 10 reference words:

WER = (1 + 1 + 1) / 10 = 0.30 = 30%

That is a 30% WER—a deliberately rough example. Real transcripts of clean speech score far lower.

WER vs. accuracy

The two terms describe the same thing from opposite directions:

Accuracy ≈ 100% − WER

A 4% WER is about 96% accuracy. Marketers prefer "accuracy" because a high percentage sounds reassuring, but WER is the more honest, measurable term. When a vendor quotes accuracy without naming the dataset or audio conditions, treat it as a soft estimate. For a deeper look at how these numbers are produced and gamed, see our AI transcription accuracy guide.

What counts as a good WER?

There is no single threshold, because WER depends heavily on the audio. As a rough guide:

2-5% WER — Excellent. Achievable by top models on clean, single-speaker, native-accent audio. This is near human-level.
5-10% WER — Good and usable for most professional needs with light editing.
10-20% WER — Noticeable errors. Common with background noise, accents, or overlapping speakers.
20%+ WER — Heavy editing required. Typical of poor recordings, heavy crosstalk, or specialized jargon.

Even the best models degrade on hard audio—numbers above are for favorable conditions and are approximate. Model choice matters too: comparisons like Whisper vs. ElevenLabs Scribe and how accurate Otter.ai is show meaningful WER gaps on the same files.

The limits of WER

WER is the industry standard, but it has real blind spots:

It treats every word equally. Dropping a filler "um" counts exactly the same as botching a client's name or a dollar figure. In practice, those errors are not equally costly.
It ignores meaning. A substitution that preserves meaning ("can't" → "cannot") is penalized identically to one that reverses it ("can" → "can't").
It is sensitive to formatting. Differences in punctuation, capitalization, and number formatting (e.g., "twenty" vs. "20") can inflate WER unless the text is normalized first.
It says nothing about speaker labels or timestamps. A transcript can have a perfect WER and still mislabel who said what.

This is why some teams supplement WER with meaning-aware metrics, but no replacement has displaced it as the default.

How to lower your WER

You control more of the outcome than you might think:

Improve the audio. Clean input is the single biggest lever. Use a decent microphone, reduce background noise, and avoid people talking over each other.
Pick an accuracy-first model. Not all engines are equal. Our best transcription software roundup for 2026 ranks tools by real-world WER.
Provide context where supported. Custom vocabulary or prompts help models get names, acronyms, and jargon right.
Match the model to the language and accent. A model tuned for your language and dialect produces lower WER than a generalist one.

Where TranscribTxt fits

TranscribTxt is built accuracy-first. It runs on ElevenLabs Scribe, one of the lowest-WER engines available, so you start from the best baseline instead of fighting a weak model. You can test it on your own audio for free—5 files per month, no credit card required—and judge the WER yourself. The Pro plan is $12/month for 1,200 minutes. Your audio is deleted after transcription, so accuracy never comes at the cost of privacy.

The bottom line: WER is the number that actually matters when you compare transcription tools. Clean audio plus a strong model gets you into the low-single-digit range—the closest thing to a transcript you can trust without re-listening.

Frequently Asked Questions

What is word error rate (WER)?

Word Error Rate (WER) is the standard metric for transcription accuracy. It equals the total number of word errors—substitutions, deletions, and insertions—divided by the number of words in the reference (correct) transcript. A lower WER means a more accurate transcript. A 5% WER corresponds to roughly 95% word-level accuracy.

What is a good WER?

For clean, clear speech, top transcription models reach roughly 2-5% WER, which is near human-level. Anything under 10% is generally usable for most purposes. Noisy audio, heavy accents, crosstalk, or technical jargon can push WER to 15-30% or higher, even for the best models.

How is WER calculated?

Align the machine transcript against a correct reference transcript, then count three error types: substitutions (wrong word), deletions (missing word), and insertions (extra word). Add those three counts together and divide by the total number of words in the reference. Multiply by 100 to express WER as a percentage.

What is the difference between WER and accuracy?

Accuracy is usually defined as 100% minus WER, so a 4% WER equals 96% accuracy. They describe the same thing from opposite directions. WER is the more precise, industry-standard term because it ties directly to a countable error formula, whereas marketing 'accuracy' figures are often loosely defined.

Does a low WER guarantee a good transcript?

Not always. WER treats every word equally, so misspelling a key name counts the same as dropping a filler word like 'um.' Two transcripts with identical WER can differ wildly in readability and meaning. WER is a strong baseline metric, but it does not weight errors by importance.

Back to all guides