Translate and transcribe audio in one step: what actually works
How to get an English transcript from audio recorded in Spanish, French, Russian, or 96 other languages — and when one-step translation-transcription works versus when you need two separate steps.
Most transcription tools were built for English. The good news: the underlying models have improved to the point where 99-language support is standard for Whisper-based tools. The bad news: accuracy in non-English languages varies more than marketing materials suggest.
Here's what actually works for multilingual transcription and translation workflows.
What "99 languages" actually means
When a tool claims 99-language support, it usually means the Whisper model underneath it was trained on audio in those languages. The accuracy numbers look like this in practice:
High accuracy (90%+ on clean audio): Spanish, French, German, Portuguese, Italian, Dutch, Russian, Japanese, Korean, Mandarin Chinese, Polish, Turkish, Arabic
Moderate accuracy (80-90%): Hindi, Thai, Vietnamese, Indonesian, Swedish, Norwegian, Ukrainian, Romanian
Lower accuracy (below 80%): Less documented languages, regional dialects, heavily accented speech in any language
The model works better on languages that appear more frequently in its training data. English, Spanish, French, and German are abundant online — other languages less so.
Two approaches to multilingual transcription
Approach 1: Transcribe in original language, then translate
This is the recommended approach for accuracy.
- Upload your audio to TranscribTxt
- Language is detected automatically — no configuration needed
- Download the transcript in the original language
- Paste into DeepL (better for European languages) or Google Translate
- Review the translation, especially for technical vocabulary
Why this is better: Translation errors are easier to spot and correct than transcription errors. If you transcribe + translate in one pass and something is wrong, you can't tell whether the error came from the transcription or the translation.
Best translation tools:
- DeepL: More natural phrasing in European languages
- Google Translate: Better language coverage, works for all 99+ languages
- DeepL API: For programmatic workflows where you need to translate in bulk
Approach 2: Direct translation output from Whisper
Whisper has a translation mode that outputs English directly from any supported language, without producing a transcript in the original language first.
This is faster but less auditable. If the output is wrong, you can't check whether the error is in the transcription or translation step.
Use this when:
- You need a quick English summary of foreign-language audio
- The content isn't critical (internal notes, rough drafts)
- You don't need the original-language transcript
Avoid this when:
- Accuracy is important (legal, medical, research)
- You'll need to verify the transcript later
- The source language is not well-supported by Whisper
Practical examples
Spanish podcast episode → English transcript for blog post: Upload the MP3 to TranscribTxt → Spanish transcript → DeepL → English → edit for natural phrasing → publish. Total time: 20-30 minutes for a 30-minute episode.
French interview for qualitative research: Record → upload to TranscribTxt → French transcript → human review with native French speaker → DeepL translation → second review → use as research data. Time: 45-60 minutes per hour of audio.
Russian business meeting → English notes: Zoom recording → upload to TranscribTxt → Russian transcript → Google Translate → edit obvious errors → send as meeting notes. Time: 15-20 minutes.
What doesn't work well
Heavy accented English: Some users try to transcribe non-native English speakers and find accuracy drops significantly. For a non-native speaker with a strong accent, transcribing in their native language and translating to English often gives better results than transcribing in accented English directly.
Mixed-language audio: Code-switching (switching between languages mid-sentence) is difficult for all current AI tools. If your recording has speakers alternating between English and Spanish, expect more errors than monolingual audio.
Dialects: Standard dialects of major languages work well. Regional dialects — Argentine Spanish, Swiss German, Sicilian Italian — may have noticeably lower accuracy than standard versions.
Frequently Asked Questions
Can I transcribe audio in one language and translate it to English automatically?
Yes. Whisper-based transcription tools can output in the original language or translate directly to English in one step. TranscribTxt transcribes in the original language — translation requires a second step using DeepL or Google Translate. For direct Spanish-to-English or French-to-English output, Whisper's translation mode handles this in one pass.
Which transcription tools support multiple languages?
TranscribTxt supports 99 languages with automatic detection. Otter.ai supports English, French, German, Spanish, and Japanese. Whisper (local) supports 99 languages. Rev AI supports 38 languages. For broad language support, Whisper-based tools are the most comprehensive.
How accurate is AI transcription in non-English languages?
Accuracy varies significantly by language. Spanish, French, German, Portuguese, and Italian perform well at 90-95% on clean audio. Russian, Japanese, and Chinese are in the 85-92% range. Less commonly spoken languages may drop below 80%. English remains the most accurate language for all AI transcription tools.
What is the best way to translate a transcription to English?
Generate the transcript in the original language first, then paste it into DeepL (most accurate for European languages) or Google Translate. For professional use, machine translation should be reviewed by a native speaker. For informal or research purposes, the translation quality from DeepL is typically sufficient.
Can I transcribe audio from a non-English interview for research?
Yes. Upload the audio to TranscribTxt — language is detected automatically. Download the transcript in the original language, then translate with DeepL. For academic research where accuracy matters, review the transcript with a native speaker before using it as data.