BRIDGE Report

ASR Models Tested

International Languages

Cohort Dimensions

AI Listens to Everyone. Except 5.5 Billion People.
That’s the gap we want to bridge.

BRIDGE is the first independent Global South ASR benchmark evaluating 14 global models across 22 languages on a first-of-its-kind 7 metric stack.

Key Findings View Leaderboard

Context

Why this Benchmark exists?

From speaker recruitment to evaluation pipeline — the decisions that make IndicBench reproducible, auditable, and resistant to benchmark gaming.

Real conversations, not read speech

Every file in BRIDGE is a genuine two-person conversation with natural overlap, spontaneous disfluencies, and unscripted code-switching — not a speaker reading from a prompt.

Geographically distributed corpus

Indic speakers from 22+ Indian states; Spanish across three Latin American dialects (Argentinian, Peruvian, Venezuelan); Brazilian Portuguese; and Vietnamese — capturing accent, dialect, and tonal variation across three language families.

7 metrics, not just WER

WER alone is blind to meaning and code-switching correctness. Our stack adds CER, Semantic Similarity, CS F1, PIER, toWER and OIWER — so a model can't hide poor Hindi-English switching or tone-mark drift behind a decent word count.

Methodology

How the Benchmark was built ?

From speaker recruitment to evaluation pipeline — the decisions that make IndicBench reproducible, auditable, and resistant to benchmark gaming.

7 Metric Evaluation Stack

WER — Word Error RateCER — Character Error RateSemantic SimilarityCS F1 — Code-Switch F1PIER — Phoneme-Informed ERtoWER — Time-Offset WEROIWER — Overlap-Informed WER

(Data collection)

Dual-speaker audio collected from real conversations: 22+ Indian states across the Indic corpus; three Latin-American Spanish dialects (Argentinian, Peruvian, Venezuelan), Brazilian Portuguese, and Vietnamese on the international side. Contributors sourced to reflect diverse demographics — age, gender, region — with no scripting or prompting. Every file is a genuine naturalistic conversation.

(Text normalisation)

Before any metric runs, both reference and hypothesis pass through three normalisation layers: base cleaning (lowercasing, punctuation stripping, Unicode marks preserved), loanword normalisation (script-variant English words unified), and OIWER normalisation (British/American spelling + mixed-script token expansion).

(7-metric evaluation stack)

Each (audio, model) pair scored on: WER & CER (word/character accuracy), Semantic Similarity (meaning preservation via multilingual embeddings), CS F1 (code-switching quality), PIER (English token recall), toWER (phonetic WER via ITRANS), and OIWER (orthography-informed WER). WER alone is insufficient — a model can score WER 0.97 and SemanticSim 0.99 simultaneously.

(Cohort attribution)

Every conversation is tagged across 7 dimensions: language, gender mix, region (same/cross-state on Indic; dialect on Spanish), age group, speaker overlap, conversational density, and gap pattern. The same scheme runs over both corpora so cross-language comparisons stay apples-to-apples.

Results

Model Leaderboard

All 14 models on WER (and more). Filter by language to see how each model performs per-language. Filter by metric to change what the bars represent.

Lower is better · 14 models

All languages · Word Error Rate (WER)

Lower is better · Word Error Rate

16%

25%

33%

41%

49%

ElevenLabs Scribe v2Best
8.5%
331
Deepgram Nova-3
13.7%
63
Sarvam saaras v3
18.9%
85
Gemini 2.5 Pro
19.1%
326
Azure (Conv. Transcriber)
19.8%
72
Soniox stt-async-v4
21.8%
246
Gemini 2.5 Flash
22.4%
324
Sarvam saarika v2.5
23.3%
84
Speechmatics
24.0%
121
AWS Transcribe
30.9%
355
Gnani Vachana v3
35.6%
4
AssemblyAI Universal
45.4%
164
OpenAI GPT-4o
47.4%
162
OpenAI GPT-4o mini
48.4%
162

Bars show mean Word Error Rate per model on the selected language slice. Models with fewer than 2 audio samples per slice are excluded. Source: 2,139 indic + 577 international evaluation rows.

Findings

ElevenLabs Scribe v2 leads overall at 8.53% WER

The only model simultaneously accurate and code-switch aware. Deepgram nova-3 has the best CS F1 but limited language coverage. Several models sit above 30% WER — not production-ready for Indic conversational audio.

What we offer

Key Findings

Six evidence-based conclusions drawn from the merged BRIDGE corpus — Indic conversational audio plus Spanish (3 dialects), Brazilian Portuguese, and Vietnamese — relevant for enterprise AI teams deploying voice products across the Global South.

ElevenLabs Scribe v2 posts 8.53% mean WER across the merged corpus, with 0.65–1.45% on the three Latin-American Spanish dialects and 1.45% on Brazilian Portuguese. The 7.67pp gap to the second-ranked model on the international slice is wider than the entire spread from rank 2 to rank 11 — Scribe v2 is the unambiguous default wherever it has language coverage.

Findings

ElevenLabs Scribe v2 leads broadly — except where it doesn’t

Scribe v2 dominates Indic, Spanish, and Portuguese; AssemblyAI Universal owns Vietnamese. Overlap, cross-state pairs, and Caribbean Spanish are the three biggest performance killers across the merged corpus. CS F1 exposes Indic code-switch failures invisible to WER. Don’t pick an ASR provider on aggregate WER alone — evaluate on the cohort that matches your deployment.

Multi-Dimensional Analysis

Cohort Performance Analysis

Choose a cohort dimension, a metric, and a model to see how performance shifts across conditions. All three filters work together — any combination is valid.

Lower is better

Region · Word Error Rate (WER) · All models

UnknownSame StateCross State

ElevenLabs Scribe v2

overall 9.2% ·Δ ±6.4pp

Unknown

7.5%

Same State

10.0%

Cross State

3.5%

Deepgram Nova-3

overall 16.4% ·Δ ±23.9pp

Same State

31.1%

Cross State

7.2%

Sarvam saaras v3

overall 18.9% ·Δ ±12.7pp

Unknown

8.8%

Same State

21.5%

Gemini 2.5 Pro

overall 18.9% ·Δ ±4.7pp

Unknown

20.9%

Same State

18.2%

Cross State

22.9%

Azure (Conv. Transcriber)

overall 22.6% ·Δ ±14.4pp

Unknown

15.0%

Same State

29.4%

Bars show mean Word Error Rate per cohort category for each model. Δ = spread across the displayed categories. Categories with fewer than 2 audio samples for that model are excluded.

Findings

Speaker overlap is the biggest acoustic stressor

Cross-state pairs are harder than same-state. Gender and age have modest effects. Duration and gap patterns show minimal impact — language and accent dominate.

Multi-Dimensional Analysis

The Hidden Quality Gap

CS F1 measures whether English vocabulary embedded in Indic speech is preserved — not dropped, not transliterated. A model that turns "data backup" into "डेटा बैकअप" scores 0 on CS F1. Invisible to WER. Fatal for downstream applications.

CS F1 Score — All Models (higher = better)

Deepgram Nova-3
0.906
ElevenLabs Scribe v2
0.736
Gemini 2.5 Pro
0.407
Azure (Conv. Transcriber)
0.375
Speechmatics
0.340
OpenAI GPT-4o
0.298
OpenAI GPT-4o mini
0.296
Gemini 2.5 Flash
0.243
AWS Transcribe
0.199
Gladia v2
0.186
AssemblyAI Universal
0.161
Sarvam saaras v3
0.161
Soniox stt-async-v4
0.138
Sarvam saarika v2.5
0.085

Findings

The CS F1 spread is wider than the WER spread for top models

The leader on CS F1 may not be #1 on WER. Models clustered near zero CS F1 hide a fundamental unsuitability for code-mixed Indic enterprise speech, even when their headline WER looks acceptable.

Access the dataset

Dataset access & citation

The full BRIDGE corpus — Indic + Latin American Spanish + Brazilian Portuguese + Vietnamese — is available on Hugging Face. If you use this benchmark in your research, please cite the following.

Access the dataset

Audio files, golden transcripts, speaker metadata, cohort labels, and evaluation scripts are available under the BRIDGE dataset card on Hugging Face — covering Indic (17 languages), Latin American Spanish (3 dialects), Brazilian Portuguese, and Vietnamese. Additional languages and an overlap-focused corpus are in preparation.

Hugging Face

Citation

@misc{humynlabs_bridge_2026,
  title  = {BRIDGE: State of Conversational ASR Across the Global South},
  author = {HumynLabs Research Team},
  year   = {2026},
  month  = {April},
  note   = {Independent benchmark evaluating 15 commercial ASR APIs on dual-speaker conversational audio across 20 languages — Indic (17 languages, 22+ Indian states), Latin American Spanish (Argentinian, Peruvian, Venezuelan), Brazilian Portuguese, and Vietnamese — scored on a 7-metric stack across 7 cohort dimensions},
  url    = {https://bridge-report-hazel.vercel.app},
  howpublished = {HumynLabs}
}

Collaborate or partner

If you work on conversational ASR and want to submit your model for evaluation, or partner on expanding the corpus into new languages, contact the BRIDGE team at humynlabs.ai.

BRIDGE Report

Why this Benchmark exists?

Real conversations, not read speech

Geographically distributed corpus

7 metrics, not just WER

How the Benchmark was built ?

Model Leaderboard

ElevenLabs Scribe v2 leads overall at 8.53% WER

Key Findings

ElevenLabs Scribe v2 leads broadly across Indic and Romance languages

Vietnamese inverts the ranking — AssemblyAI Universal leads, Scribe v2 drops to #3

Speaker overlap and rapid turn-taking degrade accuracy by up to 1.9pp

Code-switching is the sharpest Indic-only differentiator — and CS F1 exposes it

Cross-state Indic pairs, Venezuelan Spanish, and Brazilian Portuguese are the consistently hardest splits

Speed tier separates real-time from batch — and a two-provider stack covers the full language set

ElevenLabs Scribe v2 leads broadly — except where it doesn’t

Cohort Performance Analysis

Speaker overlap is the biggest acoustic stressor

The Hidden Quality Gap

The CS F1 spread is wider than the WER spread for top models

Dataset access & citation