BRIDGE Report

00
ASR Models Tested
00
International Languages
00
Cohort Dimensions

AI Listens to Everyone. Except 5.5 Billion People.
That’s the gap we want to bridge.

BRIDGE is the first independent Global South ASR benchmark evaluating 14 global models across 22 languages on a first-of-its-kind 7 metric stack.

Context

Why this Benchmark exists?

From speaker recruitment to evaluation pipeline — the decisions that make IndicBench reproducible, auditable, and resistant to benchmark gaming.

Real conversations, not read speech

Every file in BRIDGE is a genuine two-person conversation with natural overlap, spontaneous disfluencies, and unscripted code-switching — not a speaker reading from a prompt.

Geographically distributed corpus

Indic speakers from 22+ Indian states; Spanish across three Latin American dialects (Argentinian, Peruvian, Venezuelan); Brazilian Portuguese; and Vietnamese — capturing accent, dialect, and tonal variation across three language families.

7 metrics, not just WER

WER alone is blind to meaning and code-switching correctness. Our stack adds CER, Semantic Similarity, CS F1, PIER, toWER and OIWER — so a model can't hide poor Hindi-English switching or tone-mark drift behind a decent word count.

Methodology

How the Benchmark was built ?

From speaker recruitment to evaluation pipeline — the decisions that make IndicBench reproducible, auditable, and resistant to benchmark gaming.

7 Metric Evaluation Stack
WER — Word Error RateCER — Character Error RateSemantic SimilarityCS F1 — Code-Switch F1PIER — Phoneme-Informed ERtoWER — Time-Offset WEROIWER — Overlap-Informed WER
(Data collection)

Dual-speaker audio collected from real conversations: 22+ Indian states across the Indic corpus; three Latin-American Spanish dialects (Argentinian, Peruvian, Venezuelan), Brazilian Portuguese, and Vietnamese on the international side. Contributors sourced to reflect diverse demographics — age, gender, region — with no scripting or prompting. Every file is a genuine naturalistic conversation.

(Text normalisation)

Before any metric runs, both reference and hypothesis pass through three normalisation layers: base cleaning (lowercasing, punctuation stripping, Unicode marks preserved), loanword normalisation (script-variant English words unified), and OIWER normalisation (British/American spelling + mixed-script token expansion).

(7-metric evaluation stack)

Each (audio, model) pair scored on: WER & CER (word/character accuracy), Semantic Similarity (meaning preservation via multilingual embeddings), CS F1 (code-switching quality), PIER (English token recall), toWER (phonetic WER via ITRANS), and OIWER (orthography-informed WER). WER alone is insufficient — a model can score WER 0.97 and SemanticSim 0.99 simultaneously.

(Cohort attribution)

Every conversation is tagged across 7 dimensions: language, gender mix, region (same/cross-state on Indic; dialect on Spanish), age group, speaker overlap, conversational density, and gap pattern. The same scheme runs over both corpora so cross-language comparisons stay apples-to-apples.

Results

Model Leaderboard

All 14 models on WER (and more). Filter by language to see how each model performs per-language. Filter by metric to change what the bars represent.

Lower is better · 14 models
All languages · Word Error Rate (WER)
Lower is better · Word Error Rate
0%
8%
16%
25%
33%
41%
49%
  • ElevenLabs Scribe v2Best
    8.5%
    331
  • Deepgram Nova-3
    13.7%
    63
  • Sarvam saaras v3
    18.9%
    85
  • Gemini 2.5 Pro
    19.1%
    326
  • Azure (Conv. Transcriber)
    19.8%
    72
  • Soniox stt-async-v4
    21.8%
    246
  • Gemini 2.5 Flash
    22.4%
    324
  • Sarvam saarika v2.5
    23.3%
    84
  • Speechmatics
    24.0%
    121
  • AWS Transcribe
    30.9%
    355
  • Gnani Vachana v3
    35.6%
    4
  • AssemblyAI Universal
    45.4%
    164
  • OpenAI GPT-4o
    47.4%
    162
  • OpenAI GPT-4o mini
    48.4%
    162

Bars show mean Word Error Rate per model on the selected language slice. Models with fewer than 2 audio samples per slice are excluded. Source: 2,139 indic + 577 international evaluation rows.

Findings

ElevenLabs Scribe v2 leads overall at 8.53% WER

The only model simultaneously accurate and code-switch aware. Deepgram nova-3 has the best CS F1 but limited language coverage. Several models sit above 30% WER — not production-ready for Indic conversational audio.

What we offer

Key Findings

Six evidence-based conclusions drawn from the merged BRIDGE corpus — Indic conversational audio plus Spanish (3 dialects), Brazilian Portuguese, and Vietnamese — relevant for enterprise AI teams deploying voice products across the Global South.

  • ElevenLabs Scribe v2 posts 8.53% mean WER across the merged corpus, with 0.65–1.45% on the three Latin-American Spanish dialects and 1.45% on Brazilian Portuguese. The 7.67pp gap to the second-ranked model on the international slice is wider than the entire spread from rank 2 to rank 11 — Scribe v2 is the unambiguous default wherever it has language coverage.

Findings

ElevenLabs Scribe v2 leads broadly — except where it doesn’t

Scribe v2 dominates Indic, Spanish, and Portuguese; AssemblyAI Universal owns Vietnamese. Overlap, cross-state pairs, and Caribbean Spanish are the three biggest performance killers across the merged corpus. CS F1 exposes Indic code-switch failures invisible to WER. Don’t pick an ASR provider on aggregate WER alone — evaluate on the cohort that matches your deployment.

Multi-Dimensional Analysis

Cohort Performance Analysis

Choose a cohort dimension, a metric, and a model to see how performance shifts across conditions. All three filters work together — any combination is valid.

Lower is better
Region · Word Error Rate (WER) · All models
UnknownSame StateCross State
ElevenLabs Scribe v2
overall 9.2% ·Δ ±6.4pp
Unknown
7.5%
Same State
10.0%
Cross State
3.5%
Deepgram Nova-3
overall 16.4% ·Δ ±23.9pp
Same State
31.1%
Cross State
7.2%
Sarvam saaras v3
overall 18.9% ·Δ ±12.7pp
Unknown
8.8%
Same State
21.5%
Gemini 2.5 Pro
overall 18.9% ·Δ ±4.7pp
Unknown
20.9%
Same State
18.2%
Cross State
22.9%
Azure (Conv. Transcriber)
overall 22.6% ·Δ ±14.4pp
Unknown
15.0%
Same State
29.4%

Bars show mean Word Error Rate per cohort category for each model. Δ = spread across the displayed categories. Categories with fewer than 2 audio samples for that model are excluded.

6
Findings

Speaker overlap is the biggest acoustic stressor

Cross-state pairs are harder than same-state. Gender and age have modest effects. Duration and gap patterns show minimal impact — language and accent dominate.

Multi-Dimensional Analysis

The Hidden Quality Gap

CS F1 measures whether English vocabulary embedded in Indic speech is preserved — not dropped, not transliterated. A model that turns "data backup" into "डेटा बैकअप" scores 0 on CS F1. Invisible to WER. Fatal for downstream applications.

CS F1 Score — All Models (higher = better)
  • Deepgram Nova-3
    0.906
  • ElevenLabs Scribe v2
    0.736
  • Gemini 2.5 Pro
    0.407
  • Azure (Conv. Transcriber)
    0.375
  • Speechmatics
    0.340
  • OpenAI GPT-4o
    0.298
  • OpenAI GPT-4o mini
    0.296
  • Gemini 2.5 Flash
    0.243
  • AWS Transcribe
    0.199
  • Gladia v2
    0.186
  • AssemblyAI Universal
    0.161
  • Sarvam saaras v3
    0.161
  • Soniox stt-async-v4
    0.138
  • Sarvam saarika v2.5
    0.085
Findings

The CS F1 spread is wider than the WER spread for top models

The leader on CS F1 may not be #1 on WER. Models clustered near zero CS F1 hide a fundamental unsuitability for code-mixed Indic enterprise speech, even when their headline WER looks acceptable.

Access the dataset

Dataset access & citation

The full BRIDGE corpus — Indic + Latin American Spanish + Brazilian Portuguese + Vietnamese — is available on Hugging Face. If you use this benchmark in your research, please cite the following.

Access the dataset

Audio files, golden transcripts, speaker metadata, cohort labels, and evaluation scripts are available under the BRIDGE dataset card on Hugging Face — covering Indic (17 languages), Latin American Spanish (3 dialects), Brazilian Portuguese, and Vietnamese. Additional languages and an overlap-focused corpus are in preparation.

Hugging Face
Citation
@misc{humynlabs_bridge_2026,
  title  = {BRIDGE: State of Conversational ASR Across the Global South},
  author = {HumynLabs Research Team},
  year   = {2026},
  month  = {April},
  note   = {Independent benchmark evaluating 15 commercial ASR APIs on dual-speaker conversational audio across 20 languages — Indic (17 languages, 22+ Indian states), Latin American Spanish (Argentinian, Peruvian, Venezuelan), Brazilian Portuguese, and Vietnamese — scored on a 7-metric stack across 7 cohort dimensions},
  url    = {https://bridge-report-hazel.vercel.app},
  howpublished = {HumynLabs}
}
Collaborate or partner

If you work on conversational ASR and want to submit your model for evaluation, or partner on expanding the corpus into new languages, contact the BRIDGE team at humynlabs.ai.