From 435d0bebd021e87e3ca0eeab76474430b09b157e Mon Sep 17 00:00:00 2001 From: Jong Wook Kim Date: Mon, 6 Nov 2023 09:49:48 -0800 Subject: [PATCH] add mandarin alias --- README.md | 4 ++-- whisper/tokenizer.py | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 3dc26c6..afca9c9 100644 --- a/README.md +++ b/README.md @@ -69,9 +69,9 @@ There are five model sizes, four with English-only versions, offering speed and The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models. -Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the `large-v2` model (The smaller the numbers, the better the performance). Additional WER scores corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4. Meanwhile, more BLEU (Bilingual Evaluation Understudy) scores can be found in Appendix D.3. Both are found in [the paper](https://arxiv.org/abs/2212.04356). +Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3. -![WER breakdown by language](https://raw.githubusercontent.com/openai/whisper/main/language-breakdown.svg) +![WER breakdown by language](https://github.com/openai/whisper/assets/266841/f4619d66-1058-4005-8f67-a9d811b77c62) diff --git a/whisper/tokenizer.py b/whisper/tokenizer.py index ae8d1ad..49eda0f 100644 --- a/whisper/tokenizer.py +++ b/whisper/tokenizer.py @@ -130,6 +130,7 @@ TO_LANGUAGE_CODE = { "moldovan": "ro", "sinhalese": "si", "castilian": "es", + "mandarin": "zh", }