mirror of
https://github.com/openai/whisper.git
synced 2025-03-30 14:28:27 +00:00
large-v3-turbo model (#2361)
This commit is contained in:
parent
b66b46f32d
commit
25e5c364e0
20
README.md
20
README.md
@ -57,17 +57,21 @@ pip install setuptools-rust
|
||||
|
||||
## Available models and languages
|
||||
|
||||
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware.
|
||||
There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs.
|
||||
Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model.
|
||||
The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.
|
||||
|
||||
| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|
||||
|:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
|
||||
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x |
|
||||
| base | 74 M | `base.en` | `base` | ~1 GB | ~16x |
|
||||
| small | 244 M | `small.en` | `small` | ~2 GB | ~6x |
|
||||
| tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~10x |
|
||||
| base | 74 M | `base.en` | `base` | ~1 GB | ~7x |
|
||||
| small | 244 M | `small.en` | `small` | ~2 GB | ~4x |
|
||||
| medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x |
|
||||
| large | 1550 M | N/A | `large` | ~10 GB | 1x |
|
||||
| turbo | 809 M | N/A | `turbo` | ~6 GB | ~8x |
|
||||
|
||||
The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.
|
||||
Additionally, the `turbo` model is an optimized version of `large-v3` that offers faster transcription speed with a minimal degradation in accuracy.
|
||||
|
||||
Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
|
||||
|
||||
@ -77,9 +81,9 @@ Whisper's performance varies widely depending on the language. The figure below
|
||||
|
||||
## Command-line usage
|
||||
|
||||
The following command will transcribe speech in audio files, using the `medium` model:
|
||||
The following command will transcribe speech in audio files, using the `turbo` model:
|
||||
|
||||
whisper audio.flac audio.mp3 audio.wav --model medium
|
||||
whisper audio.flac audio.mp3 audio.wav --model turbo
|
||||
|
||||
The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:
|
||||
|
||||
@ -103,7 +107,7 @@ Transcription can also be performed within Python:
|
||||
```python
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("base")
|
||||
model = whisper.load_model("turbo")
|
||||
result = model.transcribe("audio.mp3")
|
||||
print(result["text"])
|
||||
```
|
||||
@ -115,7 +119,7 @@ Below is an example usage of `whisper.detect_language()` and `whisper.decode()`
|
||||
```python
|
||||
import whisper
|
||||
|
||||
model = whisper.load_model("base")
|
||||
model = whisper.load_model("turbo")
|
||||
|
||||
# load audio and pad/trim it to fit 30 seconds
|
||||
audio = whisper.load_audio("audio.mp3")
|
||||
|
@ -16,13 +16,15 @@ The Whisper models are trained for speech recognition and translation tasks, cap
|
||||
| small | 244 M | ✓ | ✓ |
|
||||
| medium | 769 M | ✓ | ✓ |
|
||||
| large | 1550 M | | ✓ |
|
||||
| turbo | 798 M | | ✓ |
|
||||
|
||||
In December 2022, we [released an improved large model named `large-v2`](https://github.com/openai/whisper/discussions/661), and `large-v3` in November 2023.
|
||||
Additionally, we've added a `turbo` model in September 2024 which is optimized for inference speed.
|
||||
|
||||
|
||||
### Release date
|
||||
|
||||
September 2022 (original series), December 2022 (`large-v2`), and November 2023 (`large-v3`)
|
||||
September 2022 (original series), December 2022 (`large-v2`), November 2023 (`large-v3`), September 2024 (`large-v3-turbo`)
|
||||
|
||||
### Model type
|
||||
|
||||
|
@ -27,6 +27,8 @@ _MODELS = {
|
||||
"large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
|
||||
"large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
|
||||
"large": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
|
||||
"large-v3-turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
|
||||
"turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
|
||||
}
|
||||
|
||||
# base85-encoded (n_layers, n_heads) boolean arrays indicating the cross-attention heads that are
|
||||
@ -44,6 +46,8 @@ _ALIGNMENT_HEADS = {
|
||||
"large-v2": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
|
||||
"large-v3": b"ABzY8gWO1E0{>%R7(9S+Kn!D~%ngiGaR?*L!iJG9p-nab0JQ=-{D1-g00",
|
||||
"large": b"ABzY8gWO1E0{>%R7(9S+Kn!D~%ngiGaR?*L!iJG9p-nab0JQ=-{D1-g00",
|
||||
"large-v3-turbo": b"ABzY8j^C+e0{>%RARaKHP%t(lGR*)0g!tONPyhe`",
|
||||
"turbo": b"ABzY8j^C+e0{>%RARaKHP%t(lGR*)0g!tONPyhe`",
|
||||
}
|
||||
|
||||
|
||||
|
@ -511,7 +511,7 @@ def cli():
|
||||
# fmt: off
|
||||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
|
||||
parser.add_argument("--model", default="small", type=valid_model_name, help="name of the Whisper model to use")
|
||||
parser.add_argument("--model", default="turbo", type=valid_model_name, help="name of the Whisper model to use")
|
||||
parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
|
||||
parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
|
||||
parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
|
||||
|
Loading…
x
Reference in New Issue
Block a user