Merge branch 'main' into main

2025-11-27 15:54:00 +00:00 · 2023-01-18 14:03:54 -08:00 · 2023-01-18 14:03:54 -08:00 · 2c914999bd
commit 2c914999bd
parent bdd0d79b8e ea1c266709
19 changed files with 4633 additions and 2690 deletions
--- a/.github/workflows/python-publish.yml
+++ b/.github/workflows/python-publish.yml
@ -0,0 +1,37 @@
+name: Release
+
+on:
+  push:
+    branches:
+    - main
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - uses: actions-ecosystem/action-regex-match@v2
+      id: regex-match
+      with:
+        text: ${{ github.event.head_commit.message }}
+        regex: '^Release ([^ ]+)'
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.8'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install setuptools wheel twine
+    - name: Release
+      if: ${{ steps.regex-match.outputs.match != '' }}
+      uses: softprops/action-gh-release@v1
+      with:
+        tag_name: v${{ steps.regex-match.outputs.group1 }}
+    - name: Build and publish
+      if: ${{ steps.regex-match.outputs.match != '' }}
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
+      run: |
+        python setup.py sdist
+        twine upload dist/*
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -0,0 +1,26 @@
+name: test
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+jobs:
+  whisper-test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.8', '3.9', '3.10']
+        pytorch-version: [1.10.2, 1.13.1]
+        exclude:
+          - python-version: '3.10'
+            pytorch-version: 1.10.2
+    steps:
+      - uses: conda-incubator/setup-miniconda@v2
+      - run: conda install -n test ffmpeg python=${{ matrix.python-version }} pytorch=${{ matrix.pytorch-version }} cpuonly -c pytorch
+      - uses: actions/checkout@v2
+      - run: echo "$CONDA/envs/test/bin" >> $GITHUB_PATH
+      - run: pip install pytest
+      - run: pip install .
+      - run: pytest --durations=0 -vv -k 'not test_transcribe or test_transcribe[tiny] or test_transcribe[tiny.en]'
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,3 +1,6 @@
+include requirements.txt
+include README.md
+include LICENSE
 include whisper/assets/*
 include whisper/assets/gpt2/*
 include whisper/assets/multilingual/*
--- a/README.md
+++ b/README.md
@ -1,8 +1,8 @@
 # Whisper

 [[Blog]](https://openai.com/blog/whisper)
-[[Paper]](https://cdn.openai.com/papers/whisper.pdf)
-[[Model card]](model-card.md)
+[[Paper]](https://arxiv.org/abs/2212.04356)
+[[Model card]](https://github.com/openai/whisper/blob/main/model-card.md)
 [[Colab example]](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb)

 Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
@ -10,17 +10,25 @@ Whisper is a general-purpose speech recognition model. It is trained on a large

 ## Approach

-![Approach](approach.png)
+![Approach](https://raw.githubusercontent.com/openai/whisper/main/approach.png)

 A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.


 ## Setup

-We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) for their fast tokenizer implementation and [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies 
+We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) for their fast tokenizer implementation and [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for reading audio files. You can download and install (or update to) the latest release of Whisper with the following command:
+
+    pip install -U openai-whisper
+
+Alternatively, the following command will pull and install the latest commit from this repository, along with its Python dependencies:

    pip install git+https://github.com/openai/whisper.git 

+To update the package to the latest version of this repository, please run:
+
+    pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
+
 It also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system, which is available from most package managers:

 ```bash
@ -62,9 +70,9 @@ There are five model sizes, four with English-only versions, offering speed and

 For English-only applications, the `.en` models tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.

-Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the `large` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://cdn.openai.com/papers/whisper.pdf).
+Whisper's performance varies widely depending on the language. The figure below shows a WER (Word Error Rate) breakdown by languages of Fleurs dataset, using the `large-v2` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://arxiv.org/abs/2212.04356). The smaller is better.

-![WER breakdown by language](language-breakdown.svg)
+![WER breakdown by language](https://raw.githubusercontent.com/openai/whisper/main/language-breakdown.svg)



@ -86,7 +94,7 @@ Run the following to view all available options:

    whisper --help

-See [tokenizer.py](whisper/tokenizer.py) for the list of all available languages.
+See [tokenizer.py](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py) for the list of all available languages.


 ## Python usage
@ -136,4 +144,4 @@ Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussion

 ## License

-The code and the model weights of Whisper are released under the MIT License. See [LICENSE](LICENSE) for further details.
+The code and the model weights of Whisper are released under the MIT License. See [LICENSE](https://github.com/openai/whisper/blob/main/LICENSE) for further details.
--- a/language-breakdown.svg
+++ b/language-breakdown.svg
--- a/model-card.md
+++ b/model-card.md
@ -2,7 +2,7 @@

 This is the official codebase for running the automatic speech recognition (ASR) models (Whisper models) trained and released by OpenAI.

-Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://cdn.openai.com/papers/whisper.pdf).
+Following [Model Cards for Model Reporting (Mitchell et al.)](https://arxiv.org/abs/1810.03993), we're providing some information about the automatic speech recognition model. More information on how these models were trained and evaluated can be found [in the paper](https://arxiv.org/abs/2212.04356).


 ## Model Details
@ -17,10 +17,12 @@ The Whisper models are trained for speech recognition and translation tasks, cap
 | medium |   769 M    |         ✓          |         ✓          |
 | large  |   1550 M   |                    |         ✓          |

+In December 2022, we [released an improved large model named `large-v2`](https://github.com/openai/whisper/discussions/661).
+

 ### Release date

-September 2022
+September 2022 (original series) and December 2022 (`large-v2`)

 ### Model type

@ -28,7 +30,7 @@ Sequence-to-sequence ASR (automatic speech recognition) and speech translation m

 ### Paper & samples

-[Paper](https://cdn.openai.com/papers/whisper.pdf) / [Blog](https://openai.com/blog/whisper)
+[Paper](https://arxiv.org/abs/2212.04356) / [Blog](https://openai.com/blog/whisper)


 ## Model Use
@ -46,7 +48,7 @@ In particular, we caution against using Whisper models to transcribe recordings

 The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages. 

-As discussed in [the accompanying paper](https://cdn.openai.com/papers/whisper.pdf), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.
+As discussed in [the accompanying paper](https://arxiv.org/abs/2212.04356), we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language.


 ## Performance and Limitations
@ -55,9 +57,9 @@ Our studies show that, over many existing ASR systems, the models exhibit improv

 However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself.

-Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in [the paper accompanying this release](https://cdn.openai.com/papers/whisper.pdf). 
+Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in [the paper accompanying this release](https://arxiv.org/abs/2212.04356).

-In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in [the paper](https://cdn.openai.com/papers/whisper.pdf). It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages.
+In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in [the paper](https://arxiv.org/abs/2212.04356). It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages.


 ## Broader Implications
--- a/notebooks/Multilingual_ASR.ipynb
+++ b/notebooks/Multilingual_ASR.ipynb
--- a/setup.py
+++ b/setup.py
@ -3,11 +3,19 @@ import os
 import pkg_resources
 from setuptools import setup, find_packages

+
+def read_version(fname="whisper/version.py"):
+    exec(compile(open(fname, encoding="utf-8").read(), fname, "exec"))
+    return locals()["__version__"]
+
+
 setup(
-    name="whisper",
+    name="openai-whisper",
    py_modules=["whisper"],
-    version="1.0",
+    version=read_version(),
    description="Robust Speech Recognition via Large-Scale Weak Supervision",
+    long_description=open("README.md", encoding="utf-8").read(),
+    long_description_content_type="text/markdown",
    readme="README.md",
    python_requires=">=3.7",
    author="OpenAI",
@ -21,8 +29,8 @@ setup(
        )
    ],
    entry_points={
-        'console_scripts': ['whisper=whisper.transcribe:cli'],
+        "console_scripts": ["whisper=whisper.transcribe:cli"],
    },
    include_package_data=True,
-    extras_require={'dev': ['pytest']},
+    extras_require={"dev": ["pytest"]},
 )
--- a/tests/test_normalizer.py
+++ b/tests/test_normalizer.py
@ -84,6 +84,7 @@ def test_text_normalizer():
    assert std("he's like") == "he is like"
    assert std("she's been like") == "she has been like"
    assert std("10km") == "10 km"
+    assert std("10mm") == "10 mm"
    assert std("RC232") == "rc 232"

    assert (
--- a/tests/test_transcribe.py
+++ b/tests/test_transcribe.py
@ -1,13 +1,15 @@
 import os

 import pytest
+import torch

 import whisper


-@pytest.mark.parametrize('model_name', whisper.available_models())
+@pytest.mark.parametrize("model_name", whisper.available_models())
 def test_transcribe(model_name: str):
-    model = whisper.load_model(model_name).cuda()
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model = whisper.load_model(model_name).to(device)
    audio_path = os.path.join(os.path.dirname(__file__), "jfk.flac")

    language = "en" if model_name.endswith(".en") else None
--- a/whisper/init.py
+++ b/whisper/init.py
@ -12,6 +12,7 @@ from .audio import load_audio, log_mel_spectrogram, pad_or_trim
 from .decoding import DecodingOptions, DecodingResult, decode, detect_language
 from .model import Whisper, ModelDimensions
 from .transcribe import transcribe
+from .version import __version__


 _MODELS = {
@ -23,7 +24,9 @@ _MODELS = {
    "small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
    "medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
    "medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
-    "large": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large.pt",
+    "large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
+    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
+    "large": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
 }


@ -37,7 +40,8 @@ def _download(url: str, root: str, in_memory: bool) -> Union[bytes, str]:
        raise RuntimeError(f"{download_target} exists and is not a regular file")

    if os.path.isfile(download_target):
-        model_bytes = open(download_target, "rb").read()
+        with open(download_target, "rb") as f:
+            model_bytes = f.read()
        if hashlib.sha256(model_bytes).hexdigest() == expected_sha256:
            return model_bytes if in_memory else download_target
        else:
--- a/whisper/audio.py
+++ b/whisper/audio.py
@ -113,7 +113,7 @@ def log_mel_spectrogram(audio: Union[str, np.ndarray, torch.Tensor], n_mels: int

    window = torch.hann_window(N_FFT).to(audio.device)
    stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
-    magnitudes = stft[:, :-1].abs() ** 2
+    magnitudes = stft[..., :-1].abs() ** 2

    filters = mel_filters(audio.device, n_mels)
    mel_spec = filters @ magnitudes
--- a/whisper/decoding.py
+++ b/whisper/decoding.py
@ -423,8 +423,12 @@ class ApplyTimestampRules(LogitFilter):
                else:  # cannot be normal text tokens
                    logits[k, : self.tokenizer.eot] = -np.inf

+        if tokens.shape[1] == self.sample_begin:
+            # suppress generating non-timestamp tokens at the beginning
+            logits[:, : self.tokenizer.timestamp_begin] = -np.inf
+
            # apply the `max_initial_timestamp` option
-        if tokens.shape[1] == self.sample_begin and self.max_initial_timestamp_index is not None:
+            if self.max_initial_timestamp_index is not None:
                last_allowed = self.tokenizer.timestamp_begin + self.max_initial_timestamp_index
                logits[:, last_allowed + 1 :] = -np.inf

--- a/whisper/model.py
+++ b/whisper/model.py
@ -82,8 +82,8 @@ class MultiHeadAttention(nn.Module):
            k = kv_cache[self.key]
            v = kv_cache[self.value]

-        wv = self.qkv_attention(q, k, v, mask)
-        return self.out(wv)
+        wv, qk = self.qkv_attention(q, k, v, mask)
+        return self.out(wv), qk

    def qkv_attention(self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None):
        n_batch, n_ctx, n_state = q.shape
@ -95,9 +95,10 @@ class MultiHeadAttention(nn.Module):
        qk = q @ k
        if mask is not None:
            qk = qk + mask[:n_ctx, :n_ctx]
+        qk = qk.float()

-        w = F.softmax(qk.float(), dim=-1).to(q.dtype)
-        return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2)
+        w = F.softmax(qk, dim=-1).to(q.dtype)
+        return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2), qk.detach()


 class ResidualAttentionBlock(nn.Module):
@ -121,9 +122,9 @@ class ResidualAttentionBlock(nn.Module):
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
-        x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)
+        x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
        if self.cross_attn:
-            x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)
+            x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
        x = x + self.mlp(self.mlp_ln(x))
        return x

@ -214,10 +215,10 @@ class Whisper(nn.Module):
        )

    def embed_audio(self, mel: torch.Tensor):
-        return self.encoder.forward(mel)
+        return self.encoder(mel)

    def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
-        return self.decoder.forward(tokens, audio_features)
+        return self.decoder(tokens, audio_features)

    def forward(self, mel: torch.Tensor, tokens: torch.Tensor) -> Dict[str, torch.Tensor]:
        return self.decoder(tokens, self.encoder(mel))
--- a/whisper/normalizers/english.json
+++ b/whisper/normalizers/english.json
@ -1737,6 +1737,5 @@
    "yoghurt": "yogurt",
    "yoghurts": "yogurts",
    "mhm": "hmm",
-    "mm": "hmm",
    "mmm": "hmm"
 }
--- a/whisper/tokenizer.py
+++ b/whisper/tokenizer.py
@ -28,7 +28,7 @@ LANGUAGES = {
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
-    "iw": "hebrew",
+    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
--- a/whisper/transcribe.py
+++ b/whisper/transcribe.py
@ -1,5 +1,6 @@
 import argparse
 import os
+import sys
 import warnings
 from typing import List, Optional, Tuple, Union, TYPE_CHECKING

@ -44,7 +45,7 @@ def transcribe(
        If False, displays minimal details. If None, does not display anything

    temperature: Union[float, Tuple[float, ...]]
-        Temperature for sampling. It can be a tuple of temperatures, which will be successfully used
+        Temperature for sampling. It can be a tuple of temperatures, which will be successively used
        upon failures according to either `compression_ratio_threshold` or `logprob_threshold`.

    compression_ratio_threshold: float
@ -159,7 +160,7 @@ def transcribe(
                "start": start,
                "end": end,
                "text": text,
-                "tokens": result.tokens,
+                "tokens": text_tokens.tolist(),
                "temperature": result.temperature,
                "avg_logprob": result.avg_logprob,
                "compression_ratio": result.compression_ratio,
@ -167,7 +168,10 @@ def transcribe(
            }
        )
        if verbose:
-            print(f"[{format_timestamp(start)} --> {format_timestamp(end)}] {text}")
+            line = f"[{format_timestamp(start)} --> {format_timestamp(end)}] {text}\n"
+            # compared to just `print(line)`, this replaces any character not representable using
+            # the system default encoding with an '?', avoiding UnicodeEncodeError.
+            sys.stderr.buffer.write(line.encode(sys.getdefaultencoding(), errors="replace"))

    # show the progress bar when verbose is False (otherwise the transcribed text will be printed)
    num_frames = mel.shape[-1]
--- a/whisper/utils.py
+++ b/whisper/utils.py
@ -24,7 +24,8 @@ def optional_float(string):


 def compression_ratio(text) -> float:
-    return len(text) / len(zlib.compress(text.encode("utf-8")))
+    text_bytes = text.encode("utf-8")
+    return len(text_bytes) / len(zlib.compress(text_bytes))


 def format_timestamp(seconds: float, always_include_hours: bool = False, decimal_marker: str = '.'):
--- a/whisper/version.py
+++ b/whisper/version.py
@ -0,0 +1 @@
+__version__ = "20230117"