Merge branch 'main' into main

2025-11-24 14:35:57 +00:00 · 2024-10-02 08:27:39 -07:00 · 2024-10-02 08:27:39 -07:00 · ad84a5f266
commit ad84a5f266
parent 41ca671338 25639fc17d
11 changed files with 100 additions and 30 deletions
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@ -41,16 +41,29 @@ jobs:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        python-version: ['3.8', '3.9', '3.10', '3.11']
+        include:
-        pytorch-version: [1.13.1, 2.0.0]
+          - python-version: '3.8'
        exclude:
          - python-version: '3.11'
            pytorch-version: 1.13.1
            numpy-requirement: "'numpy<2'"
          - python-version: '3.8'
            pytorch-version: 2.0.1
            numpy-requirement: "'numpy<2'"
          - python-version: '3.9'
            pytorch-version: 2.1.2
            numpy-requirement: "'numpy<2'"
          - python-version: '3.10'
            pytorch-version: 2.2.2
            numpy-requirement: "'numpy<2'"
          - python-version: '3.11'
            pytorch-version: 2.3.1
            numpy-requirement: "'numpy'"
          - python-version: '3.12'
            pytorch-version: 2.4.1
            numpy-requirement: "'numpy'"
    steps:
      - uses: conda-incubator/setup-miniconda@v2
      - run: conda install -n test ffmpeg python=${{ matrix.python-version }}
      - run: pip3 install torch==${{ matrix.pytorch-version }}+cpu --index-url https://download.pytorch.org/whl/cpu
      - uses: actions/checkout@v3
      - run: echo "$CONDA/envs/test/bin" >> $GITHUB_PATH
-      - run: pip install .["dev"]
+      - run: pip3 install .["dev"] ${{ matrix.numpy-requirement }} torch==${{ matrix.pytorch-version }}+cpu --index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pypi.org/simple
      - run: pytest --durations=0 -vv -k 'not test_transcribe or test_transcribe[tiny] or test_transcribe[tiny.en]' -m 'not requires_cuda'
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,19 @@
 # CHANGELOG
 ## [v20240930](https://github.com/openai/whisper/releases/tag/v20240930)
 * allowing numpy 2 in tests ([#2362](https://github.com/openai/whisper/pull/2362))
 * large-v3-turbo model ([#2361](https://github.com/openai/whisper/pull/2361))
 * test on python/pytorch versions up to 3.12 and 2.4.1 ([#2360](https://github.com/openai/whisper/pull/2360))
 * using sdpa if available ([#2359](https://github.com/openai/whisper/pull/2359))
 ## [v20240927](https://github.com/openai/whisper/releases/tag/v20240927)
 * pinning numpy<2 in tests ([#2332](https://github.com/openai/whisper/pull/2332))
 * Relax triton requirements for compatibility with pytorch 2.4 and newer ([#2307](https://github.com/openai/whisper/pull/2307))
 * Skip silence around hallucinations ([#1838](https://github.com/openai/whisper/pull/1838))
 * Fix triton env marker ([#1887](https://github.com/openai/whisper/pull/1887))
 ## [v20231117](https://github.com/openai/whisper/releases/tag/v20231117)
 * Relax triton requirements for compatibility with pytorch 2.1 and newer ([#1802](https://github.com/openai/whisper/pull/1802))
--- a/README.md
+++ b/README.md
@ -57,17 +57,21 @@ pip install setuptools-rust
 ## Available models and languages
-There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware.
+There are six model sizes, four with English-only versions, offering speed and accuracy tradeoffs.
 Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model.
 The relative speeds below are measured by transcribing English speech on a A100, and the real-world speed may vary significantly depending on many factors including the language, the speaking speed, and the available hardware.
 |  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
 |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:|
-|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
+|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~10x      |
-|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
+|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~7x       |
-| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
+| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~4x       |
 | medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
 | large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |
 | turbo  |   809 M    |        N/A         |      `turbo`       |     ~6 GB     |      ~8x       |
 The `.en` models for English-only applications tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models.
 Additionally, the `turbo` model is an optimized version of `large-v3` that offers faster transcription speed with a minimal degradation in accuracy.
 Whisper's performance varies widely depending on the language. The figure below shows a performance breakdown of `large-v3` and `large-v2` models by language, using WERs (word error rates) or CER (character error rates, shown in *Italic*) evaluated on the Common Voice 15 and Fleurs datasets. Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of [the paper](https://arxiv.org/abs/2212.04356), as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.
@ -77,9 +81,9 @@ Whisper's performance varies widely depending on the language. The figure below
 ## Command-line usage
-The following command will transcribe speech in audio files, using the `medium` model:
+The following command will transcribe speech in audio files, using the `turbo` model:
-    whisper audio.flac audio.mp3 audio.wav --model medium
+    whisper audio.flac audio.mp3 audio.wav --model turbo
 The default setting (which selects the `small` model) works well for transcribing English. To transcribe an audio file containing non-English speech, you can specify the language using the `--language` option:
@ -103,7 +107,7 @@ Transcription can also be performed within Python:
 ```python
 import whisper
-model = whisper.load_model("base")
+model = whisper.load_model("turbo")
 result = model.transcribe("audio.mp3")
 print(result["text"])
 ```
@ -115,7 +119,7 @@ Below is an example usage of `whisper.detect_language()` and `whisper.decode()`
 ```python
 import whisper
-model = whisper.load_model("base")
+model = whisper.load_model("turbo")
 # load audio and pad/trim it to fit 30 seconds
 audio = whisper.load_audio("audio.mp3")
--- a/model-card.md
+++ b/model-card.md
@ -16,13 +16,15 @@ The Whisper models are trained for speech recognition and translation tasks, cap
 | small  |   244 M    |         ✓          |         ✓          |
 | medium |   769 M    |         ✓          |         ✓          |
 | large  |   1550 M   |                    |         ✓          |
 | turbo  |   798 M    |                    |         ✓          |
 In December 2022, we [released an improved large model named `large-v2`](https://github.com/openai/whisper/discussions/661), and `large-v3` in November 2023.
 Additionally, we've added a `turbo` model in September 2024 which is optimized for inference speed.
 ### Release date
-September 2022 (original series), December 2022 (`large-v2`), and November 2023 (`large-v3`)
+September 2022 (original series), December 2022 (`large-v2`), November 2023 (`large-v3`), September 2024 (`large-v3-turbo`)
 ### Model type
--- a/requirements.txt
+++ b/requirements.txt
@ -4,4 +4,4 @@ torch
 tqdm
 more-itertools
 tiktoken
-triton>=2.0.0,<3;platform_machine=="x86_64" and sys_platform=="linux" or sys_platform=="linux2"
+triton>=2.0.0;platform_machine=="x86_64" and sys_platform=="linux" or sys_platform=="linux2"
--- a/setup.py
+++ b/setup.py
@ -13,7 +13,7 @@ def read_version(fname="whisper/version.py"):
 requirements = []
 if sys.platform.startswith("linux") and platform.machine() == "x86_64":
-    requirements.append("triton>=2.0.0,<3")
+    requirements.append("triton>=2.0.0")
 setup(
    name="openai-whisper",
--- a/whisper/init.py
+++ b/whisper/init.py
@ -27,6 +27,8 @@ _MODELS = {
    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
    "large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
    "large": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
    "large-v3-turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
    "turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
 }
 # base85-encoded (n_layers, n_heads) boolean arrays indicating the cross-attention heads that are
@ -44,6 +46,8 @@ _ALIGNMENT_HEADS = {
    "large-v2": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
    "large-v3": b"ABzY8gWO1E0{>%R7(9S+Kn!D~%ngiGaR?*L!iJG9p-nab0JQ=-{D1-g00",
    "large": b"ABzY8gWO1E0{>%R7(9S+Kn!D~%ngiGaR?*L!iJG9p-nab0JQ=-{D1-g00",
    "large-v3-turbo": b"ABzY8j^C+e0{>%RARaKHP%t(lGR*)0g!tONPyhe`",
    "turbo": b"ABzY8j^C+e0{>%RARaKHP%t(lGR*)0g!tONPyhe`",
 }
--- a/whisper/model.py
+++ b/whisper/model.py
@ -1,7 +1,8 @@
 import base64
 import gzip
 from contextlib import contextmanager
 from dataclasses import dataclass
-from typing import Dict, Iterable, Optional
+from typing import Dict, Iterable, Optional, Tuple
 import numpy as np
 import torch
@ -12,6 +13,14 @@ from .decoding import decode as decode_function
 from .decoding import detect_language as detect_language_function
 from .transcribe import transcribe as transcribe_function
 try:
    from torch.nn.functional import scaled_dot_product_attention
    SDPA_AVAILABLE = True
 except (ImportError, RuntimeError, OSError):
    scaled_dot_product_attention = None
    SDPA_AVAILABLE = False
@dataclass
 class ModelDimensions:
@ -59,7 +68,19 @@ def sinusoids(length, channels, max_timescale=10000):
    return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)
@contextmanager
 def disable_sdpa():
    prev_state = MultiHeadAttention.use_sdpa
    try:
        MultiHeadAttention.use_sdpa = False
        yield
    finally:
        MultiHeadAttention.use_sdpa = prev_state
 class MultiHeadAttention(nn.Module):
    use_sdpa = True
    def __init__(self, n_state: int, n_head: int):
        super().__init__()
        self.n_head = n_head
@ -92,20 +113,30 @@ class MultiHeadAttention(nn.Module):
    def qkv_attention(
        self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None
-    ):
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        n_batch, n_ctx, n_state = q.shape
        scale = (n_state // self.n_head) ** -0.25
-        q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) * scale
+        q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
-        k = k.view(*k.shape[:2], self.n_head, -1).permute(0, 2, 3, 1) * scale
+        k = k.view(*k.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
        v = v.view(*v.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
-        qk = q @ k
+        if SDPA_AVAILABLE and MultiHeadAttention.use_sdpa:
-        if mask is not None:
+            a = scaled_dot_product_attention(
-            qk = qk + mask[:n_ctx, :n_ctx]
+                q, k, v, is_causal=mask is not None and n_ctx > 1
-        qk = qk.float()
+            )
            out = a.permute(0, 2, 1, 3).flatten(start_dim=2)
            qk = None
        else:
            qk = (q * scale) @ (k * scale).transpose(-1, -2)
            if mask is not None:
                qk = qk + mask[:n_ctx, :n_ctx]
            qk = qk.float()
-        w = F.softmax(qk, dim=-1).to(q.dtype)
+            w = F.softmax(qk, dim=-1).to(q.dtype)
-        return (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2), qk.detach()
+            out = (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2)
            qk = qk.detach()
        return out, qk
 class ResidualAttentionBlock(nn.Module):
--- a/whisper/timing.py
+++ b/whisper/timing.py
@ -191,7 +191,9 @@ def find_alignment(
        for i, block in enumerate(model.decoder.blocks)
    ]
-    with torch.no_grad():
+    from .model import disable_sdpa
    with torch.no_grad(), disable_sdpa():
        logits = model(mel.unsqueeze(0), tokens.unsqueeze(0))[0]
        sampled_logits = logits[len(tokenizer.sot_sequence) :, : tokenizer.eot]
        token_probs = sampled_logits.softmax(dim=-1)
--- a/whisper/transcribe.py
+++ b/whisper/transcribe.py
@ -874,7 +874,7 @@ def cli():
    # fmt: off
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
-    parser.add_argument("--model", default="small", type=valid_model_name, help="name of the Whisper model to use")
+    parser.add_argument("--model", default="turbo", type=valid_model_name, help="name of the Whisper model to use")
    parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
    parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
--- a/whisper/version.py
+++ b/whisper/version.py
@ -1 +1 @@
-__version__ = "20231117"
+__version__ = "20240930"
`@ -1 +1 @@`
	`__version__ = "20231117"`	`__version__ = "20240930"`