- Created whisper/enhancements module for enhanced functionality - Implemented HallucinationDetector with multi-method detection: * Pattern-based detection (YouTube artifacts, repetitive phrases) * Statistical analysis (compression ratios, log probabilities) * Repetition analysis (looping behavior detection) * Temporal analysis (silence-based detection) - Added ConfidenceScorer for comprehensive transcription quality assessment - Enhanced transcribe() function with new parameters: * enhanced_hallucination_detection: Enable advanced detection * hallucination_detection_language: Language-specific patterns * strict_hallucination_filtering: Strict vs permissive filtering * confidence_threshold: Minimum confidence for segments - Maintains full backward compatibility - Added CLI arguments for new functionality Addresses: OpenAI Whisper Discussion #679 - Hallucinations & Repetition Loops
3.8 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
OpenAI Whisper is a robust automatic speech recognition (ASR) system built on a Transformer sequence-to-sequence model. It performs multilingual speech recognition, speech translation, spoken language identification, and voice activity detection as a unified multitask model.
Development Commands
Installation
# Install package in development mode with dependencies
pip install -e ".[dev]"
# Or install from requirements
pip install -r requirements.txt
Code Quality & Linting
# Format code with black
black .
# Sort imports with isort
isort .
# Lint with flake8
flake8
# Run all pre-commit hooks
pre-commit run --all-files
Testing
# Run all tests
pytest
# Run tests with verbose output
pytest -v
# Run specific test file
pytest tests/test_transcribe.py
# Run tests requiring CUDA
pytest -m requires_cuda
Package Building
# Build package
python -m build
# Install built package
pip install dist/openai_whisper-*.whl
Architecture Overview
Core Components
whisper/init.py: Main entry point with model loading (load_model()) and model registry (_MODELS dict mapping model names to download URLs)
whisper/model.py:
ModelDimensions: Configuration dataclass for model architectureWhisper: Main model class implementing the Transformer architecture- Audio encoder and text decoder components with multi-head attention
- Optimized layers (
LayerNorm,Linear) for mixed-precision training
whisper/transcribe.py:
transcribe(): High-level transcription function with sliding window processingcli(): Command-line interface implementation- Handles batch processing, temperature sampling, and output formatting
whisper/decoding.py:
DecodingOptions/DecodingResult: Configuration and result classesdecode(): Core decoding logic with beam search and sampling strategiesdetect_language(): Language identification functionality
whisper/audio.py: Audio preprocessing utilities including mel-spectrogram computation, padding/trimming to 30-second windows
whisper/tokenizer.py: BPE tokenization with special tokens for task specification (transcription vs translation) and language identification
whisper/timing.py: Word-level timestamp alignment using cross-attention weights from specific attention heads
whisper/normalizers/: Text normalization for different languages to improve transcription accuracy
Model Pipeline Flow
- Audio → Mel-spectrogram (whisper/audio.py)
- Spectrogram → Audio encoder features (whisper/model.py)
- Language detection via decoder (whisper/decoding.py)
- Text generation with task-specific tokens (whisper/transcribe.py)
- Optional word-level timestamp alignment (whisper/timing.py)
Available Models
Six model sizes with different accuracy/speed tradeoffs:
tiny,base,small,medium,large,turbo- English-only variants:
*.en(better for English) - Models auto-download to
~/.cache/whisper/
Testing Structure
- tests/conftest.py: pytest configuration with CUDA markers and random seeds
- tests/jfk.flac: Reference audio file for integration tests
- Tests cover audio processing, tokenization, normalization, timing, and transcription functionality
Code Style
- Black formatter (88 char line length)
- isort for import sorting (black profile)
- flake8 linting with specific ignores for E203, E501, W503, W504
- pre-commit hooks enforce consistency
Key Dependencies
- PyTorch: Core ML framework
- tiktoken: Fast BPE tokenization
- numba: JIT compilation for audio processing
- tqdm: Progress bars for model downloads and processing
- triton: GPU kernel optimization (Linux x86_64)