whisper/CLAUDE.md
safayavatsal 13eb8f20d5 feat: Add advanced hallucination detection and confidence scoring system
- Created whisper/enhancements module for enhanced functionality
- Implemented HallucinationDetector with multi-method detection:
  * Pattern-based detection (YouTube artifacts, repetitive phrases)
  * Statistical analysis (compression ratios, log probabilities)
  * Repetition analysis (looping behavior detection)
  * Temporal analysis (silence-based detection)
- Added ConfidenceScorer for comprehensive transcription quality assessment
- Enhanced transcribe() function with new parameters:
  * enhanced_hallucination_detection: Enable advanced detection
  * hallucination_detection_language: Language-specific patterns
  * strict_hallucination_filtering: Strict vs permissive filtering
  * confidence_threshold: Minimum confidence for segments
- Maintains full backward compatibility
- Added CLI arguments for new functionality

Addresses: OpenAI Whisper Discussion #679 - Hallucinations & Repetition Loops
2025-10-19 23:30:43 +05:30

3.8 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenAI Whisper is a robust automatic speech recognition (ASR) system built on a Transformer sequence-to-sequence model. It performs multilingual speech recognition, speech translation, spoken language identification, and voice activity detection as a unified multitask model.

Development Commands

Installation

# Install package in development mode with dependencies
pip install -e ".[dev]"

# Or install from requirements
pip install -r requirements.txt

Code Quality & Linting

# Format code with black
black .

# Sort imports with isort
isort .

# Lint with flake8
flake8

# Run all pre-commit hooks
pre-commit run --all-files

Testing

# Run all tests
pytest

# Run tests with verbose output
pytest -v

# Run specific test file
pytest tests/test_transcribe.py

# Run tests requiring CUDA
pytest -m requires_cuda

Package Building

# Build package
python -m build

# Install built package
pip install dist/openai_whisper-*.whl

Architecture Overview

Core Components

whisper/init.py: Main entry point with model loading (load_model()) and model registry (_MODELS dict mapping model names to download URLs)

whisper/model.py:

  • ModelDimensions: Configuration dataclass for model architecture
  • Whisper: Main model class implementing the Transformer architecture
  • Audio encoder and text decoder components with multi-head attention
  • Optimized layers (LayerNorm, Linear) for mixed-precision training

whisper/transcribe.py:

  • transcribe(): High-level transcription function with sliding window processing
  • cli(): Command-line interface implementation
  • Handles batch processing, temperature sampling, and output formatting

whisper/decoding.py:

  • DecodingOptions/DecodingResult: Configuration and result classes
  • decode(): Core decoding logic with beam search and sampling strategies
  • detect_language(): Language identification functionality

whisper/audio.py: Audio preprocessing utilities including mel-spectrogram computation, padding/trimming to 30-second windows

whisper/tokenizer.py: BPE tokenization with special tokens for task specification (transcription vs translation) and language identification

whisper/timing.py: Word-level timestamp alignment using cross-attention weights from specific attention heads

whisper/normalizers/: Text normalization for different languages to improve transcription accuracy

Model Pipeline Flow

  1. Audio → Mel-spectrogram (whisper/audio.py)
  2. Spectrogram → Audio encoder features (whisper/model.py)
  3. Language detection via decoder (whisper/decoding.py)
  4. Text generation with task-specific tokens (whisper/transcribe.py)
  5. Optional word-level timestamp alignment (whisper/timing.py)

Available Models

Six model sizes with different accuracy/speed tradeoffs:

  • tiny, base, small, medium, large, turbo
  • English-only variants: *.en (better for English)
  • Models auto-download to ~/.cache/whisper/

Testing Structure

  • tests/conftest.py: pytest configuration with CUDA markers and random seeds
  • tests/jfk.flac: Reference audio file for integration tests
  • Tests cover audio processing, tokenization, normalization, timing, and transcription functionality

Code Style

  • Black formatter (88 char line length)
  • isort for import sorting (black profile)
  • flake8 linting with specific ignores for E203, E501, W503, W504
  • pre-commit hooks enforce consistency

Key Dependencies

  • PyTorch: Core ML framework
  • tiktoken: Fast BPE tokenization
  • numba: JIT compilation for audio processing
  • tqdm: Progress bars for model downloads and processing
  • triton: GPU kernel optimization (Linux x86_64)