whisper/REPOSITORY_ISSUES_ANALYSIS.md
safayavatsal 13eb8f20d5 feat: Add advanced hallucination detection and confidence scoring system
- Created whisper/enhancements module for enhanced functionality
- Implemented HallucinationDetector with multi-method detection:
  * Pattern-based detection (YouTube artifacts, repetitive phrases)
  * Statistical analysis (compression ratios, log probabilities)
  * Repetition analysis (looping behavior detection)
  * Temporal analysis (silence-based detection)
- Added ConfidenceScorer for comprehensive transcription quality assessment
- Enhanced transcribe() function with new parameters:
  * enhanced_hallucination_detection: Enable advanced detection
  * hallucination_detection_language: Language-specific patterns
  * strict_hallucination_filtering: Strict vs permissive filtering
  * confidence_threshold: Minimum confidence for segments
- Maintains full backward compatibility
- Added CLI arguments for new functionality

Addresses: OpenAI Whisper Discussion #679 - Hallucinations & Repetition Loops
2025-10-19 23:30:43 +05:30

647 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OpenAI Whisper Repository Issues Analysis
This document analyzes the top 5 most critical issues identified from the OpenAI Whisper repository discussions, commit history, and community reports. The analysis is based on actual GitHub discussions, bug fix commits, and user-reported problems.
## Issue #1: Hallucinations and Repetition Loops
### **Severity**: CRITICAL
### **Discussion References**: #679 (184 comments), commit 919a713, ba3f3cd, 38f2f4d
### **Impact**: High - Creates "ghost transcripts" and repetitive text
### Problem Description
Whisper creates false transcripts, especially at the end of audio files or after long silent gaps. The model gets stuck in repetition loops, particularly affecting Norwegian and German audio on medium/large models.
### Root Cause Analysis
- **Context Contamination**: The `condition_on_previous_text=True` parameter causes problems when the last chunk is short compared to previous context
- **Silent Gaps**: Long periods without speech (50+ minutes) cause the model to loop on the last spoken segment
- **Chunk Boundary Issues**: Problems arise at chunk transitions, especially in the final segments
### Solution Process
#### Immediate Fix - Lucid Whisper Approach
```python
# Implementation from Discussion #679
# whisper/transcribe.py - Replace line 178
def apply_lucid_whisper_fix(decode_options, all_tokens, prompt_reset_since,
seek, num_frames, N_FRAMES):
"""
Prevents hallucinations by controlling context based on chunk position
"""
lucid_threshold = 0.3 # Threshold for permissible chunk length
if ((seek + N_FRAMES) / num_frames < 1.0) or (seek == 0):
# First chunk or next chunk fully within frames - safe to use context
decode_options["prompt"] = all_tokens[prompt_reset_since:]
else:
# Last chunk - calculate lucid score to decide context usage
lucid_score = (num_frames - seek) / N_FRAMES
if lucid_score < lucid_threshold and "prompt" in decode_options:
# Lucid Score below threshold - erase context to prevent hallucination
decode_options["prompt"] = []
else:
# Lucid Score above threshold - keep context
decode_options["prompt"] = all_tokens[prompt_reset_since:]
return decode_options
```
#### VAD-based Solution
```python
# Voice Activity Detection approach from Discussion #679
import torch
import torchaudio
def preprocess_with_vad(audio_path):
"""
Remove silent segments before transcription to prevent hallucinations
"""
waveform, sample_rate = torchaudio.load(audio_path)
# Use torchaudio's VAD (Voice Activity Detection)
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True)
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
# Get speech timestamps
speech_timestamps = get_speech_timestamps(waveform, model,
sampling_rate=sample_rate)
# Extract only speech segments
if speech_timestamps:
speech_audio = collect_chunks(speech_timestamps, waveform)
return speech_audio
else:
return waveform
# Usage in transcription
def transcribe_with_vad(model, audio_path):
clean_audio = preprocess_with_vad(audio_path)
result = model.transcribe(clean_audio, condition_on_previous_text=False)
return result
```
---
## Issue #2: Real-time Streaming and Performance Limitations
### **Severity**: HIGH
### **Discussion References**: #2 (92 comments), #937 (131 comments)
### **Impact**: Medium-High - Prevents real-time applications
### Problem Description
Whisper's architecture isn't designed for real-time streaming tasks. Users need websocket integration for streaming PCM data, but the 30-second window requirement makes this challenging.
### Root Cause Analysis
- **Fixed Window Size**: Whisper processes 30-second chunks, not suitable for streaming
- **Model Architecture**: Encoder-decoder architecture requires complete audio segments
- **Memory Requirements**: Large models need significant GPU memory for real-time processing
### Solution Process
#### CTranslate2 Acceleration (from Discussion #937)
```python
# Accelerated Whisper with CTranslate2
import ctranslate2
import faster_whisper
def setup_fast_whisper():
"""
Setup accelerated Whisper for better real-time performance
"""
# Use faster-whisper with CTranslate2 backend
model = faster_whisper.WhisperModel("large-v2", device="cuda", compute_type="float16")
return model
def streaming_transcribe(model, audio_stream, chunk_duration=5):
"""
Pseudo-streaming by processing shorter chunks
"""
buffer = []
results = []
for audio_chunk in audio_stream:
buffer.append(audio_chunk)
# Process when we have enough audio
if len(buffer) >= chunk_duration * 16000: # 16kHz sample rate
audio_data = np.concatenate(buffer)
segments, info = model.transcribe(audio_data, beam_size=1)
for segment in segments:
results.append(segment.text)
yield segment.text # Stream results
# Keep overlap for context
overlap_samples = int(1 * 16000) # 1 second overlap
buffer = [audio_data[-overlap_samples:]]
return results
```
#### WebSocket Integration
```python
# Real-time WebSocket handler
import asyncio
import websockets
import json
import numpy as np
class WhisperWebSocketServer:
def __init__(self, model):
self.model = model
self.audio_buffer = np.array([], dtype=np.float32)
async def handle_audio_stream(self, websocket, path):
"""
Handle streaming audio from WebSocket
"""
try:
async for message in websocket:
data = json.loads(message)
if data['type'] == 'audio':
# Decode PCM data
audio_data = np.array(data['audio'], dtype=np.float32)
self.audio_buffer = np.concatenate([self.audio_buffer, audio_data])
# Process if buffer is large enough (5 seconds)
if len(self.audio_buffer) >= 5 * 16000:
result = await self.process_chunk(self.audio_buffer)
await websocket.send(json.dumps({
'type': 'transcription',
'text': result
}))
# Keep 1 second overlap
self.audio_buffer = self.audio_buffer[-16000:]
except websockets.exceptions.ConnectionClosed:
pass
async def process_chunk(self, audio_data):
"""
Process audio chunk asynchronously
"""
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(
None, self.model.transcribe, audio_data
)
return result['text']
# Start WebSocket server
def start_streaming_server():
model = setup_fast_whisper()
server = WhisperWebSocketServer(model)
start_server = websockets.serve(
server.handle_audio_stream, "localhost", 8765
)
asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()
```
---
## Issue #3: Fine-tuning and Training Code Unavailability
### **Severity**: MEDIUM-HIGH
### **Discussion References**: #64 (113 comments), #759 (79 comments)
### **Impact**: High - Limits model customization
### Problem Description
OpenAI hasn't released the training code for Whisper models, preventing users from fine-tuning for specific domains, languages, or use cases.
### Root Cause Analysis
- **Proprietary Training Pipeline**: OpenAI maintains training code internally
- **Dataset Dependencies**: Training requires massive multilingual datasets
- **Computational Requirements**: Training requires significant computational resources
### Solution Process
#### Community Fine-tuning Framework
```python
# Fine-tuning setup using Hugging Face transformers
from transformers import (
WhisperProcessor,
WhisperForConditionalGeneration,
TrainingArguments,
Trainer
)
import torch
from torch.utils.data import Dataset
class WhisperDataset(Dataset):
def __init__(self, audio_files, transcriptions, processor):
self.audio_files = audio_files
self.transcriptions = transcriptions
self.processor = processor
def __len__(self):
return len(self.audio_files)
def __getitem__(self, idx):
audio = whisper.load_audio(self.audio_files[idx])
audio = whisper.pad_or_trim(audio)
# Process audio
input_features = self.processor(
audio, sampling_rate=16000, return_tensors="pt"
).input_features[0]
# Process transcription
labels = self.processor.tokenizer(
self.transcriptions[idx],
return_tensors="pt"
).input_ids[0]
return {
"input_features": input_features,
"labels": labels
}
def setup_fine_tuning():
"""
Setup fine-tuning environment for domain-specific adaptation
"""
# Load pre-trained model
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
# Training arguments
training_args = TrainingArguments(
output_dir="./whisper-finetuned",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
warmup_steps=500,
max_steps=5000,
learning_rate=1e-5,
fp16=True,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
logging_steps=25,
)
return processor, model, training_args
def fine_tune_whisper(audio_files, transcriptions):
"""
Fine-tune Whisper on custom dataset
"""
processor, model, training_args = setup_fine_tuning()
# Create dataset
dataset = WhisperDataset(audio_files, transcriptions, processor)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=processor.feature_extractor,
)
# Start fine-tuning
trainer.train()
# Save fine-tuned model
trainer.save_model()
return model
```
#### Domain Adaptation Strategy
```python
# Domain-specific adaptation without full retraining
def create_domain_adapter():
"""
Create adapter layers for domain-specific fine-tuning
"""
import torch.nn as nn
class WhisperAdapter(nn.Module):
def __init__(self, original_model, adapter_dim=64):
super().__init__()
self.original_model = original_model
self.adapter_dim = adapter_dim
# Add adapter layers
self.adapters = nn.ModuleDict()
for name, module in original_model.named_modules():
if isinstance(module, nn.Linear):
self.adapters[name] = nn.Sequential(
nn.Linear(module.in_features, adapter_dim),
nn.ReLU(),
nn.Linear(adapter_dim, module.out_features)
)
def forward(self, *args, **kwargs):
# Apply adapters during forward pass
return self.original_model(*args, **kwargs)
return WhisperAdapter
```
---
## Issue #4: Memory Issues and Model Performance
### **Severity**: MEDIUM
### **Discussion References**: #5 (25 comments), commit analysis
### **Impact**: Medium - Affects scalability
### Problem Description
Large Whisper models consume significant GPU memory, and processing long audio files can cause memory overflow or slow performance.
### Root Cause Analysis
- **Model Size**: Large models require 10GB+ VRAM
- **Batch Processing**: Memory accumulates with long audio files
- **Inefficient Caching**: Attention caches grow with sequence length
### Solution Process
#### Memory-Efficient Processing
```python
def memory_efficient_transcribe(model, audio_path, max_memory_mb=4000):
"""
Process large audio files with memory constraints
"""
import psutil
import gc
audio = whisper.load_audio(audio_path)
duration = len(audio) / 16000 # seconds
# Calculate optimal chunk size based on available memory
available_memory = psutil.virtual_memory().available / (1024 * 1024) # MB
safe_memory = min(max_memory_mb, available_memory * 0.7) # Use 70% of available
# Estimate chunk duration based on memory
chunk_duration = min(30, max(10, safe_memory / 200)) # Heuristic
chunk_samples = int(chunk_duration * 16000)
results = []
for i in range(0, len(audio), chunk_samples):
chunk = audio[i:i + chunk_samples]
# Clear memory before processing
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
# Process chunk
result = model.transcribe(chunk, fp16=False) # Use fp32 for stability
results.append(result['text'])
print(f"Processed {i//chunk_samples + 1}/{(len(audio)-1)//chunk_samples + 1}")
return ' '.join(results)
# Memory monitoring
def monitor_memory_usage():
"""
Monitor memory usage during transcription
"""
import psutil
process = psutil.Process()
memory_info = process.memory_info()
print(f"RSS Memory: {memory_info.rss / 1024 / 1024:.1f} MB")
print(f"VMS Memory: {memory_info.vms / 1024 / 1024:.1f} MB")
if torch.cuda.is_available():
gpu_memory = torch.cuda.memory_allocated()
gpu_cached = torch.cuda.memory_reserved()
print(f"GPU Memory: {gpu_memory / 1024 / 1024:.1f} MB")
print(f"GPU Cached: {gpu_cached / 1024 / 1024:.1f} MB")
```
#### Model Optimization
```python
def optimize_model_for_memory(model):
"""
Optimize model for lower memory usage
"""
# Use gradient checkpointing
model.model.encoder.gradient_checkpointing = True
model.model.decoder.gradient_checkpointing = True
# Enable mixed precision
if torch.cuda.is_available():
model = model.half()
# Optimize attention
try:
from torch.nn.functional import scaled_dot_product_attention
# Enable flash attention if available
torch.backends.cuda.enable_flash_sdp(True)
except:
pass
return model
```
---
## Issue #5: Language-Specific and Pronunciation Issues
### **Severity**: MEDIUM
### **Discussion References**: #25 (6 comments), #16 (13 comments)
### **Impact**: Medium - Affects non-English users
### Problem Description
Whisper struggles with specific languages (Chinese variants, Serbo-Croatian), pronunciation variations, and code-switching scenarios.
### Root Cause Analysis
- **Training Data Imbalance**: Less representation for some languages
- **Dialect Variations**: Similar languages treated as single categories
- **Phonetic Similarities**: Confusion between related languages
### Solution Process
#### Language-Specific Processing
```python
def language_aware_transcribe(model, audio_path, target_language=None):
"""
Enhanced transcription with language-specific optimizations
"""
audio = whisper.load_audio(audio_path)
# Language detection with confidence
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
_, probs = model.detect_language(mel)
if target_language is None:
# Use detected language
detected_lang = max(probs, key=probs.get)
confidence = probs[detected_lang]
if confidence < 0.7:
# Low confidence - try multiple languages
return multi_language_transcribe(model, audio, probs)
target_language = detected_lang
# Language-specific parameters
lang_config = get_language_config(target_language)
result = model.transcribe(
audio,
language=target_language,
**lang_config
)
# Post-process for language-specific corrections
result['text'] = apply_language_corrections(result['text'], target_language)
return result
def get_language_config(language):
"""
Get language-specific transcription parameters
"""
configs = {
'zh': { # Chinese
'temperature': 0.0, # More deterministic
'compression_ratio_threshold': 2.8, # Higher threshold
'condition_on_previous_text': False # Reduce context confusion
},
'sr': { # Serbian
'temperature': 0.2,
'initial_prompt': "Говори јасно.", # "Speak clearly" in Serbian
},
'hr': { # Croatian
'temperature': 0.2,
'initial_prompt': "Govorite jasno.", # "Speak clearly" in Croatian
},
'de': { # German
'temperature': 0.1,
'condition_on_previous_text': False, # Reduce hallucinations
}
}
return configs.get(language, {})
def apply_language_corrections(text, language):
"""
Apply language-specific post-processing corrections
"""
corrections = {
'zh': [
# Chinese-specific corrections
('', ', '),
('。', '. '),
('', '? '),
('', '! ')
],
'de': [
# German-specific corrections
(' ß ', 'ß'),
(' ä ', 'ä'),
(' ö ', 'ö'),
(' ü ', 'ü')
]
}
if language in corrections:
for wrong, correct in corrections[language]:
text = text.replace(wrong, correct)
return text
```
#### Multi-language Detection
```python
def multi_language_transcribe(model, audio, language_probs, threshold=0.1):
"""
Handle audio with multiple languages or uncertain detection
"""
# Get top languages above threshold
candidate_languages = {
lang: prob for lang, prob in language_probs.items()
if prob > threshold
}
results = {}
for language, prob in candidate_languages.items():
try:
result = model.transcribe(audio, language=language, temperature=0.0)
# Calculate quality score
quality_score = calculate_transcription_quality(result)
results[language] = {
'text': result['text'],
'language_prob': prob,
'quality_score': quality_score,
'combined_score': prob * quality_score
}
except Exception as e:
print(f"Failed to transcribe in {language}: {e}")
# Return best result
if results:
best_language = max(results.keys(), key=lambda x: results[x]['combined_score'])
return results[best_language]
else:
# Fallback to auto-detection
return model.transcribe(audio)
def calculate_transcription_quality(result):
"""
Calculate transcription quality heuristics
"""
text = result['text']
# Basic quality indicators
word_count = len(text.split())
char_diversity = len(set(text.lower())) / max(len(text), 1)
# Penalize very short or very long outputs
length_score = 1.0
if word_count < 3:
length_score *= 0.5
elif word_count > 200:
length_score *= 0.8
# Reward character diversity
diversity_score = min(char_diversity * 2, 1.0)
return length_score * diversity_score
```
---
## Summary and Implementation Priorities
### Critical Actions (Week 1)
1. **Implement hallucination fixes** - Apply Lucid Whisper approach and VAD preprocessing
2. **Setup memory monitoring** - Implement memory-efficient processing for production use
### High Priority (Week 2-3)
3. **Real-time optimization** - Integrate CTranslate2 acceleration and streaming capabilities
4. **Language-specific processing** - Add language detection confidence and post-processing
### Medium Priority (Month 1)
5. **Fine-tuning framework** - Setup domain adaptation infrastructure
### Repository-Specific Recommendations
Based on the actual issues from the OpenAI Whisper repository:
1. **Monitor Discussion #679** - Stay updated on hallucination solutions from the community
2. **Implement commits ba3f3cd and 919a713** - These contain official fixes for repetition issues
3. **Consider CTranslate2 integration** - As suggested in Discussion #937 for better performance
4. **Use VAD preprocessing** - Multiple discussions recommend this for better accuracy
5. **Test with problematic languages** - Focus on German, Norwegian, and Chinese variants
This analysis provides actionable solutions based on real user problems and community-developed fixes from the OpenAI Whisper repository.