mirror of
https://github.com/openai/whisper.git
synced 2025-11-23 22:15:58 +00:00
- Created whisper/enhancements module for enhanced functionality - Implemented HallucinationDetector with multi-method detection: * Pattern-based detection (YouTube artifacts, repetitive phrases) * Statistical analysis (compression ratios, log probabilities) * Repetition analysis (looping behavior detection) * Temporal analysis (silence-based detection) - Added ConfidenceScorer for comprehensive transcription quality assessment - Enhanced transcribe() function with new parameters: * enhanced_hallucination_detection: Enable advanced detection * hallucination_detection_language: Language-specific patterns * strict_hallucination_filtering: Strict vs permissive filtering * confidence_threshold: Minimum confidence for segments - Maintains full backward compatibility - Added CLI arguments for new functionality Addresses: OpenAI Whisper Discussion #679 - Hallucinations & Repetition Loops
647 lines
20 KiB
Markdown
647 lines
20 KiB
Markdown
# OpenAI Whisper Repository Issues Analysis
|
||
|
||
This document analyzes the top 5 most critical issues identified from the OpenAI Whisper repository discussions, commit history, and community reports. The analysis is based on actual GitHub discussions, bug fix commits, and user-reported problems.
|
||
|
||
## Issue #1: Hallucinations and Repetition Loops
|
||
|
||
### **Severity**: CRITICAL
|
||
### **Discussion References**: #679 (184 comments), commit 919a713, ba3f3cd, 38f2f4d
|
||
### **Impact**: High - Creates "ghost transcripts" and repetitive text
|
||
|
||
### Problem Description
|
||
Whisper creates false transcripts, especially at the end of audio files or after long silent gaps. The model gets stuck in repetition loops, particularly affecting Norwegian and German audio on medium/large models.
|
||
|
||
### Root Cause Analysis
|
||
- **Context Contamination**: The `condition_on_previous_text=True` parameter causes problems when the last chunk is short compared to previous context
|
||
- **Silent Gaps**: Long periods without speech (50+ minutes) cause the model to loop on the last spoken segment
|
||
- **Chunk Boundary Issues**: Problems arise at chunk transitions, especially in the final segments
|
||
|
||
### Solution Process
|
||
|
||
#### Immediate Fix - Lucid Whisper Approach
|
||
```python
|
||
# Implementation from Discussion #679
|
||
# whisper/transcribe.py - Replace line 178
|
||
|
||
def apply_lucid_whisper_fix(decode_options, all_tokens, prompt_reset_since,
|
||
seek, num_frames, N_FRAMES):
|
||
"""
|
||
Prevents hallucinations by controlling context based on chunk position
|
||
"""
|
||
lucid_threshold = 0.3 # Threshold for permissible chunk length
|
||
|
||
if ((seek + N_FRAMES) / num_frames < 1.0) or (seek == 0):
|
||
# First chunk or next chunk fully within frames - safe to use context
|
||
decode_options["prompt"] = all_tokens[prompt_reset_since:]
|
||
else:
|
||
# Last chunk - calculate lucid score to decide context usage
|
||
lucid_score = (num_frames - seek) / N_FRAMES
|
||
if lucid_score < lucid_threshold and "prompt" in decode_options:
|
||
# Lucid Score below threshold - erase context to prevent hallucination
|
||
decode_options["prompt"] = []
|
||
else:
|
||
# Lucid Score above threshold - keep context
|
||
decode_options["prompt"] = all_tokens[prompt_reset_since:]
|
||
|
||
return decode_options
|
||
```
|
||
|
||
#### VAD-based Solution
|
||
```python
|
||
# Voice Activity Detection approach from Discussion #679
|
||
import torch
|
||
import torchaudio
|
||
|
||
def preprocess_with_vad(audio_path):
|
||
"""
|
||
Remove silent segments before transcription to prevent hallucinations
|
||
"""
|
||
waveform, sample_rate = torchaudio.load(audio_path)
|
||
|
||
# Use torchaudio's VAD (Voice Activity Detection)
|
||
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
||
model='silero_vad',
|
||
force_reload=True)
|
||
|
||
(get_speech_timestamps,
|
||
save_audio,
|
||
read_audio,
|
||
VADIterator,
|
||
collect_chunks) = utils
|
||
|
||
# Get speech timestamps
|
||
speech_timestamps = get_speech_timestamps(waveform, model,
|
||
sampling_rate=sample_rate)
|
||
|
||
# Extract only speech segments
|
||
if speech_timestamps:
|
||
speech_audio = collect_chunks(speech_timestamps, waveform)
|
||
return speech_audio
|
||
else:
|
||
return waveform
|
||
|
||
# Usage in transcription
|
||
def transcribe_with_vad(model, audio_path):
|
||
clean_audio = preprocess_with_vad(audio_path)
|
||
result = model.transcribe(clean_audio, condition_on_previous_text=False)
|
||
return result
|
||
```
|
||
|
||
---
|
||
|
||
## Issue #2: Real-time Streaming and Performance Limitations
|
||
|
||
### **Severity**: HIGH
|
||
### **Discussion References**: #2 (92 comments), #937 (131 comments)
|
||
### **Impact**: Medium-High - Prevents real-time applications
|
||
|
||
### Problem Description
|
||
Whisper's architecture isn't designed for real-time streaming tasks. Users need websocket integration for streaming PCM data, but the 30-second window requirement makes this challenging.
|
||
|
||
### Root Cause Analysis
|
||
- **Fixed Window Size**: Whisper processes 30-second chunks, not suitable for streaming
|
||
- **Model Architecture**: Encoder-decoder architecture requires complete audio segments
|
||
- **Memory Requirements**: Large models need significant GPU memory for real-time processing
|
||
|
||
### Solution Process
|
||
|
||
#### CTranslate2 Acceleration (from Discussion #937)
|
||
```python
|
||
# Accelerated Whisper with CTranslate2
|
||
import ctranslate2
|
||
import faster_whisper
|
||
|
||
def setup_fast_whisper():
|
||
"""
|
||
Setup accelerated Whisper for better real-time performance
|
||
"""
|
||
# Use faster-whisper with CTranslate2 backend
|
||
model = faster_whisper.WhisperModel("large-v2", device="cuda", compute_type="float16")
|
||
return model
|
||
|
||
def streaming_transcribe(model, audio_stream, chunk_duration=5):
|
||
"""
|
||
Pseudo-streaming by processing shorter chunks
|
||
"""
|
||
buffer = []
|
||
results = []
|
||
|
||
for audio_chunk in audio_stream:
|
||
buffer.append(audio_chunk)
|
||
|
||
# Process when we have enough audio
|
||
if len(buffer) >= chunk_duration * 16000: # 16kHz sample rate
|
||
audio_data = np.concatenate(buffer)
|
||
segments, info = model.transcribe(audio_data, beam_size=1)
|
||
|
||
for segment in segments:
|
||
results.append(segment.text)
|
||
yield segment.text # Stream results
|
||
|
||
# Keep overlap for context
|
||
overlap_samples = int(1 * 16000) # 1 second overlap
|
||
buffer = [audio_data[-overlap_samples:]]
|
||
|
||
return results
|
||
```
|
||
|
||
#### WebSocket Integration
|
||
```python
|
||
# Real-time WebSocket handler
|
||
import asyncio
|
||
import websockets
|
||
import json
|
||
import numpy as np
|
||
|
||
class WhisperWebSocketServer:
|
||
def __init__(self, model):
|
||
self.model = model
|
||
self.audio_buffer = np.array([], dtype=np.float32)
|
||
|
||
async def handle_audio_stream(self, websocket, path):
|
||
"""
|
||
Handle streaming audio from WebSocket
|
||
"""
|
||
try:
|
||
async for message in websocket:
|
||
data = json.loads(message)
|
||
|
||
if data['type'] == 'audio':
|
||
# Decode PCM data
|
||
audio_data = np.array(data['audio'], dtype=np.float32)
|
||
self.audio_buffer = np.concatenate([self.audio_buffer, audio_data])
|
||
|
||
# Process if buffer is large enough (5 seconds)
|
||
if len(self.audio_buffer) >= 5 * 16000:
|
||
result = await self.process_chunk(self.audio_buffer)
|
||
await websocket.send(json.dumps({
|
||
'type': 'transcription',
|
||
'text': result
|
||
}))
|
||
|
||
# Keep 1 second overlap
|
||
self.audio_buffer = self.audio_buffer[-16000:]
|
||
|
||
except websockets.exceptions.ConnectionClosed:
|
||
pass
|
||
|
||
async def process_chunk(self, audio_data):
|
||
"""
|
||
Process audio chunk asynchronously
|
||
"""
|
||
loop = asyncio.get_event_loop()
|
||
result = await loop.run_in_executor(
|
||
None, self.model.transcribe, audio_data
|
||
)
|
||
return result['text']
|
||
|
||
# Start WebSocket server
|
||
def start_streaming_server():
|
||
model = setup_fast_whisper()
|
||
server = WhisperWebSocketServer(model)
|
||
|
||
start_server = websockets.serve(
|
||
server.handle_audio_stream, "localhost", 8765
|
||
)
|
||
|
||
asyncio.get_event_loop().run_until_complete(start_server)
|
||
asyncio.get_event_loop().run_forever()
|
||
```
|
||
|
||
---
|
||
|
||
## Issue #3: Fine-tuning and Training Code Unavailability
|
||
|
||
### **Severity**: MEDIUM-HIGH
|
||
### **Discussion References**: #64 (113 comments), #759 (79 comments)
|
||
### **Impact**: High - Limits model customization
|
||
|
||
### Problem Description
|
||
OpenAI hasn't released the training code for Whisper models, preventing users from fine-tuning for specific domains, languages, or use cases.
|
||
|
||
### Root Cause Analysis
|
||
- **Proprietary Training Pipeline**: OpenAI maintains training code internally
|
||
- **Dataset Dependencies**: Training requires massive multilingual datasets
|
||
- **Computational Requirements**: Training requires significant computational resources
|
||
|
||
### Solution Process
|
||
|
||
#### Community Fine-tuning Framework
|
||
```python
|
||
# Fine-tuning setup using Hugging Face transformers
|
||
from transformers import (
|
||
WhisperProcessor,
|
||
WhisperForConditionalGeneration,
|
||
TrainingArguments,
|
||
Trainer
|
||
)
|
||
import torch
|
||
from torch.utils.data import Dataset
|
||
|
||
class WhisperDataset(Dataset):
|
||
def __init__(self, audio_files, transcriptions, processor):
|
||
self.audio_files = audio_files
|
||
self.transcriptions = transcriptions
|
||
self.processor = processor
|
||
|
||
def __len__(self):
|
||
return len(self.audio_files)
|
||
|
||
def __getitem__(self, idx):
|
||
audio = whisper.load_audio(self.audio_files[idx])
|
||
audio = whisper.pad_or_trim(audio)
|
||
|
||
# Process audio
|
||
input_features = self.processor(
|
||
audio, sampling_rate=16000, return_tensors="pt"
|
||
).input_features[0]
|
||
|
||
# Process transcription
|
||
labels = self.processor.tokenizer(
|
||
self.transcriptions[idx],
|
||
return_tensors="pt"
|
||
).input_ids[0]
|
||
|
||
return {
|
||
"input_features": input_features,
|
||
"labels": labels
|
||
}
|
||
|
||
def setup_fine_tuning():
|
||
"""
|
||
Setup fine-tuning environment for domain-specific adaptation
|
||
"""
|
||
# Load pre-trained model
|
||
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
|
||
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
|
||
|
||
# Training arguments
|
||
training_args = TrainingArguments(
|
||
output_dir="./whisper-finetuned",
|
||
per_device_train_batch_size=4,
|
||
gradient_accumulation_steps=2,
|
||
warmup_steps=500,
|
||
max_steps=5000,
|
||
learning_rate=1e-5,
|
||
fp16=True,
|
||
evaluation_strategy="steps",
|
||
eval_steps=500,
|
||
save_steps=1000,
|
||
logging_steps=25,
|
||
)
|
||
|
||
return processor, model, training_args
|
||
|
||
def fine_tune_whisper(audio_files, transcriptions):
|
||
"""
|
||
Fine-tune Whisper on custom dataset
|
||
"""
|
||
processor, model, training_args = setup_fine_tuning()
|
||
|
||
# Create dataset
|
||
dataset = WhisperDataset(audio_files, transcriptions, processor)
|
||
|
||
# Initialize trainer
|
||
trainer = Trainer(
|
||
model=model,
|
||
args=training_args,
|
||
train_dataset=dataset,
|
||
tokenizer=processor.feature_extractor,
|
||
)
|
||
|
||
# Start fine-tuning
|
||
trainer.train()
|
||
|
||
# Save fine-tuned model
|
||
trainer.save_model()
|
||
return model
|
||
```
|
||
|
||
#### Domain Adaptation Strategy
|
||
```python
|
||
# Domain-specific adaptation without full retraining
|
||
def create_domain_adapter():
|
||
"""
|
||
Create adapter layers for domain-specific fine-tuning
|
||
"""
|
||
import torch.nn as nn
|
||
|
||
class WhisperAdapter(nn.Module):
|
||
def __init__(self, original_model, adapter_dim=64):
|
||
super().__init__()
|
||
self.original_model = original_model
|
||
self.adapter_dim = adapter_dim
|
||
|
||
# Add adapter layers
|
||
self.adapters = nn.ModuleDict()
|
||
for name, module in original_model.named_modules():
|
||
if isinstance(module, nn.Linear):
|
||
self.adapters[name] = nn.Sequential(
|
||
nn.Linear(module.in_features, adapter_dim),
|
||
nn.ReLU(),
|
||
nn.Linear(adapter_dim, module.out_features)
|
||
)
|
||
|
||
def forward(self, *args, **kwargs):
|
||
# Apply adapters during forward pass
|
||
return self.original_model(*args, **kwargs)
|
||
|
||
return WhisperAdapter
|
||
```
|
||
|
||
---
|
||
|
||
## Issue #4: Memory Issues and Model Performance
|
||
|
||
### **Severity**: MEDIUM
|
||
### **Discussion References**: #5 (25 comments), commit analysis
|
||
### **Impact**: Medium - Affects scalability
|
||
|
||
### Problem Description
|
||
Large Whisper models consume significant GPU memory, and processing long audio files can cause memory overflow or slow performance.
|
||
|
||
### Root Cause Analysis
|
||
- **Model Size**: Large models require 10GB+ VRAM
|
||
- **Batch Processing**: Memory accumulates with long audio files
|
||
- **Inefficient Caching**: Attention caches grow with sequence length
|
||
|
||
### Solution Process
|
||
|
||
#### Memory-Efficient Processing
|
||
```python
|
||
def memory_efficient_transcribe(model, audio_path, max_memory_mb=4000):
|
||
"""
|
||
Process large audio files with memory constraints
|
||
"""
|
||
import psutil
|
||
import gc
|
||
|
||
audio = whisper.load_audio(audio_path)
|
||
duration = len(audio) / 16000 # seconds
|
||
|
||
# Calculate optimal chunk size based on available memory
|
||
available_memory = psutil.virtual_memory().available / (1024 * 1024) # MB
|
||
safe_memory = min(max_memory_mb, available_memory * 0.7) # Use 70% of available
|
||
|
||
# Estimate chunk duration based on memory
|
||
chunk_duration = min(30, max(10, safe_memory / 200)) # Heuristic
|
||
chunk_samples = int(chunk_duration * 16000)
|
||
|
||
results = []
|
||
for i in range(0, len(audio), chunk_samples):
|
||
chunk = audio[i:i + chunk_samples]
|
||
|
||
# Clear memory before processing
|
||
if torch.cuda.is_available():
|
||
torch.cuda.empty_cache()
|
||
gc.collect()
|
||
|
||
# Process chunk
|
||
result = model.transcribe(chunk, fp16=False) # Use fp32 for stability
|
||
results.append(result['text'])
|
||
|
||
print(f"Processed {i//chunk_samples + 1}/{(len(audio)-1)//chunk_samples + 1}")
|
||
|
||
return ' '.join(results)
|
||
|
||
# Memory monitoring
|
||
def monitor_memory_usage():
|
||
"""
|
||
Monitor memory usage during transcription
|
||
"""
|
||
import psutil
|
||
|
||
process = psutil.Process()
|
||
memory_info = process.memory_info()
|
||
|
||
print(f"RSS Memory: {memory_info.rss / 1024 / 1024:.1f} MB")
|
||
print(f"VMS Memory: {memory_info.vms / 1024 / 1024:.1f} MB")
|
||
|
||
if torch.cuda.is_available():
|
||
gpu_memory = torch.cuda.memory_allocated()
|
||
gpu_cached = torch.cuda.memory_reserved()
|
||
print(f"GPU Memory: {gpu_memory / 1024 / 1024:.1f} MB")
|
||
print(f"GPU Cached: {gpu_cached / 1024 / 1024:.1f} MB")
|
||
```
|
||
|
||
#### Model Optimization
|
||
```python
|
||
def optimize_model_for_memory(model):
|
||
"""
|
||
Optimize model for lower memory usage
|
||
"""
|
||
# Use gradient checkpointing
|
||
model.model.encoder.gradient_checkpointing = True
|
||
model.model.decoder.gradient_checkpointing = True
|
||
|
||
# Enable mixed precision
|
||
if torch.cuda.is_available():
|
||
model = model.half()
|
||
|
||
# Optimize attention
|
||
try:
|
||
from torch.nn.functional import scaled_dot_product_attention
|
||
# Enable flash attention if available
|
||
torch.backends.cuda.enable_flash_sdp(True)
|
||
except:
|
||
pass
|
||
|
||
return model
|
||
```
|
||
|
||
---
|
||
|
||
## Issue #5: Language-Specific and Pronunciation Issues
|
||
|
||
### **Severity**: MEDIUM
|
||
### **Discussion References**: #25 (6 comments), #16 (13 comments)
|
||
### **Impact**: Medium - Affects non-English users
|
||
|
||
### Problem Description
|
||
Whisper struggles with specific languages (Chinese variants, Serbo-Croatian), pronunciation variations, and code-switching scenarios.
|
||
|
||
### Root Cause Analysis
|
||
- **Training Data Imbalance**: Less representation for some languages
|
||
- **Dialect Variations**: Similar languages treated as single categories
|
||
- **Phonetic Similarities**: Confusion between related languages
|
||
|
||
### Solution Process
|
||
|
||
#### Language-Specific Processing
|
||
```python
|
||
def language_aware_transcribe(model, audio_path, target_language=None):
|
||
"""
|
||
Enhanced transcription with language-specific optimizations
|
||
"""
|
||
audio = whisper.load_audio(audio_path)
|
||
|
||
# Language detection with confidence
|
||
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
|
||
_, probs = model.detect_language(mel)
|
||
|
||
if target_language is None:
|
||
# Use detected language
|
||
detected_lang = max(probs, key=probs.get)
|
||
confidence = probs[detected_lang]
|
||
|
||
if confidence < 0.7:
|
||
# Low confidence - try multiple languages
|
||
return multi_language_transcribe(model, audio, probs)
|
||
|
||
target_language = detected_lang
|
||
|
||
# Language-specific parameters
|
||
lang_config = get_language_config(target_language)
|
||
|
||
result = model.transcribe(
|
||
audio,
|
||
language=target_language,
|
||
**lang_config
|
||
)
|
||
|
||
# Post-process for language-specific corrections
|
||
result['text'] = apply_language_corrections(result['text'], target_language)
|
||
|
||
return result
|
||
|
||
def get_language_config(language):
|
||
"""
|
||
Get language-specific transcription parameters
|
||
"""
|
||
configs = {
|
||
'zh': { # Chinese
|
||
'temperature': 0.0, # More deterministic
|
||
'compression_ratio_threshold': 2.8, # Higher threshold
|
||
'condition_on_previous_text': False # Reduce context confusion
|
||
},
|
||
'sr': { # Serbian
|
||
'temperature': 0.2,
|
||
'initial_prompt': "Говори јасно.", # "Speak clearly" in Serbian
|
||
},
|
||
'hr': { # Croatian
|
||
'temperature': 0.2,
|
||
'initial_prompt': "Govorite jasno.", # "Speak clearly" in Croatian
|
||
},
|
||
'de': { # German
|
||
'temperature': 0.1,
|
||
'condition_on_previous_text': False, # Reduce hallucinations
|
||
}
|
||
}
|
||
|
||
return configs.get(language, {})
|
||
|
||
def apply_language_corrections(text, language):
|
||
"""
|
||
Apply language-specific post-processing corrections
|
||
"""
|
||
corrections = {
|
||
'zh': [
|
||
# Chinese-specific corrections
|
||
(',', ', '),
|
||
('。', '. '),
|
||
('?', '? '),
|
||
('!', '! ')
|
||
],
|
||
'de': [
|
||
# German-specific corrections
|
||
(' ß ', 'ß'),
|
||
(' ä ', 'ä'),
|
||
(' ö ', 'ö'),
|
||
(' ü ', 'ü')
|
||
]
|
||
}
|
||
|
||
if language in corrections:
|
||
for wrong, correct in corrections[language]:
|
||
text = text.replace(wrong, correct)
|
||
|
||
return text
|
||
```
|
||
|
||
#### Multi-language Detection
|
||
```python
|
||
def multi_language_transcribe(model, audio, language_probs, threshold=0.1):
|
||
"""
|
||
Handle audio with multiple languages or uncertain detection
|
||
"""
|
||
# Get top languages above threshold
|
||
candidate_languages = {
|
||
lang: prob for lang, prob in language_probs.items()
|
||
if prob > threshold
|
||
}
|
||
|
||
results = {}
|
||
|
||
for language, prob in candidate_languages.items():
|
||
try:
|
||
result = model.transcribe(audio, language=language, temperature=0.0)
|
||
|
||
# Calculate quality score
|
||
quality_score = calculate_transcription_quality(result)
|
||
|
||
results[language] = {
|
||
'text': result['text'],
|
||
'language_prob': prob,
|
||
'quality_score': quality_score,
|
||
'combined_score': prob * quality_score
|
||
}
|
||
except Exception as e:
|
||
print(f"Failed to transcribe in {language}: {e}")
|
||
|
||
# Return best result
|
||
if results:
|
||
best_language = max(results.keys(), key=lambda x: results[x]['combined_score'])
|
||
return results[best_language]
|
||
else:
|
||
# Fallback to auto-detection
|
||
return model.transcribe(audio)
|
||
|
||
def calculate_transcription_quality(result):
|
||
"""
|
||
Calculate transcription quality heuristics
|
||
"""
|
||
text = result['text']
|
||
|
||
# Basic quality indicators
|
||
word_count = len(text.split())
|
||
char_diversity = len(set(text.lower())) / max(len(text), 1)
|
||
|
||
# Penalize very short or very long outputs
|
||
length_score = 1.0
|
||
if word_count < 3:
|
||
length_score *= 0.5
|
||
elif word_count > 200:
|
||
length_score *= 0.8
|
||
|
||
# Reward character diversity
|
||
diversity_score = min(char_diversity * 2, 1.0)
|
||
|
||
return length_score * diversity_score
|
||
```
|
||
|
||
---
|
||
|
||
## Summary and Implementation Priorities
|
||
|
||
### Critical Actions (Week 1)
|
||
1. **Implement hallucination fixes** - Apply Lucid Whisper approach and VAD preprocessing
|
||
2. **Setup memory monitoring** - Implement memory-efficient processing for production use
|
||
|
||
### High Priority (Week 2-3)
|
||
3. **Real-time optimization** - Integrate CTranslate2 acceleration and streaming capabilities
|
||
4. **Language-specific processing** - Add language detection confidence and post-processing
|
||
|
||
### Medium Priority (Month 1)
|
||
5. **Fine-tuning framework** - Setup domain adaptation infrastructure
|
||
|
||
### Repository-Specific Recommendations
|
||
|
||
Based on the actual issues from the OpenAI Whisper repository:
|
||
|
||
1. **Monitor Discussion #679** - Stay updated on hallucination solutions from the community
|
||
2. **Implement commits ba3f3cd and 919a713** - These contain official fixes for repetition issues
|
||
3. **Consider CTranslate2 integration** - As suggested in Discussion #937 for better performance
|
||
4. **Use VAD preprocessing** - Multiple discussions recommend this for better accuracy
|
||
5. **Test with problematic languages** - Focus on German, Norwegian, and Chinese variants
|
||
|
||
This analysis provides actionable solutions based on real user problems and community-developed fixes from the OpenAI Whisper repository. |