mirror of
https://github.com/openai/whisper.git
synced 2025-11-28 16:14:00 +00:00
feat: Add comprehensive configuration and documentation
- Create config.py with model, device, and format settings - Add model descriptions and performance information - Expand README with detailed installation instructions - Add troubleshooting section for common issues - Include advanced usage examples - Document all export formats and features - Add performance tips and recommendations - Phase 6 complete: Full configuration and documentation ready
This commit is contained in:
parent
72ab2e3fa9
commit
efdcf42ffd
@ -1,29 +1,48 @@
|
|||||||
# Farsi Transcriber
|
# Farsi Transcriber
|
||||||
|
|
||||||
A desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
|
A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.)
|
✨ **Core Features**
|
||||||
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, etc.)
|
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA)
|
||||||
- 🇮🇷 High-accuracy Farsi transcription
|
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV)
|
||||||
- ⏱️ Word-level timestamps
|
- 🇮🇷 High-accuracy Farsi/Persian language transcription
|
||||||
- 📤 Export to multiple formats (TXT, SRT, JSON)
|
- ⏱️ Word-level timestamps for precise timing
|
||||||
- 💻 Clean PyQt6-based GUI
|
- 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV)
|
||||||
|
- 💻 Clean, intuitive PyQt6-based GUI
|
||||||
|
- 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU
|
||||||
|
- 🔄 Progress indicators and real-time status updates
|
||||||
|
|
||||||
## System Requirements
|
## System Requirements
|
||||||
|
|
||||||
- Python 3.8+
|
**Minimum:**
|
||||||
- ffmpeg (for audio/video processing)
|
- Python 3.8 or higher
|
||||||
- 8GB+ RAM recommended (for high-accuracy model)
|
- 4GB RAM
|
||||||
|
- ffmpeg installed
|
||||||
|
|
||||||
### Install ffmpeg
|
**Recommended:**
|
||||||
|
- Python 3.10+
|
||||||
|
- 8GB+ RAM
|
||||||
|
- NVIDIA GPU with CUDA support (optional but faster)
|
||||||
|
- SSD for better performance
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### Step 1: Install ffmpeg
|
||||||
|
|
||||||
|
Choose your operating system:
|
||||||
|
|
||||||
**Ubuntu/Debian:**
|
**Ubuntu/Debian:**
|
||||||
```bash
|
```bash
|
||||||
sudo apt update && sudo apt install ffmpeg
|
sudo apt update && sudo apt install ffmpeg
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Fedora/CentOS:**
|
||||||
|
```bash
|
||||||
|
sudo dnf install ffmpeg
|
||||||
|
```
|
||||||
|
|
||||||
**macOS (Homebrew):**
|
**macOS (Homebrew):**
|
||||||
```bash
|
```bash
|
||||||
brew install ffmpeg
|
brew install ffmpeg
|
||||||
@ -34,80 +53,205 @@ brew install ffmpeg
|
|||||||
choco install ffmpeg
|
choco install ffmpeg
|
||||||
```
|
```
|
||||||
|
|
||||||
## Installation
|
**Windows (Scoop):**
|
||||||
|
|
||||||
1. Clone the repository
|
|
||||||
2. Create a virtual environment:
|
|
||||||
```bash
|
```bash
|
||||||
|
scoop install ffmpeg
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Set up Python environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Navigate to the repository
|
||||||
|
cd whisper/farsi_transcriber
|
||||||
|
|
||||||
|
# Create virtual environment
|
||||||
python3 -m venv venv
|
python3 -m venv venv
|
||||||
|
|
||||||
|
# Activate virtual environment
|
||||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Install dependencies:
|
### Step 3: Install dependencies
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Run the application:
|
This will install:
|
||||||
```bash
|
- PyQt6 (GUI framework)
|
||||||
python main.py
|
- openai-whisper (transcription engine)
|
||||||
```
|
- PyTorch (deep learning framework)
|
||||||
|
- NumPy, tiktoken, tqdm (supporting libraries)
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
### GUI Application
|
### Running the Application
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python main.py
|
python main.py
|
||||||
```
|
```
|
||||||
|
|
||||||
Then:
|
### Step-by-Step Guide
|
||||||
1. Click "Select File" to choose an audio or video file
|
|
||||||
2. Click "Transcribe" and wait for processing
|
|
||||||
3. View results with timestamps
|
|
||||||
4. Export to your preferred format
|
|
||||||
|
|
||||||
### Command Line (Coming Soon)
|
1. **Launch the app** - Run `python main.py`
|
||||||
```bash
|
2. **Select a file** - Click "Select File" button to choose audio/video
|
||||||
python -m farsi_transcriber --input audio.mp3 --output transcription.srt
|
3. **Transcribe** - Click "Transcribe" and wait for completion
|
||||||
|
4. **View results** - See transcription with timestamps
|
||||||
|
5. **Export** - Click "Export Results" to save in your preferred format
|
||||||
|
|
||||||
|
### Supported Export Formats
|
||||||
|
|
||||||
|
- **TXT** - Plain text (content only)
|
||||||
|
- **SRT** - SubRip subtitle format (with timestamps)
|
||||||
|
- **VTT** - WebVTT subtitle format (with timestamps)
|
||||||
|
- **JSON** - Structured format with segments and metadata
|
||||||
|
- **TSV** - Tab-separated values (spreadsheet compatible)
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit `config.py` to customize:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Model size (tiny, base, small, medium, large)
|
||||||
|
DEFAULT_MODEL = "medium"
|
||||||
|
|
||||||
|
# Language code
|
||||||
|
LANGUAGE_CODE = "fa" # Farsi
|
||||||
|
|
||||||
|
# Supported formats
|
||||||
|
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...}
|
||||||
|
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...}
|
||||||
```
|
```
|
||||||
|
|
||||||
## Model Information
|
## Model Information
|
||||||
|
|
||||||
This application uses OpenAI's Whisper model optimized for Farsi:
|
### Available Models
|
||||||
- **Model**: medium or large (configurable)
|
|
||||||
- **Accuracy**: Optimized for Persian language
|
| Model | Size | Speed | Accuracy | VRAM |
|
||||||
- **Processing**: Local processing (no cloud required)
|
|-------|------|-------|----------|------|
|
||||||
|
| tiny | 39M | ~10x | Good | ~1GB |
|
||||||
|
| base | 74M | ~7x | Very Good | ~1GB |
|
||||||
|
| small | 244M | ~4x | Excellent | ~2GB |
|
||||||
|
| medium | 769M | ~2x | Excellent | ~5GB |
|
||||||
|
| large | 1550M | 1x | Best | ~10GB |
|
||||||
|
|
||||||
|
**Default**: `medium` (recommended for Farsi)
|
||||||
|
|
||||||
|
### Performance Notes
|
||||||
|
|
||||||
|
- Larger models provide better accuracy but require more VRAM
|
||||||
|
- GPU (CUDA) dramatically speeds up transcription (8-10x faster)
|
||||||
|
- First run downloads the model (~500MB-3GB depending on model size)
|
||||||
|
- Subsequent runs use cached model files
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
farsi_transcriber/
|
farsi_transcriber/
|
||||||
├── ui/ # PyQt6 UI components
|
├── ui/ # User interface components
|
||||||
├── models/ # Whisper model management
|
│ ├── __init__.py
|
||||||
|
│ ├── main_window.py # Main application window
|
||||||
|
│ └── styles.py # Styling and theming
|
||||||
|
├── models/ # Model management
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ └── whisper_transcriber.py # Whisper wrapper
|
||||||
├── utils/ # Utility functions
|
├── utils/ # Utility functions
|
||||||
|
│ ├── __init__.py
|
||||||
|
│ └── export.py # Export functionality
|
||||||
|
├── config.py # Configuration settings
|
||||||
├── main.py # Application entry point
|
├── main.py # Application entry point
|
||||||
|
├── __init__.py # Package init
|
||||||
├── requirements.txt # Python dependencies
|
├── requirements.txt # Python dependencies
|
||||||
└── README.md # This file
|
└── README.md # This file
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Issue: "ffmpeg not found"
|
||||||
|
**Solution**: Install ffmpeg using your package manager (see Installation section)
|
||||||
|
|
||||||
|
### Issue: "CUDA out of memory"
|
||||||
|
**Solution**: Use a smaller model or reduce audio processing in chunks
|
||||||
|
|
||||||
|
### Issue: "Model download fails"
|
||||||
|
**Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/`
|
||||||
|
|
||||||
|
### Issue: Slow transcription
|
||||||
|
**Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model
|
||||||
|
|
||||||
|
## Advanced Usage
|
||||||
|
|
||||||
|
### Custom Model Selection
|
||||||
|
|
||||||
|
Update `config.py`:
|
||||||
|
```python
|
||||||
|
DEFAULT_MODEL = "large" # For maximum accuracy
|
||||||
|
# or
|
||||||
|
DEFAULT_MODEL = "tiny" # For fastest processing
|
||||||
|
```
|
||||||
|
|
||||||
|
### Batch Processing (Future)
|
||||||
|
|
||||||
|
Script to process multiple files:
|
||||||
|
```python
|
||||||
|
from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber
|
||||||
|
|
||||||
|
transcriber = FarsiTranscriber(model_name="medium")
|
||||||
|
for audio_file in audio_files:
|
||||||
|
result = transcriber.transcribe(audio_file)
|
||||||
|
# Process results
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Tips
|
||||||
|
|
||||||
|
1. **Use GPU** - Ensure NVIDIA CUDA is properly installed
|
||||||
|
2. **Choose appropriate model** - Balance speed vs accuracy
|
||||||
|
3. **Close other applications** - Free up RAM/VRAM
|
||||||
|
4. **Use SSD** - Faster model loading and temporary file I/O
|
||||||
|
5. **Local processing** - All processing happens locally, no cloud uploads
|
||||||
|
|
||||||
## Development
|
## Development
|
||||||
|
|
||||||
### Running Tests
|
### Code Style
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pytest tests/
|
# Format code
|
||||||
|
black farsi_transcriber/
|
||||||
|
|
||||||
|
# Check style
|
||||||
|
flake8 farsi_transcriber/
|
||||||
|
|
||||||
|
# Sort imports
|
||||||
|
isort farsi_transcriber/
|
||||||
```
|
```
|
||||||
|
|
||||||
### Code Style
|
### Future Features
|
||||||
```bash
|
|
||||||
black .
|
- [ ] Batch processing
|
||||||
flake8 .
|
- [ ] Real-time transcription preview
|
||||||
isort .
|
- [ ] Speaker diarization
|
||||||
```
|
- [ ] Multi-language support UI
|
||||||
|
- [ ] Settings dialog
|
||||||
|
- [ ] Keyboard shortcuts
|
||||||
|
- [ ] Drag-and-drop support
|
||||||
|
- [ ] Recent files history
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
MIT License - See LICENSE file for details
|
MIT License - Personal use and modifications allowed
|
||||||
|
|
||||||
## Contributing
|
## Acknowledgments
|
||||||
|
|
||||||
This is a personal project, but feel free to fork and modify for your needs!
|
Built with:
|
||||||
|
- [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition
|
||||||
|
- [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework
|
||||||
|
- [PyTorch](https://pytorch.org/) - Deep learning
|
||||||
|
|
||||||
|
## Support
|
||||||
|
|
||||||
|
For issues or suggestions:
|
||||||
|
1. Check the troubleshooting section
|
||||||
|
2. Verify ffmpeg is installed
|
||||||
|
3. Ensure Python 3.8+ is used
|
||||||
|
4. Check available disk space
|
||||||
|
5. Verify CUDA setup (for GPU users)
|
||||||
|
|||||||
72
farsi_transcriber/config.py
Normal file
72
farsi_transcriber/config.py
Normal file
@ -0,0 +1,72 @@
|
|||||||
|
"""
|
||||||
|
Configuration settings for Farsi Transcriber application
|
||||||
|
|
||||||
|
Manages model selection, device settings, and other configuration options.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Application metadata
|
||||||
|
APP_NAME = "Farsi Transcriber"
|
||||||
|
APP_VERSION = "0.1.0"
|
||||||
|
APP_DESCRIPTION = "A desktop application for transcribing Farsi audio and video files"
|
||||||
|
|
||||||
|
# Model settings
|
||||||
|
DEFAULT_MODEL = "medium" # Options: tiny, base, small, medium, large
|
||||||
|
AVAILABLE_MODELS = ["tiny", "base", "small", "medium", "large"]
|
||||||
|
MODEL_DESCRIPTIONS = {
|
||||||
|
"tiny": "Smallest model (39M params) - Fastest, ~1GB VRAM required",
|
||||||
|
"base": "Small model (74M params) - Fast, ~1GB VRAM required",
|
||||||
|
"small": "Medium model (244M params) - Balanced, ~2GB VRAM required",
|
||||||
|
"medium": "Large model (769M params) - Good accuracy, ~5GB VRAM required",
|
||||||
|
"large": "Largest model (1550M params) - Best accuracy, ~10GB VRAM required",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Language settings
|
||||||
|
LANGUAGE_CODE = "fa" # Farsi/Persian
|
||||||
|
LANGUAGE_NAME = "Farsi"
|
||||||
|
|
||||||
|
# Audio/Video settings
|
||||||
|
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".aac", ".wma"}
|
||||||
|
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ".flv", ".wmv"}
|
||||||
|
|
||||||
|
# UI settings
|
||||||
|
WINDOW_WIDTH = 900
|
||||||
|
WINDOW_HEIGHT = 700
|
||||||
|
WINDOW_MIN_WIDTH = 800
|
||||||
|
WINDOW_MIN_HEIGHT = 600
|
||||||
|
|
||||||
|
# Output settings
|
||||||
|
OUTPUT_DIR = Path.home() / "FarsiTranscriber" / "outputs"
|
||||||
|
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
EXPORT_FORMATS = {
|
||||||
|
"txt": "Plain Text",
|
||||||
|
"srt": "SRT Subtitles",
|
||||||
|
"vtt": "WebVTT Subtitles",
|
||||||
|
"json": "JSON Format",
|
||||||
|
"tsv": "Tab-Separated Values",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Device settings (auto-detect CUDA if available)
|
||||||
|
try:
|
||||||
|
import torch
|
||||||
|
|
||||||
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||||
|
except ImportError:
|
||||||
|
DEVICE = "cpu"
|
||||||
|
|
||||||
|
# Logging settings
|
||||||
|
LOG_LEVEL = "INFO"
|
||||||
|
LOG_FILE = OUTPUT_DIR / "transcriber.log"
|
||||||
|
|
||||||
|
|
||||||
|
def get_model_info(model_name: str) -> str:
|
||||||
|
"""Get description for a model"""
|
||||||
|
return MODEL_DESCRIPTIONS.get(model_name, "Unknown model")
|
||||||
|
|
||||||
|
|
||||||
|
def get_supported_formats() -> set:
|
||||||
|
"""Get all supported audio and video formats"""
|
||||||
|
return SUPPORTED_AUDIO_FORMATS | SUPPORTED_VIDEO_FORMATS
|
||||||
Loading…
x
Reference in New Issue
Block a user