mirror of
https://github.com/openai/whisper.git
synced 2025-11-28 00:03:40 +00:00
feat: Add comprehensive configuration and documentation
- Create config.py with model, device, and format settings - Add model descriptions and performance information - Expand README with detailed installation instructions - Add troubleshooting section for common issues - Include advanced usage examples - Document all export formats and features - Add performance tips and recommendations - Phase 6 complete: Full configuration and documentation ready
This commit is contained in:
parent
72ab2e3fa9
commit
efdcf42ffd
@ -1,29 +1,48 @@
|
||||
# Farsi Transcriber
|
||||
|
||||
A desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
|
||||
A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
|
||||
|
||||
## Features
|
||||
|
||||
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.)
|
||||
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, etc.)
|
||||
- 🇮🇷 High-accuracy Farsi transcription
|
||||
- ⏱️ Word-level timestamps
|
||||
- 📤 Export to multiple formats (TXT, SRT, JSON)
|
||||
- 💻 Clean PyQt6-based GUI
|
||||
✨ **Core Features**
|
||||
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA)
|
||||
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV)
|
||||
- 🇮🇷 High-accuracy Farsi/Persian language transcription
|
||||
- ⏱️ Word-level timestamps for precise timing
|
||||
- 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV)
|
||||
- 💻 Clean, intuitive PyQt6-based GUI
|
||||
- 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU
|
||||
- 🔄 Progress indicators and real-time status updates
|
||||
|
||||
## System Requirements
|
||||
|
||||
- Python 3.8+
|
||||
- ffmpeg (for audio/video processing)
|
||||
- 8GB+ RAM recommended (for high-accuracy model)
|
||||
**Minimum:**
|
||||
- Python 3.8 or higher
|
||||
- 4GB RAM
|
||||
- ffmpeg installed
|
||||
|
||||
### Install ffmpeg
|
||||
**Recommended:**
|
||||
- Python 3.10+
|
||||
- 8GB+ RAM
|
||||
- NVIDIA GPU with CUDA support (optional but faster)
|
||||
- SSD for better performance
|
||||
|
||||
## Installation
|
||||
|
||||
### Step 1: Install ffmpeg
|
||||
|
||||
Choose your operating system:
|
||||
|
||||
**Ubuntu/Debian:**
|
||||
```bash
|
||||
sudo apt update && sudo apt install ffmpeg
|
||||
```
|
||||
|
||||
**Fedora/CentOS:**
|
||||
```bash
|
||||
sudo dnf install ffmpeg
|
||||
```
|
||||
|
||||
**macOS (Homebrew):**
|
||||
```bash
|
||||
brew install ffmpeg
|
||||
@ -34,80 +53,205 @@ brew install ffmpeg
|
||||
choco install ffmpeg
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
1. Clone the repository
|
||||
2. Create a virtual environment:
|
||||
**Windows (Scoop):**
|
||||
```bash
|
||||
scoop install ffmpeg
|
||||
```
|
||||
|
||||
### Step 2: Set up Python environment
|
||||
|
||||
```bash
|
||||
# Navigate to the repository
|
||||
cd whisper/farsi_transcriber
|
||||
|
||||
# Create virtual environment
|
||||
python3 -m venv venv
|
||||
|
||||
# Activate virtual environment
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
3. Install dependencies:
|
||||
### Step 3: Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
4. Run the application:
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
This will install:
|
||||
- PyQt6 (GUI framework)
|
||||
- openai-whisper (transcription engine)
|
||||
- PyTorch (deep learning framework)
|
||||
- NumPy, tiktoken, tqdm (supporting libraries)
|
||||
|
||||
## Usage
|
||||
|
||||
### GUI Application
|
||||
### Running the Application
|
||||
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
Then:
|
||||
1. Click "Select File" to choose an audio or video file
|
||||
2. Click "Transcribe" and wait for processing
|
||||
3. View results with timestamps
|
||||
4. Export to your preferred format
|
||||
### Step-by-Step Guide
|
||||
|
||||
### Command Line (Coming Soon)
|
||||
```bash
|
||||
python -m farsi_transcriber --input audio.mp3 --output transcription.srt
|
||||
1. **Launch the app** - Run `python main.py`
|
||||
2. **Select a file** - Click "Select File" button to choose audio/video
|
||||
3. **Transcribe** - Click "Transcribe" and wait for completion
|
||||
4. **View results** - See transcription with timestamps
|
||||
5. **Export** - Click "Export Results" to save in your preferred format
|
||||
|
||||
### Supported Export Formats
|
||||
|
||||
- **TXT** - Plain text (content only)
|
||||
- **SRT** - SubRip subtitle format (with timestamps)
|
||||
- **VTT** - WebVTT subtitle format (with timestamps)
|
||||
- **JSON** - Structured format with segments and metadata
|
||||
- **TSV** - Tab-separated values (spreadsheet compatible)
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit `config.py` to customize:
|
||||
|
||||
```python
|
||||
# Model size (tiny, base, small, medium, large)
|
||||
DEFAULT_MODEL = "medium"
|
||||
|
||||
# Language code
|
||||
LANGUAGE_CODE = "fa" # Farsi
|
||||
|
||||
# Supported formats
|
||||
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...}
|
||||
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...}
|
||||
```
|
||||
|
||||
## Model Information
|
||||
|
||||
This application uses OpenAI's Whisper model optimized for Farsi:
|
||||
- **Model**: medium or large (configurable)
|
||||
- **Accuracy**: Optimized for Persian language
|
||||
- **Processing**: Local processing (no cloud required)
|
||||
### Available Models
|
||||
|
||||
| Model | Size | Speed | Accuracy | VRAM |
|
||||
|-------|------|-------|----------|------|
|
||||
| tiny | 39M | ~10x | Good | ~1GB |
|
||||
| base | 74M | ~7x | Very Good | ~1GB |
|
||||
| small | 244M | ~4x | Excellent | ~2GB |
|
||||
| medium | 769M | ~2x | Excellent | ~5GB |
|
||||
| large | 1550M | 1x | Best | ~10GB |
|
||||
|
||||
**Default**: `medium` (recommended for Farsi)
|
||||
|
||||
### Performance Notes
|
||||
|
||||
- Larger models provide better accuracy but require more VRAM
|
||||
- GPU (CUDA) dramatically speeds up transcription (8-10x faster)
|
||||
- First run downloads the model (~500MB-3GB depending on model size)
|
||||
- Subsequent runs use cached model files
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
farsi_transcriber/
|
||||
├── ui/ # PyQt6 UI components
|
||||
├── models/ # Whisper model management
|
||||
├── utils/ # Utility functions
|
||||
├── main.py # Application entry point
|
||||
├── requirements.txt # Python dependencies
|
||||
└── README.md # This file
|
||||
├── ui/ # User interface components
|
||||
│ ├── __init__.py
|
||||
│ ├── main_window.py # Main application window
|
||||
│ └── styles.py # Styling and theming
|
||||
├── models/ # Model management
|
||||
│ ├── __init__.py
|
||||
│ └── whisper_transcriber.py # Whisper wrapper
|
||||
├── utils/ # Utility functions
|
||||
│ ├── __init__.py
|
||||
│ └── export.py # Export functionality
|
||||
├── config.py # Configuration settings
|
||||
├── main.py # Application entry point
|
||||
├── __init__.py # Package init
|
||||
├── requirements.txt # Python dependencies
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: "ffmpeg not found"
|
||||
**Solution**: Install ffmpeg using your package manager (see Installation section)
|
||||
|
||||
### Issue: "CUDA out of memory"
|
||||
**Solution**: Use a smaller model or reduce audio processing in chunks
|
||||
|
||||
### Issue: "Model download fails"
|
||||
**Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/`
|
||||
|
||||
### Issue: Slow transcription
|
||||
**Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Model Selection
|
||||
|
||||
Update `config.py`:
|
||||
```python
|
||||
DEFAULT_MODEL = "large" # For maximum accuracy
|
||||
# or
|
||||
DEFAULT_MODEL = "tiny" # For fastest processing
|
||||
```
|
||||
|
||||
### Batch Processing (Future)
|
||||
|
||||
Script to process multiple files:
|
||||
```python
|
||||
from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber
|
||||
|
||||
transcriber = FarsiTranscriber(model_name="medium")
|
||||
for audio_file in audio_files:
|
||||
result = transcriber.transcribe(audio_file)
|
||||
# Process results
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use GPU** - Ensure NVIDIA CUDA is properly installed
|
||||
2. **Choose appropriate model** - Balance speed vs accuracy
|
||||
3. **Close other applications** - Free up RAM/VRAM
|
||||
4. **Use SSD** - Faster model loading and temporary file I/O
|
||||
5. **Local processing** - All processing happens locally, no cloud uploads
|
||||
|
||||
## Development
|
||||
|
||||
### Running Tests
|
||||
### Code Style
|
||||
|
||||
```bash
|
||||
pytest tests/
|
||||
# Format code
|
||||
black farsi_transcriber/
|
||||
|
||||
# Check style
|
||||
flake8 farsi_transcriber/
|
||||
|
||||
# Sort imports
|
||||
isort farsi_transcriber/
|
||||
```
|
||||
|
||||
### Code Style
|
||||
```bash
|
||||
black .
|
||||
flake8 .
|
||||
isort .
|
||||
```
|
||||
### Future Features
|
||||
|
||||
- [ ] Batch processing
|
||||
- [ ] Real-time transcription preview
|
||||
- [ ] Speaker diarization
|
||||
- [ ] Multi-language support UI
|
||||
- [ ] Settings dialog
|
||||
- [ ] Keyboard shortcuts
|
||||
- [ ] Drag-and-drop support
|
||||
- [ ] Recent files history
|
||||
|
||||
## License
|
||||
|
||||
MIT License - See LICENSE file for details
|
||||
MIT License - Personal use and modifications allowed
|
||||
|
||||
## Contributing
|
||||
## Acknowledgments
|
||||
|
||||
This is a personal project, but feel free to fork and modify for your needs!
|
||||
Built with:
|
||||
- [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition
|
||||
- [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework
|
||||
- [PyTorch](https://pytorch.org/) - Deep learning
|
||||
|
||||
## Support
|
||||
|
||||
For issues or suggestions:
|
||||
1. Check the troubleshooting section
|
||||
2. Verify ffmpeg is installed
|
||||
3. Ensure Python 3.8+ is used
|
||||
4. Check available disk space
|
||||
5. Verify CUDA setup (for GPU users)
|
||||
|
||||
72
farsi_transcriber/config.py
Normal file
72
farsi_transcriber/config.py
Normal file
@ -0,0 +1,72 @@
|
||||
"""
|
||||
Configuration settings for Farsi Transcriber application
|
||||
|
||||
Manages model selection, device settings, and other configuration options.
|
||||
"""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# Application metadata
|
||||
APP_NAME = "Farsi Transcriber"
|
||||
APP_VERSION = "0.1.0"
|
||||
APP_DESCRIPTION = "A desktop application for transcribing Farsi audio and video files"
|
||||
|
||||
# Model settings
|
||||
DEFAULT_MODEL = "medium" # Options: tiny, base, small, medium, large
|
||||
AVAILABLE_MODELS = ["tiny", "base", "small", "medium", "large"]
|
||||
MODEL_DESCRIPTIONS = {
|
||||
"tiny": "Smallest model (39M params) - Fastest, ~1GB VRAM required",
|
||||
"base": "Small model (74M params) - Fast, ~1GB VRAM required",
|
||||
"small": "Medium model (244M params) - Balanced, ~2GB VRAM required",
|
||||
"medium": "Large model (769M params) - Good accuracy, ~5GB VRAM required",
|
||||
"large": "Largest model (1550M params) - Best accuracy, ~10GB VRAM required",
|
||||
}
|
||||
|
||||
# Language settings
|
||||
LANGUAGE_CODE = "fa" # Farsi/Persian
|
||||
LANGUAGE_NAME = "Farsi"
|
||||
|
||||
# Audio/Video settings
|
||||
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".aac", ".wma"}
|
||||
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ".flv", ".wmv"}
|
||||
|
||||
# UI settings
|
||||
WINDOW_WIDTH = 900
|
||||
WINDOW_HEIGHT = 700
|
||||
WINDOW_MIN_WIDTH = 800
|
||||
WINDOW_MIN_HEIGHT = 600
|
||||
|
||||
# Output settings
|
||||
OUTPUT_DIR = Path.home() / "FarsiTranscriber" / "outputs"
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
EXPORT_FORMATS = {
|
||||
"txt": "Plain Text",
|
||||
"srt": "SRT Subtitles",
|
||||
"vtt": "WebVTT Subtitles",
|
||||
"json": "JSON Format",
|
||||
"tsv": "Tab-Separated Values",
|
||||
}
|
||||
|
||||
# Device settings (auto-detect CUDA if available)
|
||||
try:
|
||||
import torch
|
||||
|
||||
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
except ImportError:
|
||||
DEVICE = "cpu"
|
||||
|
||||
# Logging settings
|
||||
LOG_LEVEL = "INFO"
|
||||
LOG_FILE = OUTPUT_DIR / "transcriber.log"
|
||||
|
||||
|
||||
def get_model_info(model_name: str) -> str:
|
||||
"""Get description for a model"""
|
||||
return MODEL_DESCRIPTIONS.get(model_name, "Unknown model")
|
||||
|
||||
|
||||
def get_supported_formats() -> set:
|
||||
"""Get all supported audio and video formats"""
|
||||
return SUPPORTED_AUDIO_FORMATS | SUPPORTED_VIDEO_FORMATS
|
||||
Loading…
x
Reference in New Issue
Block a user