feat: Add comprehensive configuration and documentation

- Create config.py with model, device, and format settings
- Add model descriptions and performance information
- Expand README with detailed installation instructions
- Add troubleshooting section for common issues
- Include advanced usage examples
- Document all export formats and features
- Add performance tips and recommendations
- Phase 6 complete: Full configuration and documentation ready
This commit is contained in:
Claude 2025-11-12 05:13:35 +00:00
parent 72ab2e3fa9
commit efdcf42ffd
No known key found for this signature in database
2 changed files with 266 additions and 50 deletions

View File

@ -1,29 +1,48 @@
# Farsi Transcriber
A desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
## Features
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.)
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, etc.)
- 🇮🇷 High-accuracy Farsi transcription
- ⏱️ Word-level timestamps
- 📤 Export to multiple formats (TXT, SRT, JSON)
- 💻 Clean PyQt6-based GUI
✨ **Core Features**
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA)
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV)
- 🇮🇷 High-accuracy Farsi/Persian language transcription
- ⏱️ Word-level timestamps for precise timing
- 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV)
- 💻 Clean, intuitive PyQt6-based GUI
- 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU
- 🔄 Progress indicators and real-time status updates
## System Requirements
- Python 3.8+
- ffmpeg (for audio/video processing)
- 8GB+ RAM recommended (for high-accuracy model)
**Minimum:**
- Python 3.8 or higher
- 4GB RAM
- ffmpeg installed
### Install ffmpeg
**Recommended:**
- Python 3.10+
- 8GB+ RAM
- NVIDIA GPU with CUDA support (optional but faster)
- SSD for better performance
## Installation
### Step 1: Install ffmpeg
Choose your operating system:
**Ubuntu/Debian:**
```bash
sudo apt update && sudo apt install ffmpeg
```
**Fedora/CentOS:**
```bash
sudo dnf install ffmpeg
```
**macOS (Homebrew):**
```bash
brew install ffmpeg
@ -34,80 +53,205 @@ brew install ffmpeg
choco install ffmpeg
```
## Installation
1. Clone the repository
2. Create a virtual environment:
**Windows (Scoop):**
```bash
scoop install ffmpeg
```
### Step 2: Set up Python environment
```bash
# Navigate to the repository
cd whisper/farsi_transcriber
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. Install dependencies:
### Step 3: Install dependencies
```bash
pip install -r requirements.txt
```
4. Run the application:
```bash
python main.py
```
This will install:
- PyQt6 (GUI framework)
- openai-whisper (transcription engine)
- PyTorch (deep learning framework)
- NumPy, tiktoken, tqdm (supporting libraries)
## Usage
### GUI Application
### Running the Application
```bash
python main.py
```
Then:
1. Click "Select File" to choose an audio or video file
2. Click "Transcribe" and wait for processing
3. View results with timestamps
4. Export to your preferred format
### Step-by-Step Guide
### Command Line (Coming Soon)
```bash
python -m farsi_transcriber --input audio.mp3 --output transcription.srt
1. **Launch the app** - Run `python main.py`
2. **Select a file** - Click "Select File" button to choose audio/video
3. **Transcribe** - Click "Transcribe" and wait for completion
4. **View results** - See transcription with timestamps
5. **Export** - Click "Export Results" to save in your preferred format
### Supported Export Formats
- **TXT** - Plain text (content only)
- **SRT** - SubRip subtitle format (with timestamps)
- **VTT** - WebVTT subtitle format (with timestamps)
- **JSON** - Structured format with segments and metadata
- **TSV** - Tab-separated values (spreadsheet compatible)
## Configuration
Edit `config.py` to customize:
```python
# Model size (tiny, base, small, medium, large)
DEFAULT_MODEL = "medium"
# Language code
LANGUAGE_CODE = "fa" # Farsi
# Supported formats
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...}
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...}
```
## Model Information
This application uses OpenAI's Whisper model optimized for Farsi:
- **Model**: medium or large (configurable)
- **Accuracy**: Optimized for Persian language
- **Processing**: Local processing (no cloud required)
### Available Models
| Model | Size | Speed | Accuracy | VRAM |
|-------|------|-------|----------|------|
| tiny | 39M | ~10x | Good | ~1GB |
| base | 74M | ~7x | Very Good | ~1GB |
| small | 244M | ~4x | Excellent | ~2GB |
| medium | 769M | ~2x | Excellent | ~5GB |
| large | 1550M | 1x | Best | ~10GB |
**Default**: `medium` (recommended for Farsi)
### Performance Notes
- Larger models provide better accuracy but require more VRAM
- GPU (CUDA) dramatically speeds up transcription (8-10x faster)
- First run downloads the model (~500MB-3GB depending on model size)
- Subsequent runs use cached model files
## Project Structure
```
farsi_transcriber/
├── ui/ # PyQt6 UI components
├── models/ # Whisper model management
├── ui/ # User interface components
│ ├── __init__.py
│ ├── main_window.py # Main application window
│ └── styles.py # Styling and theming
├── models/ # Model management
│ ├── __init__.py
│ └── whisper_transcriber.py # Whisper wrapper
├── utils/ # Utility functions
│ ├── __init__.py
│ └── export.py # Export functionality
├── config.py # Configuration settings
├── main.py # Application entry point
├── __init__.py # Package init
├── requirements.txt # Python dependencies
└── README.md # This file
```
## Troubleshooting
### Issue: "ffmpeg not found"
**Solution**: Install ffmpeg using your package manager (see Installation section)
### Issue: "CUDA out of memory"
**Solution**: Use a smaller model or reduce audio processing in chunks
### Issue: "Model download fails"
**Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/`
### Issue: Slow transcription
**Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model
## Advanced Usage
### Custom Model Selection
Update `config.py`:
```python
DEFAULT_MODEL = "large" # For maximum accuracy
# or
DEFAULT_MODEL = "tiny" # For fastest processing
```
### Batch Processing (Future)
Script to process multiple files:
```python
from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber
transcriber = FarsiTranscriber(model_name="medium")
for audio_file in audio_files:
result = transcriber.transcribe(audio_file)
# Process results
```
## Performance Tips
1. **Use GPU** - Ensure NVIDIA CUDA is properly installed
2. **Choose appropriate model** - Balance speed vs accuracy
3. **Close other applications** - Free up RAM/VRAM
4. **Use SSD** - Faster model loading and temporary file I/O
5. **Local processing** - All processing happens locally, no cloud uploads
## Development
### Running Tests
### Code Style
```bash
pytest tests/
# Format code
black farsi_transcriber/
# Check style
flake8 farsi_transcriber/
# Sort imports
isort farsi_transcriber/
```
### Code Style
```bash
black .
flake8 .
isort .
```
### Future Features
- [ ] Batch processing
- [ ] Real-time transcription preview
- [ ] Speaker diarization
- [ ] Multi-language support UI
- [ ] Settings dialog
- [ ] Keyboard shortcuts
- [ ] Drag-and-drop support
- [ ] Recent files history
## License
MIT License - See LICENSE file for details
MIT License - Personal use and modifications allowed
## Contributing
## Acknowledgments
This is a personal project, but feel free to fork and modify for your needs!
Built with:
- [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition
- [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework
- [PyTorch](https://pytorch.org/) - Deep learning
## Support
For issues or suggestions:
1. Check the troubleshooting section
2. Verify ffmpeg is installed
3. Ensure Python 3.8+ is used
4. Check available disk space
5. Verify CUDA setup (for GPU users)

View File

@ -0,0 +1,72 @@
"""
Configuration settings for Farsi Transcriber application
Manages model selection, device settings, and other configuration options.
"""
import os
from pathlib import Path
# Application metadata
APP_NAME = "Farsi Transcriber"
APP_VERSION = "0.1.0"
APP_DESCRIPTION = "A desktop application for transcribing Farsi audio and video files"
# Model settings
DEFAULT_MODEL = "medium" # Options: tiny, base, small, medium, large
AVAILABLE_MODELS = ["tiny", "base", "small", "medium", "large"]
MODEL_DESCRIPTIONS = {
"tiny": "Smallest model (39M params) - Fastest, ~1GB VRAM required",
"base": "Small model (74M params) - Fast, ~1GB VRAM required",
"small": "Medium model (244M params) - Balanced, ~2GB VRAM required",
"medium": "Large model (769M params) - Good accuracy, ~5GB VRAM required",
"large": "Largest model (1550M params) - Best accuracy, ~10GB VRAM required",
}
# Language settings
LANGUAGE_CODE = "fa" # Farsi/Persian
LANGUAGE_NAME = "Farsi"
# Audio/Video settings
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".aac", ".wma"}
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ".flv", ".wmv"}
# UI settings
WINDOW_WIDTH = 900
WINDOW_HEIGHT = 700
WINDOW_MIN_WIDTH = 800
WINDOW_MIN_HEIGHT = 600
# Output settings
OUTPUT_DIR = Path.home() / "FarsiTranscriber" / "outputs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
EXPORT_FORMATS = {
"txt": "Plain Text",
"srt": "SRT Subtitles",
"vtt": "WebVTT Subtitles",
"json": "JSON Format",
"tsv": "Tab-Separated Values",
}
# Device settings (auto-detect CUDA if available)
try:
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
except ImportError:
DEVICE = "cpu"
# Logging settings
LOG_LEVEL = "INFO"
LOG_FILE = OUTPUT_DIR / "transcriber.log"
def get_model_info(model_name: str) -> str:
"""Get description for a model"""
return MODEL_DESCRIPTIONS.get(model_name, "Unknown model")
def get_supported_formats() -> set:
"""Get all supported audio and video formats"""
return SUPPORTED_AUDIO_FORMATS | SUPPORTED_VIDEO_FORMATS