feat: Add comprehensive configuration and documentation

- Create config.py with model, device, and format settings - Add model descriptions and performance information - Expand README with detailed installation instructions - Add troubleshooting section for common issues - Include advanced usage examples - Document all export formats and features - Add performance tips and recommendations - Phase 6 complete: Full configuration and documentation ready
2025-11-28 08:11:11 +00:00 · 2025-11-12 05:13:35 +00:00 · 2025-11-12 05:13:35 +00:00 · efdcf42ffd
commit efdcf42ffd
parent 72ab2e3fa9
2 changed files with 266 additions and 50 deletions
--- a/farsi_transcriber/README.md
+++ b/farsi_transcriber/README.md
@ -1,29 +1,48 @@
 # Farsi Transcriber

-A desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
+A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.

 ## Features

- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.)
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, etc.)
- 🇮🇷 High-accuracy Farsi transcription
- ⏱️ Word-level timestamps
- 📤 Export to multiple formats (TXT, SRT, JSON)
- 💻 Clean PyQt6-based GUI
+✨ **Core Features**
+- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA)
+- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV)
+- 🇮🇷 High-accuracy Farsi/Persian language transcription
+- ⏱️ Word-level timestamps for precise timing
+- 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV)
+- 💻 Clean, intuitive PyQt6-based GUI
+- 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU
+- 🔄 Progress indicators and real-time status updates

 ## System Requirements

- Python 3.8+
- ffmpeg (for audio/video processing)
- 8GB+ RAM recommended (for high-accuracy model)
+**Minimum:**
+- Python 3.8 or higher
+- 4GB RAM
+- ffmpeg installed

-### Install ffmpeg
+**Recommended:**
+- Python 3.10+
+- 8GB+ RAM
+- NVIDIA GPU with CUDA support (optional but faster)
+- SSD for better performance
+
+## Installation
+
+### Step 1: Install ffmpeg
+
+Choose your operating system:

 **Ubuntu/Debian:**
 ```bash
 sudo apt update && sudo apt install ffmpeg
 ```

+**Fedora/CentOS:**
+```bash
+sudo dnf install ffmpeg
+```
+
 **macOS (Homebrew):**
 ```bash
 brew install ffmpeg
@ -34,80 +53,205 @@ brew install ffmpeg
 choco install ffmpeg
 ```

-## Installation
-
-1. Clone the repository
-2. Create a virtual environment:
+**Windows (Scoop):**
 ```bash
+scoop install ffmpeg
+```
+
+### Step 2: Set up Python environment
+
+```bash
+# Navigate to the repository
+cd whisper/farsi_transcriber
+
+# Create virtual environment
 python3 -m venv venv
+
+# Activate virtual environment
 source venv/bin/activate  # On Windows: venv\Scripts\activate
 ```

-3. Install dependencies:
+### Step 3: Install dependencies
+
 ```bash
 pip install -r requirements.txt
 ```

-4. Run the application:
-```bash
-python main.py
-```
+This will install:
+- PyQt6 (GUI framework)
+- openai-whisper (transcription engine)
+- PyTorch (deep learning framework)
+- NumPy, tiktoken, tqdm (supporting libraries)

 ## Usage

-### GUI Application
+### Running the Application
+
 ```bash
 python main.py
 ```

-Then:
-1. Click "Select File" to choose an audio or video file
-2. Click "Transcribe" and wait for processing
-3. View results with timestamps
-4. Export to your preferred format
+### Step-by-Step Guide

-### Command Line (Coming Soon)
-```bash
-python -m farsi_transcriber --input audio.mp3 --output transcription.srt
+1. **Launch the app** - Run `python main.py`
+2. **Select a file** - Click "Select File" button to choose audio/video
+3. **Transcribe** - Click "Transcribe" and wait for completion
+4. **View results** - See transcription with timestamps
+5. **Export** - Click "Export Results" to save in your preferred format
+
+### Supported Export Formats
+
+- **TXT** - Plain text (content only)
+- **SRT** - SubRip subtitle format (with timestamps)
+- **VTT** - WebVTT subtitle format (with timestamps)
+- **JSON** - Structured format with segments and metadata
+- **TSV** - Tab-separated values (spreadsheet compatible)
+
+## Configuration
+
+Edit `config.py` to customize:
+
+```python
+# Model size (tiny, base, small, medium, large)
+DEFAULT_MODEL = "medium"
+
+# Language code
+LANGUAGE_CODE = "fa"  # Farsi
+
+# Supported formats
+SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...}
+SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...}
 ```

 ## Model Information

-This application uses OpenAI's Whisper model optimized for Farsi:
- **Model**: medium or large (configurable)
- **Accuracy**: Optimized for Persian language
- **Processing**: Local processing (no cloud required)
+### Available Models
+
+| Model | Size | Speed | Accuracy | VRAM |
+|-------|------|-------|----------|------|
+| tiny | 39M | ~10x | Good | ~1GB |
+| base | 74M | ~7x | Very Good | ~1GB |
+| small | 244M | ~4x | Excellent | ~2GB |
+| medium | 769M | ~2x | Excellent | ~5GB |
+| large | 1550M | 1x | Best | ~10GB |
+
+**Default**: `medium` (recommended for Farsi)
+
+### Performance Notes
+
+- Larger models provide better accuracy but require more VRAM
+- GPU (CUDA) dramatically speeds up transcription (8-10x faster)
+- First run downloads the model (~500MB-3GB depending on model size)
+- Subsequent runs use cached model files

 ## Project Structure

 ```
 farsi_transcriber/
-├── ui/               # PyQt6 UI components
-├── models/           # Whisper model management
+├── ui/                          # User interface components
+│   ├── __init__.py
+│   ├── main_window.py          # Main application window
+│   └── styles.py               # Styling and theming
+├── models/                      # Model management
+│   ├── __init__.py
+│   └── whisper_transcriber.py  # Whisper wrapper
 ├── utils/                       # Utility functions
+│   ├── __init__.py
+│   └── export.py               # Export functionality
+├── config.py                    # Configuration settings
 ├── main.py                      # Application entry point
+├── __init__.py                  # Package init
 ├── requirements.txt             # Python dependencies
 └── README.md                    # This file
 ```

+## Troubleshooting
+
+### Issue: "ffmpeg not found"
+**Solution**: Install ffmpeg using your package manager (see Installation section)
+
+### Issue: "CUDA out of memory"
+**Solution**: Use a smaller model or reduce audio processing in chunks
+
+### Issue: "Model download fails"
+**Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/`
+
+### Issue: Slow transcription
+**Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model
+
+## Advanced Usage
+
+### Custom Model Selection
+
+Update `config.py`:
+```python
+DEFAULT_MODEL = "large"  # For maximum accuracy
+# or
+DEFAULT_MODEL = "tiny"   # For fastest processing
+```
+
+### Batch Processing (Future)
+
+Script to process multiple files:
+```python
+from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber
+
+transcriber = FarsiTranscriber(model_name="medium")
+for audio_file in audio_files:
+    result = transcriber.transcribe(audio_file)
+    # Process results
+```
+
+## Performance Tips
+
+1. **Use GPU** - Ensure NVIDIA CUDA is properly installed
+2. **Choose appropriate model** - Balance speed vs accuracy
+3. **Close other applications** - Free up RAM/VRAM
+4. **Use SSD** - Faster model loading and temporary file I/O
+5. **Local processing** - All processing happens locally, no cloud uploads
+
 ## Development

-### Running Tests
+### Code Style
+
 ```bash
-pytest tests/
+# Format code
+black farsi_transcriber/
+
+# Check style
+flake8 farsi_transcriber/
+
+# Sort imports
+isort farsi_transcriber/
 ```

-### Code Style
-```bash
-black .
-flake8 .
-isort .
-```
+### Future Features
+
+- [ ] Batch processing
+- [ ] Real-time transcription preview
+- [ ] Speaker diarization
+- [ ] Multi-language support UI
+- [ ] Settings dialog
+- [ ] Keyboard shortcuts
+- [ ] Drag-and-drop support
+- [ ] Recent files history

 ## License

-MIT License - See LICENSE file for details
+MIT License - Personal use and modifications allowed

-## Contributing
+## Acknowledgments

-This is a personal project, but feel free to fork and modify for your needs!
+Built with:
+- [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition
+- [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework
+- [PyTorch](https://pytorch.org/) - Deep learning
+
+## Support
+
+For issues or suggestions:
+1. Check the troubleshooting section
+2. Verify ffmpeg is installed
+3. Ensure Python 3.8+ is used
+4. Check available disk space
+5. Verify CUDA setup (for GPU users)
--- a/farsi_transcriber/config.py
+++ b/farsi_transcriber/config.py
@ -0,0 +1,72 @@
+"""
+Configuration settings for Farsi Transcriber application
+
+Manages model selection, device settings, and other configuration options.
+"""
+
+import os
+from pathlib import Path
+
+# Application metadata
+APP_NAME = "Farsi Transcriber"
+APP_VERSION = "0.1.0"
+APP_DESCRIPTION = "A desktop application for transcribing Farsi audio and video files"
+
+# Model settings
+DEFAULT_MODEL = "medium"  # Options: tiny, base, small, medium, large
+AVAILABLE_MODELS = ["tiny", "base", "small", "medium", "large"]
+MODEL_DESCRIPTIONS = {
+    "tiny": "Smallest model (39M params) - Fastest, ~1GB VRAM required",
+    "base": "Small model (74M params) - Fast, ~1GB VRAM required",
+    "small": "Medium model (244M params) - Balanced, ~2GB VRAM required",
+    "medium": "Large model (769M params) - Good accuracy, ~5GB VRAM required",
+    "large": "Largest model (1550M params) - Best accuracy, ~10GB VRAM required",
+}
+
+# Language settings
+LANGUAGE_CODE = "fa"  # Farsi/Persian
+LANGUAGE_NAME = "Farsi"
+
+# Audio/Video settings
+SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".aac", ".wma"}
+SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ".flv", ".wmv"}
+
+# UI settings
+WINDOW_WIDTH = 900
+WINDOW_HEIGHT = 700
+WINDOW_MIN_WIDTH = 800
+WINDOW_MIN_HEIGHT = 600
+
+# Output settings
+OUTPUT_DIR = Path.home() / "FarsiTranscriber" / "outputs"
+OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+
+EXPORT_FORMATS = {
+    "txt": "Plain Text",
+    "srt": "SRT Subtitles",
+    "vtt": "WebVTT Subtitles",
+    "json": "JSON Format",
+    "tsv": "Tab-Separated Values",
+}
+
+# Device settings (auto-detect CUDA if available)
+try:
+    import torch
+
+    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+except ImportError:
+    DEVICE = "cpu"
+
+# Logging settings
+LOG_LEVEL = "INFO"
+LOG_FILE = OUTPUT_DIR / "transcriber.log"
+
+
+def get_model_info(model_name: str) -> str:
+    """Get description for a model"""
+    return MODEL_DESCRIPTIONS.get(model_name, "Unknown model")
+
+
+def get_supported_formats() -> set:
+    """Get all supported audio and video formats"""
+    return SUPPORTED_AUDIO_FORMATS | SUPPORTED_VIDEO_FORMATS