# Farsi Transcriber A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model. ## Features ✨ **Core Features** - 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA) - 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV) - 🇮🇷 High-accuracy Farsi/Persian language transcription - ⏱️ Word-level timestamps for precise timing - 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV) - 💻 Clean, intuitive PyQt6-based GUI - 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU - 🔄 Progress indicators and real-time status updates ## System Requirements **Minimum:** - Python 3.8 or higher - 4GB RAM - ffmpeg installed **Recommended:** - Python 3.10+ - 8GB+ RAM - NVIDIA GPU with CUDA support (optional but faster) - SSD for better performance ## Installation ### Step 1: Install ffmpeg Choose your operating system: **Ubuntu/Debian:** ```bash sudo apt update && sudo apt install ffmpeg ``` **Fedora/CentOS:** ```bash sudo dnf install ffmpeg ``` **macOS (Homebrew):** ```bash brew install ffmpeg ``` **Windows (Chocolatey):** ```bash choco install ffmpeg ``` **Windows (Scoop):** ```bash scoop install ffmpeg ``` ### Step 2: Set up Python environment ```bash # Navigate to the repository cd whisper/farsi_transcriber # Create virtual environment python3 -m venv venv # Activate virtual environment source venv/bin/activate # On Windows: venv\Scripts\activate ``` ### Step 3: Install dependencies ```bash pip install -r requirements.txt ``` This will install: - PyQt6 (GUI framework) - openai-whisper (transcription engine) - PyTorch (deep learning framework) - NumPy, tiktoken, tqdm (supporting libraries) ## Usage ### Running the Application ```bash python main.py ``` ### Step-by-Step Guide 1. **Launch the app** - Run `python main.py` 2. **Select a file** - Click "Select File" button to choose audio/video 3. **Transcribe** - Click "Transcribe" and wait for completion 4. **View results** - See transcription with timestamps 5. **Export** - Click "Export Results" to save in your preferred format ### Supported Export Formats - **TXT** - Plain text (content only) - **SRT** - SubRip subtitle format (with timestamps) - **VTT** - WebVTT subtitle format (with timestamps) - **JSON** - Structured format with segments and metadata - **TSV** - Tab-separated values (spreadsheet compatible) ## Configuration Edit `config.py` to customize: ```python # Model size (tiny, base, small, medium, large) DEFAULT_MODEL = "medium" # Language code LANGUAGE_CODE = "fa" # Farsi # Supported formats SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...} SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...} ``` ## Model Information ### Available Models | Model | Size | Speed | Accuracy | VRAM | |-------|------|-------|----------|------| | tiny | 39M | ~10x | Good | ~1GB | | base | 74M | ~7x | Very Good | ~1GB | | small | 244M | ~4x | Excellent | ~2GB | | medium | 769M | ~2x | Excellent | ~5GB | | large | 1550M | 1x | Best | ~10GB | **Default**: `medium` (recommended for Farsi) ### Performance Notes - Larger models provide better accuracy but require more VRAM - GPU (CUDA) dramatically speeds up transcription (8-10x faster) - First run downloads the model (~500MB-3GB depending on model size) - Subsequent runs use cached model files ## Project Structure ``` farsi_transcriber/ ├── ui/ # User interface components │ ├── __init__.py │ ├── main_window.py # Main application window │ └── styles.py # Styling and theming ├── models/ # Model management │ ├── __init__.py │ └── whisper_transcriber.py # Whisper wrapper ├── utils/ # Utility functions │ ├── __init__.py │ └── export.py # Export functionality ├── config.py # Configuration settings ├── main.py # Application entry point ├── __init__.py # Package init ├── requirements.txt # Python dependencies └── README.md # This file ``` ## Troubleshooting ### Issue: "ffmpeg not found" **Solution**: Install ffmpeg using your package manager (see Installation section) ### Issue: "CUDA out of memory" **Solution**: Use a smaller model or reduce audio processing in chunks ### Issue: "Model download fails" **Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/` ### Issue: Slow transcription **Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model ## Advanced Usage ### Custom Model Selection Update `config.py`: ```python DEFAULT_MODEL = "large" # For maximum accuracy # or DEFAULT_MODEL = "tiny" # For fastest processing ``` ### Batch Processing (Future) Script to process multiple files: ```python from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber transcriber = FarsiTranscriber(model_name="medium") for audio_file in audio_files: result = transcriber.transcribe(audio_file) # Process results ``` ## Performance Tips 1. **Use GPU** - Ensure NVIDIA CUDA is properly installed 2. **Choose appropriate model** - Balance speed vs accuracy 3. **Close other applications** - Free up RAM/VRAM 4. **Use SSD** - Faster model loading and temporary file I/O 5. **Local processing** - All processing happens locally, no cloud uploads ## Development ### Code Style ```bash # Format code black farsi_transcriber/ # Check style flake8 farsi_transcriber/ # Sort imports isort farsi_transcriber/ ``` ### Future Features - [ ] Batch processing - [ ] Real-time transcription preview - [ ] Speaker diarization - [ ] Multi-language support UI - [ ] Settings dialog - [ ] Keyboard shortcuts - [ ] Drag-and-drop support - [ ] Recent files history ## License MIT License - Personal use and modifications allowed ## Acknowledgments Built with: - [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition - [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework - [PyTorch](https://pytorch.org/) - Deep learning ## Support For issues or suggestions: 1. Check the troubleshooting section 2. Verify ffmpeg is installed 3. Ensure Python 3.8+ is used 4. Check available disk space 5. Verify CUDA setup (for GPU users)