diff --git a/farsi_transcriber/README.md b/farsi_transcriber/README.md index 548301f..61e95fe 100644 --- a/farsi_transcriber/README.md +++ b/farsi_transcriber/README.md @@ -1,29 +1,48 @@ # Farsi Transcriber -A desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model. +A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model. ## Features -- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.) -- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, etc.) -- 🇮🇷 High-accuracy Farsi transcription -- ⏱️ Word-level timestamps -- 📤 Export to multiple formats (TXT, SRT, JSON) -- 💻 Clean PyQt6-based GUI +✨ **Core Features** +- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA) +- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV) +- 🇮🇷 High-accuracy Farsi/Persian language transcription +- ⏱️ Word-level timestamps for precise timing +- 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV) +- 💻 Clean, intuitive PyQt6-based GUI +- 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU +- 🔄 Progress indicators and real-time status updates ## System Requirements -- Python 3.8+ -- ffmpeg (for audio/video processing) -- 8GB+ RAM recommended (for high-accuracy model) +**Minimum:** +- Python 3.8 or higher +- 4GB RAM +- ffmpeg installed -### Install ffmpeg +**Recommended:** +- Python 3.10+ +- 8GB+ RAM +- NVIDIA GPU with CUDA support (optional but faster) +- SSD for better performance + +## Installation + +### Step 1: Install ffmpeg + +Choose your operating system: **Ubuntu/Debian:** ```bash sudo apt update && sudo apt install ffmpeg ``` +**Fedora/CentOS:** +```bash +sudo dnf install ffmpeg +``` + **macOS (Homebrew):** ```bash brew install ffmpeg @@ -34,80 +53,205 @@ brew install ffmpeg choco install ffmpeg ``` -## Installation - -1. Clone the repository -2. Create a virtual environment: +**Windows (Scoop):** ```bash +scoop install ffmpeg +``` + +### Step 2: Set up Python environment + +```bash +# Navigate to the repository +cd whisper/farsi_transcriber + +# Create virtual environment python3 -m venv venv + +# Activate virtual environment source venv/bin/activate # On Windows: venv\Scripts\activate ``` -3. Install dependencies: +### Step 3: Install dependencies + ```bash pip install -r requirements.txt ``` -4. Run the application: -```bash -python main.py -``` +This will install: +- PyQt6 (GUI framework) +- openai-whisper (transcription engine) +- PyTorch (deep learning framework) +- NumPy, tiktoken, tqdm (supporting libraries) ## Usage -### GUI Application +### Running the Application + ```bash python main.py ``` -Then: -1. Click "Select File" to choose an audio or video file -2. Click "Transcribe" and wait for processing -3. View results with timestamps -4. Export to your preferred format +### Step-by-Step Guide -### Command Line (Coming Soon) -```bash -python -m farsi_transcriber --input audio.mp3 --output transcription.srt +1. **Launch the app** - Run `python main.py` +2. **Select a file** - Click "Select File" button to choose audio/video +3. **Transcribe** - Click "Transcribe" and wait for completion +4. **View results** - See transcription with timestamps +5. **Export** - Click "Export Results" to save in your preferred format + +### Supported Export Formats + +- **TXT** - Plain text (content only) +- **SRT** - SubRip subtitle format (with timestamps) +- **VTT** - WebVTT subtitle format (with timestamps) +- **JSON** - Structured format with segments and metadata +- **TSV** - Tab-separated values (spreadsheet compatible) + +## Configuration + +Edit `config.py` to customize: + +```python +# Model size (tiny, base, small, medium, large) +DEFAULT_MODEL = "medium" + +# Language code +LANGUAGE_CODE = "fa" # Farsi + +# Supported formats +SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...} +SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...} ``` ## Model Information -This application uses OpenAI's Whisper model optimized for Farsi: -- **Model**: medium or large (configurable) -- **Accuracy**: Optimized for Persian language -- **Processing**: Local processing (no cloud required) +### Available Models + +| Model | Size | Speed | Accuracy | VRAM | +|-------|------|-------|----------|------| +| tiny | 39M | ~10x | Good | ~1GB | +| base | 74M | ~7x | Very Good | ~1GB | +| small | 244M | ~4x | Excellent | ~2GB | +| medium | 769M | ~2x | Excellent | ~5GB | +| large | 1550M | 1x | Best | ~10GB | + +**Default**: `medium` (recommended for Farsi) + +### Performance Notes + +- Larger models provide better accuracy but require more VRAM +- GPU (CUDA) dramatically speeds up transcription (8-10x faster) +- First run downloads the model (~500MB-3GB depending on model size) +- Subsequent runs use cached model files ## Project Structure ``` farsi_transcriber/ -├── ui/ # PyQt6 UI components -├── models/ # Whisper model management -├── utils/ # Utility functions -├── main.py # Application entry point -├── requirements.txt # Python dependencies -└── README.md # This file +├── ui/ # User interface components +│ ├── __init__.py +│ ├── main_window.py # Main application window +│ └── styles.py # Styling and theming +├── models/ # Model management +│ ├── __init__.py +│ └── whisper_transcriber.py # Whisper wrapper +├── utils/ # Utility functions +│ ├── __init__.py +│ └── export.py # Export functionality +├── config.py # Configuration settings +├── main.py # Application entry point +├── __init__.py # Package init +├── requirements.txt # Python dependencies +└── README.md # This file ``` +## Troubleshooting + +### Issue: "ffmpeg not found" +**Solution**: Install ffmpeg using your package manager (see Installation section) + +### Issue: "CUDA out of memory" +**Solution**: Use a smaller model or reduce audio processing in chunks + +### Issue: "Model download fails" +**Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/` + +### Issue: Slow transcription +**Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model + +## Advanced Usage + +### Custom Model Selection + +Update `config.py`: +```python +DEFAULT_MODEL = "large" # For maximum accuracy +# or +DEFAULT_MODEL = "tiny" # For fastest processing +``` + +### Batch Processing (Future) + +Script to process multiple files: +```python +from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber + +transcriber = FarsiTranscriber(model_name="medium") +for audio_file in audio_files: + result = transcriber.transcribe(audio_file) + # Process results +``` + +## Performance Tips + +1. **Use GPU** - Ensure NVIDIA CUDA is properly installed +2. **Choose appropriate model** - Balance speed vs accuracy +3. **Close other applications** - Free up RAM/VRAM +4. **Use SSD** - Faster model loading and temporary file I/O +5. **Local processing** - All processing happens locally, no cloud uploads + ## Development -### Running Tests +### Code Style + ```bash -pytest tests/ +# Format code +black farsi_transcriber/ + +# Check style +flake8 farsi_transcriber/ + +# Sort imports +isort farsi_transcriber/ ``` -### Code Style -```bash -black . -flake8 . -isort . -``` +### Future Features + +- [ ] Batch processing +- [ ] Real-time transcription preview +- [ ] Speaker diarization +- [ ] Multi-language support UI +- [ ] Settings dialog +- [ ] Keyboard shortcuts +- [ ] Drag-and-drop support +- [ ] Recent files history ## License -MIT License - See LICENSE file for details +MIT License - Personal use and modifications allowed -## Contributing +## Acknowledgments -This is a personal project, but feel free to fork and modify for your needs! +Built with: +- [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition +- [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework +- [PyTorch](https://pytorch.org/) - Deep learning + +## Support + +For issues or suggestions: +1. Check the troubleshooting section +2. Verify ffmpeg is installed +3. Ensure Python 3.8+ is used +4. Check available disk space +5. Verify CUDA setup (for GPU users) diff --git a/farsi_transcriber/config.py b/farsi_transcriber/config.py new file mode 100644 index 0000000..d5bb631 --- /dev/null +++ b/farsi_transcriber/config.py @@ -0,0 +1,72 @@ +""" +Configuration settings for Farsi Transcriber application + +Manages model selection, device settings, and other configuration options. +""" + +import os +from pathlib import Path + +# Application metadata +APP_NAME = "Farsi Transcriber" +APP_VERSION = "0.1.0" +APP_DESCRIPTION = "A desktop application for transcribing Farsi audio and video files" + +# Model settings +DEFAULT_MODEL = "medium" # Options: tiny, base, small, medium, large +AVAILABLE_MODELS = ["tiny", "base", "small", "medium", "large"] +MODEL_DESCRIPTIONS = { + "tiny": "Smallest model (39M params) - Fastest, ~1GB VRAM required", + "base": "Small model (74M params) - Fast, ~1GB VRAM required", + "small": "Medium model (244M params) - Balanced, ~2GB VRAM required", + "medium": "Large model (769M params) - Good accuracy, ~5GB VRAM required", + "large": "Largest model (1550M params) - Best accuracy, ~10GB VRAM required", +} + +# Language settings +LANGUAGE_CODE = "fa" # Farsi/Persian +LANGUAGE_NAME = "Farsi" + +# Audio/Video settings +SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".aac", ".wma"} +SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ".flv", ".wmv"} + +# UI settings +WINDOW_WIDTH = 900 +WINDOW_HEIGHT = 700 +WINDOW_MIN_WIDTH = 800 +WINDOW_MIN_HEIGHT = 600 + +# Output settings +OUTPUT_DIR = Path.home() / "FarsiTranscriber" / "outputs" +OUTPUT_DIR.mkdir(parents=True, exist_ok=True) + +EXPORT_FORMATS = { + "txt": "Plain Text", + "srt": "SRT Subtitles", + "vtt": "WebVTT Subtitles", + "json": "JSON Format", + "tsv": "Tab-Separated Values", +} + +# Device settings (auto-detect CUDA if available) +try: + import torch + + DEVICE = "cuda" if torch.cuda.is_available() else "cpu" +except ImportError: + DEVICE = "cpu" + +# Logging settings +LOG_LEVEL = "INFO" +LOG_FILE = OUTPUT_DIR / "transcriber.log" + + +def get_model_info(model_name: str) -> str: + """Get description for a model""" + return MODEL_DESCRIPTIONS.get(model_name, "Unknown model") + + +def get_supported_formats() -> set: + """Get all supported audio and video formats""" + return SUPPORTED_AUDIO_FORMATS | SUPPORTED_VIDEO_FORMATS