feat: Add comprehensive configuration and documentation

- Create config.py with model, device, and format settings
- Add model descriptions and performance information
- Expand README with detailed installation instructions
- Add troubleshooting section for common issues
- Include advanced usage examples
- Document all export formats and features
- Add performance tips and recommendations
- Phase 6 complete: Full configuration and documentation ready
This commit is contained in:
Claude 2025-11-12 05:13:35 +00:00
parent 72ab2e3fa9
commit efdcf42ffd
No known key found for this signature in database
2 changed files with 266 additions and 50 deletions

View File

@ -1,29 +1,48 @@
# Farsi Transcriber # Farsi Transcriber
A desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model. A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.
## Features ## Features
- 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.) ✨ **Core Features**
- 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, etc.) - 🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA)
- 🇮🇷 High-accuracy Farsi transcription - 🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV)
- ⏱️ Word-level timestamps - 🇮🇷 High-accuracy Farsi/Persian language transcription
- 📤 Export to multiple formats (TXT, SRT, JSON) - ⏱️ Word-level timestamps for precise timing
- 💻 Clean PyQt6-based GUI - 📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV)
- 💻 Clean, intuitive PyQt6-based GUI
- 🚀 GPU acceleration support (CUDA) with automatic fallback to CPU
- 🔄 Progress indicators and real-time status updates
## System Requirements ## System Requirements
- Python 3.8+ **Minimum:**
- ffmpeg (for audio/video processing) - Python 3.8 or higher
- 8GB+ RAM recommended (for high-accuracy model) - 4GB RAM
- ffmpeg installed
### Install ffmpeg **Recommended:**
- Python 3.10+
- 8GB+ RAM
- NVIDIA GPU with CUDA support (optional but faster)
- SSD for better performance
## Installation
### Step 1: Install ffmpeg
Choose your operating system:
**Ubuntu/Debian:** **Ubuntu/Debian:**
```bash ```bash
sudo apt update && sudo apt install ffmpeg sudo apt update && sudo apt install ffmpeg
``` ```
**Fedora/CentOS:**
```bash
sudo dnf install ffmpeg
```
**macOS (Homebrew):** **macOS (Homebrew):**
```bash ```bash
brew install ffmpeg brew install ffmpeg
@ -34,80 +53,205 @@ brew install ffmpeg
choco install ffmpeg choco install ffmpeg
``` ```
## Installation **Windows (Scoop):**
1. Clone the repository
2. Create a virtual environment:
```bash ```bash
scoop install ffmpeg
```
### Step 2: Set up Python environment
```bash
# Navigate to the repository
cd whisper/farsi_transcriber
# Create virtual environment
python3 -m venv venv python3 -m venv venv
# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate source venv/bin/activate # On Windows: venv\Scripts\activate
``` ```
3. Install dependencies: ### Step 3: Install dependencies
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
``` ```
4. Run the application: This will install:
```bash - PyQt6 (GUI framework)
python main.py - openai-whisper (transcription engine)
``` - PyTorch (deep learning framework)
- NumPy, tiktoken, tqdm (supporting libraries)
## Usage ## Usage
### GUI Application ### Running the Application
```bash ```bash
python main.py python main.py
``` ```
Then: ### Step-by-Step Guide
1. Click "Select File" to choose an audio or video file
2. Click "Transcribe" and wait for processing
3. View results with timestamps
4. Export to your preferred format
### Command Line (Coming Soon) 1. **Launch the app** - Run `python main.py`
```bash 2. **Select a file** - Click "Select File" button to choose audio/video
python -m farsi_transcriber --input audio.mp3 --output transcription.srt 3. **Transcribe** - Click "Transcribe" and wait for completion
4. **View results** - See transcription with timestamps
5. **Export** - Click "Export Results" to save in your preferred format
### Supported Export Formats
- **TXT** - Plain text (content only)
- **SRT** - SubRip subtitle format (with timestamps)
- **VTT** - WebVTT subtitle format (with timestamps)
- **JSON** - Structured format with segments and metadata
- **TSV** - Tab-separated values (spreadsheet compatible)
## Configuration
Edit `config.py` to customize:
```python
# Model size (tiny, base, small, medium, large)
DEFAULT_MODEL = "medium"
# Language code
LANGUAGE_CODE = "fa" # Farsi
# Supported formats
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...}
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...}
``` ```
## Model Information ## Model Information
This application uses OpenAI's Whisper model optimized for Farsi: ### Available Models
- **Model**: medium or large (configurable)
- **Accuracy**: Optimized for Persian language | Model | Size | Speed | Accuracy | VRAM |
- **Processing**: Local processing (no cloud required) |-------|------|-------|----------|------|
| tiny | 39M | ~10x | Good | ~1GB |
| base | 74M | ~7x | Very Good | ~1GB |
| small | 244M | ~4x | Excellent | ~2GB |
| medium | 769M | ~2x | Excellent | ~5GB |
| large | 1550M | 1x | Best | ~10GB |
**Default**: `medium` (recommended for Farsi)
### Performance Notes
- Larger models provide better accuracy but require more VRAM
- GPU (CUDA) dramatically speeds up transcription (8-10x faster)
- First run downloads the model (~500MB-3GB depending on model size)
- Subsequent runs use cached model files
## Project Structure ## Project Structure
``` ```
farsi_transcriber/ farsi_transcriber/
├── ui/ # PyQt6 UI components ├── ui/ # User interface components
├── models/ # Whisper model management │ ├── __init__.py
│ ├── main_window.py # Main application window
│ └── styles.py # Styling and theming
├── models/ # Model management
│ ├── __init__.py
│ └── whisper_transcriber.py # Whisper wrapper
├── utils/ # Utility functions ├── utils/ # Utility functions
│ ├── __init__.py
│ └── export.py # Export functionality
├── config.py # Configuration settings
├── main.py # Application entry point ├── main.py # Application entry point
├── __init__.py # Package init
├── requirements.txt # Python dependencies ├── requirements.txt # Python dependencies
└── README.md # This file └── README.md # This file
``` ```
## Troubleshooting
### Issue: "ffmpeg not found"
**Solution**: Install ffmpeg using your package manager (see Installation section)
### Issue: "CUDA out of memory"
**Solution**: Use a smaller model or reduce audio processing in chunks
### Issue: "Model download fails"
**Solution**: Check internet connection, try again. Models are cached in `~/.cache/whisper/`
### Issue: Slow transcription
**Solution**: Ensure CUDA is detected (`nvidia-smi`), or upgrade to a smaller/faster model
## Advanced Usage
### Custom Model Selection
Update `config.py`:
```python
DEFAULT_MODEL = "large" # For maximum accuracy
# or
DEFAULT_MODEL = "tiny" # For fastest processing
```
### Batch Processing (Future)
Script to process multiple files:
```python
from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber
transcriber = FarsiTranscriber(model_name="medium")
for audio_file in audio_files:
result = transcriber.transcribe(audio_file)
# Process results
```
## Performance Tips
1. **Use GPU** - Ensure NVIDIA CUDA is properly installed
2. **Choose appropriate model** - Balance speed vs accuracy
3. **Close other applications** - Free up RAM/VRAM
4. **Use SSD** - Faster model loading and temporary file I/O
5. **Local processing** - All processing happens locally, no cloud uploads
## Development ## Development
### Running Tests ### Code Style
```bash ```bash
pytest tests/ # Format code
black farsi_transcriber/
# Check style
flake8 farsi_transcriber/
# Sort imports
isort farsi_transcriber/
``` ```
### Code Style ### Future Features
```bash
black . - [ ] Batch processing
flake8 . - [ ] Real-time transcription preview
isort . - [ ] Speaker diarization
``` - [ ] Multi-language support UI
- [ ] Settings dialog
- [ ] Keyboard shortcuts
- [ ] Drag-and-drop support
- [ ] Recent files history
## License ## License
MIT License - See LICENSE file for details MIT License - Personal use and modifications allowed
## Contributing ## Acknowledgments
This is a personal project, but feel free to fork and modify for your needs! Built with:
- [OpenAI Whisper](https://github.com/openai/whisper) - Speech recognition
- [PyQt6](https://www.riverbankcomputing.com/software/pyqt/) - GUI framework
- [PyTorch](https://pytorch.org/) - Deep learning
## Support
For issues or suggestions:
1. Check the troubleshooting section
2. Verify ffmpeg is installed
3. Ensure Python 3.8+ is used
4. Check available disk space
5. Verify CUDA setup (for GPU users)

View File

@ -0,0 +1,72 @@
"""
Configuration settings for Farsi Transcriber application
Manages model selection, device settings, and other configuration options.
"""
import os
from pathlib import Path
# Application metadata
APP_NAME = "Farsi Transcriber"
APP_VERSION = "0.1.0"
APP_DESCRIPTION = "A desktop application for transcribing Farsi audio and video files"
# Model settings
DEFAULT_MODEL = "medium" # Options: tiny, base, small, medium, large
AVAILABLE_MODELS = ["tiny", "base", "small", "medium", "large"]
MODEL_DESCRIPTIONS = {
"tiny": "Smallest model (39M params) - Fastest, ~1GB VRAM required",
"base": "Small model (74M params) - Fast, ~1GB VRAM required",
"small": "Medium model (244M params) - Balanced, ~2GB VRAM required",
"medium": "Large model (769M params) - Good accuracy, ~5GB VRAM required",
"large": "Largest model (1550M params) - Best accuracy, ~10GB VRAM required",
}
# Language settings
LANGUAGE_CODE = "fa" # Farsi/Persian
LANGUAGE_NAME = "Farsi"
# Audio/Video settings
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".aac", ".wma"}
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ".flv", ".wmv"}
# UI settings
WINDOW_WIDTH = 900
WINDOW_HEIGHT = 700
WINDOW_MIN_WIDTH = 800
WINDOW_MIN_HEIGHT = 600
# Output settings
OUTPUT_DIR = Path.home() / "FarsiTranscriber" / "outputs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
EXPORT_FORMATS = {
"txt": "Plain Text",
"srt": "SRT Subtitles",
"vtt": "WebVTT Subtitles",
"json": "JSON Format",
"tsv": "Tab-Separated Values",
}
# Device settings (auto-detect CUDA if available)
try:
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
except ImportError:
DEVICE = "cpu"
# Logging settings
LOG_LEVEL = "INFO"
LOG_FILE = OUTPUT_DIR / "transcriber.log"
def get_model_info(model_name: str) -> str:
"""Get description for a model"""
return MODEL_DESCRIPTIONS.get(model_name, "Unknown model")
def get_supported_formats() -> set:
"""Get all supported audio and video formats"""
return SUPPORTED_AUDIO_FORMATS | SUPPORTED_VIDEO_FORMATS