mirror of https://github.com/openai/whisper.git synced 2025-11-23 22:15:58 +00:00

History

feat: Add comprehensive configuration and documentation

- Create config.py with model, device, and format settings
- Add model descriptions and performance information
- Expand README with detailed installation instructions
- Add troubleshooting section for common issues
- Include advanced usage examples
- Document all export formats and features
- Add performance tips and recommendations
- Phase 6 complete: Full configuration and documentation ready

2025-11-12 05:13:35 +00:00

models

feat: Implement Whisper integration for Farsi transcription

2025-11-12 05:11:31 +00:00

feat: Add professional styling and theming

2025-11-12 05:12:38 +00:00

utils

feat: Implement comprehensive export functionality

2025-11-12 05:12:06 +00:00

__init__.py

feat: Initialize Farsi Transcriber application structure

2025-11-12 05:09:15 +00:00

.gitignore

feat: Initialize Farsi Transcriber application structure

2025-11-12 05:09:15 +00:00

config.py

feat: Add comprehensive configuration and documentation

2025-11-12 05:13:35 +00:00

main.py

feat: Create PyQt6 GUI with file picker and results display

2025-11-12 05:10:53 +00:00

README.md

feat: Add comprehensive configuration and documentation

2025-11-12 05:13:35 +00:00

requirements.txt

feat: Initialize Farsi Transcriber application structure

2025-11-12 05:09:15 +00:00

README.md

Farsi Transcriber

A professional desktop application for transcribing Farsi audio and video files using OpenAI's Whisper model.

Features

✨ Core Features

🎙️ Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, AAC, WMA)
🎬 Extract audio from video files (MP4, MKV, MOV, WebM, AVI, FLV, WMV)
🇮🇷 High-accuracy Farsi/Persian language transcription
⏱️ Word-level timestamps for precise timing
📤 Export to multiple formats (TXT, SRT, VTT, JSON, TSV)
💻 Clean, intuitive PyQt6-based GUI
🚀 GPU acceleration support (CUDA) with automatic fallback to CPU
🔄 Progress indicators and real-time status updates

System Requirements

Minimum:

Python 3.8 or higher
4GB RAM
ffmpeg installed

Recommended:

Python 3.10+
8GB+ RAM
NVIDIA GPU with CUDA support (optional but faster)
SSD for better performance

Installation

Step 1: Install ffmpeg

Choose your operating system:

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Fedora/CentOS:

sudo dnf install ffmpeg

macOS (Homebrew):

brew install ffmpeg

Windows (Chocolatey):

choco install ffmpeg

Windows (Scoop):

scoop install ffmpeg

Step 2: Set up Python environment

# Navigate to the repository
cd whisper/farsi_transcriber

# Create virtual environment
python3 -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install dependencies

pip install -r requirements.txt

This will install:

PyQt6 (GUI framework)
openai-whisper (transcription engine)
PyTorch (deep learning framework)
NumPy, tiktoken, tqdm (supporting libraries)

Usage

Running the Application

python main.py

Step-by-Step Guide

Launch the app - Run python main.py
Select a file - Click "Select File" button to choose audio/video
Transcribe - Click "Transcribe" and wait for completion
View results - See transcription with timestamps
Export - Click "Export Results" to save in your preferred format

Supported Export Formats

TXT - Plain text (content only)
SRT - SubRip subtitle format (with timestamps)
VTT - WebVTT subtitle format (with timestamps)
JSON - Structured format with segments and metadata
TSV - Tab-separated values (spreadsheet compatible)

Configuration

Edit config.py to customize:

# Model size (tiny, base, small, medium, large)
DEFAULT_MODEL = "medium"

# Language code
LANGUAGE_CODE = "fa"  # Farsi

# Supported formats
SUPPORTED_AUDIO_FORMATS = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ...}
SUPPORTED_VIDEO_FORMATS = {".mp4", ".mkv", ".mov", ".webm", ".avi", ...}

Model Information

Available Models

Model	Size	Speed	Accuracy	VRAM
tiny	39M	~10x	Good	~1GB
base	74M	~7x	Very Good	~1GB
small	244M	~4x	Excellent	~2GB
medium	769M	~2x	Excellent	~5GB
large	1550M	1x	Best	~10GB

Default: medium (recommended for Farsi)

Performance Notes

Larger models provide better accuracy but require more VRAM
GPU (CUDA) dramatically speeds up transcription (8-10x faster)
First run downloads the model (~500MB-3GB depending on model size)
Subsequent runs use cached model files

Project Structure

farsi_transcriber/
├── ui/                          # User interface components
│   ├── __init__.py
│   ├── main_window.py          # Main application window
│   └── styles.py               # Styling and theming
├── models/                      # Model management
│   ├── __init__.py
│   └── whisper_transcriber.py  # Whisper wrapper
├── utils/                       # Utility functions
│   ├── __init__.py
│   └── export.py               # Export functionality
├── config.py                    # Configuration settings
├── main.py                      # Application entry point
├── __init__.py                  # Package init
├── requirements.txt             # Python dependencies
└── README.md                    # This file

Troubleshooting

Issue: "ffmpeg not found"

Solution: Install ffmpeg using your package manager (see Installation section)

Issue: "CUDA out of memory"

Solution: Use a smaller model or reduce audio processing in chunks

Issue: "Model download fails"

Solution: Check internet connection, try again. Models are cached in ~/.cache/whisper/

Issue: Slow transcription

Solution: Ensure CUDA is detected (nvidia-smi), or upgrade to a smaller/faster model

Advanced Usage

Custom Model Selection

Update config.py:

DEFAULT_MODEL = "large"  # For maximum accuracy
# or
DEFAULT_MODEL = "tiny"   # For fastest processing

Batch Processing (Future)

Script to process multiple files:

from farsi_transcriber.models.whisper_transcriber import FarsiTranscriber

transcriber = FarsiTranscriber(model_name="medium")
for audio_file in audio_files:
    result = transcriber.transcribe(audio_file)
    # Process results

Performance Tips

Use GPU - Ensure NVIDIA CUDA is properly installed
Choose appropriate model - Balance speed vs accuracy
Close other applications - Free up RAM/VRAM
Use SSD - Faster model loading and temporary file I/O
Local processing - All processing happens locally, no cloud uploads

Development

Code Style

# Format code
black farsi_transcriber/

# Check style
flake8 farsi_transcriber/

# Sort imports
isort farsi_transcriber/

Future Features

Batch processing
Real-time transcription preview
Speaker diarization
Multi-language support UI
Settings dialog
Keyboard shortcuts
Drag-and-drop support
Recent files history

License

MIT License - Personal use and modifications allowed

Acknowledgments

Built with:

OpenAI Whisper - Speech recognition
PyQt6 - GUI framework
PyTorch - Deep learning

Support

For issues or suggestions:

Check the troubleshooting section
Verify ffmpeg is installed
Ensure Python 3.8+ is used
Check available disk space
Verify CUDA setup (for GPU users)