Frontend:
- Initialize React 18 + TypeScript project with Vite
- Implement complete App.tsx matching Figma design
- Add dark/light theme toggle support
- Create file queue management UI
- Implement search with text highlighting
- Add segment copy functionality
- Create reusable UI components (Button, Progress, Input, Select)
- Configure Tailwind CSS v4.0 for styling
- Setup window resizing functionality
- Implement RTL support for Farsi text
Backend:
- Create Flask API server with CORS support
- Implement /transcribe endpoint for audio/video processing
- Add /models endpoint for available models info
- Implement /export endpoint for multiple formats (TXT, SRT, VTT, JSON)
- Setup Whisper model integration
- Handle file uploads with validation
- Format transcription results with timestamps
Configuration:
- Setup Vite dev server with API proxy
- Configure Tailwind CSS with custom colors
- Setup TypeScript strict mode
- Add PostCSS with autoprefixer
- Configure Flask for development
Documentation:
- Write comprehensive README with setup instructions
- Include API endpoint documentation
- Add troubleshooting guide
- Include performance tips
Includes everything ready to run with: npm install && npm run dev (frontend) and python backend/app.py (backend)
- Create config.py with model, device, and format settings
- Add model descriptions and performance information
- Expand README with detailed installation instructions
- Add troubleshooting section for common issues
- Include advanced usage examples
- Document all export formats and features
- Add performance tips and recommendations
- Phase 6 complete: Full configuration and documentation ready
- Create styles.py module with comprehensive stylesheet
- Implement color palette and typography configuration
- Apply consistent styling across all UI elements
- Improve button, text input, and progress bar appearance
- Use monospace font for transcription results display
- Add hover and active states for interactive elements
- Phase 5 complete: Professional UI styling applied
- Create FarsiTranscriber class wrapping OpenAI's Whisper model
- Support both audio and video file formats
- Implement word-level timestamp extraction
- Add device detection (CUDA/CPU) for optimal performance
- Format results for display with timestamps
- Integrate transcriber with PyQt6 worker thread
- Add error handling and progress updates
- Phase 3 complete: Core transcription engine ready
- Implement MainWindow class with professional layout
- Add file picker for audio and video formats
- Create transcription button with threading support
- Add progress bar and status indicators
- Implement TranscriptionWorker thread to prevent UI freezing
- Add results display with timestamps support
- Create export button (placeholder for Phase 4)
- Add error handling and user feedback
- Phase 2 complete: Full GUI scaffolding ready
* Fix: Update torch.load to use weights_only=True to prevent security warning
* Update __init__.py
* Update __init__.py
---------
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Update triton kernel using _unsafe_update_src
* support old triton versions
* refactored changes to update triton kernel only once
* Update triton_ops.py
---------
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
Co-authored-by: Jong Wook Kim <ilikekjw@gmail.com>
* Bugfix: Illogical "Avoid computing higher temperatures on no_speech"
Bugfix for https://github.com/openai/whisper/pull/1279
It's "silence" when decoding has failed due to `compression_ratio_threshold` too, when further down the code it's not "silence" anymore.
"Silence" should be only when decoding has failed due to `logprob_threshold`.
Like described there:
8bc8860694/whisper/transcribe.py (L421)
And in code there:
8bc8860694/whisper/transcribe.py (L243-L251)
* Fix if "logprob_threshold=None"
---------
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Add option to carry initial_prompt with the sliding window
Add an option `carry_initial_prompt = False` to `whisper.transcribe()`.
When set to `True`, `initial_prompt` is prepended to each internal `decode()` call's `prompt`.
If there is not enough context space at the start of the prompt, the prompt is left-sliced to make space.
* Prevent redundant initial_prompt_tokens
* Revert unnecessary .gitignore change
---------
Co-authored-by: Kittsil <kittsil@gmail.com>
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* Relax triton requirements for compatibility with pytorch 2.4 and newer
Similar to https://github.com/openai/whisper/pull/1802, but now when pytorch upgrades to 2.4, it requires triton==3.0.0. I am not sure if it makes sense to remove the upper bound version constraints
* Update requirements.txt
* Update audio.py
The `mel_filters` function is using a `np.load` function to load a pre-computed mel filterbank matrix. This function is not thread-safe, which means that if it is called from multiple threads at the same time, it may corrupt the data.
To fix this, you can use the `torch.load` function instead. This function is thread-safe, so it will not corrupt the data if it is called from multiple threads at the same time.
* Update audio.py
updated the docstring
* allow_pickle=False
* newline
---------
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
Co-authored-by: Jong Wook Kim <jongwook@openai.com>
* ADD parser for new argument --max_words_count
* ADD max_words_count in words_options
ADD warning for max_line_width compatibility
* ADD logic for max_words_count
* rename to max_words_per_line
* make them kwargs
* allow specifying file path by --model
* black formatting
---------
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>