Table of Contents
Unedited videos are full of verbal disfluencies ("huh", "uh", "erm", "um") and long pauses when the speaker is thinking of what to say. Editing such videos manually is tedious and time consuming.
- Removes glaring disfluencies and hesitations.
- Reduce the duration of pauses without cuts.
- Return you a final cut that is cleaner and shorter than the original mp4 video upload.
- Some of your computer's CPU processing resources
- Time
You can find the deployed app here
| Video size/mb | Video duration/min | Final Video Duration/min | Time taken for audio analysis/min | Time Taken for video editing/min |
|---|---|---|---|---|
| 21 | 1.04 | 0.53 | 0.4 | 11 |
| 130 | 6.44 | 5.04 | 3.53 | 73 |
- Sign-in with Google
- Browse to upload a .mp4 video ( max size: 100mb )
- Analyze Video to start video analysis.
- Clean Video when progress bar reaches 50% to start video processing. Colored bars will appear below the video to indicate the type of speech (speech, hesitation, pauses) that occurred in the video's timeframe. The bar on the right visualises the cleaned state without hesitations and long pauses.
- Download when progress bar reaches 100%
- [] Fix login bugs
- [] Nicer login ui
- [] Port from nextjs + firebase to heroku
See the open issues for a full list of proposed features (and known issues).
NextJS was chosen in a whim as we wanted to explore non create-react-app frameworks. Unexpectedly Vercel serverless functions has a timeout of 15s, which would cut off before a transcription result is received from the IBM watsons speech-text request.
This was extended to 540s when the IBM speech-to-text call was moved to Firebase cloud functions. It is sufficient for proof-of-concept tests of short videos <9min (current approach).
To overcome time-outs, a Create-React-App frontend, Express backend, and the use of IBM async API will be more suitable. The pipeline will be much simpler than the one shown above.
Started using this API because a very early idea of succinct cut was to edit videos by deleting words from a transcript. Idea was simplified as it was hard to accomplish that in 2 weeks. Regardless we proceeded with using the API as it will give us timestamps of hesitations, which depending on the video, can be an insignificant feature.
If it was decided initially to process the video based on pauses/long silences, IBM Watsons speech-to-text will not be needed. AudioContext could be used to analyze for silences, and the lengthy audio analysis stage can be shortened.
Despite the choice of methods being less ideal, there was lots to learn from using the Vercel deployment, Firebase, and IBM watsons speech to text.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt for more information.
Jia En - @ennnm_ - jiaen.1sc4@gmail.com Shen Nan - @wongsn - wongshennan@gmail.com
