This is a brilliant and useful application of LLM technology, I'm impressed.
One question- On the backend, is it downloading each video CC (closed-caption) transcript and feeding that into a tuned prompt? What happens for videos where this is missing? Asking because I've noticed CC is occasionally unavailable for some YouTube videos.
If you cared to have a fallback, a potentially interesting experiment / solution for such cases is to download the video, extract the audio to a WAV file, then through the audio through Whisper [1] to generate the transcript. Using CPUa, it will still be incredibly intensive and slow, generally not much faster than real-time (e.g. a 5 minute clip will take on the order of ~5 minutes to complete transcription). However, with Whisper running on a fancy GPU it is insanely faster, between 100-200x faster, meaning even for long videos, generating the transcripts will complete in only a few seconds.
p.s. Is there any chance you'd open source your code? Or do you plan to turn this into a business? The code itself is exactly a huge moat, and it'd be cool to see how you did this. Cheers.
p.p.s. stepify.tech app is currently crashing out to a heroku error page when I try to submit a YT link.
Thank you!
I'm getting the transcript through an API and feeding it to the GPT. For now, the fallback function for no captions is just to make something out of the description of the video.
I really appreciate the suggestion, i'll experiment around using Whisper. Regarding open source or business. I don't really know about that yet. Maybe, i'll lean towards the business side to cover the costs and see where this goes.
And sorry for the downtime! API credits ran out. It should be fixed by now
Eek, so many typos in my comment - but the most egregious was where I meant to convey the code itself is not a huge moat. Even still, no worries if you don't want to give it away, I totally understand.
Definitly try out whisper after splitting out the audio as a fallback, and don't forget their are other models like WhisperFast that might be slightly less accurate but less resource intesnive, and since your not publishing the captions themselves you don't need it to literally get every word perfect.
Here an example of implementation you may find interesting (that also includes snapshots, and links back to original video) - https://github.com/Yannael/video2blogpost
One question- On the backend, is it downloading each video CC (closed-caption) transcript and feeding that into a tuned prompt? What happens for videos where this is missing? Asking because I've noticed CC is occasionally unavailable for some YouTube videos.
If you cared to have a fallback, a potentially interesting experiment / solution for such cases is to download the video, extract the audio to a WAV file, then through the audio through Whisper [1] to generate the transcript. Using CPUa, it will still be incredibly intensive and slow, generally not much faster than real-time (e.g. a 5 minute clip will take on the order of ~5 minutes to complete transcription). However, with Whisper running on a fancy GPU it is insanely faster, between 100-200x faster, meaning even for long videos, generating the transcripts will complete in only a few seconds.
Great job @aka_sh!
[1] https://github.com/openai/whisper
p.s. Is there any chance you'd open source your code? Or do you plan to turn this into a business? The code itself is exactly a huge moat, and it'd be cool to see how you did this. Cheers.
p.p.s. stepify.tech app is currently crashing out to a heroku error page when I try to submit a YT link.