That is what they are working on, but that needs high-quality training data: htt...

posguy · on Aug 9, 2020

Mozilla DeepSpeech is trained on about 2000 hours of audio that is mostly spoken by American males. It has little ability to handle noise or accents and has a 5.97% Word Error Rate on LibriVox (which is noiseless, plain spoken english).

Meanwhile, Google, Microsoft & IBM have tons of fresh audio coming in constantly to use in augmenting their models.

Baidu was able to build a competitive English Speech to Text model with 5000 hours of quality audio to train against.

Mozilla did create Common Voice to address this serious data gap, but it has only collected 1492hrs of validated English audio: https://commonvoice.mozilla.org/en/datasets

scrollaway · on Aug 9, 2020

There's tens of millions of hours of video content out there that has been subtitled pretty well, and I'd wager a lot of it is under usable licenses for Mozilla. Has that been considered?

est31 · on Aug 9, 2020

If you know such sources, file an issue, and better yet, download the video content yourself and publish a dataset.

But note that raw video content is not training data. It has to be segmented to be in short enough parts for training (few seconds), the subtitles have to be aligned to match what's said precisely, and one needs to balance the data, e.g. when 90% of speakers are men and 10% are women, you have a problem.

scrollaway · on Aug 10, 2020

Gotcha, all excellent points.