Mozilla DeepSpeech is trained on about 2000 hours of audio that is mostly spoken by American males. It has little ability to handle noise or accents and has a 5.97% Word Error Rate on LibriVox (which is noiseless, plain spoken english).
Meanwhile, Google, Microsoft & IBM have tons of fresh audio coming in constantly to use in augmenting their models.
Baidu was able to build a competitive English Speech to Text model with 5000 hours of quality audio to train against.
There's tens of millions of hours of video content out there that has been subtitled pretty well, and I'd wager a lot of it is under usable licenses for Mozilla. Has that been considered?
If you know such sources, file an issue, and better yet, download the video content yourself and publish a dataset.
But note that raw video content is not training data. It has to be segmented to be in short enough parts for training (few seconds), the subtitles have to be aligned to match what's said precisely, and one needs to balance the data, e.g. when 90% of speakers are men and 10% are women, you have a problem.
This is just another way of gathering that data. (If consented to.)