Of course this is an continuation of things people have been trying for decades ...

Of course this is an continuation of things people have been trying for decades at this point, rather than something fundamentally new, but it brings up a point a colleague and I had a decade ago on training something like this on large data sets - namely that you are going to tend find common idioms rather than nominally best ones. In many scenarios it may make little to no difference, but clearly not all . It's likely going to gravitate towards lowest-common-denominator solutions.

One example of where this can be a problem is numerics - most software developers don't understand it and routinely do questionable things. I'm curious what effort the authors have put in to mitigate this problem.