You're wrong. The people building the models don't write CUDA kernels. The people optimizing the models write CUDA kernels. And you don't need to know a bunch of ML bs to optimize kernels. Source: I optimize GPU kernels. I don't make 500k but I'm not that far from.
How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX?
What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?
Heh I'm in the wrong business then. Interesting. Used to be that game programmers spent lots of time optimizing non-ML CUDA code. They didn't make anything like 500k at that time. I wonder what the ML industry has done to game development, or for that matter to scientific programming. Wow.