Hm. So that helps with high-frequency noise. Any progress on what to do when the dimensions are of vastly different scales? I have an old physics engine which had to solve about 20-value nonlinear differential equations. During a collision, the equations go stiff, and some dimensions may be 10 orders of magnitude steeper than others. Gradient descent then faces very steep knife edges. This is called a "stiff system" numerically.
Author here - I believe the problem of a "stiff system" you're referring to is exactly the problem of pathological curvature!
Some points not touched on in the article. If the individual dimensions are of different scales, this problem can be easily fixed with a diagonal preconditioner. Even something like ADAM or Adagrad (unconventional, I know, in this domain) can be used.
There's also a small industry around more sophisticated preconditioners for the linear systems in PDEs, see Multigrid, for example, or preconditioned conjugate gradient.
The stiffness may be local. It definitely is in a physical simulation for hard collisions. Machine learning data is usually normalized into [0..1], so if you get a really steep slope, something is pathological.
I'm not an expert on anything covered in the article but we have a similar physics based model at my work (complex non-linear equations) we use a technique called Sequential Quadratic Programming (SQP) to find an optimal solution. My understanding is that this gives better results than using gradient descent but will only work if the functions are continuous.