Foundations Built for a General Theory of Neural Networks (2019)

orange_tee · on Jan 23, 2021

This is something I am interested and considering as a possible future for my career, coming from a math background and starting to dip my toes into ML. One thing I am afraid of is how useful is it actually going to be to create a rigorous mathematical framework? I am afraid it might end up like mathematical physics, where they are almost a century behind theoretical and experimental physics and playing catch-up. What does anyone else think?

joe_the_user · on Jan 23, 2021

One thing to consider is the distinction between deep neural networks as mathematical objects and machine learning as currently practiced.

Lately, there have been quite a few theories on neural networks as ideal nonlinear approximators [1]. Similarly, people have shown many ways that gradient descent can tend to reach a global maximum of regularized curve-closeness[2]. Which is to say, if your development cycle is: gather-data, train, test, deploy, we know this approximates the data almost ideally; you can't really do much better than a deep network.

But we know in practice, when deployed, that deep neural networks actually have many limitations (compared to our intuitions or compared to human performance, etc). There are some obvious explanations. Of course, they're limited by our ability to gather data and by the biases of the data. But even more, they're limited to situations where you have large chunks of unchanging data that you can extrapolate from.

Given that deeps are more or less perfect for the train-test-deploy cycle, it seems like the problem is with this cycle itself. And it's easy to see human beings somehow acting "intelligently" without using this cycle. So figuring out an alternative to this might be something to look at.

[1] For example: Nonlinear Approximation and (Deep) ReLU Networks I. Daubechies, et al https://arxiv.org/abs/1905.02199

[2] For example: Gradient descent optimizes over-parameterized deep ReLU networks Difan Zou et al https://link.springer.com/article/10.1007/s10994-019-05839-6

orange3xchicken · on Jan 23, 2021

imo, instead of developing new theory entirely from the ground up, its more useful to address & work in the context of understanding longstanding open problems/phenomenon. Theoretical insight and the framework should follow.

e.g.

Robustness & expressiveness

Memorization & catastrophic forgetting

Ensembling & ntk & optimization

Double descent

Pruning

enriquto · on Jan 23, 2021

> This is something I am interested and considering as a possible future for my career, coming from a math background and starting to dip my toes into ML.

This is a great time and place to be! Neural networks are the twenty-first century Fourier series. It's just that we don't yet understand them. We can easily run them (synthesis) but we are missing the analysis. There's a lot of math to do here.

bigdict · on Jan 23, 2021

> This is something I am interested and considering as a possible future for my career, coming from a math background and starting to dip my toes into ML.

I'm in the exact same position.

I think mathematical explanations and (substantiated) intuition for the things that practitioners discover would be very useful. Maybe an all explaining grand theory of deep learning is possible.

dr_dshiv · on Jan 23, 2021

Relate it to free energy minimization, that's very hot right now. Read Smolensky's 1986 article in PDP on the harmonium, it was the first restricted boltzmann machine.

canjobear · on Jan 23, 2021

Don't most neural networks already do free energy minimization? Just about any information theoretic objective can be interpreted that way...

dr_dshiv · on Jan 23, 2021

Yes, that's arguably the dominant paradigm. It's interesting because of the relationship to thermodynamics -- again, check out Smolensky's paper in PDP, he was a postdoc (along with Geoff Hinton) at UCSD's cognitive science dept, run by Don Norman.

https://apps.dtic.mil/sti/pdfs/ADA620727.pdf

eli_gottlieb · on Jan 24, 2021

>Relate it to free energy minimization, that's very hot right now.

It is? I didn't see that buzzword very much at NeurIPS this year.

konjin · on Jan 23, 2021

You're 5 years too late to try and break into ML.

GregarianChild · on Jan 23, 2021

In what sense are neural nets not rigorous?

The whole pipeline from Pytorch or TensorFlow and Python to LLVM, to GPUs or TPUs is absolutely rigorous. Much more rigorous, in fact, than normal, hand-written mathematics, as you find it in e.g. a typical Annals of Mathematics publication, or mathematical textbook!

I think what you really have in mind is a simple model of modern deep learning that is not fully accurate, but still useful!

Let me argue by analogy. You are looking for something that is to deep learning what the lambda-calculus is to the Haskell compiler. One of the main simplifications in programming language theory is replacing finite precision arithmetic (which is painfully complex) with mathematical integers and real numbers (which are much simpler). Would a theory of deep learning based on mathematical reals be valuable in a theory of deep learning? The stunning success of floating point formats like bfloat16 [1] suggests otherwise, since arithmetic precision in deep learning is closely connected to important learning phenomena such as overfitting and regularisation.

I am tempted to be provocative and say that you are really looking for less rigour!

[1] https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

b3kart · on Jan 23, 2021

We don't know which model architecture will work for which problem, and why it would/wouldn't. We experiment until we find something that works, and can sometimes try to guess why it did. But none of this knowledge is formalized in a way that can reliably predict performance in future problems. We engineer solutions to problems, but don't build a rigorous body of knowledge to help us in future problems.

GregarianChild · on Jan 23, 2021

The problem you describe (model architecture will work for which problem) is not lacking rigour, but due to the Turing complete expressive power of neural networks.

What you can do is use the existing rigour and derive stronger properties for restricted classes of NNs (just like you can prove stronger properties for simple subsystems of lambda-calculus) and that is an interesting field of study.

eli_gottlieb · on Jan 24, 2021

> The problem you describe (model architecture will work for which problem) is not lacking rigour, but due to the Turing complete expressive power of neural networks.

Non-recurrent neural networks are not Turing-complete in any sense.

glial · on Jan 23, 2021

Well put. I would say that often even the problems aren’t particularly well defined.

karpierz · on Jan 23, 2021

> In what sense are neural nets not rigorous?

In the sense that for non-trivial applications, we struggle to define what approach is successful and why it's successful without leaning on empirical metrics.

GregarianChild · on Jan 23, 2021

That's not because of lacking rigour, but because of complexity!

You cannot, in general, predict arbitrary program properties without running programs, that's the essence of Rice's theorem [1]. By the famous Universal Approximation Theorems for neural networks, in general, NNs are Turing complete, which precludes general and simple mathematical "silver bullets" that will help you overcome those challenges.

[1] https://en.wikipedia.org/wiki/Rice%27s_theorem

exporectomy · on Jan 23, 2021

I think you're confusing rigorous processes with rigorous theory. The analogy of steam engines seems appropriate - they were initially built with (sort of) rigorous processes for metal forming and joining, etc. It wasn't just people tinkering in their backyards. Yet that was done without the theory of thermodynamics. Those rigorous engineers didn't know a Helmholtz free energy from a free energy machine. Once we had thermodynamics, it didn't directly influence how to form metal parts or lubricate pistons, but it did put hard constraints on the possible performance of engines and made it easier to predict how an engine would perform before building it, even if it was outside the range of what had already been tested - eg. jet and rocket engines that required high confidence in predictions of the theory to invest the money needed to develop working models in the middle of wartime scarcity.

avrionov · on Jan 23, 2021

Not the OP, but I've built several ML pipelines in the last few years. Even if every step in the process very rigorous and using solid software, there are still challenges. ML models are very difficult (or impossible to test). The usual testing approaches don't work at all for ML features.

dang · on Jan 23, 2021

Discussed at the time: https://news.ycombinator.com/item?id=19069671