Of course there is. All the building blocks that people are mix and matching in networks nowadays were introduced at some point.
The paper that introduced batch norm, adaptive instance norm, attention heads, or any module used in a network have an extensive discussion of the motivation for their existance, some derivation or proof that they do what you want, and an empirical test to show it helps in practice. The reason some losses allow GANs to converge in certain situations while others don't isn't a complete mystery, there is theory that supports this.
Researchers designing new models are considering weak points in old approaches, identifying why they aren't working correctly, and proposing something new that solves a part of the problem. All of this is done by looking at the math behind all the operations in the network (or at least the parts relevant to a certain question).
That nobody really knows how AI works is one of those myths told by the media. Just because the model weights aren't interprable doesn't mean we don't know why that model works well. It just takes quite a bit of maths knowledge to really understand state of the art models. All that knowledge is also easily packaged into modern frameworks that make it easy to use without a deep knowledge of why it works. All of this contributes to the feeling that nobody really knows what's going on, while in reality it's onky the majority of people that don't know what's going on ;)
nobody really knows how AI works is one of those myths told by the media
It's not a myth. No one really understands how neural networks work. We don't know why a particular model works well. Or why any model works well. For example no one can answer why NNs generalize so well even when they have enough learning capacity to memorize all training examples. We can guess, but we don't know for sure. Most of the proofs you see in papers are there as fillers, so that papers seem more convincing. We rarely can prove anything mathematically about NNs that has any practical value or leads to any breakthroughs in understanding.
If we did really understand how NNs work, then we wouldn't need to do expensive hyperparameter searches - we would have a way to determine the optimal ones given a particular architecture and training data. And we wouldn't need to do expensive architecture searches, yet the best of the latest convnets have been found through NAS (e.g. EfficientNet), and there's very little math involved in the process - it's pretty much just random search.
Funny you mentioned the batchnorm paper - we still don't know why batchnorm is so effective - the paper gave an explanation (covariate shift reduction) which later was shown to be wrong (batchnorm does not reduce it), then several other explanations were suggested (smoother loss surface, easier gradient flow, etc), but we still don't know for sure. Pretty much every good idea in NN field is a result of lots of experimentation, good intuition developed in the process, looking at how a brain does it, and practical constraints. And yes, sometimes we're looking at the equations, and thinking hard, and sometimes we see a better way to do stuff. But usually it starts with empirical tests, and if successful, some math is used in the attempt to explain things. Not the other way around.
NNs are currently at a similar point as where physics was before Newton and before calculus.
> NNs are currently at a similar point as where physics was before Newton and before calculus.
I'm more inclined to compare with the era after Newton and Leibniz, but prior to the development of rigorous analysis. If you look at this time period, the analogy fits a bit better IMO -- you have a proliferation of people using calculus techniques to great advantage for solving practical problems, but no real foundations propping the whole thing up (e.g., no definition of a limit, continuity, notions of how to deal with infinite series, etc.).
Maybe. On the other hand, maybe a rigorous mathematical analysis of NNs is as useful as a rigorous mathematical analysis of computer architectures - not very useful. Maybe all you need is just to keep scaling it up, adding some clever optimizations in the process (none of the great CPU ideas like caches, pipelining, out of order execution, branch prediction, etc came from rigorous mathematical analysis).
Or maybe it's as useful as a rigorous mathematical analysis of a brain - again, not very useful, because for us (people who develop AI systems), it would be far more valuable to understand a brain on a circuit level, or an architecture level, rather than on a mathematical theory level. The latter would be interesting, but probably too complex to be useful, while the former would most likely lead to dramatic breakthroughs in terms of performance and capabilities of the AI systems.
So maybe we just need to keep doing what we have been doing in DL field in the last 10 years - trying/revisiting various ideas, scaling them up, and evolving the architectures the same way we've been evolving our computers for the last 100 years, with the hope there will be more clues from neuroscience. I think we just need more ideas like transformers, capsules, or neural Turing machines, and computers that are getting ~20% faster every year.
Actually, I read the batch norm paper and maybe I forgot important details, but it roughly went like this: "here we add a term `b` to make sure the mean values of Ax+b are zero and that will help us with convergence; ah, and here is a covariance matrix!", but no quantitative proofs about how much that convergence was helped. No, I intuitively agree that shifting the mean value to zero should help, but math taught me that there is a huge difference between a seemingly correct statement and its proof. The ML papers seem to just state these seemingly correct ideas without real, proof-backed understanding, why this works. In other words, ML is entirely about empirical results, peppered with math-like terminology. But don't take my blunt writing style personally.
Let's take the simplest example: recognizing the grayscale 30x80 pictures with 0-9 digits. IIRC, this is called the MNIST example and can be done by my cat in 1 hour without prior knowledge. Let's choose the probably simplest model: 2400 inputs are fully connected with a 1024 vector that's fully connected with a 10 vector. And let's use relu at both steps. We know that this kinda works and converges quickly. In particular, after T steps we get error E(T) and E(1e6) < 0.03 (a random guess). Can you tell me how T and E will change if we add another layer: 2400->1024->1024->10, using the same relu? Same question, but now we replace relu with tanh: 2400->1024->10.
I think you and the person you're responding to might have slightly different expectations behind what level of rigor counts as "math", just like how physicists and theoretical mathematicians often have somewhat different ideas about rigor.
My impression is that obviously ML is guided by math and people want to have an understanding of why some things converge and others don't. But "in the field" many people just mess around with different set-ups and see what works (especially in deep learning). Maybe theory follows to explain why it worked. I think you're right that a lot of progress in the field is based on intuition and some reasoning (e.g. trying something like an inception network) more than derivations that show that a particular set-up should be successful. I get the impression that most low-level components are pretty well understood, but when they are stacked and combined it gets more complicated.
I would be very curious to see a video of your cat solving MNIST in 1 hour!
The paper that introduced batch norm, adaptive instance norm, attention heads, or any module used in a network have an extensive discussion of the motivation for their existance, some derivation or proof that they do what you want, and an empirical test to show it helps in practice. The reason some losses allow GANs to converge in certain situations while others don't isn't a complete mystery, there is theory that supports this.
Researchers designing new models are considering weak points in old approaches, identifying why they aren't working correctly, and proposing something new that solves a part of the problem. All of this is done by looking at the math behind all the operations in the network (or at least the parts relevant to a certain question).
That nobody really knows how AI works is one of those myths told by the media. Just because the model weights aren't interprable doesn't mean we don't know why that model works well. It just takes quite a bit of maths knowledge to really understand state of the art models. All that knowledge is also easily packaged into modern frameworks that make it easy to use without a deep knowledge of why it works. All of this contributes to the feeling that nobody really knows what's going on, while in reality it's onky the majority of people that don't know what's going on ;)