Hacker Newsnew | past | comments | ask | show | jobs | submit | umutisik's commentslogin

The number of resume submissions from here is not a lot. It's not at all like on LinkedIn.


This is very impressive. I once built an educational Haskell programming + math. + art web site (mathvas.com). Something like this would have simplified that a lot.


Cannot paste into the editor (safari on iphone).


As a former professional mathematician: the benefits mentioned in the article (click-through definitions and statements, analyzing meta trends, version control, ...) do not seem particularly valuable.

The reason to formalize mathematics is to automate mathematical proofs and the production of mathematical theory.


You still need the other 80% of the folks to get the remaining 20% of the work done :)


He should add “Economists confidently making wishful-thinking based proclamations without a shred of evidence or even a plausible logical path.“ to the list!


It does make me uneasy that founding companies is becoming another tracked thing that people feel they can apply high-IQ pattern-matching to. Especially when startups are where first-principles thinking is the most valuable.

That being said, I can see this being useful to a lot of kids. Certainly beats going to grad school for someone who wants to start a company. Keep in mind that a 500K SAFE doesn't force a founder to go big or zero-out.


Data doesn't actually live on a manifold. It's an approximation used for thinking about data. Near total majority, if not 100%, of the useful things done in deep learning have come from not thinking about topology in any way. Deep learning is not applied anything, it's an empirical field advanced mostly by trial and error and, sure, a few intuitions coming from theory (that was not topology).


I disagree with this wholeheartedly. Sure, there is lots of trial and error, but it’s more an amalgamation of theory from many areas of mathematics including but not limited to: topology, geometry, game theory, calculus, and statistics. The very foundations (i.e. back-propagation) is just the chain rule applied to the weights. The difference is that deep learning has become such an accessible (sic profitable) field that many practitioners have the luxury of learning the subject without having to learn the origins of the formalisms. Ultimately allowing them to utilize or “reinvent” theories and techniques often without knowing they have been around in other fields for much longer.


None of the major aspects of deep learning came from manifolds though.

It is primarily linear algebra, calculus, probability theory and statistics, secondarily you could add something like information theory for ideas like entropy, loss functions etc.

But really, if "manifolds" had never been invented/conceptualized, we would still have deep learning now, it really made zero impact on the actual practical technology we are all using every day now.


Loss landscapes can be viewed as manifolds. Adagrad/ADAM adjust SGD to better fit the local geometry and are widely used in practice.


Can you give an example where theories and techniques from other fields are reinvented? I would be genuinely interested for concrete examples. Such "reinventions" happen quite often in science, so to some degree this would be expected.


Bethe ansatz is one. It took a toure de force by Yedidia to recognize that loopy belief propagation is computing the stationary point of Bethe's approximation to Free Energy.

Many statistical thermodynamics ideas were reinvented in ML.

Same is true for mirror descent. It was independently discovered by Warmuth and his students as Bregman divergence proximal minimization, or as a special case would have it, exponential gradient algorithms.

One can keep going.


The connections of deep learning to stat-mech and thermodynamics are really cool.

It's led me to wonder about the origin of the probability distributions in stat-mech. Physical randomness is mostly a fiction (outside maybe quantum mechanics) so probability theory must be a convenient fiction. But objectively speaking, where then do the probabilities in stat-mech come from? So far, I've noticed that the (generalised) Boltzmann distribution serves as the bridge between probability theory and thermodynamics: It lets us take non-probabilistic physics and invent probabilities in a useful way.


In Boltzmann's formulation of stat-mech it comes from the assumption that when a system is in "equilibrium", then all the micro-states that are consistent with the macro-state are equally occupied. That's the basis of the theory. A prime mover is thermal agitation.

It can be circular if one defines equilibrium to be that situation when all the micro-states are equally occupied. One way out is to define equilibrium in temporal terms - when the macro-states are not changing with time.


The Bayesian reframing of that would be that when all you have measured is the macrostate, and you have no further information by which to assign a higher probability to any compatible microstate than any other, you follow the principle of indifference and assign a uniform distribution.


Yes indeed, thanks for pointing this out. There are strong relationships between max-ent and Bayesian formulations.

For example one can use a non-uniform prior over the micro-states. If that prior happens to be in the Darmois-Koopman family that implicitly means that there are some non explicitly stated constraints that bind the micro-state statistics.


One might add 8-16-bit training and quantization. Also, computing semi-unreliable values with error correction. Such tricks have been used in embedded, software development on MCU's for some time.


I mean the entire domain of systems control is being reinvented by deep RL. System identification, stability, robustness etc


Good one. Slightly different focus but they really are the same topic. Historically, Control Theory has focused on stability and smooth dynamics while RL has traditionally focused on convergence of learning algorithms in discrete spaces.


> a few intuitions coming from theory (that was not topology).

I think these 'intuitions' are an after-the-fact thing, meaning AFTER deep learning comes up with a method, researchers in other fields of science notice the similarities between the deep learning approach and their (possibly decades old) methods. Here's an example where the author discovers that GPT is really the same computational problems he has solved in physics before:

https://ondrejcertik.com/blog/2023/03/fastgpt-faster-than-py...


I beg to differ. It's complete hyperbole to suggest that the article said "it's the same problem as something in physics", given this statement:

     It seems that the bottleneck algorithm in GPT-2 inference is matrix-matrix multiplication. For physicists like us, matrix-matrix multiplication is very familiar, *unlike other aspects of AI and ML* [emphasis mine]. Finding this familiar ground inspired us to approach GPT-2 like any other numerical computing problem.
Note: Matrix-matrix multiplication is basic mathematics, and not remotely interesting as physics. It's not physically interesting.


Agreed.

Although, to try to see it from the author’s perspective, it is pulling tools out of the same (extremely well developed and studied in it’s own right) toolbox as computational physics does. It is a little funny although not too surprising that a computational physics guy would look at some linear algebra code and immediately see the similarity.

Edit: actually, thinking a little more, it is basically absurd to believe that somebody has had a career in computational physics without knowing they are relying heavily on the HPC/scientific computing/numerical linear algebra toolbox. So, I think they are just using that to help with the narrative for the blog post.


You are exactly right, after deep learning researchers had invented Adam for SGD, numerical analysts finally discovered Gradient descent. And after the first neural net was discovered, finally the matrix was invented in the novel field of linear algebra.


I say this as someone who has been in deep learning for over a decade now: this is pretty wrong, both on the merits (data obviously lives on a manifold) and on its applications to deep learning (cf chris olah's blog as an example from 2014, which is linked in my post -- https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/). Embedding spaces are called 'spaces' for a reason. GANs, VAEs, contrastive losses -- all of these are about constructing vector manifolds that you can 'walk' to produce different kinds of data.


If data did live on a manifold contained, e.g. images in R^{n^2}, then it wouldn't have thickness or branching, which it does. It's an imperfect approximation to help think about it. The use of mathematical language is not the same as an application of mathematics (and the use of the word 'space' there is not about topology).


You're citing a guy that never went to college (has no math or physics degree), has never published a paper, etc. I guess that actually tracks pretty well with how strong the whole "it's deep theory" claim is.


Chris Olah has never published a paper? ... https://scholar.google.com/citations?user=6dskOSUAAAAJ&hl=en...


Chris Olah? One of the founders of Anthropic and the head of their interpretability team?


It’s alchemy.

Deep learning in its current form relates to a hypothetical underlying theory as alchemy does to chemistry.

In a few hundred years the Inuktitut speaking high schoolers of the civilisation that comes after us will learn that this strange word “deep learning” is a left over from the lingua franca of yore.


Not really, most of the current approaches are some approximations of the partition function.


The reason deep learning is alchemy is that none of these deep theories have predictive ability.

Essentially all practical models are discovered by trial and error and then "explained" after the fact. In many papers you read a few paragraphs of derivation followed by a simpler formulation that "works better in practice". E.g., diffusion models: here's how to invert the forward diffusion process, but actually we don't use this, because gradient descent on the inverse log likelihood works better. For bonus points the paper might come up with an impressive name for the simple thing.

In most other fields you would not get away with this. Your reviewers would point this out and you'd have to reformulate the paper as an experience report, perhaps with a section about "preliminary progress towards theoretical understanding". If your theory doesn't match what you do in practice - and indeed many random approaches will kind of work (!) - then it's not a good theory.


It's true that there is no directly predictive model of deep learning, and it's also true that there is some trial and error, but it is wrong to say that therefore there is no operating theory at all. I recommend reading Ilyas 30 papers (here's my review of that set: https://open.substack.com/pub/theahura/p/ilyas-30-papers-to-...) to see how shared intuitions and common threads are clearly developed over the last decade+


That is a great list, do you know of something similar that is more recent?


It does if you relax your definition to accommodate approximation error, cf. e.g., Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (https://aclanthology.org/2021.acl-long.568.pdf)


> Data doesn't actually live on a manifold.

Often, they do (and then they are called "sheaves").


Many types of data don’t. Disconnected spaces like integer spaces don’t sit on a manifold (they are lattices). Spiky noisy fragmented data don’t sit on a (smooth) manifold.

In fact not all ML models treat data as manifolds. Nearest neighbors, decision trees don’t require the manifold assumption and actually work better without it.


Any reasonable statistical explanation of deep learning requires there to be some sort of low dimensional latent structure in the data. Otherwise, we would not have enough training data to learn good models, given how high the ambient dimensions are for most problems.


Deep learning specifically yes. It needs a manifold assumption. But not data in general which was what I was responding to.


It turns out a lot of disconnected spaces can be approximated by smooth ones that have really sharp boundaries, which more or less seems to be how neural networks will approximate something like discrete tokens


Can be approximated yes. Approximated well? No, but you can get away with it sometimes with saturation functions like softmax. But badly. It’s like trying to solve an integer program as a linear program. You end up with a relaxation that is not integral and not the real answer.

An integer lattice can only be a manifold in a trivial sense (dimension 0). But not for any positive dimensions.


Your comment sits in the nice gradient between not seeing at all the obvious relationships between deep learning and topology and thinking that deep learning is applied topology.

See? Everything lives in the manifold.

Now for a great visualization about the Manifold Hypothesis I cannot recommend more this video: https://www.youtube.com/watch?v=pdNYw6qwuNc

That helps to visualize how the activation functions, bias and weights (linear transformations) serve to stretch the high dimensional space so that data go into extremes and become easy to put in a high dimension, low dimensional object (the manifold) where is trivial to classify or separate.

Gaining an intuition about this process will make some deep learning practices so much easy to understand.


I cannot understand this prideful resentment of theory common among self-described practitioners.

Even if existing theory is inadequate, would an operating theory not be beneficial?

Or is the mystique combined with guess&check drudgery job security?


If there were theory that led to directly useful results (like, telling you the right hyperparameters to use for your data in a simple way, or giving you a new kind of regularization that you can drop in to dramatically improve learning) then deep learning practitioners would love it. As it currently stands, such theories don't really exist.


This is way too rigorous. You can absolutely have theories that lead to useful results even if they aren't as predictive as you describe. Theory of evolution for an obvious counterpoint.


Useful theories only come to exist because someone started by saying they must exist and then spent years or lifetimes discovering them.


There are strong incentives to leave theory as technical debt and keep charging forward. I don't think it's resentment of theory, everyone would love a theory if one were available but very few are willing to forgoe the near term rewards to pursue theory. Also it's really hard.


There are many reasons to believe a theory may not be forthcoming, or that if it is available may not be useful.

For instance, we do not have consensus on what a theory should accomplish - should it provide convergence bounds/capability bounds? Should it predict optimal parameter counts/shapes? Should it allow more efficient calculation of optimal weights? Does it need to do these tasks in linear time?

Even materials science in metals is still cycling through theoretical models after thousands of years of making steel and other alloys.


Maybe a little less with the ad hominems? The OP is providing an accurate description of an extremely immature field.


Many mathematicians are (rightly, IMO) allergic to assertions that certain branches are not useful (explicit in OP) and especially so if they are dismissive of attempts to understand complicated real world phenomema (implicit in OP, if you ask me).


Who is proud? What you are seeing in some cases is eye rolling. And it's fair eye rolling.

There is an enormous amount of theory used in the various parts of building models, there just isn't an overarching theory at the very most convenient level of abstraction.

It almost has to be this way. If there was some neat theory, people would use it and build even more complex things on top of it in an experimental way and then so on.


Just a side comment to your observation: the principle is called reductionism and has been tried on many fields.

Physics is just applied mathematics

Chemistry is just applied physics

Biology is just applied chemistry

It doesn’t work very well.


>it's an empirical field advanced mostly by trial and error and, sure, a few intuitions coming from theory (that was not topology).

Neural Networks consist almost exclusively of two parts, numerical linear algebra and numerical optimization.

Even if you reject the abstract topological description. Numerical linear algebra and optimization couldn't be any more directly applicable.


> Near total majority, if not 100%, of the useful things done in deep learning have come from not thinking about topology in any way.

Of course. Now, to actually deeply understand what is happening with these constructs, we will use topology. Topoligical insights will without doubt then inform the next generations of this technology.


May I ask you to give examples of insights from topology which improved existing models, or at least improved our understanding of them? arxiv papers are preferred.


I feel like the fact that ML has no good explanation why it works this well gives a lot of people room to invent their head-canon, usually from their field of expertise. I've seen this from exceptionally intelligent individuals too. If you only have a hammer...


I think it would be more unusual, and concerning, if an intelligent individual didn't attempt to apply their expertise for a head-canon of something unknown.

Coming up with an idea for how something works, by applying your expertise, is the fundamental foundation of intelligence, learning, and was behind every single advancement of human understanding.

People thinking is always a good thing. Thinking about the unknown is better. Thinking with others is best, and sharing those thoughts isn't somehow bad, even if they're not complete.


When you say ML, I assume you really mean LLMs?

Even with LLMs, there's no real mystery about why they work so well - they produce human-like input continuations (aka "answers") because they are trained to predict continuations of human-generated training data. Maybe we should be a bit surprised that the continuation signal is there in the first place, but given that it evidentially is, it's no mystery that LLMs are able to use it - just testimony to the power of the Transformer as a predictive architecture, and of course to gradient descent as a cold unthinking way of finding an error minimum.

Perhaps you meant how LLMs work, rather than why they work, but I'm not sure there's any real mystery there either - the transformer itself is all about key-based attention, and we now know that training a transformer seems to consistently cause it to leverage attention to learn "induction heads" (using pairs of adjacent attention heads) that are the main data finding/copying primitive they use to operate.

Of course knowing how an LLM works in broad strokes isn't the same as knowing specifically how it is working in any given case, how is it transforming a specific input layer by layer to create the given output, but that seems a bit like saying that because I can't describe - precisely - why you had pancakes for breakfast, that we don't know how the brains works.


"All models are wrong, but some are useful" -George Box


I don't agree with your first sentence, but I agree with the rest of this post.


Make the cost of operating data centers reflect the damage to the environment and see how quickly people optimize. I don’t know how damaging storage is, but that’s the only way to make a difference.


The cost to operate a data center is already, to a close approximation, the cost of its electricity. The data center operator already pays for every joule.

Data centers used ~4% of USA electricity last year, about .15 quads, while the chemical industry used 10 quads, virtually all in the form of fossil fuel. If you are going to start reflecting the externality of energy consumption in the price of goods, information sector will probably not need to adjust anything, while the chemical sector will be fundamentally impaired.


so charge more? I'm sure they are already charging as much as the market will bear. Governments can impose a tax which they already have.


Incept AI | Remote (US) | Full-time | Software Engineers and Deep Learning Scientists | https://www.incept.ai

Incept AI is solving the last mile problem in Voice AI. Our team of PhD applied scientists and software engineers builds neural audio processing systems, audio-processing foundation models, and agentic AI workflows that make Voice AI reliable and truly useful in the real world.

Our first application is the lucrative drive-thru and phone automation market for restaurants. We have built the most accurate order-taking restaurant Voice AI assistant in the world, and are working with major restaurant chains to help elevate their customer service while providing major automation-led savings.

We are well-funded are looking to grow our team with Software Engineers (generalists or back-end, previous work experience required) and Applied Scientists (PhD in a technical field like CS, Math, Physics, ... or equivalent deep learning experience required).

https://www.incept.ai/careers


When considering a candidate on an F-1 + OPT, what should be our expectation of their need for H1B sponsorship? What are O-1 and National Interest Waiver success rates for recent PhD grads (in computer science)?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: