Statistical learning can typically be phrased in terms of k nearest neighbours I...

autokad · on Aug 10, 2023

I disagree, it feels like you are just fusing over words and not what's happening in the real world. If you were right, a human doesn't learn anything either, they just memories.

you can look at it by results: I give these models inputs its never seen before but it gives me outputs that are correct / acceptable.

you can look at it in terms of data: we took petabytes of data, and with an 8gb model (stable difusion) we can output an image of anything. That's an unheard of compression, only possible if its generalizing - not memorizing.

ActivePattern · on Aug 10, 2023

I'd be curious how much of the link you read.

What they demonstrate is a neural network learning an algorithm that approximates modular addition. The exact workings of this algorithm is explained in the footnotes. The learned algorithm is general -- it is just as valid on unseen inputs as seen inputs.

There's no memorization going on in this case. It's actually approximating the process used to generate the data, which just isn't possible using k nearest neighbors.

visarga · on Aug 10, 2023

> Statistical learning can typically be phrased in terms of k nearest neighbours

We have suspected that neural nets are a kind of kNN. Here's a paper:

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

https://arxiv.org/abs/2012.00152

bippihippi1 · on Aug 10, 2023

it's been proven that all models learned by gradient descent are equivalent to kernel machines. interpolation isn't generalization. if theres a new input sufficiently different from the training data the behaviour is unknown

drdeca · on Aug 11, 2023

Can you say what that says about the behavior described with the modular arithmetic in the article?

And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?

My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?

bippihippi1 · on Aug 11, 2023

if you train a model on modular arithmatic, it can only learn what's in the training data. if all of the examples are of the form a + b mod 10, it isn't likely to generalize to be able to solve a + b mod 12. a human can learn the rule and figure it out. a model can't that's why a diverse training set is so important. it's possible to train a model to aproximate any function, but whether the approximation is accurate outside of the datapoints you trained on is not reliable, as far as I understand.

different hyperparameters can give a model that us over or underfit, but this helps the model interpolate, not generalize. it can know all the answers similar to the training data, not answers different to or it

xapata · on Aug 10, 2023

One weird trick ...

There's some fox and hedgehog analogy I've never understood.

visarga · on Aug 10, 2023

but when the model trains on 13T tokens it is hard to be OOD