Grr, the AI folks are ruining the term 'grok'. It means roughly 'to understand c...

erwald · on Aug 10, 2023

"Grok" in AI doesn't quite describe generalization, it's more specific that that. It's more like "delayed and fairly sudden generalization" or something like that. There was some discussion of this in the comments of this post[1], which proposes calling the phenomenon "eventual recovery from overfitting" instead.

[1] https://www.lesswrong.com/posts/GpSzShaaf8po4rcmA/qapr-5-gro...

tbalsam · on Aug 10, 2023

Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.

Neural network training [edit: on a fixed point task, as is often the case {such as image->label}] is always (always) biphasic necessarily, so there is no "eventual recovery from overfitting". In my experience, it is just people newer to the field or just noodling around fundamentally misunderstanding what is happening, as their network goes through a very delayed phase change. Unfortunately there is a significant amplification to these kinds of posts and such, as people like chasing the new shiny of some fad-or-another-that-does-not-actually-exist instead of the much more 'boring' (which I find fascinating) math underneath it all.

To me, as someone who specializes in optimizing network training speeds, it just indicates poor engineering to the problem on the part of the person running the experiments. It is not a new or strange phenomenon, it is a literal consequence of the information theory underlying neural network training.

PoignardAzur · on Aug 10, 2023

> Part of the issue here is posting a LessWrong post

I mean, this whole line of analysis comes from the LessWrong community. You may disagree with them on whether AI is an existential threat, but the fact that people take that threat seriously is what gave us this whole "memorize-or-generalize" analysis, and glitch tokens before that, and RLHF before that.

tbalsam · on Aug 11, 2023

I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization has been a debate before LW even existed in the public eye, and inputs that networks have unusual sensitivity to have been well studied as well (re:chaotic vs linear regimes in neural networks). Especially the memorization vs generalization bit -- that has been around for...decades. It's considered a fundamental part of the field, and has had a ton of research dedicated to it.

I don't know much either way about RLHF in terms of its direct lineage, but I highly doubt that is actually what happened, since DeepMind is actually responsible for the bulk of the historical research supporting those methods.

It's possible ala the broken clock hypothesis + LessWrong is obviously not the "primate at a typewriter" situation, so there's a chance of some people scoring meaningful contributions, but the signal to noise ratio is awful. I want to get something out of some of the posts I've tried to read there, but there are so many bad takes written with more bombastic language that it's really quite hard indeed.

Right now, it's an active detractor to the field because it pulls attention away from things that are much more deserving of energy and time. I honestly wish the vibe was back to people even just making variations of Char-RNN repos based on Karpathy's blog posts. That was a much more innocent time.

PoignardAzur · on Aug 11, 2023

> I think you may be missing the extensive lines of research covering those topics. Memorization vs Generalization

I meant this specific analysis, that neural networks that are over-parameterized will at first memorize but, if they keep training on the same dataset with weight decay, will eventually generalize.

Then again, maybe there have been analyses done on this subject I wasn't aware of.

tbalsam · on Aug 11, 2023

Gotcha. I'm happy to do the trace as it likely would be fruitful for me.

Do you have a link to a specific post you're thinking of? It's likely going to be a Tishby-like (the classic paper from 2015 {with much more work going back into the early aughts, just outside of the NN regime IIRC}: https://arxiv.org/abs/1503.02406) lineage, but I'm happy to look to see if it's novel.

PoignardAzur · on Aug 13, 2023

The specific post I'm thinking of is A Mechanistic Interpretability Analysis of Grokking - https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

I originally thought the PAIR article was another presentation by the same authors, but upon closer reading, I think they just independently discovered similar results. Though the PAIR article quotes Progress measures for grokking via mechanistic interpretability, the Arxiv paper by the authors of the alignmentforum article.

(In researching this I found another paper about grokking finding similar results a few months earlier; again, I suspect these are all parallel discoveries.)

You could say that all of these avenues of research are all re-statements of well-known properties, eg deep double-descent, but I think that's a stretch. Double descent feels related, but I don't think a 2018 AI researcher who knew about double descent would spontaneously predict "if you train your model past the point it starts overfitting, it will start generalizing again if you train it for long enough with weight decay".

But anyway, in retrospect, I agree that saying "the LessWrong community is where this line of analysis comes from" is false; it's more like they were among the people working on it and reaching similar conclusions.

woopwoop · on Aug 11, 2023

I don't think that is true? As far as I know the grokking phenomenon was first observed (and the name coined) in this paper, not in any blog post:

https://arxiv.org/abs/2201.02177

tbalsam · on Aug 11, 2023

That's true, and I probably should have done some better backing up, sorting out, and clarification. I remember when that paper came out, it rubbed me the wrong way too then, because it is people rediscovering double descent from a different perspective, and not recognizing it as such.

What it would be better defined as is "a sudden change in phase state after a long period of metastability". Even then it ignores that those sharp inflections indicate a poor KL between some of the inductive priors and the data at hand.

You can think about it as the loss signal from the support of two gaussians extremely far apart with narrow standard deviations. Sure, they technically have support, but in a noisy regime you're going to have nothing.... nothing.... nothing....and then suddenly something as you hit that point of support.

Little of the literature, definitions around the word, or anything like that really takes this into account generally, leading to this mass illusion that this is not a double descent phenomenon, when in fact it is.

Hopefully this is a more appropriate elaboration, I appreciate your comment pointing out my mistake.

rileyphone · on Aug 11, 2023

Singular learning theory explains the sudden phase changes of generalization in terms of resolution of singularities. Alas it's still associated with the LW crowd.

https://www.lesswrong.com/s/mqwA5FcL6SrHEQzox/p/fovfuFdpuEwQ...

tbalsam · on Aug 11, 2023

If it's any consolation, that post is...hot word salad garbage. It's like they learned the words on Wikipedia and then proceeded to try to make a post that used as many of them as possible. It's a good litmus test for experience vs armchair observers -- certainly scanning the article without decoding the phrasing to see how silly the argument is would seem impressive because "oooooh, fancy math". It's sort of why LW is more popular, because it is basically white collar flat-earthery, and many of the relevant topics discussed have already been discussed ad infinitum in the academic world and are accepted as general fact. We're generally not dwelling on silly arguments like that.

One of the most common things I see is people oftentimes assuming something that came from LW is novel and "was discovered through research published there", and that's because oftentimes it's really incentivized to make a lot of noise and sound plausible over there. Whereas arxiv papers, while there is some battle for popularity, are inherently more "boring" and formal.

For example, the LW post as I understand it completely ignores existing work and just... doesn't cite things which are rigorously reviewed and prepared. How about this paper from five years ago in a long string of research about generalization loss basins, for example? https://papers.nips.cc/paper_files/paper/2018/hash/be3087e74...

If someone earnestly tried to share the post you linked at a workshop at a conference, they would not be laughed out of the room, but instead have to deal with the long, draining, and muffling silence of walking to the back of the room without any applause when it was over. It's not going to fly with academics/professionals who are academia-adjacent.

This whole thing is not too terribly complicated, either, I personally feel -- a little information theory and the basics, and time studying and working on it, and someone is 50% of the way there, in my personal opinion. I feel frustrated that this kind of low quality content is parasitically supplanting actual research with meaning and a well-documented history. This is flashy nonsense that goes nowhere, and while I hesitate to call it drivel, is nigh-worthless. This barely passes muster for a college essay on the subject, if even that. If I was their professor, I would pull them aside to see if there is a more productive way for them to channel their interests in the Deep Learning space, and how we could better accomplish that.

rileyphone · on Aug 11, 2023

I appreciate the thoughts. In such a fast moving field, it's difficult for the layman to navigate without a heavy math background. There's some more academic research I should have pointed to like https://arxiv.org/abs/2010.11560

ShamelessC · on Aug 10, 2023

> Part of the issue here is posting a LessWrong post. There is some good in there, but much of that site is like a Flat Earth conspiracy theory for neural networks.

Indeed! It’s very frustrating that so many people here are such staunch defenders of LessWrong. Some/much of the behavior there is honestly concerning.

tbalsam · on Aug 11, 2023

100% agreed. I'm pretty sure today was the first time I learned that the site was founded by Yudkowsky, which honestly explains quite a bit (polite 'lol' added here for lightheartedness)

tbalsam · on Aug 10, 2023

To further clarify things, the reason there is no mystical 'eventual recovery from overfitting ' is because overfitting is a stable bound that is approached. Adding this false denomination to this implies a non-biphasic nature to neural network training, and adds false information that wasn't there before.

Thankfully things are pretty stable in the over/underfitting regime. I feel sad when I see ML misinformation propagated on a forum that requires little experience but has high leverage due to the rampant misuse of existing terms and complete invention of a in-group-language that has little touch with the mathematical foundations of what's happening behind the scenes. I've done this for 7-8 years at this point at a pretty deep level and have a strong pocket of expertise, so I'm not swinging at this one blindly.

Noumenon72 · on Aug 11, 2023

What are the two phases? What determines when you switch?

tbalsam · on Aug 11, 2023

Memorization of individual examples -> generalization, I can't speak about the determinant of switching as that is (partially, to some degree) work I'm working on, and I have a personal rule not to share work in progress until it's completed (and then be very open and explicit about it). My apologies on that front.

However, I can point you to one comment I made earlier in this particular comment section about the MDL and how that relates to the L2 norm. Obviously this is not the only thing that induces a phase change, but it is one of the more blatant ones that's been covered little more publicly by different people.

gorjusborg · on Aug 10, 2023

Whoever suggested 'eventual recovery from overfitting' is a kindred spirit.

Why throw away the context and nuance?

That decision only further leans into the 'AI is magic' attitude.

jeremyjh · on Aug 10, 2023

No, actually this is just how language evolves. I'm glad we have the word "car" instead of "carriage powered by internal combustion engine" even if it confused some people 100 years ago when the term became used exclusively to mean something a bit more specfic.

Of course the jargon used in a specific sub-field evolves much more quickly than common usage because the intended audience of paper like this is expected to be well-read and current in the field already.

smolder · on Aug 10, 2023

Language devolves just as it evolves. We (the grand we) regularly introduce ambiguity --words and meanings with no useful purpose, or that are worse than useless.

I'm not really weighing in on the appropriateness of the use "grok" in this case. It's just a pet peeve of mine that people bring out "language evolves" as an excuse for why any arbitrary change is natural and therefore acceptable and we should go with the flow. Some changes are strictly bad ones.

A go-to example is when "literally" no longer means "literally", but its opposite, or nothing at all. We don't have a replacement word, so now in some contexts people have to explain that they "literally mean literally".

krapp · on Aug 10, 2023

Language only evolves, "devolving" isn't a thing. All changes are arbitrary. Language is always messy, fluid and ambigious. You should go with the flow because being a prescriptivist about the way other people speak is obnoxious and pointless.

And "literally" has been used to mean "figuratively" for as long as the word has existed[0].

[0]https://blogs.illinois.edu/view/25/96439

smolder · on Aug 10, 2023

I'm going to take a rosier view of prescriptivists and say they are a necessary part of the speaking/writing public, doing the valuable work of fighting entropic forces to prevent making our language dumb. They don't always need to win or be right.

That's the first time I've seen literally-as-figuratively defended from a historical perspective. I still think we'd all be better off if people didn't mindlessly use it as a filler word or for emphasis, which is generally what people are doing these days that is the source of controversy, not reviving an archaic usage.

Also, it's kind of ironic you corrected my use of "devolves", where many would accept it. :)

mdp2021 · on Aug 10, 2023

> devolving isn't a thing

Incompetent use is devolution.

gorjusborg · on Aug 10, 2023

Also being overlooked is that the nuances in what we accept is in large part how we define group culture.

If you want to use the word 'irregardless' unironically there are people who will accept that. Then there are the rest of us.

kaba0 · on Aug 11, 2023

Just as an added data point, some languages (e.g. Hungarian) do use double negative “natively”, and I have definitely caught myself having to fight some native expression seeping into my English, including ‘irregardless’. For example a Hungarian would say “I have never done nothing bad” over “anything bad”, but it is used not in a logical sense, but more as an emphasis, perhaps?

(!)Regardless, what I’m trying to say is that due to the unique position of English as the de facto world language, it has to “suffer” some non-idiomatic uses seeping in from non-natives. Actually, I would go even further and say that most smaller languages will slowly stop evolving and only English will have that property going forward (most new inventions no longer gets a native name in most languages, the English one is used).

gorjusborg · on Aug 10, 2023

> No, actually this is just how language evolves

Stop making 'fetch' happen, it's not going to happen.

benreesman · on Aug 10, 2023

Sci-Fi Nerd Alert:

“Grok” was Valentine Michael Smith’s rendering for human ears and vocal cords of a Martian word with a precise denotational semantic of “to drink”. The connotational semantics range from to literally or figuratively “drink deeply” all the way up through to consume the absented carcass of a cherished one.

I highly recommend Stranger in A Strange Land (and make sure to get the unabridged re-issue, 1990 IIRC).

mxwsn · on Aug 10, 2023

They're just defining grokking in a different way. It's reasonable to me though - grokking suggests elements of intuitive understanding, and a sudden, large increase in understanding. These mirror what happens to the loss.

whimsicalism · on Aug 10, 2023

I literally do not see the difference between the two uses that you are trying to make

jjk166 · on Aug 10, 2023

I've always considered the important part of grokking something to be the intuitiveness of the understanding, rather than the completeness.

paulddraper · on Aug 10, 2023

What the difference between understanding and generalizing?

And what is the indicator for a machine understanding something?

NikkiA · on Aug 10, 2023

I've always taken 'grok' to be in the same sense as 'to be one with'

gorjusborg · on Aug 10, 2023

Yeah, there is definitely irony that I'm trying to push my own definition of an extra-terrestrial word, complaining that someone is ruining it.

If anyone wants to come up with their own definition, read Robert Heinlein's 'Stranger in a Strange Land'. There is no definition in there, but you build an intuition of the meaning by its use.

One of the issues I have w/ the use in AI is that using the word 'grok' suggests that the machine understands (that's a common interpretation of the word grok, that it is an understanding greater than normal understanding).

By using an alien word, we are both suggesting something that probably isn't technically true, while simultaneously giving ourselves a slimy out. If you are going to suggest that AI understands, just have the courage to say it with common english, and be ready for argument.

Redefining a word that already exists to make the argument technical feels dishonest.

snewman · on Aug 10, 2023

Actually the definition of 'grok' is discussed in the book; you can find some relevant snippets at https://en.m.wikipedia.org/wiki/Grok. My recollection is that the book says the original / literal meaning is "drink", but this isn't supported by the Wikipedia quotes and perhaps I am misremembering, it has been a long time.

NikkiA · on Aug 11, 2023

The book also points out that it is much more than just 'drink', and 'drink' would by no means cover 99% of the way it is used in the book.

That said, I've only ever read the full unabridged re-issue from the mid-90s, it's possible the earlier, edited, releases had many of the uses elided.

dogcomplex · on Aug 11, 2023

Same thing. To grok is to fully incorporate the new into your intuitive view of the world - changing your view of both in the process. An AI is training their model with the new data, incorporating it into their existing world view in such a way that may even subtly change every variable they know. A human is doing the same. We integrate it deeper the more we can connect it to existing metaphor and understanding - and it becomes one less thing we need to "remember" precisely because we can then recreate it from "base principles" because we fully understand it. We've grokked it.

thuuuomas · on Aug 10, 2023

“Grok” is more about in-group signaling like “LaTex credibility” or publishing blog posts on arxiv.

mr_toad · on Aug 10, 2023

In programming circles ‘grok’ has long been used to describe that moment when you finally understand the piece of code you’ve been staring at all day.

So the AI folks are just borrowing something that had already been co-opted 30+ years ago.

93po · on Aug 10, 2023

I have heard grok used tremendously more frequently in the past year or two and I find it annoying because they're using it as a replacement for the word "understand" for reasons I don't "grok"

momirlan · on Aug 10, 2023

grok, implying a mystical union, is not applicable to AI

Filligree · on Aug 10, 2023

Why not?