"Language models trust their memory quite a bit." All they have is memory, eithe...

famouswaffles · on May 13, 2023

LLMs trained on code reason better. Perform better on reasoning benchmarks even if the benchmarks have nothing to do with code. You're wrong.

https://arxiv.org/abs/2210.07128

nathan_compton · on May 13, 2023

Code is often just a sequence of steps (sometimes with comments to indicate goals). As such, it is just another form of patterns of reasoning. Many chains of thought that you would utilize in code are useful skeletons to think about other things.

I don't see how this undermines my point.

famouswaffles · on May 13, 2023

If code transfers to reasoning tasks that don't have anything to do with code then what is being "substituted" ? Ideas and concepts ?

Code and MMLU don't share similar "reasoning patterns" unless you're being extremely vague. In the, "they both require reasoning" sense.

nathan_compton · on May 13, 2023

Structure. GPT has seen lots of logical constructions/arguments for things. These are either explicitly in code (in documentation) or are implicitly in code (code is often a linear sequence of steps building to, for example, a return value). ChatGPT learns patterns like this. A prompt may condition the generator to produce something like one of these patterns with elements from the prompt substituted into the generated text. This works relatively often, but fails exactly in the case where the prompt so strongly indicates a pattern that won't work for the prompt given.

I won't say these models can't reason per se, but they can only reason using their memories and the prompt. There is nothing else for them to compute on.

In a hand wavy kind of way, when ChatGPT fails at a riddle phrased in a way as to make it seem similar to a common riddle, you're seeing overfitting. But given the quantity of data these models consume, its hard to imagine how to test for overfitting because the training data contains things similar to almost anything you can imagine. Because of that I'm still very suspicious of claims that they "reason" in any strong sense of the word.

But if you try very hard you can find "held out" data and when you test on it, GPT4 stops looking so smart:

https://teddit.net/r/singularity/comments/121tc48/gpt4_fails...

That said, I've been very impressed by GPT4 as a productivity tool.

famouswaffles · on May 13, 2023

>but they can only reason using their memories and the prompt.

Eh no.

https://arxiv.org/abs/2212.10559

>But if you try very hard you can find "held out" data and when you test on it, GPT4 stops looking so smart:

This can be done to anybody. This can be done to you. It's not a gotcha. Nobody is saying GPTs don't/can't memorize.

nathan_compton · on May 14, 2023

Two things about this.

1. the paper in question demonstrates a formal duality between the transformer architecture and gradient descent. If you take this to indicate that the model reasons in some way, then it would be true of the smallest GPT as well as the largest (it is, after all, a consequence of the architecture rather than anything the model has learned to do per se). In any case, the fact that the model can perform the equivalent of a finite number of gradient-like steps on its way to calculating its final conditioned probabilities doesn't really suggest to me that the model reasons in a general way.

2. You are right that no one disputes the model's ability to memorize (and rephrase). What is at question here is whether the model can reason. If it can do 10 code questions it has seen before but fails to do 10 it hasn't (of similar difficulty) then it strongly suggests that it isn't reasoning its way through the questions, but regurgitating/rephrasing.

famouswaffles · on May 14, 2023

>If it can do 10 code questions it has seen before but fails to do 10 it hasn't (of similar difficulty) then it strongly suggests that it isn't reasoning its way through the questions, but regurgitating/rephrasing.

First of all, coding is one thing where expecting perfect try on first pass makes no sense. That GPT-4 didn't one-shot those problems doesn't mean it can't solve them.

Moreover, all this says if true is that GPT-4 isn't as good at coding as initially thought. Nothing else. Doesn't mean it doesn't reason. There are many other tasks where GPT-4 performs about as well on out of distribution/unseen data