They’re not wrong though. The frequency with which these things still just make shit up is astonishingly bad. Very dismissive of a legitimate criticism.
It's getting better, faster than you and I and the GP are. What else matters?
You can't bullshit your way through this particular benchmark. Try it.
And yes, they're wrong. The latest/greatest models "make shit up" perhaps 5-10% as frequently as were seeing just a couple of years ago. Only someone who has deliberately decided to stop paying attention could possibly argue otherwise.
And yet I still can't trust Claude or o1 to not get the simplest of things, such as test cases (not even full on test suites, just the test cases) wrong, consistently. No amount of handholding from me or prompting or feeding it examples etc helps in the slightest, it is just consistently wrong for anything but the simplest possible examples, which takes more effort to manually verify than if I had just written it myself. I'm not even using an obscure stack or language, but especially with things that aren't Python or JS it shits the bed even worse.
I have noticed it's great in the hands of marketers and scammers, however. Real good at those "jobs", so I see why the cryptobros have now moved onto hailing LLMs as the next coming of jesus.
I still find that 'trusting' the models is a waste of time, we agree there. But I haven't had that much more luck with blindly telling a low-level programmer to go write something. The process of creating something new was, and still is, an interactive endeavor.
I do find, however, that the newer the model the fewer elementary mistakes it makes, and the better it is at figuring out what I really want. The process of getting the right answer or the working function continues to become less frustrating over time, although not always monotonically so.
o1-pro is expensive and slow, for instance, but its performance on tasks that require step-by-step reasoning is just astonishing. As long as things keep moving in that direction I'm not going to complain (much).
They are NOT as unaware of things as we are. That’s like someone seeing a software developer googling stuff and saying “see, they don’t know much more than me”.
An expert refreshing their knowledge on Google is not the same as a layman learning it for the first time. At all.
> in no universe is that a legitimate ML interview question
Why not? This seems like the ML equivalent of FizzBuzz. If you don't know how matrix multiplication works well enough to implement it, I would argue that you don't know what you're doing at all.
I think the time estimates depend heavily on the field. A mathematician recently told him that it would take him 1-2 weeks (with nothing else to do in that time) to digest a paper outside his main area.
Makes me wonder: should it, though? The author of the paper presumably had gone through the trouble of understanding everything described in the paper, built intuition and a mental model. Instead of putting more effort in writing the latter down, it appears to me that authors are inclined to throw the proverbial baby out with the bath water: they spend more time writing out the dense proof rather than an exposition.
I'm not saying proof isn't important or to exclude it. But given that more people understand plain and intuitive explanation (at the expense of accuracy, maybe), their hard work reaches broader audience that way. Isn't that what authors want, instead of "dog whistling"? Do proofs alone carry intuition? I don't think so.
It takes many dozens of hours, sometimes hundreds, to write a paper (just the writing not the technical work). People will spend a long time trying to improve the expositions. I have seen papers where co-authors have fought for weeks about the accuracy of a single sentence. I have seen papers where there were over twenty draft-revision cycles.
But there are natural limits. Usually, after working for years on a problem, you become so close to it that describing to a general technical audience is very difficult. Often, after you publish the work, someone else will do the difficult work of understanding your paper, and then write a more readable exposition as part of a review paper or book.
That makes sense. The proof-heavy part is the third pass and part the author says takes the longest (4-5 hours for a beginner). With math papers it's essentially all proof!