So the focus is trying to find things where LLMs are bad instead of trying to fi...

jiggawatts · on Jan 28, 2025

On the contrary, this is testing the LLMs on inputs they're supposed to be good at.

Fundamentally, this kind of problem is the same as language translation, text comprehension, or coding tasks. It just tests where the boundaries are of the LLM capabilities by pushing it to its limits.

I've noticed the LLMs bumping up against those very same limits in ordinary coding tasks. For example, if you have a prefix-suffix type naming convention for identifiers, depending on how the tokenizer splits these, the LLMs can either do very well or get muddled up. Similarly, they're not great at spotting small typos with very long identifiers because in their internal vector representations the correct and typo versions are very "close".