On the contrary, this is testing the LLMs on inputs they're supposed to be good at.
Fundamentally, this kind of problem is the same as language translation, text comprehension, or coding tasks. It just tests where the boundaries are of the LLM capabilities by pushing it to its limits.
I've noticed the LLMs bumping up against those very same limits in ordinary coding tasks. For example, if you have a prefix-suffix type naming convention for identifiers, depending on how the tokenizer splits these, the LLMs can either do very well or get muddled up. Similarly, they're not great at spotting small typos with very long identifiers because in their internal vector representations the correct and typo versions are very "close".
That's basically like trying to embarass a IQ 180 student on emotional intelligence.
But I guess that's human nature to expect a machine to be 100x better than humanity on first try.