The reason is that image generators don't iterate on the output in the same way ...

The reason is that image generators don't iterate on the output in the same way the text-based LLMs do. Essentially they produce the image in "one hit" and can't solve a complex sequence in the same way you couldn't one-shot this either. Try taking a random maze, glance at it, then go off to draw a squiggle on a transparency. If you were to place that on top of the maze, there's virtually no chance that you'd have found the solution on the first try.

That's essentially what's going on with AI models, they're struggling because they only get "one step" to solve the problem instead of being able to trace through the maze slowly.

An interesting experiment would be to ask the AI to incrementally solve the maze. Ask it to draw a line starting at the entrance a little ways into the maze, then a little bit further, etc... until it gets to the end.