This isn't true. I wrote a blog post about it a while back but never finished. I...

This isn't true. I wrote a blog post about it a while back but never finished. It's complete enough to demonstrate the point though so I'll plug anyway https://stefanlavelle.substack.com/p/no-language-isnt-enough

TLDR: Internal LLM representations correspond to an understanding of the visual world. We've all seen the Othello example, which is too constrained a world to mean much, but even more interesting is that LLMs can caption tokenized images with no pretraining on visual tasks whatsoever. Specifically, pass an image to an encoder-decoder visual model trained in a completely unsupervised manner on images -> take the encoded representation -> pass the encoded representation to an LLM as tokens -> get accurate captions. The tests were done on gpt-j, which is not multimodal and only has about 7bn params. The only caveat is that a linear mapping model needs to be trained to map the vector space from the encoder-decoder model to the embedding space of the language model, but this isn't doing any conceptual labour, it's only needed to align the completely arbitrary coordinate axes of the vision and language models, which were trained separately (akin to an American and a European to agreeing to use metric or imperial — neither’s conception of the world changes).

It's not intuitive, but it's hard to argue with these results. Even small LLMs can caption images. Sure, they don't get the low-level details like the texture of grass, but they get the gist.

I keep reading your sort of analysis, but honestly, those priors need updating. I had to update when learning this. If 7bn params can do it, 175bn params with multimodality can certainly do it.

It's true that humans need symbol grounding, but we don't see hundreds of billions of sequences. There are theoretical reasons (cf category theory) why this could work, albeit probably limited to gist rather than detail.