The root cause of all this is that C programs are not much more than glorified assembly programs. Any effort to retrofit higher level reasoning will always be defeated somebody doing some dirty pointer tricks. This can only be solved by more abstract ways to express programs which necessarily restricts the bare metal dirty things one can do. But what you gain is that the compiler will easily be able to do lots of things which a C compiler can't do or only with a lot of headache. The kind of stuff this article is about is really trying to solve the wrong problem IMO.
> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.
> It therefore follows that robots should be able to learn with just RGB images too!
That does not follow at all! It's not how you learned either.
Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.