Consistent output and spatial coherence across each eye, maybe a couple years? But meeting head tracking accuracy and latency requirements, I’d bet decades.
There’s no way any of this tech reduces end to end latency to acceptable levels, without a massive change in hardware. We’ll probably see someone use reprojection techniques in a year or so and claim they’ve done it. But true generated pixels straight to the headset based on head tracking, is so so far away.
You don't have to do it in real time, per se. I imagine a world in which the renderer and the world generation are decoupled. For example, you could descriptively articulate what you wanted to achieve and have it generate a world, quietly do some structure from motion (or just generate the models and textures), and those those as assets in a game engine for the actual moment to moment rendering.
You'd have some "please wait in this lobby space while we generate the universe" moments, but those are easy to hide with clever design.
I think your timeline is off, at least for a tech demo.
This model already runs at 24fps, and I bet could be made to run at >75fps by scaling hardware and distilling/quantizing the model to only work on certain environments.
The two eye problem seems pretty trivial to me: add another image decoding head with the sole task of decoding the other eye. Training data for this can be plentifully gathered through simulated 3D data, or running existing 2D data (e.g. youtube videos) through slow mono to stereo models. This should add minimal latency as it's another head vs. subsequent layers.
If you can train the model to allow WASD movement + mouse, head tracking is not very different. I think with enough effort we could probably build a VR experience using this today. Getting it onto affordable hardware could be a totally different story, but certainly not decades.
I think VR will come at the same time they make multiplayer. There needs to be differentiation between the world-state and the viewport. Right now, I suspect they're the same.
But once you can get N cameras looking at the same world-state, you can make them N players, or a player with 2 eyes.
It's hard to get an acceptable VR output for today's rendering engines still. In the examples provided, the movement seems to be slow and somewhat linear, which doesn't translate to head movements in VR. VR needs 2 consistent videos with much higher resolutions and low latency is a must. The feedback would still be very dependent on people's tolerance to all imperfections - some would be amazed, others would puke. That's why VR still isn't in the spotlight after all the years (I personally find it great).
That's an insane product right there just waiting to happen. Too bad Google sleeps so hard on the tech they create.