I wish they would share more about how it works. Maybe a reseach paper for once?...

I wish they would share more about how it works. Maybe a reseach paper for once? we didn't even get a technical report.

From my best guess: it's a video generation model like the ones we already head. But they condition inputs (movement direction, viewangle). Perhaps they aren't relative inputs but absolute and there is a bit of state simulation going on? [although some demo videos show physics interactions like bumping against objects - so that might be unlikely, or maybe it's 2D and the up axis is generated??].

It's clearly trained on a game engine as I can see screenspace reflection artefacts being learned. They also train on photoscans/splats... some non realistic elements look significantly lower fidelity too..

some inconsistencies I have noticed in the demo videos:

- wingsuit discollcusions are lower fidelity (maybe initialized by high resolution image?)

- garden demo has different "geometry" for each variation, look at the 2nd hose only existing in one version (new "geometry" is made up when first looked at, not beforehand).

- school demo has half a caroutside the window? and a suspiciously repeating pattern (infinite loop patterns are common in transformer models that lack parameters, so they can scale this even more! also might be greedy sampling for stability)

- museum scene has odd reflection in the amethyst box, like the rear mammoth doesn't have reflections on the right most side of the box before it's shown through the box. The tusk reflection just pops in. This isn't fresnel effect.