I expect part of it is that the contemporary recommendations for VR are extremely meaty - something like 2160x2160 and 120hz with stereoscopic rendering meaning you're rendering every frame twice.
That's more than 1.1 billion pixels per second. At 24 bits a pixel that's something like 26Gb/s of raw data. And that's just in bandwidth - you also need to hit that 120hz of latency, in an environment where hiccups or input lag can cause physical discomfort for a user. And then even if you remote everything you need the headset to have enough juice to decompress and render all of this and hit these desired throughputs.
I'm napkin mathing all of this, and so I'm sure there have been lots of breakthroughs to help along these lines, but it's definitely not a straightforward problem to solve. Of course it's arguable I'm also just falling victim to the contemporary trappings of fidelity > experience, that I was just criticizing.