A graphics frame appears on the screen at a specific time. (For VR, it is a definite time, and this is critical. For normal video or games, a little bit of slop, maybe a few ms, is probably okay.)
For audio, humans are sensitive to 10 ms deviations or even less.
Any API that works decently will need to synchronize audio and video, so there needs to be a way for a program to say “this audio sample should play are the same time as this video frame is shown”. But an API should also allow programs to react as quickly as possible to user input. And Bluetooth headphones, in particular, have very, very high latency.
So designing an API that performs well without revealing the latency is hard.
I do think it would be good to cleanly separate normal web pages and games, though. For pure content, none of this matters except that video needs to maintain synchronization. But normal content does not need clicks to translate quickly to video changes.
I don't know about you but I find everything about computers is hard, that doesn't mean there aren't better or worse solutions.
The browser is the presentation layer, it needs to know the latency of your headphones (or the system does). Why does the content provider need it? What's wrong with "here is frame A, please play audio A at the same time (while taking into account the latency that only you know about)" as a request?
It's simply moving responsibility from one entity to another, there's no technical reason that the content provide would be better at syncing the two, just as with any other network communication.
A graphics frame appears on the screen at a specific time. (For VR, it is a definite time, and this is critical. For normal video or games, a little bit of slop, maybe a few ms, is probably okay.)
For audio, humans are sensitive to 10 ms deviations or even less.
Any API that works decently will need to synchronize audio and video, so there needs to be a way for a program to say “this audio sample should play are the same time as this video frame is shown”. But an API should also allow programs to react as quickly as possible to user input. And Bluetooth headphones, in particular, have very, very high latency.
So designing an API that performs well without revealing the latency is hard.
I do think it would be good to cleanly separate normal web pages and games, though. For pure content, none of this matters except that video needs to maintain synchronization. But normal content does not need clicks to translate quickly to video changes.