How would you do this on the os? I would honestly try to sync audio output based...

svnt · on May 27, 2024

My money is that outside of a couple of dog and pony demos with everyone on one well-administered LAN you could not make this work with system time and NTP on consumer devices. You will regularly see 100ms difference in NTP time.

The fact that phased array microphones exist has nothing to do with the point we are discussing, which is audio coherence across heterogenous devices whose only real connection is a web browser.

jpc0 · on May 28, 2024

You don't need the time to actually be synced.

I'm thinking more some sort of system with a sync point registered per device and using that as a time reference.

It's not inconceivable that they could easily detect multiple devices in a room and find a sync point based on microphone input from a speaker.

Once you have a sync point found you can then set a delay on all devices to try to match that sync point. Nobody said this is easy or everyone would be doing it but it's simple enough.

The phased array microphones is more a pipe dream but you wpuld absolutely be able to do something approaching that with multiple devices on a single room depending on how accurately you can predict microphone location within the room. Im reasonably sure you could start by just using the closest mic and then over timr as you improve sync you can try to use multiple.

As I said they get every single audio stream in and out into their servers and they have full control of the audio the tab is playing and the timing of that.

I don't see this being any different to what the likes of Sonos/Google Home/ Apple Home etc are doing with synced appliances for stereo/ multichannel devices, it's likely significantly harder because it's heterogeneous devices as you said.

All that doesn't answer my question of how you would do this at the OS level? You don't have any of the required information per device, only the central server has even the hope of having all the relevant information and control.

svnt · on May 29, 2024

We agree that doing it at the OS level is probably the wrong direction. I think you could get there with PNTP and audio hardware support, which is more how Sonos etc do it afaik but then again you aren’t solving the heterogenous device problem.

It is apparently a good example of something that needs performant neural nets in the cloud to solve. At first glance it looks like a low-level hardware-firmware problem. Market conditions prevent solving it at that level though, so we had to wait for the right combination of resources, new signal processing and heavy cloud compute.