I would honestly try to sync audio output based on a shared time reference, something along the lines of what AES67/Ravenna/Dante does but you can be a little more lax and use ntp or system time since you don't need to be sample accurate.
For the microphones that would be a little harder but you should be aware it's not that tough since a few high end manufacturers have phased microphone arrays for videoconferencing. You could probably get close though but the fact is you need the audio from all the sources in a single location for processing and do phase analysis on it and possibly find an optimal delay for each by checking the group delay.
The advantage they have is some latency is acceptable and they don't need to do it on a low power device.
I don't see what this has to do with Gemini but maybe that's just marketing...
My money is that outside of a couple of dog and pony demos with everyone on one well-administered LAN you could not make this work with system time and NTP on consumer devices. You will regularly see 100ms difference in NTP time.
The fact that phased array microphones exist has nothing to do with the point we are discussing, which is audio coherence across heterogenous devices whose only real connection is a web browser.
I'm thinking more some sort of system with a sync point registered per device and using that as a time reference.
It's not inconceivable that they could easily detect multiple devices in a room and find a sync point based on microphone input from a speaker.
Once you have a sync point found you can then set a delay on all devices to try to match that sync point. Nobody said this is easy or everyone would be doing it but it's simple enough.
The phased array microphones is more a pipe dream but you wpuld absolutely be able to do something approaching that with multiple devices on a single room depending on how accurately you can predict microphone location within the room. Im reasonably sure you could start by just using the closest mic and then over timr as you improve sync you can try to use multiple.
As I said they get every single audio stream in and out into their servers and they have full control of the audio the tab is playing and the timing of that.
I don't see this being any different to what the likes of Sonos/Google Home/ Apple Home etc are doing with synced appliances for stereo/ multichannel devices, it's likely significantly harder because it's heterogeneous devices as you said.
All that doesn't answer my question of how you would do this at the OS level? You don't have any of the required information per device, only the central server has even the hope of having all the relevant information and control.
We agree that doing it at the OS level is probably the wrong direction. I think you could get there with PNTP and audio hardware support, which is more how Sonos etc do it afaik but then again you aren’t solving the heterogenous device problem.
It is apparently a good example of something that needs performant neural nets in the cloud to solve. At first glance it looks like a low-level hardware-firmware problem. Market conditions prevent solving it at that level though, so we had to wait for the right combination of resources, new signal processing and heavy cloud compute.
I would honestly try to sync audio output based on a shared time reference, something along the lines of what AES67/Ravenna/Dante does but you can be a little more lax and use ntp or system time since you don't need to be sample accurate.
For the microphones that would be a little harder but you should be aware it's not that tough since a few high end manufacturers have phased microphone arrays for videoconferencing. You could probably get close though but the fact is you need the audio from all the sources in a single location for processing and do phase analysis on it and possibly find an optimal delay for each by checking the group delay.
The advantage they have is some latency is acceptable and they don't need to do it on a low power device.
I don't see what this has to do with Gemini but maybe that's just marketing...