I'm not particularly experienced, but I did have good experiences with picovoice...

I'm not particularly experienced, but I did have good experiences with picovoice's services. It's a business specialised in programmatically available audio, tts, vad services etc.

They have a VAD that is trained on a 10 second clip of -your- voice, and it is then only activated by -your- voice. It works quite well in my experience, although it does add a little bit of additional latency before it starts detecting your voice (which is reasonably easy to overcome by keeping a 1s buffer of voice ready at all times. If the vad is active, just add the past 100-200ms of the buffer to the recorded audio. Works perfectly fine. It's just that the UI showing "voice detected" or "voice not detected" might lag behind 100-200ms)

Source: I worked on a VAD + whisper + LLM demo project this year and ran into some VAD issues myself too.