Encoder in the T5 sense doesn't produce a fixed vector, it produces one encoded vector for every step of input and all of that is given to the decoder.
The only difference between encoder/decoder and decoder-only is masking:
In an encoder, none of the tokens are masked at any step, and are all visible in both directions to the encoder. Each output of the encoder can attend to any input of the encoder.
In the decoder, the tokens are masked causally - each N+1 token can only attend to the previous N tokens.
It is not streaming in the way people normally use this term. It's a fuzzy notion but typically streaming means something encompassing:
- Processing and emitting results on something closer to word by word level
- Allowing partial results while the user is still speaking and mid-segment
- Not relying on an external segmenter to determine the chunking (and therefore also latency) of the output.
This is fascinating because if your hint in another comment indicates you worked on this at Google, it's entirely possible I have this all wrong because I'm missing the actual ML part - I wrote the client encoder & server decoder for Opus and the client-side UI for the SODA launch, and I'm honestly really surprised to hear Google has different stuff. The client-side code loop AGSA used is 100% replicated, in my experience, by using Whisper.
I don't want to make too strong of claims given NDAs (in reality, my failing memory :P) but I'm 99% sure inference on-device is just as fast as SODA. I don't know what to say because I'm flummoxed, it makes sense to me that Whisper isn't as good as SODA, and I don't want to start banging the table about that its no different from a user or client perspective, I don't think that's fair. There's a difference in model architecture and it matters. I think its at least a couple WER behind.
But then where's the better STT solutions? Are all the obviously much better solutions really all locked up? Picovoice is the only closed solution I know of available for local dev, and per even them, it's only better than the worst Whisper. Smallest is 70 MB in ONNX vs. 130 MB for next step up, both inference fine with ~600 ms latency from audio byte to mic to text on screen, ranging from WASM in web browser to 3 year old Android phone.
Something to keep an eye on is that Whisper is strongly bound to processing a 30-second window at a time. So if you send it 30 seconds of audio, and it decodes it, then you send it another one second of audio, the only sensible way it can work is to have it reprocess seconds 2s-30s in addition to the new data at 31s. If there was a way to have it just process the update, then there's every possibility it could avoid a lot of work.
I suspect that's what people are getting at by saying it's "not streaming": it's built as a batch process but, under some circumstances, you can run it fast enough to get away with pretending that it isn't.
You are missing the speech decoding part. I can't speak to why the clients you were working on were doing what they were doing. For a different reference point see the cloud streaming api.
Possibly confusions from that doc: "RNN-T" is entirely orthogonal to RNNs (and not the only streamable model). Attention is also orthogonal to streaming. A chunked or sliding window attention can stream, a bi-directional RNN cannot. How you think of an encoder and a decoder streaming is also different.
At a practical level, if a model is fast enough, and VAD is doing an adequate job, you can get something that looks like "streaming" which a non-streaming model. If a streaming model has tons of look-ahead or a very large input chunk size, its latency may not feel a lot better.
Where the difference is sharp is where VAD is not adequate: Users speak in continuous streams of audio, they leave in unusual gaps within sentences and run sentences together. A non-streaming system either hurts quality because sentences (or even words) get broken up that shouldn't, or has to wait forever and doesn't get a chance to run, when a streaming system would have already been producing output.
And to your points about echo cancellation and interference: There's many text only operations that benefit from being able to start early in the audio stream, not late.
I just went through process of helping someone stand up an interactive system with whisper etc and the lack of an open sourced whisper-quality streaming system is such a bummer because it really is so much laggier than it has to be.
Streaming for TTS doesn't matter but for speech to text it is more meaningful in interactive cases. In that case the user's speech is arriving in real time and streaming can mean a couple levels of things:
- Overlap compute with the user speaking: Not having to wait until all the speech has been acquired can massively reduce latency at the end of speech and allow a larger model to be used. This doesn't have to be the whole system, for instance an encoder can run in this fashion along audio as it comes in even if the final step of the system then runs in a non-streaming fashion.
- Produce partial results while the user is speaking: This can be just a UI nice to have, but it can also be much deeper, eg, a system can be activating on words or phrases in the input before the user is finished speaking which can dramatically change latency.
- Better segmentation: Whisper + Silero is just using VAD to make segments for Whisper, this is not at all the best you can do if you are actually decoding while you go. Looking at the results as you go allow you to make much better and faster segmentation decisions.
The only models that do what you're poking at hostically are 4o (claimed) and that french company with the 7B one. They're also bleeding edge, either unreleased or released and way wilder, ex. The french one interrupts too much, and screams back in an alien language occasionally.
Until these, you'd use echo cancellation to try and allow interruptible dialogue, and thats unsolved, you need a consistently cooperative chipset vendor for that (read: wasn't possible even at scale, carrots, presumably sticks, and with nuch cajoling. So it works on iPhones consistently.)
The partial results are obtained by running inference on the entire audio so far, and silence is determined by VAD, on every stack I've seen that is described as streaming
I find it hard to believe that Google and Apple specifically, and every other audio stack I've seen, are choosing to do "not the best they can at all"
This is exactly what Google ASR does. Give it a try and watch how the results flow back to you, it certainly is not waiting for VAD segment breaking. I should know.
Streaming used to be something people cared about more. VAD is always part of those systems as well, you want to use it to start segments and to hard cut-off, but it is just the starting off point. It's kind of a big gap (to me) that's missing in available models since Whisper came out, partly I think because it does add to the complexity of using the model, and latency has to be tuned/traded-off with quality.
Thank you for your insight. It confirms some of my suspicions working in this area (you wouldn't happen to know anybody who makes anything more modern than the Respeaker 4-mic array?). My biggest problem is even with AEC, the voice output is triggering the VAD and so it continually thinks it's getting interrupted by a human. My next attempt will be to try to only signal true VAD if there's also sound coming from anywhere but behind, where the speaker is. It's been an interesting challenge so far though.
Re: mic, alas, no, BigCo kinda sucked, I had to go way out of my way to get work on interesting stuff, it never mattered, and even when you did, you never got over the immediate wall of your own org, except for brief moments. i.e. never ever had anyone even close to knowing anything about the microphones we'd be using, they were shocked to hear what AEC was, even when what we were working on was a marketing tentpole for Pixel. Funny place.
I'm really glad you saw this. So, so, so much time and hope was wasted there on the Nth team of XX people saying "how hard can it be? given physics and a lil ML, we can do $X", and inevitably reality was far more complicated, and it's important to me to talk about it so other people get a sense it's not them, it's the problem. Even unlimited resources and your Nth fresh try can fail.
FWIW my mind's been grinding on how I'd get my little Silero x Whisper gAssistant on device replica pulling off something akin to the gpt4o demo. I keep coming back to speaker ID: replace Silero with some newer models I'm seeing hit ONNX. Super handwave-y, but I can't help thinking this does an end-around both AEC being shit on presumably most non-Apple devices, and poor interactions from trying to juggle two things operating differently (VAD and AEC). """Just""" detect when there's >= 2 simultaneous speakers with > 20% confidence --- of course, tons of bits missing from there, ideally you'd be resilient to ex. TV in background. Sigh. Tough problems.
I'm not particularly experienced, but I did have good experiences with picovoice's services. It's a business specialised in programmatically available audio, tts, vad services etc.
They have a VAD that is trained on a 10 second clip of -your- voice, and it is then only activated by -your- voice. It works quite well in my experience, although it does add a little bit of additional latency before it starts detecting your voice (which is reasonably easy to overcome by keeping a 1s buffer of voice ready at all times. If the vad is active, just add the past 100-200ms of the buffer to the recorded audio. Works perfectly fine. It's just that the UI showing "voice detected" or "voice not detected" might lag behind 100-200ms)
Source: I worked on a VAD + whisper + LLM demo project this year and ran into some VAD issues myself too.
You and I agree fully, then. IMHO it's not too much work, at all, 400 LOC and someone else's models. Of course, as in that old saw, the art is knowing exactly those models, knowing what ONNX is, etc. etc., that's what makes it fast.
The non-sequitor is because I can't feel out what's going on from their perspective, the hedging left a huge range where they could have been saying "I saw the gpt4o demo and theres another way that lets you have more natural conversation" and "hey think like an LSTM model, like Silero, there are voice recognizers that let you magically get a state and current transcription out", or in between, "yeah in reality the models are f(audio bytes) => transcription", which appears to be closer to your position, given your "it's not a streaming model, though it can be adapted"
The Xenon episode of Hamilton's Pharmacopeia is one of the very best episodes of that series, I don't know the best way to find it on streaming currently, it was available I believe on Netflix (or maybe it was Hulu?): https://www.vicetv.com/en_us/video/xenon-the-perfect-anesthe...
The full episode takes very disturbing twists and turns, worth a watch.
Unfortunately it's not, this is some supplementary video. The full thing is 44 minutes long and covers a lot of different angles. The ongoing story about the Czech Xenon clinic in it that he covers over a longer time period is especially crazy.
From season 3, they're all great, but besides the Xenon one I also remember "Ultra LSD" and "Synthetic Toad Venom Machine" being especially great episodes.
From the perspective of the husband, the woman taking prep implies she is cheating. He either doesn't get tested or doesn't care, it's more important that she not do the thing that implies she could be cheating than that she not get HIV, and that she be sexually available to him.
You're digging for a logical explanation for a fundamentally illogical cultural problem and you aren't going to get it.
No, that doesn’t make sense. And writing it off as an illogical cultural problem is just lazy. I’ll believe these guys are assholes. I don’t believe they’re just consistently idiots. If the men refused to acknowledge that they had HIV and refused to allow their wives to get prep, then the logical consequence is that the wife gets HIV and the men are forced to confront the fact that either the husbands themselves have HIV or the wife is cheating and got it from someone else.
It seems much more likely that the husbands refuse to allow their wives to get prep out of spite. The implication of infidelity angle does not feel plausible.
Otherwise the men are setting themselves up for a lose:lose scenario regardless of what the wife does.
"the men are forced to confront the fact..." No they aren't, they simply never confront it. People go to their graves denying that they have HIV, denying that they ever tested positive, denying that a positive HIV test has anything to do with illness.
"Otherwise the men are setting themselves up for a lose:lose scenario regardless of what the wife does." - Of course.
I mean, the most rational win:win thing to do is to get an HIV test and get treated if positive. They then both don't get sick and die and can't pass along HIV. Many people don't do that either. What's the mindset that explains this behavior? You can't work backwards from the most rational thing to do to what people actually do.
You are the one injecting this narrative about women cheating though. It seems to me you’re just making this up. It doesn’t make sense.
Establishing the narrative that if my wife gets HIV that she must be cheating on me is a losing proposition for the man that only increases the probability that his wife will appear to be cheating. There is no motivation for it.
Simply being a dick and saying women can’t use prep because I don’t want them to is a much simpler narrative.
Your reasoning here is similar to arguing they’re a stupid people ergo they don’t use prep because of aliens. It’s not compelling even if you’re willing to believe they may engage in irrational behavior.
You asked "why would someone hide prep", you got one example. It is by no means an exhaustive list. For example, a big fear is being perceived as having HIV (since prep drugs are also part of HIV treatment).
If you do actually have interest in this topic you could read about it:
"Men were able to initiate PrEP without discussing it with their partners, whereas some women said they needed to get permission. Discussions around starting PrEP could raise questions about trust and infidelity and act as a barrier to PrEP use."
Well, you have an anecdote from someone claiming to be from SA saying that’s culturally the perception. Here’s a summary of research [1] on the topic concluding similar reasons (among others):
> Several participants felt that they could stop taking PrEP when the need, as they saw it, had passed. Often this was to do with the nature of their current relationship, for example with a person regarded as unfaithful: “If I find someone that I will be in a relationship with and if he is not faithful, or I have started being unfaithful, then I will come back and get them.”
And
> On the basis of these findings, the authors suggest that take-up and continued use of PrEP is likely to remain subject to established social norms. These norms often relate to gender and they determine, for example, who decides what HIV prevention methods to use, and the extent to which a woman in a relationship might – or might not – be able to make and implement such choices.
Just because something seems logical to you, doesn’t mean that social norms and pressures don’t superseded it. In fact, we even see it in our own cultural with people believing vaccines cause autism, the whole belief that ivermectin cures COVID-19, or flat earthers. What’s really impressive though is you having such a problem with this idea despite overwhelming objective evidence to the contrary being available online and people telling you their lived experience on this very website and you significantly discount the very real possibility that people can be illogical in their strongly held beliefs even if it seems nonsensical to you. If you know nothing about a subject, you’re likely to believe what all your peers tell you which is how misinformation gets a foothold. This misinformation can even come from nowhere. The point is that if enough people believe it, they can get others to believe it to. That’s literally how human belief systems work where beliefs spring out of nothing.
Your first quote is saying the opposite of what you are trying to defend. It’s about women taking prep because they believe their husbands are cheating, which makes total sense.
Not HIV husbands forbidding their wives from taking prep because it would enable the wives to cheat or imply that they are cheating
The point is that in the scenario being described, where the woman feels she needs "permission", the man's perspective is... if you were taking this, what does that say about ME? What does that say about what you think about ME? The decision would be about him, not her. How could it be about her? Wait, if it's not about ME, who else do you need this for? MY wife would never need such a thing.
It is easy to avoid stigma and shame through denial. The woman would be well aware that he would not approve such a thing and would take it in secret.
I mean you literally have women saying they take it if they are cheating.
> Some of the women were prevented by their male partner from taking or continuing PrEP: “I showed him the pill. He immediately stopped me from saying more before mentioning he had heard about PrEP and that he was strongly against the pill... He ordered me to throw them away or else pack my bags and leave. And that was why I stopped taking them.”
Here’s some more explicit quotes [1]:
> Another concern was that partners would interpret PrEP use as evidence of sexual activity outside the relationship.
> “I didn’t tell him about the pills. I was hesitant because he will say, ‘Why are you preventing HIV? Are you cheating now because we don’t have HIV so why are you taking pills?’ So, I decided to keep quiet. I am going to tell him. But for now, I haven’t told him about it.” PrEP User, Lower adherer, Age 21
It even makes sense that your partner taking PrEP would be seen as evidence of having an affair if you are convinced that neither of you had HIV prior. So not only do you continue on doubling down on a losing position, it’s not even an illogical line of reasoning to have.
I’d say the burden of proof is on you at this point that such an interpretation isn’t a social norm or that it’s even an illogical position to have.
The mainstream consensus is the he was wildly wrong about HIV specifically, that HIV causes AIDS, and that his influence in South Africa to not deploy anti-viral medications killed hundreds of thousands of people before the policy was reversed.
Part of his hypothesis was that viruses in general, not just retroviruses, were not connected to cancers, the consensus view is that this is completely wrong. We have a very large body of evidence on many virus caused cancers now.
So, the two known human retroviruses both cause disease and retroviruses cause diseases in animals. Duesberg held on to and promoted this concept long after it should have been clear to him that there was zero empirical support for his idea.
To me the most convincing bit that weakens his "hypothesis" is that people who received blood transfusions from HIV-contaminated blood. Many of those people showed none of the risk factors.
Influenced the deaths of hundreds of thousands of people?
Yet apparently to this day he draws over 200k/yr in salary from Berkeley. I believe they are not entirely funded by tuition/endowments which means California tax payers support him at least in part.
He also, that I know of, still supports this position. To this day, you will find people getting into this particular conspiracy and rejecting treatment. It doesn't go well for them.
I do think that freedom of speech is important, and that many attempts to squash "misinformation" are misguided, but some speech has consequences. Personally I find Duesberg utterly reprehensible and morally culpable.
Perhaps I found the article clearer because of familiarity with the subject.
On the "retroviruses must be harmless" virology: He's a denier of viral involvement in cancers in general, not just that HIV must be harmless. He is way outside mainstream consensus on all kinds of things.
For instance, he argues that Kaposi sarcoma, a very common AIDS related cancer was caused by drug use and not opportunistic infection. It is now very well established that all KS, which also affects (typically older) HIV- people, is caused by HHV-8 infection.
The core thing he does on all of these topics is just to ignore or deny anything that doesn't agree with him, eg: Hemophiliacs treated with tainted blood get AIDS, HIV viral load directly corresponds to disease progression which is clearly halted by dropping HIV load with treatment, the HPV vaccine demonstrably prevents cervical cancer, etc. He is far off in quack territory.
I think I understand that retroviruses can cause disease, contrary to what Peter Duesberg seems to be claiming. What I'm wondering about is his claim that they should be harmless in order to survive. Is that something commonly accepted? If so, should it cause surprise that they aren't harmless, and still surviving? Is there an interesting scientific question somewhere in there?
That's the question I couldn't answer by reading the wikipedia article. But I think thanks to some of the comments here my question is at least partly answered: at least some retroviruses -including HIV- seem to not kill off their host immediately, which I guess gives them time to reproduce and infect more hosts.
This injection isn't a vaccine, it's an anti-viral drug being used as pre-exposure prophylaxis. The first approval of this approach was in 2012, but using an oral pill with a short half-life taken daily.
That drug is still in use and also highly effective, the new improvement is to provide the same approach with a longer acting injected drug. One reason there has been great interest in this, despite the already effective oral PREP, is that there are thought to be socio-behavior advantages for cases like women in Africa as in this study. For example: the woman does not have to keep a supply of daily pills that a partner can find. Also possibly improved adherence with no missed doses.
The drug itself is not thought to be more biologically effective than the oral drugs, which are basically already at close to 100% effective assuming the patient actually takes them as scheduled.
South Africa (study was in SA and Uganda) has an adult HIV prevalence of 18.3% and 210k new infections per year. It is easy to select a high risk group in which you would expect to see new HIV infections during the course of the study without intervention.
The only difference between encoder/decoder and decoder-only is masking:
In an encoder, none of the tokens are masked at any step, and are all visible in both directions to the encoder. Each output of the encoder can attend to any input of the encoder.
In the decoder, the tokens are masked causally - each N+1 token can only attend to the previous N tokens.