Hacker Newsnew | past | comments | ask | show | jobs | submit | more nmfisher's commentslogin

The article is talking about doing exactly that. The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

That's why residual vector quantization is a useful technique - using multiple dictionaries to quantize a single timeslice, each conditioned on the previous residual level. You can also quantize a signal at different frequencies.

There are samples towards the end of the post of their LLM trained on their Mimi audio codec.


> The key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens. A single window of audio is usually somewhere between 10ms and 100ms. It's difficult to squish all that information down to a single "token" that represents the semantic and acoustic content for that window.

I read the article and confess some of the modeling parts were above my comprehension. But I would like to add that as an audio engineer, the "key question" you describe is solved, just not applied to transformer models (?).

An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently. And with tools like Melodyne - which already quantize audio semantically - they can identify (and manipulate) pitch and formants as well, turning an O vowel into an E vowel, or changing the inflection of a phrase (up-speak vs down-speak, for example).

I don't know how to apply this to a neural codec, but it seems like it shouldn't be that hard (that's my naivete coming through)


> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

As an experienced DAW author, I very, very much doubt this.

What can be done relatively easy is to "see" or rather "follow along" in the waveform when listening to the audio. But I read your claim as being that someone could look at the waveform (which is already decimated from the original) and identify words or phonemes without hearing the associated audio. I am extremely skeptical that there is anyone anywhere in the world who can do this.


I started in music but have since edited thousands of hours of podcasts. I cannot transcribe a track by looking at the waveform, except the word "um" haha. But without playing the audio I can tell you where words start and end, whether a peak is a B or a T or an A or an I sound... And melodyne can add layers to that and tell me the pitch, formants (vowels), quantize the syllables etc. If I can do all this, a computer ought to be able to do the same and more


Hundreds of hours here, and I can't even always reliably spot my own ums. I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air. This is boring and tedious work but it makes the end result considerably better I think.

I feel like there should be a model that can do much of this for me but I haven't really looked into it, ironically due to laziness, but also because I edit across multiple tracks at this stage, and I'm afraid to feed the model an already mixed stereo track. I'm curious why you still do it manually, if you still do and if you've looked into alternatives.


> I edit as many out as I possibly can for myself, my co-host and guest, as well as eliminating continuation signaling phrases like "you know" and "like". I also remove uninteresting asides and bits of dead air.

Hopefully using Ardour's "Ripple - Interview" mode :))


I use Descript to edit videos/podcasts and it works great for this kind of thing! It transcribes your audio and then you can edit it as if you were editing text.


Yeah, that stuff is just freaking amazing. I don't know what the transcription quality is like, but if I was doing this as a job, and it was good at transcription, I'd definitely be using that all the time.


> An experienced engineer can look at a waveform in a DAW and identify specific consonants, vowels, specific words, etc quite fluently.

DAWs' rendered waveforms have so little information that such identification is likely impossible even in theory. Telling apart plosives and vowels maybe, but not much more than that.

I work with phoneticians and they can (sometimes) read even words from suitably scaled spectrograms, but that's a lot more information than in waveforms.


> the key question is how to convert an inherently continuous signal (speech/audio) into a discrete set of tokens

Did Claude Shannon not answer this question in 1948? You need at least 1 bit per 6dB of dynamic range for each symbol and 2B symbols per second where B is the bandwidth of the signal.

Compression techniques are all about getting below that fundamental limit but it's not like this is an unsolved problem. Or is 1kbaud too much for LLMs?


Yes, quantization isn't anything new, nor are audio codecs. As you point out, though, it's not just about designing a quantization scheme to reconstruct an analog signal. The scheme itself needs to be "easy" for current model architectures to learn and decode autoregressively (ideally, in realtime on standard hardware).

The blog post addresses this directly with samples from their own baseline (an autoregressive mu-law vocoder), and from WaveNet (which was similar architecture). The sound is mostly recognizable as a human voice, but it's unintelligible. The sequence length is too long and the SNR for the encoding scheme is too low for an generative/autoregressive model to learn.

This is what the neural codec is intended to address. Decoupling semantic from acoustic modelling is an important step ("how our ears interpret a sound" vs. "what we need to reconstruct the exact acoustic signal"). Mimi works at 1.1kbps, and others work at low bitrates (descript, semanticodec, etc). Encodec runs at at a higher bitrate so generally delivers better audio quality.

Now - why are neural codecs easier to model than conventional parametric codecs? I don't know. Maybe they're not, maybe it's just an artifact of the transformer architecture (since semantic tokens are generally extracted from self-supervised models like WavLM). It's definitely an interesting question.


One of the popular speech-to-text models is Whisper, which starts with the conventional spectral analysis of the speech signal, and then feeds the data into a Transformer model. It works quite well.

https://openai.com/index/whisper/

Such approach dates back to 1940s, when people were trained to read the speech from spectrograms. There is a 1947 book "Visible Speech" by Potter, Kopp, and Green describing these experiments. Here is a more slightly recent 1988 review of the subject: "Formalizing Knowledge Used in Spectrogram Reading"

https://apps.dtic.mil/sti/tr/pdf/ADA206826.pdf


Interesting - do you have the source handy?


To be more exact, it is 70-80% of hospital-aquired UTIs are catheter linked.

https://pubmed.ncbi.nlm.nih.gov/31532742/

The infection rate is initially 3-10% but increases by 5% for every day its left in.

Some more information here on biofilm development and sepsis/mortality rates which is chilling.

https://pmc.ncbi.nlm.nih.gov/articles/PMC2963580/


Overall I'm quite positive towards core Cloudflare products like Tunnels, Workers, R2, KV etc, but a lot of newer products are often either thoroughly broken (e.g. Cloudflare AI) or unusable due to insufficient documentation (e.g. Email Routing).

After being burned a few times, I think I'm going to ignore any new Cloudflare product for 12 months after stable release. If their products worked as advertised, I'd be willing to pay considerably more. I think their commitment to the free tier is hamstringing them a little bit.


I also got burned and yes I also feel this way about it, i.e. AutoRAG has huge issues too, not to mention the whole MCP/Agents suite of SDKs...


Yeah I just use Workers and Durable Objects. Stuff like Queues built on DO is better to just use DO


Ubank in Australia just told me they’re retiring their website in a few months, the app will be the only way to access your account. It’s digital only, so no real world branches either.


This would be enough of a reason for me to immediately move all my savings to another bank. No website, no business.


> the app will be the only way to access your account

Maybe also on the ATMs of other banks?


Feels very much like a knee-jerk response to Facebook releasing their "Vibes" app the week before. It's basically the same thing, OpenAI are probably willing to light a pile of money on fire to take the wind out of their sails.

I also don't think the "Sam Altman" videos were authentic/organic at all, smells much more like a coordinated astroturfing campaign.


Or to distract from the new routing and intent/context detection thing they have going on.


Dart transpiles to Javascript already - not exactly native support, but practically the next best thing.

That being said, I'm also 100% behind the effort to standardize WASM as the cross-platform compilation target of choice.


If you transpile to javascript the performance will never exceed that of Javascript. Typescript is a bit silly in that aspect because it removes all the types that the developers put in, they aren't used to improve time or memory performance at all.


Why do you say they are different models? I've been looking at this today and haven't seen anything explicitly state that.


This is just my assumption given that they listed a lot of different models here: https://modelstudio.console.alibabacloud.com/?spm=a3c0i.2876...

This is an older link, but they listed two different sections here, commercial and open source models: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

For the realtime multimodal, I'm not seeing the open source models tab: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...


A dollar is always a dollar, so it's hard to claim that $1 million in revenue is actually worth $10 million. OpenAI shares, on the other hand, aren't publicly traded, so it's much easier to claim they're worth $10 million when noone would actually be willing to buy for more than $1 million.

It's not necessarily manipulative but it's also not exactly an arms-length purchase of GPUs on the open market.


> We need to build better native UI libraries that just open up a WebGL context and draw shit to that.

This is what Flutter does. It works well, but you do lose some nice things that the browser/DOM provides (accessibility, zooming, text sizing/selection, etc). There’s also a noticeable overhead when the app + renderer is initially downloaded).


Singapore definitely does have this problem, there are police/government warnings against scams and fraud plastered everywhere.


It's actually one of the hardest forms of crime for state like Singapore to stop - you can police everyone inside the borders of the country extremely effectively, but there's not much you can do against scammers operating out of mainland China apart from trying to stop people falling for it


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: