> I won’t speculate about why images exhibit this behaviour and sound seemingly ...

benanne · on Sept 3, 2024

Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397

fjkdlsjflkds · on Sept 3, 2024

The lack of semantics associated to DC (and near-DC) components in audio data is important, and a big difference compared to image data, no doubt.

I'm not sure this changes if you look at a cepstral representation (as suggested in the article). In this case, the DC component represents the (white) noise level in the raw audio space (i.e., the spectrum averaged over all frequencies), so it doesn't have strong semantics either (other than... "how noisy is the waveform?").

wrs · on Sept 3, 2024

All four audio examples are human-made, so it makes sense they emphasize the frequency range that humans distinguish best. It would be interesting to compare with natural audio to see if there’s a distinction like that found in natural vs. manmade scenes in images. (Unfortunately there are increasingly few places on Earth you can find truly natural audio with no manmade sounds audible…)

jiggawatts · on Sept 3, 2024

You could just generate the audio in frequency space, much like how MP3 style codecs encode the raw signal. This converts the purely 1D audio waveform into a 2D grid of values, which is more amenable to this type of diffusion-based generation.

psyq123 · on Sept 3, 2024

It is not really 1D - to perform any T/F transform (FFT, (M)DCT, etc.) you need a number of samples in the time domain, so you are essentially transforming 2D (intensity over time) to another 2D representation (magnitude or magnitude+phase over frequency) - this is why MP3 style codecs usually have multiple frame (or "window") lenghts, usually one longer for high frequency resolution and one shorter for high temporal resolution.

jiggawatts · on Sept 3, 2024

That’s exactly what I mean. Break up the 1D audio into 2D samples in time and frequency space. Train the AI in this space plus diffusion noise, and have it generate de-noised output in this space.