> I won’t speculate about why images exhibit this behaviour and sound seemingly doesn’t, but it is certainly interesting (feel free to speculate away in the comments!).
Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).
What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?
Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397
The lack of semantics associated to DC (and near-DC) components in audio data is important, and a big difference compared to image data, no doubt.
I'm not sure this changes if you look at a cepstral representation (as suggested in the article). In this case, the DC component represents the (white) noise level in the raw audio space (i.e., the spectrum averaged over all frequencies), so it doesn't have strong semantics either (other than... "how noisy is the waveform?").
All four audio examples are human-made, so it makes sense they emphasize the frequency range that humans distinguish best. It would be interesting to compare with natural audio to see if there’s a distinction like that found in natural vs. manmade scenes in images. (Unfortunately there are increasingly few places on Earth you can find truly natural audio with no manmade sounds audible…)
You could just generate the audio in frequency space, much like how MP3 style codecs encode the raw signal. This converts the purely 1D audio waveform into a 2D grid of values, which is more amenable to this type of diffusion-based generation.
It is not really 1D - to perform any T/F transform (FFT, (M)DCT, etc.) you need a number of samples in the time domain, so you are essentially transforming 2D (intensity over time) to another 2D representation (magnitude or magnitude+phase over frequency) - this is why MP3 style codecs usually have multiple frame (or "window") lenghts, usually one longer for high frequency resolution and one shorter for high temporal resolution.
That’s exactly what I mean. Break up the 1D audio into 2D samples in time and frequency space. Train the AI in this space plus diffusion noise, and have it generate de-noised output in this space.
Images have a large near-DC component (solid colors) and useful time-domain properties, while human hearing starts at ~20 Hz and the frequencies needed to understand speech range from 300-4 kHz (spitballing based on the bandwidth of analog phones).
What would happen if you built a diffusion model using pink noise to corrupt all coefficients simultaneously? Alternatively what if you used something other than noise (like a direct blur) for the model to reverse?