psyq123's comments

psyq123 · on Sept 3, 2024

It is not really 1D - to perform any T/F transform (FFT, (M)DCT, etc.) you need a number of samples in the time domain, so you are essentially transforming 2D (intensity over time) to another 2D representation (magnitude or magnitude+phase over frequency) - this is why MP3 style codecs usually have multiple frame (or "window") lenghts, usually one longer for high frequency resolution and one shorter for high temporal resolution.

jiggawatts · on Sept 3, 2024

That’s exactly what I mean. Break up the 1D audio into 2D samples in time and frequency space. Train the AI in this space plus diffusion noise, and have it generate de-noised output in this space.