Hacker Newsnew | past | comments | ask | show | jobs | submit | benanne's commentslogin

I actually wrote down some thoughts about audio phase in a previous blog post: https://sander.ai/2020/03/24/audio-generation.html#motivatio...

I have an example audio clip in there where the phase information has been replaced with random noise, so you can perceive the effect. It certainly does matter perceptually, but it is tricky to model, and small "vocoder" models do a decent job of filling it in post-hoc.


I'm not sure if frequency decomposition makes sense for anything that's not grid-structured, but there is certainly evidence that there is positive "transfer" between generative modelling tasks in vastly different domains, implying that there are some underlying universal statistics which occur in almost all data modalities that we care about.

That said, the gap between perceptual modalities (image, video, sound) and language is quite large in this regard, and probably also partially explains why we currently use different modelling paradigms for them.


Oof, you're not going to like this other blog post I wrote then :D https://sander.ai/2023/07/20/perspectives.html


Well, yeah, I don’t know what you expect me to say, it’s sloppy work.


Sorry to hear that. My blog posts are intended to build intuition. I also write academic papers, which of course involves a different standard of rigour. Perhaps you'd prefer those, only one of those is about diffusion models though.


Thanks for reading! Absolutely, I included a few references that explore that approach at the bottom of section 4 (last two paragraphs).


Excellent, thanks, will check them out.

Had just finished watching the Physics of Language Models[1] talk, where they show how GPT2 models could learn non-trivial context-free grammars, as well as effectively do dynamic programming to an extent, so though it would be interesting to see how they performed in the spectral fine-graining task.

[1]: https://physics.allen-zhu.com/home


> I included a few references that explore that approach at the bottom of section 4

Man, reading on mobile phone just ain't the same. Somehow managed to not catch then end of that section. The first reference, "Generating Images with Sparse Representations", is very close to what I had in mind.


Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397


Thanks for reading! Check out subspace diffusion: https://arxiv.org/abs/2205.01490


I've since moved on to work primarily on diffusion models, so I have a series of blog posts about that topic as well!

- https://sander.ai/2022/01/31/diffusion.html is about the link between diffusion models and denoising autoencoders, IMO the easiest to understand out of all interpretations; - https://sander.ai/2023/07/20/perspectives.html covers a slew of different perspectives on diffusion models (including the "autoencoder" one).

In a nutshell, diffusion models break up the difficult task of generating natural signals (such as images or sound) into many smaller partial denoising tasks. This is done by defining a corruption process that gradually adds noise to an input until all of the signal is drowned out (this is the "diffusion"), and then learning how to invert that process step-by-step.

This is not dissimilar to how modern language models work: they break up the task of generating text into a series of easier next-word-prediction tasks. In both cases, the model only solves a small part of the problem at a time, and you apply it repeatedly to generate a signal.


In Georgian, mother is "deda" and father is "mama", which lends further evidence to this :)


A nice property of the model is that it is easy to compute exact log-likelihoods for both training data and unseen data, so one can actually measure the degree of overfitting (which is not true for many other types of generative models). Another nice property of the model is that it seems to be extremely resilient to overfitting, based on these measurements.


Why wouldn't this work in Theano?

    >>> import theano
    >>> import theano.tensor as T
    >>> state = theano.shared(1.0)
    >>> states = []
    >>> for step in range(10):
    >>>     state = state + state
    >>>     states.append(state)
    >>> 
    >>> f = theano.function([], states)
    >>> f()
    [array(2.0),
     array(4.0),
     array(8.0),
     array(16.0),
     array(32.0),
     array(64.0),
     array(128.0),
     array(256.0),
     array(512.0),
     array(1024.0)]


Thanks! When I tried this before, I thought compilation was stuck in an infinite loop and gave up after about a minute. But you're right, it works. Though on my machine, this took two and a half minutes to compile (ten times as long as compiling a small convnet). For 10 recurrence steps, that's weird, right? And the TensorFlow thing above runs instantly.


Agreed. Theano has trouble dealing efficiently with very deeply nested graphs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: