I haven't heard of gradient checkpointing yet, thank you for the link! Do you kn...

gwern · on May 5, 2019

Gradient accumulation and gradient checkpointing are orthogonal. You might want to use them simultaneously.

If I had to compare them, I'd say that accumulation is about working on a minibatch datapoint by datapoint and faking being able to run an entire large minibatch in a single shot, while checkpointing is about working on a model layer by layer and faking being able to run an entire model in a single shot.

The problem with GPT-2-335M and why nshepperd had to mess with gradient checkpointing is that the GPT-2-335M model will literally not fit in your standard 11GB GPU (and from Twitter comments about people trying it on the new 16GB Google Colab instances, it's unclear if 16GB would be enough either!). You can't even run minibatch n=1. It doesn't fit. It OOMs.

The model itself is only a gigabyte or so, the problem is that the self-attention layers, when run, use up a huge amount of memory for their intermediate steps, which must be stored in order to trace everything backwards through each step for the backprop part of training.

(Right now I believe nshepperd's code punts on doing gradient accumulation simultaneous with gradient checkpointing, so we've just been reducing the learning rate, which is sort of similar to faking large minibatches with gradient accumulation.)

Fortunately, because the self-attention layers are so small and cheap to compute, they work well with gradient checkpointing. They're cheap to recompute on the fly, so it's more important to save memory and allow training at all. (This is also how OpenAI is training the Sparse Transformers which are enormous; they haven't said either way, but I assume this is how they trained the larger GPT-2s like the 1.5b parameter version, because I can't imagine what hardware would fit even a single GPT-2 1.5b without tricks.)

leod · on May 5, 2019

Thank you so much for your comprehensive answer, this helps a lot.

If I understand nshepperd's code correctly, it uses a constant and small learning rate. Do you know if this works better than the learning rate schedule that is usually used for Transformer models (https://www.tensorflow.org/alpha/tutorials/text/transformer_...)?

gwern · on May 5, 2019

It's a constant, yes. We haven't tried any other learning rate schedules (for my poetry GPT-2s, I simply drop the LR 10x each day or so). I have no idea if this is optimal for transfer learning or not.

indalo · on May 5, 2019

wow! I just had a blast putting titles into that. the results are amazing. kudos!