Formal Algorithms for Transformers

lynguist · on July 21, 2022

I find the distinction introduced in this paper into encoder-decoder Transformers, encoder-only Transformers and decoder-only Transformers very useful for my informal understanding of the different architectures. Thank you for this clear clarification.

godelski · on July 21, 2022

I'm curious about this comment and it being the top comment. Is this something people don't know? Everything in this paper was introduced in Attention Is All You Need[0]. They introduced Dot Product Attention, which is what everyone just refers to now as Attention, and they talk about the decoder and encoder framework. The encoder is just self attention `softmax(<q(x),k(x)>)v(x)` and decoder includes joint attention `softmax(<q(x),k(x)>)v(y)`

I have a lot of complaints about this paper because it only covers topics addressed in the main attention paper (Vaswani) and I can't see how it accomplishes anything but pulling citations away from grad students who did survey papers on Attention, which are more precise and have more coverage of the field. As a quick search, here's a survey paper from last year that has more in depth discussion and more mathematical precision[1].

Maybe because this is an area of research adjacent to my active area, but I'm a little confused at the attention (pun intended) it is getting. Seems just like Deep Mind's name is the major motivation.

[0] https://arxiv.org/abs/1706.03762

[1] https://arxiv.org/abs/2106.04554

sva_ · on July 21, 2022

I like how this seems to actually be self-contained. They even have a list of notations in the end.

geysersam · on July 21, 2022

This is a fantastic resource. It's the missing piece of many machine learning articles.

tartakovsky · on July 21, 2022

Zero diagrams, but maybe they wouldn’t be helpful to clarify the concept? Guess it depends on the types of learners, I’m not sure.

nh23423fefe · on July 21, 2022

paper explicitly rejects diagrams as unhelpful

> Some 100+ page papers contain only a few lines of prose informally describing the model [RBC+21]. At best there are some high-level diagrams

godelski · on July 21, 2022

> only a few lines of prose informally describing the model

This is ironic considering they use more words to describe chunking (splitting along a dimension, x,y = a[0,:,:,:], a[1,:,:,:]) than multi-heads.

godelski · on July 21, 2022

I can't tell who this paper is aimed at. It isn't formal. It isn't mathematical. It isn't a good description and doesn't have good coverage. I can only assume it is for citations.

ThrowawayTestr · on July 21, 2022

I was assuming electrical transformers.

zdw · on July 21, 2022

Or the cartoon/toy franchise

mrhether · on July 21, 2022

familiar with basic ML terminology might be an understatement

godelski · on July 21, 2022

If you're starting from scratch scratch, these might be of more use to you. Second focuses on vision transformers, but all the concepts still apply.

https://jalammar.github.io/illustrated-transformer/

https://medium.com/pytorch/training-compact-transformers-fro...

uoaei · on July 21, 2022

This field moves very quickly. I don't think anyone can be expected to keep up who don't make it a weekly study subject or are actively employed in it.