I find the distinction introduced in this paper into encoder-decoder Transformers, encoder-only Transformers and decoder-only Transformers very useful for my informal understanding of the different architectures. Thank you for this clear clarification.
I'm curious about this comment and it being the top comment. Is this something people don't know? Everything in this paper was introduced in Attention Is All You Need[0]. They introduced Dot Product Attention, which is what everyone just refers to now as Attention, and they talk about the decoder and encoder framework. The encoder is just self attention `softmax(<q(x),k(x)>)v(x)` and decoder includes joint attention `softmax(<q(x),k(x)>)v(y)`
I have a lot of complaints about this paper because it only covers topics addressed in the main attention paper (Vaswani) and I can't see how it accomplishes anything but pulling citations away from grad students who did survey papers on Attention, which are more precise and have more coverage of the field. As a quick search, here's a survey paper from last year that has more in depth discussion and more mathematical precision[1].
Maybe because this is an area of research adjacent to my active area, but I'm a little confused at the attention (pun intended) it is getting. Seems just like Deep Mind's name is the major motivation.
I can't tell who this paper is aimed at. It isn't formal. It isn't mathematical. It isn't a good description and doesn't have good coverage. I can only assume it is for citations.
This field moves very quickly. I don't think anyone can be expected to keep up who don't make it a weekly study subject or are actively employed in it.