Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Formal Algorithms for Transformers (arxiv.org)
106 points by hexhowells on July 21, 2022 | hide | past | favorite | 13 comments


I find the distinction introduced in this paper into encoder-decoder Transformers, encoder-only Transformers and decoder-only Transformers very useful for my informal understanding of the different architectures. Thank you for this clear clarification.


I'm curious about this comment and it being the top comment. Is this something people don't know? Everything in this paper was introduced in Attention Is All You Need[0]. They introduced Dot Product Attention, which is what everyone just refers to now as Attention, and they talk about the decoder and encoder framework. The encoder is just self attention `softmax(<q(x),k(x)>)v(x)` and decoder includes joint attention `softmax(<q(x),k(x)>)v(y)`

I have a lot of complaints about this paper because it only covers topics addressed in the main attention paper (Vaswani) and I can't see how it accomplishes anything but pulling citations away from grad students who did survey papers on Attention, which are more precise and have more coverage of the field. As a quick search, here's a survey paper from last year that has more in depth discussion and more mathematical precision[1].

Maybe because this is an area of research adjacent to my active area, but I'm a little confused at the attention (pun intended) it is getting. Seems just like Deep Mind's name is the major motivation.

[0] https://arxiv.org/abs/1706.03762

[1] https://arxiv.org/abs/2106.04554


I like how this seems to actually be self-contained. They even have a list of notations in the end.


This is a fantastic resource. It's the missing piece of many machine learning articles.


Zero diagrams, but maybe they wouldn’t be helpful to clarify the concept? Guess it depends on the types of learners, I’m not sure.


paper explicitly rejects diagrams as unhelpful

> Some 100+ page papers contain only a few lines of prose informally describing the model [RBC+21]. At best there are some high-level diagrams


> only a few lines of prose informally describing the model

This is ironic considering they use more words to describe chunking (splitting along a dimension, x,y = a[0,:,:,:], a[1,:,:,:]) than multi-heads.


I can't tell who this paper is aimed at. It isn't formal. It isn't mathematical. It isn't a good description and doesn't have good coverage. I can only assume it is for citations.


I was assuming electrical transformers.


Or the cartoon/toy franchise


familiar with basic ML terminology might be an understatement


If you're starting from scratch scratch, these might be of more use to you. Second focuses on vision transformers, but all the concepts still apply.

https://jalammar.github.io/illustrated-transformer/

https://medium.com/pytorch/training-compact-transformers-fro...


This field moves very quickly. I don't think anyone can be expected to keep up who don't make it a weekly study subject or are actively employed in it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: