Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is this applying the learnings from vision transformers to language transformers?

If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.

I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?



Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: