Is this applying the learnings from vision transformers to language transformers?
If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.
I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?
Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).
If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.
I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?