Is this applying the learnings from vision transformers to language transformers... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		bigyikes on June 7, 2024 \| parent \| context \| favorite \| on: σ-GPTs: A new approach to autoregressive models Is this applying the learnings from vision transformers to language transformers? If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile. I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?

seurimas on June 7, 2024 [–]

Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact