I think they mean Transformers in the Vaswani et al 'Attention is all you need' paper, not Generative Pretrained Transformers, specifically? Paper link below:
For some papers on attention mechanisms from before the 2017 'Attention is all you need' paper, check out that paper's references. Chris Manning's 2015 paper covers attention mechanisms. And so do a few other researchers from that mid-2010s time period:
[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[22] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention
model. In Empirical Methods in Natural Language Processing, 2016.
https://proceedings.neurips.cc/paper/2017/file/3f5ee243547de...
For some papers on attention mechanisms from before the 2017 'Attention is all you need' paper, check out that paper's references. Chris Manning's 2015 paper covers attention mechanisms. And so do a few other researchers from that mid-2010s time period:
[21] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[22] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.