There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.
[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...