There was a time when people would estimate n-gram probabilities with feed-forwa...

There was a time when people would estimate n-gram probabilities with feed-forward neural networks [1,2]. We just improved that with the (multilayer) attention mechanism which allows for better factoring over individual tokens. It also allowed for much larger n.

[1] https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

[2] https://www.sciencedirect.com/science/article/abs/pii/S08852...