> Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how...

> Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.

What would you call "ultra-deep"? [1] shows how to train an RNN in a GPT-like mode, using parallel scan, with great performance on Path-X, which has a sequence length of 16k. It's based on prior papers doing the same thing but from a state space model perspective.

[1] https://arxiv.org/abs/2303.06349