Thanks for this write up. Your comment clears up a lot of the confusion I've had...

Thanks for this write up. Your comment clears up a lot of the confusion I've had around these time series transformers.

How does lagged features for an MLP compare to longer sequence lengths for attention in Transformers? Are you able to lag 128 time steps in a feed forward network and get good results?