As someone who's worked in time series forecasting for a while, I haven't yet found a use case for these "time series" focused deep learning models.
On extremely high dimensional data (I worked at a credit card processor company doing fraud modeling), deep learning dominates, but there's simply no advantage in using a designated "time series" model that treats time differently than any other feature. We've tried most time series deep learning models that claim to be SoTA - N-BEATS, N-HiTS, every RNN variant that was popular pre-transformers, and they don't beat an MLP that just uses lagged values as features. I've talked to several others in the forecasting space and they've found the same result.
On mid-dimensional data, LightGBM/Xgboost is by far the best and generally performs at or better than any deep learning model, while requiring much less finetuning and a tiny fraction of the computation time.
And on low-dimensional data, (V)ARIMA/ETS/Factor models are still king, since without adequate data, the model needs to be structured with human intuition.
As a result I'm extremely skeptical of any of these claims about a generally high performing "time series" model. Training on time series gives a model very limited understanding of the fundamental structure of how the world works, unlike a language model, so the amount of generalization ability a model will gain is very limited.
Great write-up, thank you. Do you have rough measures for what constitutes high/mid/low- dimensional data? And how do you use XGBoost et al for multi-step forecasting, I.e. in scenarios where you want to predict multiple time steps in the future?
The added benefit is that you optimize each regressor towards its own target timestep t+1 ... t+n. A single loss on the aggregate of all timesteps is often problematic
I've found that it works well to add the prediction horizon as a numerical feature (e.g. # of days), and them replicate each row for many such horizons, while ensuring that all such rows go to the same training fold.
Thanks for this write up. Your comment clears up a lot of the confusion I've had around these time series transformers.
How does lagged features for an MLP compare to longer sequence lengths for attention in Transformers? Are you able to lag 128 time steps in a feed forward network and get good results?
I agree that the conventional (numeric) forecasting can hardly benefit from the newest approaches like transformers and LLMs. I made such a conclusion while working on the intelligent trading bot [0] by experimenting with many ML algorithms. Yet, there exist some cases where transformers might provide significant advantages. They could be useful where the (numeric) forecasting is augmented with discrete event analysis and where sequences of events are important. Another use case is where certain patterns are important like those detected in technical analysis. Yet, for these cases much more data is needed.
Foundational models can work where so far „needs human intuition“ was the state of things. I can picture a time series model with large enough Training corpus being able to deal quite well with typical quirks of seasonalities, shocks, outliers, etc.
I fully agree regarding how things have been so far, but I’m excited to see practitioners try out models such as the one presented here — it might just work.
Reminds me a bit how in psychology you have ANOVA, MANOVA, ANCOVA, MANCOVA etc etc but really in the end we are just running regressions—variables are just variables.
My read on this was that you can just dump the lagged values as inputs and let the network figure it out just as well as the other, time series specific models do, not that time doesn't matter.
I assume the time series modelling is used to predict normal non-fraud behaviour. And then simpler algorithms are able to highlight deviations from the norm?
As much as Transformers feel like the state of the art universal function approximators, people need to realize why they work so well for language and vision.
Transformers parallelize incredibly well, and they learn sophisticated intermediate representations. We start seeing neat separation of different semantic concepts in space. We start seeing models do delimiter detection naturally. We start seeing models reason about lines, curves, colors, dog ears etc. The final layers of a Transformer are then putting these sophisticated concepts together to learn high level concepts like dog/cat/blog etc.
Transformers (and deep learning methods in general) do not work for time series data because they have yet to extract any novel intermediate representations from said data.
At face value, how do you even work with a 'token window' ? At the simplest level, time series modelling is about identifying repeating patterns over very different lifecycles conditioned on certain observations about the world. You need a model to natively be able to reason over years, days and seconds all at the same time to even be able to reason about the problem in the first place. Hilariously, last week's streaming LLM paper from MIT might actually help here.
Secondly, the improvements appear marginal at best. If you're proposing a massive architecture change, removing observability & and explainability .........then you better have some incredible results.
Truth is, if someone identifies a groundbreaking technique for timeseries forecasting, then they'd be an idiot to tell anyone about it before making their first $Billion$ on the market. Hell, I'd say they'd be an idiot for stopping at a billion. Time series forecasting is the most monetarily rewarding problem you could solve. If you publish a paper, then by implication, I expect it to be disappointing.
> Truth is, if someone identifies a groundbreaking technique for timeseries forecasting
It’s really quite simple. Just iterate through all possible monotone universal Turing machines where the input tape consists of all data we can possibly collect concatenated with the time series of interest. Skip the programs that take too long to halt, keep the remaining ones that reproduce the input sequence, then form a probability distribution based on the next output bits, weighted by 2^-(program size).
Not higher than having to perpetually secure a network by computation; which taken to the extreme is essentially a sure-footed destination to causing a black hole by argument of having to use all available space for computational security and incentives.
That is how it works, precisely. You secure the network with compute. A computer requires physical space to run a computation. Thus maximizes towards using all physical space for incentives driven by network security.
Are you telling me that there is not already physical evidence for such? I assure you there is plenty of evidence for physical space being assimilated by the incentive structures related to bitcoin and its progency.
Taken across time to a civilization that grows into further complexity, there exists a limit into how much space can be used to secure the network, and most likely even incentivizes maximizing capture of space for computational security therefore it accelerates our civilization towards creating a black hole. I couldn't come up with a better way to fast-track our way towards a cosmic environmental disaster. It's a pretty bad incentive structure long-term.
Are you technically competent? Have you read the whitepaper? It’s the fundamental theory of the paper. I am not sure how I can describe it better than the whitepaper itself, which necessarily depends on its substrate which is a substratum which provides compute which necessarily implies a concrete material in physicality. Thus, physical material acts as the mechanism for computation, which Bitcoin depends on for network security, incentivized by the value of the distributed nature of the network thus requiring an ever greater need for compute.
Even in a world where computation doesn’t become more efficient it still takes up the total space available eventually due to the incentives of protecting against network failure.
Thanks for telling me that computation requires physical matter. That sure will help getting to the bottom of this.
Now could you answer the question? What is it about the Bitcoin blockchain that requires EVER INCREASING compute.
Network security. What says it doesn’t? The whitepaper specifically points to computation for security. Computation is not evenly distributed and changes with time in allocation. As a generality.
Computation is not evenly distributed and changes with time in allocation. As a generality.
I will NOT grant you this. Please, give me actual technical details on WHY it requires ever increasing compute. You've said network security, what about it requires ever increasing compute.
Your stubborance or pedantry is not my concern. You have nothing to grant me. I require no grant of you. You offer me nothing, for you display nothing I lack. You already provide me with what I crave: my own self amusement; so thank you.
You can read the paper and understand the principles it is based on, which is rooted on balancing computational asymmetry amongst other concerns across a network of computers. On the most simple level, if you are aware of hashcash and sybil resistance you should be able to figure it out.
If you're still confused: then answer yourself the question why does the bitcoin algorithm adjust to computational power?
You are unable to explain it, showing a clear sign of lack of understanding.
But maybe you have links to others who are able to explain it, not the Bitcoin paper which obviously does not lead one to think network security will subsume all available matter for compute.
> Truth is, if someone identifies a groundbreaking technique for timeseries forecasting, then they'd be an idiot to tell anyone about it before making their first $Billion$ on the market.
This is correct.
I work in HFT and the industry has been successfully applying deep learning to market data for a while now. Everything from pcaps/ticks to candles.
Why publish your method when it generates $1B+/year in profit for a team of 50 quants/SWEs/traders?
Are you at liberty to say how high the frequency gets in connection with these models?
I assume the latency is comparably much higher but also wouldn't be surprised if microseconds generally aren't a problem, eg because the patterns detected are on a much larger scale.
Re candles - even longer term, hourly/daily? Are there actually strategies out there that deliver great sharpe over many years with just time series forecasting? Most hedge funds don't beat the index afaik
Time series prediction is always about using the particular features of your distribution of time series. In standard time series prediction the features of the distribution are mostly things like "periodic patterns are continued" or "growth patterns are continued". A transformer that is trained on language data essentially learns time series prediction where a large variety of complex feature appear that influence the continuation. Language data is so complex and diverse that continuing a text necessitates in-context learning: Being able to find some common features in any kind of string of symbols, and using those to continue the text. Just think that language data could contain huge excel tables of various data, like stock market prices, or weather recordings. It is therefore plausible that in-context learning can be very powerful, enough to perform zero-shot time series continuation. Moreover, I believe that due to in-context learning language data + transformer architecture has the potential to really obtain general intelligence like behaviour. General pattern recognition. Language data is complex enough that SGD must lead to general pattern recognition and continuation. We are only at the beginning, and right now we are focused on finetuning which destroys in-context learning. But we will soon train giant transformers on every modality, every string of symboly we can find.
The reality is that the market has inefficiencies like human emotion and bot/algorithmic trading which absolutely can be exploited by AI. You just need to train an AI to recognize the inefficiencies, which is exactly what neural networks excel at.
> people need to realize why they work so well for language and vision.
I agree with your entire post, however this sentence made me think, well video is just layered vision. Why couldn't frames of vision work similar to vision? We know the current answer is it doesn't, but is it a matter of NNs can't or we haven't figured out the correct way to model it yet?
edit: I'm not sure what jdkwkbdbs means (dead-banned) by "LLMs don't. ML works pretty well." (well, I do); LLMs solve certain tasks -- and really interesting ones, at that, too -- at certain average cross-entropy loss levels, and typically form the Pareto front of the models of all sizes in terms of parameter count, at least once you start looking at the biggest that we have, e.g. start above a certain parameter count; and they typically increase in accuracy with respect to parameter count in a statistically significant fashion.
In essence, they represent the state of the art with respect to those specific tasks, as measured at the current time. Though you may desire for there to be a better (cross entropy loss, accuracy percentage so 100% acc) tuple at any given instant (e.g. (optimal expected ce-loss, an actual proof of correctness)), for the current time, it seems like not only do they do the best as a class of models for these sets of tasks, but also improve the fastest in terms of accuracy as a class of models as of late. They're quite noteworthy in that regard, fundamentally, imo. Just my 2c.
“Some popular models like Prophet [Taylor and Letham, 2018] and ARIMA were excluded from the analysis due to their prohibitive computational requirements and extensive training times.”
Anyone who work a lot in time series forecasting can explain this in some more details?
I’ve def used ARIMA, but only for simple things. Not sure why this would be more expensive to train and run than a Transformer model, and even if true, ARIMA is so ubiquitous that comparing resources & time would be enlightening. Otherwise it just sounds like a sales pitch and throw more obscure acronyms for a bit of “I’m the expert, abc xyz industry letters” marketing.
We love ARIMAs. That is why we put so much effort into creating fast and scalable Arimas and AutoArima in Python [1].
Regarding your valid concern. There are several reasons for the high computational costs. First, ARIMA and other "statistical" methods are local, so they must train one different model for each time series. (ML and DL models are global, so you have 'one' model for all the series.) Second, the ARIMA model usually performs poorly for a diverse set of time series, like the one considered in our experiments. The AutoARIMA is a better option, but its training time is considerably longer, given the number and length of the series. Also, AutoARIMA tends to be very slow for long series.
In short: for the 500k series we used for benchmarcking, ARIMA would have taken literally weeks and would have been very expensive.
That is why we included many well-performing local "statistical" models, such as the Theta and CES. We used the implementations on our open-source ecosystem for all the baselines, including StatsForecast, MLForecast, and Neuralforecast. We will release a reproducible set of experiments on smaller subsets soon!
I immediately tried to find a comparison with ARIMA as well and was disappointed. It's difficult to take this paper seriously when they dismiss a forecasting technique from the 70's because of "extensive training times".
Even then, 500 years of daily data is less than 200k observations, most of which are meaningless for predicting the future. Less than 16B seconds of data. Regression might not handle directly, but linear algebra tricks are still available.
While I could find some excuses to exclude ARIMA, notably that in practice you need to input some important priors about your time series (periodicity, refinements for turning points, etc) for it to work decently, "prohibitive compute and extensive training time" are just not applicable.
That part is a bit wanky, but the rest of the paper, notably the zero shot capability, is very interesting if confirmed. I look forward for it to be more accessible than a "contact us" api to compare to ARIMA and others myself
I have need doing time series forecasting professionally. ARIMA is computationally one of the cheapest (both training and inference) forecasting models out there. It suffers from many deficiencies and shortcomings but computational efficiency is not one of them.
> “Some popular models like Prophet [Taylor and Letham, 2018] and ARIMA were excluded from the analysis due to their prohibitive computational requirements and extensive training times.”
Yes, I've done some work in time series forecasting. The above sentence is the one that tipped me off to this paper being BS, so I stopped reading after that. :) I can't take any paper about timeseries forecasting seriously by an author who isn't familiar with the field.
Eh it's not as if you could just project down the 300k time series to something lower dimensional for forecasting. The TimeGPT would have to do something similar to avoid the same problem.
Though I can't quite figure out how the predicting works exactly, they have a lot of test series but do they input all of them simultaneously?
If true, then beating it and looking good will be easy.
Having trained ARIMA models in my day, I will say that long training times and training cost -- compared to any deep learning model -- is not something that ever crossed my mind.
High training times could be cost prohibitive. Currently, its over $100mil to train GPT4 from scratch (which possibly includes other costs related to RLHF and data acquisition). Not sure how this model compares, but its likely not cheap.
This is an extremely content-light paper. There's basically zero information on anything important. Just hand-waving about the architecture and the data. Instead it spends its space on things like the equation for MAE and a diagram depicting the concept of training and inference. Red flags everywhere.
Max from Nixtla here. We are surprised that this has gained so much attention and are excited about both the positive and critical responses.
Some important clarifications:
The primary goal of this first version of the paper is to present TimeGPT-1 and showcase our preliminary findings from a large-scale experiment, demonstrating that transfer learning at this scale is indeed possible in time series. As mentioned in the paper, we deeply believe that pre-trained models can represent a very cost-effective solution (in terms of computational resources) for many applications. Please also consider that this is a pre-print version. We are working on releasing a reproducible set of experiments on a subset of the data, so stay tuned!
All previous work of Nixtla has been open source and we believe TimeGPT could be a viable commercial product, offering forecasting and anomaly detection out of the box for practitioners. Some interesting details were omitted because they represent a competitive advantage that we hope to leverage in order to grow the company and keep providing better solutions and continuing to build our ecosystem.
As some others have mentioned in the thread, we are working to onboard as many people as possible into a free trial so that more independent practitioners can validate the accuracy for their particular use cases. You can read some initial impressions of the creators Prophet [1] and GluonTS [2] or listen to an early test by the people from H20 [3]. We hope to see some more independent benchmarcks soon.
This is exactly the kind of thing the academics are warning you about when they say things like "peer review is important" and "don't read arxiv preprints if you're not a subject matter expert"
Time series works amazing until a fundamental assumption breaks or the time frame extends too far. It's pretty much just drawing the existing pattern out further with mathematical precision. It only works till the game changes.
Yeah, that’s part of the game though. You can’t get perfection from modelling a complex system with the (relatively) few variables that you can actually measure. Assumptions are always evolving, and are always going to be broken at some point or another.
That’s the opening line, right? Uncertainty is a fact of life. With time series forecasts, the best you can ever hope to do is give probability bounds, and even then you can only really do so by either:
- limiting by the rules of the game (e.g. the laws of physics, or the rules of a stock exchange)
- using past data
The former is only useful if you’re the most risk averse person on the planet, and the latter is only useful if you are willing to assume the past is relevant.
Good response. People seem to think that what I call “single pass” inference is the only thing that matters - a monolithic single process system
When in fact the world and intelligent agents inside it are ensembles of ensembles of systems with various and changing confidence that flow and adjust as the world does
A personal note here is that they could done a better job on the tokens because the announcement was so grandiose and maybe they underestimated people with legit interest.
I’m using all libs from Nixtla with active advocation and I did not have a token; meanwhile lots of guys posting their usages on Twitter.
Aren't LLM's already zero shot time series predictors? Predicting next token and forecasting seem like the exact same problem. I will admit some small tweaks in tokenization could help but it seems like we're just pretraining on a different dataset.
One idea I was interested in was after reading the paper on introducing pause tokens[1] was a multimodal architecture that generalizes everything to parallel time series streams of tokens in different modalities. Pause tokens make even more sense in that setup.
I agree - You could frame LLMs this way. Tokens over "time" where time just happens to be represented by discrete, sequential memory.
Each token could encode a specific amplitude for the signal. You could literally just have tokens [0,1,...,MAX_AMPLITUDE] and map your input signal to this range.
In the most extreme case, you could have 2 tokens - zero and one. This is the scheme used in DSD audio. The only tradeoff is that you need way more samples per unit time to represent the same amount of information, but there are probably some elegant perf hacks for having only 2 states to represent per sample.
There are probably a lot of variations on the theme where you can "resample" the input sequences to different token rate vs bits per token arrangements.
Training an LLM on timeseries feels limited, unless I’m missing something fundamental. If LLMs are basically prediction machines, if I have an LLM trained on cross-industry timeseries data, and I want to predict orange futures how much more effective can it be? (Genuine question). Secondly, Isn’t context hyper important? Such as weather, political climate etc.
A long time ago when I was a grad student, I got a consulting job with a radiologist who thought that he could use digital processing techniques to predict the options market well enough to make money. He didn't want to shell out for a real quant; he asked my prof it he knew anyone and I decided I could use the extra money. I came up with some techniques that appeared to produce a small profit, unfortunately it was a hair less than what he'd have to pay on commissions. He wanted to keep pushing but I decided to hang it up. I'm sure that there are people here who know far more about this than my decades-old experiments taught me.
So in principle it could work, but the problem is that these days, the big players are all doing high frequency trading with algorithms that try to predict market swings. And the big guys have an advantage: they are closer to the stock exchanges. They trade so fast that speed-of-light limitations affect who gets the trades in first. So I think the only people who could win with an LLM technique is someone who doesn't need to pay commissions (a market maker, Goldman Sachs or similar) with access to real time data, very close to the exchange so they get it fast.
Where's the dataset? Without the dataset it's impossible to back-test, for finance for example, I'm assuming a large part of that is US stock tickers, or FRED public data, so it's almost certain it's seen data people would want to back-test on.
Author could probably make more impact if you have open sourced their models, the way it is presented looks like ClosedAI sort of pathway. Meaning using papers as a way to advertise their model for developers.
Perhaps a stupid question, but why train it only on time series data and not in conjunction with e.g. news sources like the Financial Times, etc., since LLMs are good at language so why not use it?
Not sure why this is getting so many upvotes. There is no concrete information in the paper about how the model actually works or what differentiates it from other models.
The M7 forecasting challenge makes this goal explicit. Not the only use of forecasting, and IMO it would be good to have other timeseries data to present to models.
this is my thought as well. reading the paper now.. who would have thought that the quote, "those who don't learn from history is doomed to repeat it" might be useful in predictions.
No, because when every player is using it, they will need something else to give them an edge. Anyways, quants do a lot more than time-series forecasting.
Most quants work on the sell-side where direct forecasting is almost irrelevant (to an academic, a trader perhaps not...), in that they are usually attempting to "interpolate" market prices to be able to price derivatives.
> Most quants work on the sell-side where direct forecasting is almost irrelevant
Those aren’t real quants. Even the sell-side quants know they aren’t real quants. For those unfamiliar, sell-side quants typically work at banks like Goldman, HSBC, JPMorgan, etc.
The real quants are buy-side quants/traders: prop shops, hedge funds, endowment/pensions funds, etc.
Out come the fingers made of foam
They finally open-sourced Rehoboam
Our societies have been thusly blessed
What it predicts could have anyone guessed
Targeted advertising wherever you roam
On extremely high dimensional data (I worked at a credit card processor company doing fraud modeling), deep learning dominates, but there's simply no advantage in using a designated "time series" model that treats time differently than any other feature. We've tried most time series deep learning models that claim to be SoTA - N-BEATS, N-HiTS, every RNN variant that was popular pre-transformers, and they don't beat an MLP that just uses lagged values as features. I've talked to several others in the forecasting space and they've found the same result.
On mid-dimensional data, LightGBM/Xgboost is by far the best and generally performs at or better than any deep learning model, while requiring much less finetuning and a tiny fraction of the computation time.
And on low-dimensional data, (V)ARIMA/ETS/Factor models are still king, since without adequate data, the model needs to be structured with human intuition.
As a result I'm extremely skeptical of any of these claims about a generally high performing "time series" model. Training on time series gives a model very limited understanding of the fundamental structure of how the world works, unlike a language model, so the amount of generalization ability a model will gain is very limited.