As much as Transformers feel like the state of the art universal function approximators, people need to realize why they work so well for language and vision.
Transformers parallelize incredibly well, and they learn sophisticated intermediate representations. We start seeing neat separation of different semantic concepts in space. We start seeing models do delimiter detection naturally. We start seeing models reason about lines, curves, colors, dog ears etc. The final layers of a Transformer are then putting these sophisticated concepts together to learn high level concepts like dog/cat/blog etc.
Transformers (and deep learning methods in general) do not work for time series data because they have yet to extract any novel intermediate representations from said data.
At face value, how do you even work with a 'token window' ? At the simplest level, time series modelling is about identifying repeating patterns over very different lifecycles conditioned on certain observations about the world. You need a model to natively be able to reason over years, days and seconds all at the same time to even be able to reason about the problem in the first place. Hilariously, last week's streaming LLM paper from MIT might actually help here.
Secondly, the improvements appear marginal at best. If you're proposing a massive architecture change, removing observability & and explainability .........then you better have some incredible results.
Truth is, if someone identifies a groundbreaking technique for timeseries forecasting, then they'd be an idiot to tell anyone about it before making their first $Billion$ on the market. Hell, I'd say they'd be an idiot for stopping at a billion. Time series forecasting is the most monetarily rewarding problem you could solve. If you publish a paper, then by implication, I expect it to be disappointing.
> Truth is, if someone identifies a groundbreaking technique for timeseries forecasting
It’s really quite simple. Just iterate through all possible monotone universal Turing machines where the input tape consists of all data we can possibly collect concatenated with the time series of interest. Skip the programs that take too long to halt, keep the remaining ones that reproduce the input sequence, then form a probability distribution based on the next output bits, weighted by 2^-(program size).
Not higher than having to perpetually secure a network by computation; which taken to the extreme is essentially a sure-footed destination to causing a black hole by argument of having to use all available space for computational security and incentives.
That is how it works, precisely. You secure the network with compute. A computer requires physical space to run a computation. Thus maximizes towards using all physical space for incentives driven by network security.
Are you telling me that there is not already physical evidence for such? I assure you there is plenty of evidence for physical space being assimilated by the incentive structures related to bitcoin and its progency.
Taken across time to a civilization that grows into further complexity, there exists a limit into how much space can be used to secure the network, and most likely even incentivizes maximizing capture of space for computational security therefore it accelerates our civilization towards creating a black hole. I couldn't come up with a better way to fast-track our way towards a cosmic environmental disaster. It's a pretty bad incentive structure long-term.
Are you technically competent? Have you read the whitepaper? It’s the fundamental theory of the paper. I am not sure how I can describe it better than the whitepaper itself, which necessarily depends on its substrate which is a substratum which provides compute which necessarily implies a concrete material in physicality. Thus, physical material acts as the mechanism for computation, which Bitcoin depends on for network security, incentivized by the value of the distributed nature of the network thus requiring an ever greater need for compute.
Even in a world where computation doesn’t become more efficient it still takes up the total space available eventually due to the incentives of protecting against network failure.
Thanks for telling me that computation requires physical matter. That sure will help getting to the bottom of this.
Now could you answer the question? What is it about the Bitcoin blockchain that requires EVER INCREASING compute.
Network security. What says it doesn’t? The whitepaper specifically points to computation for security. Computation is not evenly distributed and changes with time in allocation. As a generality.
Computation is not evenly distributed and changes with time in allocation. As a generality.
I will NOT grant you this. Please, give me actual technical details on WHY it requires ever increasing compute. You've said network security, what about it requires ever increasing compute.
Your stubborance or pedantry is not my concern. You have nothing to grant me. I require no grant of you. You offer me nothing, for you display nothing I lack. You already provide me with what I crave: my own self amusement; so thank you.
You can read the paper and understand the principles it is based on, which is rooted on balancing computational asymmetry amongst other concerns across a network of computers. On the most simple level, if you are aware of hashcash and sybil resistance you should be able to figure it out.
If you're still confused: then answer yourself the question why does the bitcoin algorithm adjust to computational power?
You are unable to explain it, showing a clear sign of lack of understanding.
But maybe you have links to others who are able to explain it, not the Bitcoin paper which obviously does not lead one to think network security will subsume all available matter for compute.
> Truth is, if someone identifies a groundbreaking technique for timeseries forecasting, then they'd be an idiot to tell anyone about it before making their first $Billion$ on the market.
This is correct.
I work in HFT and the industry has been successfully applying deep learning to market data for a while now. Everything from pcaps/ticks to candles.
Why publish your method when it generates $1B+/year in profit for a team of 50 quants/SWEs/traders?
Are you at liberty to say how high the frequency gets in connection with these models?
I assume the latency is comparably much higher but also wouldn't be surprised if microseconds generally aren't a problem, eg because the patterns detected are on a much larger scale.
Re candles - even longer term, hourly/daily? Are there actually strategies out there that deliver great sharpe over many years with just time series forecasting? Most hedge funds don't beat the index afaik
Time series prediction is always about using the particular features of your distribution of time series. In standard time series prediction the features of the distribution are mostly things like "periodic patterns are continued" or "growth patterns are continued". A transformer that is trained on language data essentially learns time series prediction where a large variety of complex feature appear that influence the continuation. Language data is so complex and diverse that continuing a text necessitates in-context learning: Being able to find some common features in any kind of string of symbols, and using those to continue the text. Just think that language data could contain huge excel tables of various data, like stock market prices, or weather recordings. It is therefore plausible that in-context learning can be very powerful, enough to perform zero-shot time series continuation. Moreover, I believe that due to in-context learning language data + transformer architecture has the potential to really obtain general intelligence like behaviour. General pattern recognition. Language data is complex enough that SGD must lead to general pattern recognition and continuation. We are only at the beginning, and right now we are focused on finetuning which destroys in-context learning. But we will soon train giant transformers on every modality, every string of symboly we can find.
The reality is that the market has inefficiencies like human emotion and bot/algorithmic trading which absolutely can be exploited by AI. You just need to train an AI to recognize the inefficiencies, which is exactly what neural networks excel at.
> people need to realize why they work so well for language and vision.
I agree with your entire post, however this sentence made me think, well video is just layered vision. Why couldn't frames of vision work similar to vision? We know the current answer is it doesn't, but is it a matter of NNs can't or we haven't figured out the correct way to model it yet?
edit: I'm not sure what jdkwkbdbs means (dead-banned) by "LLMs don't. ML works pretty well." (well, I do); LLMs solve certain tasks -- and really interesting ones, at that, too -- at certain average cross-entropy loss levels, and typically form the Pareto front of the models of all sizes in terms of parameter count, at least once you start looking at the biggest that we have, e.g. start above a certain parameter count; and they typically increase in accuracy with respect to parameter count in a statistically significant fashion.
In essence, they represent the state of the art with respect to those specific tasks, as measured at the current time. Though you may desire for there to be a better (cross entropy loss, accuracy percentage so 100% acc) tuple at any given instant (e.g. (optimal expected ce-loss, an actual proof of correctness)), for the current time, it seems like not only do they do the best as a class of models for these sets of tasks, but also improve the fastest in terms of accuracy as a class of models as of late. They're quite noteworthy in that regard, fundamentally, imo. Just my 2c.
As much as Transformers feel like the state of the art universal function approximators, people need to realize why they work so well for language and vision.
Transformers parallelize incredibly well, and they learn sophisticated intermediate representations. We start seeing neat separation of different semantic concepts in space. We start seeing models do delimiter detection naturally. We start seeing models reason about lines, curves, colors, dog ears etc. The final layers of a Transformer are then putting these sophisticated concepts together to learn high level concepts like dog/cat/blog etc.
Transformers (and deep learning methods in general) do not work for time series data because they have yet to extract any novel intermediate representations from said data.
At face value, how do you even work with a 'token window' ? At the simplest level, time series modelling is about identifying repeating patterns over very different lifecycles conditioned on certain observations about the world. You need a model to natively be able to reason over years, days and seconds all at the same time to even be able to reason about the problem in the first place. Hilariously, last week's streaming LLM paper from MIT might actually help here.
Secondly, the improvements appear marginal at best. If you're proposing a massive architecture change, removing observability & and explainability .........then you better have some incredible results.
Truth is, if someone identifies a groundbreaking technique for timeseries forecasting, then they'd be an idiot to tell anyone about it before making their first $Billion$ on the market. Hell, I'd say they'd be an idiot for stopping at a billion. Time series forecasting is the most monetarily rewarding problem you could solve. If you publish a paper, then by implication, I expect it to be disappointing.