I'd guess that the ability of a very small model to do well on the TinyStories dataset isn't just because of the limited 3-4yr old vocabulary, but also because of it being an LLM-generated dataset.
LLM-generated content (synthetic data) is easier than human generated text for an LLM to learn because it was auto-regressively generated, and therefore should be possible to auto-regressively predict.
It's surprising that LLMs do as well as they do attempting to predict human generated training samples where there is no guarantee that the predictive signal is actually contained in the sample (it may just be something in the mind of the human that generated it).
I've got to wonder what the impact on generation is of an LLM only trained on synthetic LLM-generated data? I'd guess it wouldn't be as robust as one that had learned to handle more uncertainty.
> I'd guess that the ability of a very small model to do well on the TinyStories dataset isn't just because of the limited 3-4yr old vocabulary, but also because of it being an LLM-generated dataset.
You guess is correct. The level of vocabulary has little to do with it. There was a paper about this a while back (sorry, can't find the link) where they found that the model still learned just as well when they increased the complexity of the text, as long as the texts were LLM generated.
LLM-generated content (synthetic data) is easier than human generated text for an LLM to learn because it was auto-regressively generated, and therefore should be possible to auto-regressively predict.
It's surprising that LLMs do as well as they do attempting to predict human generated training samples where there is no guarantee that the predictive signal is actually contained in the sample (it may just be something in the mind of the human that generated it).
I've got to wonder what the impact on generation is of an LLM only trained on synthetic LLM-generated data? I'd guess it wouldn't be as robust as one that had learned to handle more uncertainty.