Diversity and quantity are important for LLM training.
A search engine can index more than just "the best sources", and show results from the tail when no relevant matches are in the best sources.
I would agree that with a softer restatement of your thesis though, I am sure there is a lot of diminishing marginal utility in search indexing broadly, especially as the web keeps getting more and more full of spam and nonsense.
For pre-training LLMs, the quality/quantity/diversity story is more nuanced. They do seem to benefit a lot from quantity. For a fixed LLM training budget, the choice to train on the same high quality documents for more epochs, or to train on lower quality but unseen data is an interesting area of research. Empirically, the research finds that additional epochs on the same data starts to diminish after the 4th iteration. All the research I've read tends to have an all or nothing flavor to data selection. Either it makes it in, and gets processed the same number of times, or it doesn't get in at all. There is probably some juice in the middle ground, where high quality data gets 4x'ed, bad data is still eliminated, but the lesser but not terrible data gets in once.
A search engine can index more than just "the best sources", and show results from the tail when no relevant matches are in the best sources.
I would agree that with a softer restatement of your thesis though, I am sure there is a lot of diminishing marginal utility in search indexing broadly, especially as the web keeps getting more and more full of spam and nonsense.
For pre-training LLMs, the quality/quantity/diversity story is more nuanced. They do seem to benefit a lot from quantity. For a fixed LLM training budget, the choice to train on the same high quality documents for more epochs, or to train on lower quality but unseen data is an interesting area of research. Empirically, the research finds that additional epochs on the same data starts to diminish after the 4th iteration. All the research I've read tends to have an all or nothing flavor to data selection. Either it makes it in, and gets processed the same number of times, or it doesn't get in at all. There is probably some juice in the middle ground, where high quality data gets 4x'ed, bad data is still eliminated, but the lesser but not terrible data gets in once.