I'm actually shocked that a company that has spent 25 years on finetuning search results for any random question people ask in the searchbox does not have a good, clean, dataset to train an LLM on.
Maybe this is the time to get out the old Encyclopedia Britannica CD and use that for training input.
Google’s transformation of conventional methods into means of hypercapitalist surveillance is both pervasive and insidious. The “normal definition of that term” hides this.
You don't need "hypercapitalist surveillance" to show someone ads for a PS5 when they search for "buy PS5".
If they're doing surveillance they're not doing a good job of it, I make no effort to hide from them and approximately none of their ads are personalized to me. They are instead personalized to the search results instead of what they know from my history.
It’s a bit weird since Google is taking over the “burden of proof”-like liability. Up until now, once user clicked on a search result, they mentally judged the website’s credibility, not Google’s. Now every user will judge whether data coming from Google is reliable or not, which is a big risk to take on, in my opinion.
That latter point might be illuminating for a number of additional ideas. Specifically, should people have questioned Google's credibility from the start? Ie: these are the search results, vs this is what google chose.
Google did well in the old days for reasons. It beat alta vista and Yahoo by having better search results and a clean loading page. Since perhaps 08 (based on memory, that date might be off) or so, Google has dominated search, to the extent that it's not salient that search engines can be really questionable. Which is also to say, google dominated, people lost sight that searching and googling are different, that gives a lot of freedom for enshittification without people getting too upset or even quite realizing - it could be different and better
But only if you do a lot of filtering when going through responses. It’s kind of simple to do as a human, we see a ridiculous joke answer or obvious astroturfing and move on, but Reddit is like >99% noise, with people upvoting obviously wrong answer because it’s funny, lots of bot content, constant astroturfing attempts.
The users of r/montreal are so sick of lazy tourists constantly asking the same dumb "what's the best XYZ" questions without doing a basic search fit, the meme answer is always "bain colonial" which is a men-only spa for cruising. Often the topmost voted comment. I just tried asking gemini and chatgpt what that response meant and neither caught on..
No, it isn't. Humans interacting with human-generated text is generally fine. You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent.
> You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent
I didn't say you could, but that a machine can't decode the mountains of text doesn't mean that the answer isn't (perhaps only) on Reddit. I don't think people would be that interested in search engine that just serves content from books and academic papers.
The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version. Humans can extrapolate knowledge from a relatively succinct book and some guidance. But I don't know how a model can add the common sense part (that we already have) that books relies on to transmit knowledge and ideas.
> The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version.
Coincidentally, I was just watching a video about how South Africa has gone downhill - and that slide was hastened by McKinsey advising the crooked "Gupta brothers" on how to most efficiently rip off the country.
The problem in this case is not that it was trained on bad data. The AI summaries are just that - summaries - and there are bad results that it faithfully summarizes.
This is an attempt to reduce hallucinations coming full circle. A simple summarization model was meant to reduce hallucination risk, but now it's not discerning enough to exclude untruthful results from the summary.
Two reasons. The first, even ignoring that truth isn't necessarily widely agreed (is Donald Trump a raping fraud?), is that truth changes over time. eg is Donald Trump president? And presidents are the easiest case because we all know a fixed point in time when that is recalculated.
Second, Google's entire business model is built around spending nothing on content. Building clean pristinely labeled training sets is an extremely expensive thing to do at scale. Google has been in the business of stealing other people's data. Just one small example: if you produced (very expensive at scale) clean, multiple views, well lit photographs of your products for sale they would take those photos and show them on links to other people's stores; and if you didn't like that, they would kick you out of their shopping search. etc etc. Paying to produce content upends their business model. See eg the 5-10% profit margin well run news orgs have vs the 25% tech profit margin Google has even after all the money blown on moonshots.
Maybe this is the time to get out the old Encyclopedia Britannica CD and use that for training input.