I am also surprised that training data are not much more curated. Encyclopedias,...

sgift · on May 25, 2024

Even some results from "The Onion" seem to be in it. Looks like Google just took every website they've ever crawled as source.

helsinkiandrew · on May 25, 2024

The problem is that for some searches and answers Reddit or other social media is fine.

dgellow · on May 25, 2024

But only if you do a lot of filtering when going through responses. It’s kind of simple to do as a human, we see a ridiculous joke answer or obvious astroturfing and move on, but Reddit is like >99% noise, with people upvoting obviously wrong answer because it’s funny, lots of bot content, constant astroturfing attempts.

morkalork · on May 26, 2024

The users of r/montreal are so sick of lazy tourists constantly asking the same dumb "what's the best XYZ" questions without doing a basic search fit, the meme answer is always "bain colonial" which is a men-only spa for cruising. Often the topmost voted comment. I just tried asking gemini and chatgpt what that response meant and neither caught on..

eldaisfish · on May 25, 2024

No, it isn't. Humans interacting with human-generated text is generally fine. You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent.

helsinkiandrew · on May 26, 2024

> You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent

I didn't say you could, but that a machine can't decode the mountains of text doesn't mean that the answer isn't (perhaps only) on Reddit. I don't think people would be that interested in search engine that just serves content from books and academic papers.

skydhash · on May 25, 2024

The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version. Humans can extrapolate knowledge from a relatively succinct book and some guidance. But I don't know how a model can add the common sense part (that we already have) that books relies on to transmit knowledge and ideas.

MrVandemar · on May 26, 2024

> The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version.

https://books.google.com/