Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am also surprised that training data are not much more curated.

Encyclopedias, textbooks, reputable journals, newspapers and magazines make sense.

But to throw in social media? Reddit? Seems insane.



Even some results from "The Onion" seem to be in it. Looks like Google just took every website they've ever crawled as source.


The problem is that for some searches and answers Reddit or other social media is fine.


But only if you do a lot of filtering when going through responses. It’s kind of simple to do as a human, we see a ridiculous joke answer or obvious astroturfing and move on, but Reddit is like >99% noise, with people upvoting obviously wrong answer because it’s funny, lots of bot content, constant astroturfing attempts.


The users of r/montreal are so sick of lazy tourists constantly asking the same dumb "what's the best XYZ" questions without doing a basic search fit, the meme answer is always "bain colonial" which is a men-only spa for cruising. Often the topmost voted comment. I just tried asking gemini and chatgpt what that response meant and neither caught on..


No, it isn't. Humans interacting with human-generated text is generally fine. You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent.


> You cannot unleash a machine on the mountains of text stored on reddit and magically expect it to tell fact from fiction or sarcasm from bad intent

I didn't say you could, but that a machine can't decode the mountains of text doesn't mean that the answer isn't (perhaps only) on Reddit. I don't think people would be that interested in search engine that just serves content from books and academic papers.


The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version. Humans can extrapolate knowledge from a relatively succinct book and some guidance. But I don't know how a model can add the common sense part (that we already have) that books relies on to transmit knowledge and ideas.


> The fact is that I think that there is not much written word, to actually train a sensible model on. A lot of books don't have OCRed scans, or a digital version.

https://books.google.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: