Has anyone built a search engine that uses LLMs to pre-grade every page with met...

pona-a · on May 15, 2024

Did exactly that, hacked together a small pipeline in Nushell with simonw/llm. Even with GPT-4 turbo and given direct guidelines on common spam heuristics, it seems to perform worse than a Bayesian BoW. Endless trails of questions with no relation to the title it describes as informative, presence of affiliate links often gets forgotten in a mass of tokens in their tracking parameters, relevance is consistently near 0.8 even if there's relationship between title and content, and as for insincerity, our favorite BS generator cannot for the life of it correctly recognize its own creations.

Your ideas for metrics are good, but LLMs seem to be quite terrible at any of these. A simple set of heuristics and maybe a tiny language model for named entity detection and "vibe checking" would serve you much better.

Also, a lot of the worst offenders seem to use the same Q&A +- conclusion structure, which Viktor from marginalia.nu wrote a simple heuristic for, which I recall he said did wonders for pruning it. Solving SEO spam is easy when you aren't the one being optimized against. What's left is scaling and information retrieval.

Uehreka · on May 15, 2024

What makes you think that LLMs will be better at combating spam than they are at creating it? There’s no universal rule that innovations in AI will go hand in hand with innovations in detecting AI, yet I feel like I see people talking all the time like that’s the case.

As of right now, LLMs are prolific but unreliable, which makes them extremely well suited for generating spam, but unsuited to detecting it without a large number of false positives and negatives.

yowzadave · on May 15, 2024

I honestly don’t mind so much if it detects “ai-generated” vs. “human-generated”—the key thing to detect is whether the page is full of irrelevant SEO junk. GP suggested several attributes that ought to be detectable. Even if we don't eliminate the AI content, but succeed in promoting "better" content, maybe it's an improvement?

WithinReason · on May 15, 2024

Classification is easier than generation

pona-a · on May 15, 2024

Sadly, this didn't seem to track for LLMs. Even OpenAI gave up on trying to detect its own outputs.

> As of July 20, 2023, the AI classifier is no longer available due to its low rate of accuracy. We are working to incorporate feedback and are currently researching more effective provenance techniques for text, and have made a commitment to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated.

WithinReason · on May 15, 2024

You're not classifying just text, you're classifying entire web pages. If it's easy to tell for a human that it's SEO spam, it's easy for a classifier.

geekraver · on May 15, 2024

Not with standard LLM models. Generating garbage is easier; deciding that it’s garbage much more difficult. This is an inherent problem with “prompt optimizers” like DSPy.

Uehreka · on May 15, 2024

Yeah no I’m gonna need more than that chief. Everything I know about LLMs says the opposite.

WithinReason · on May 15, 2024

Look up Generative Adversarial Networks, that's their basic principle.

You're not classifying just text, you're classifying entire web pages.

TeMPOraL · on May 15, 2024

I'd say the whole point of GAN is that generation is cheaper than classification, therefore an effective brute-force way of making a good classifier is to generate an infinite supply of examples with a-priori known classification, and pit it against a classifier.

mafuku · on May 15, 2024

Yeah, but the theoretical endpoint of training a GAN is that the generator gets so good that the discriminator has to resort to guesses and become unable to tell with any sort of accuracy as to whether the example it is shown is real or generated.

WithinReason · on May 15, 2024

I don't think that ever happens in training, at least in the image domain. The classifier can always can find some subtle clue.

itissid · on May 15, 2024

Sounds a bit more like you want to do something reranking-ish. Ideally, you would train a retrieval system to retrieve the most relevant pages which would inturn have been trained on a dataset not very different from MS-Marco. This would get you a small set of documents you want to rerank.

For reranking to be able to detect commercial bias, insincerity or bloat you could use LLMs but IIRC you train a multiclass classifier for each and then combine the probabilities for each head(calibrate too?) into a score and use it in your ranking as weights?

fxtentacle · on May 15, 2024

I think Kagi should add a feature where I can subscribe to the domain blocks of someone else. Every time I see a spam blog, I can easily prevent the domain from polluting future results. But it'd be great if I could also use my friends lists to rank their blocked domains to the end of my search results.

sniggers · on May 15, 2024

This is definitely the best option. Community search blocklists, like uBlock Origin

mvkel · on May 15, 2024

We don't need to. We have LLMs now

frabjoused · on May 15, 2024

I still would like to search and not get the dog crap that is Google's Internet, in addition to using LLMs myself.

szszrk · on May 15, 2024

I know it's the other way around of what you are suggesting, but I feel I'm using Kagi for a while for the same reasons, with results you expect.

Their search is much, much cleaner, that's for sure. But what made me stick (and mostly ditch DDG, which btw is also much cleaner than google), was how well their fastgpt works as a search tool.

Summaries are very good, it includes recent events and news, it goes through pdfs, always cites it's sources. Does hallucinate for me sometimes, but I always can tell it's incorrect by the response itself. Plus it usually gives me links that easily clear out the confusion. Especially in IT field I can tell I'm fed with the source of my trouble (like initial GitHub issue that introduces broken functionality, source pdf of a study) and less discussion around it.

Their search has some neat features as well, as you can simply choose to see less/no results from given site straight from list of search results.

dmje · on May 15, 2024

Yes - and the other killer feature in Kagi is being able to uprank your own choice of sites, and set contexts for this upranking. That to me is the killer thing about it

voisin · on May 15, 2024

I agree. My issue with LLMs is that it isn’t clear when it is hallucinating versus when it isn’t.

docandrew · on May 15, 2024

It’s always hallucinating… but sometimes the hallucinations are of things that actually happened.

komali2 · on May 15, 2024

Non LLM efforts include alternative and even self hosted search engine indexes. I'm also curious if brave search's concept of "goggles" could work out, where you can write your own indexing logic and share it with others.

frabjoused · on May 15, 2024

We don't need self-hosted alternatives (just like the computer market doesn't need Linux tinkerers) as much as we need a real formidable competitor to dethrone Google and create proper website incentives so the Internet stops sucking.

We need SEO companies to realize they will go out of business if they continue to generate crap filler content for their clients.

We know Google won't do it. We need influence. We need effective results.

Self-hosted is cute but ineffective at best, selfish at worst.

A4ET8a8uTh0 · on May 15, 2024

Hmm, I reflexively disagree, but I disagree even after considering it from other different positions.

Need is a strong term and it is likely doing a lot of work in that claim. We, technically, do not need much bar food, water and shelter. In that sense, the post is absolutely correct. Realistically, there is zero need for self-hosting, or linux, or anything much really.

But even if we get past the need claim, why is setting up a duopoly a preferred option to people actually running their own preferred setups ( and maybe even learning something in the process )?

More importantly, why on earth would I want yet another giant corporation in charge of my digital life?

<< create proper website incentives so the Internet stops sucking.

Interestingly, linux tinkerers and self-hosters are likely one of the few reasons web does not suck AS much as it otherwise could have. In a sense, the incentives are there.

<< We need SEO companies to realize they will go out of business if they continue to generate crap filler content for their clients.

Business is just that.. a business. I don't expect mosquito not to bite me. If anything, the whole premise is wrong. SEO companies defer to google's wishes and google heard the pleas of the ad industry and declared war on adblockers.

<< We need influence. We need effective results.

Zero disagreement.

<< Self-hosted is cute but ineffective at best, selfish at worst.

I dunno. It might be selfish, but I am ok with that if that is the worst. I would worry about it being ineffective, but.. I like my various instances. They serve a purpose to me.

TexanFeller · on May 15, 2024

Switch to Kagi, IMHO it's worlds better than Google. It's worth the price, I get a lot more out of it than my Netflix subscription.

freitzkriesler2 · on May 15, 2024

I'm sorry but LLMs notoriously provide inaccurate and otherwise awful information.

Case in point, I asked an LLM what the last non cellular windows mobile classic PDA was. (I knew the answer) And it routinely got it wrong.

This is what LLMs should be useful for. If I cant audit the results or very how it came to the conclusion the answer is useless.

LLMs are toys at this juncture.

simonw · on May 15, 2024

That's a great example of the kind of prompt that I intuitively know wouldn't return a useful result... but I can't explain WHY I intuitively know that. Which is deeply frustrating.

mvkel · on May 15, 2024

Here are the results for that exact search in Google: https://www.google.com/search?q=what+was+the+last+non+cellul...

vs 4o: The last non-cellular Windows Mobile Classic PDA was the HP iPAQ 110 Classic. Released in 2008, it ran on Windows Mobile 6.0 Classic and featured a 624-MHz Marvell PXA310 processor, 256MB of Flash ROM, and a 3.5-inch screen with a 240 x 320 resolution. It included Wi-Fi and Bluetooth connectivity but lacked cellular capabilities, making it one of the final models in the declining PDA market as smartphones began to dominate [oai_citation:1,List of Windows Mobile devices - Wikipedia](https://en.wikipedia.org/wiki/List_of_Windows_Mobile_devices) [oai_citation:2,The End of the Classic Version of Windows Mobile (AKA the PDA)](http://www.pocketpcfaq.com/commentary/end_of_WM_Classic.htm) [oai_citation:3,HP iPAQ 110 Classic - PDA Like It's 1999 - WiFi Planet](https://wi-fiplanet.com/hp-ipaq-110-classic/).

I know which version I'd prefer.

qiqitori · on May 15, 2024

What about the HP iPAQ 112 then? Wasn't the HP iPAQ 110 Classic released in November 2007?

Also why are you entering questions into the Google search prompt?

mvkel · on May 15, 2024

Because the commenter implied that LLMs get it wrong, as if search engines get it right. The reality is that both take digging, but the LLM response gets you to the right answer more quickly.

goatlover · on May 15, 2024

Which LLMs stay current and link their sources? If I have to wait on the LLM to search for me, I'd rather just do the search myself. What if I want to search for something the LLM can't show me? Or something I want to watch or interact with that isn't an LLM?

mvkel · on May 15, 2024

GPT-4o serves up results that look like Perplexity's, except the sources are actually relevant links.

All of that to say: solved problem? Assuming you're ok with chat as the UI

frabjoused · on May 15, 2024

I'm saying use LLMs as preprocessors to form predetermined rankings by URI which weigh into the search. Let the crawler pipe into an LLM.

foogazi · on May 15, 2024

I get it - use the LLM score as an additional metric for page rank, not ask the LLM for search results

anguspmitchell · on May 15, 2024

I always assumed LLMs would result in more bloated content (stupid interns) but I think you’re right that it’ll lead to more efficient prioritization (hooray interns)

7thpower · on May 15, 2024

I’m working on that problem right now actually, but not in a direct way. We create ad hoc content for people based on product reviews etc and have had to invest a fair amount of time filtering content and removing sponsored/shill and low quality generated content that reduces the utility of our… generated content. Ours is at least dynamically rendered so others won’t have to sift through it some day.

fasa99 · on May 16, 2024

Literally just assess how much ad revenue Google stands to earn from the site. If it's a Google top hit (SEO) AND earns Google a disproportionate potential amount of revenue, then there's your grade.