Hacker Newsnew | past | comments | ask | show | jobs | submit | CShorten's commentslogin


Here is a video I made diving into the paper, hopefully helpful!

https://www.youtube.com/watch?v=Ek0tZootK00


I like your style, subscribed!


Thank you so much!


Portkey is super cool, congrats Rohit and Ayush on the HN launch!


thanks! we spent very long building it out properly. Time to integrate weaviate generators now?


Thank you so much! Really appreciate that!


Hey all, my apologies for these comments! I will be more mindful of this going forward!


Epic! Love this, one of the best Weaviate demos I've seen!


> Epic! Love this, one of the best Weaviate demos I've seen!

Just out of interest, do you still work for Weaviate? Probably worth mentioning.


Hey robertlagrant, my apologies -- still figuring out best practices on Hacker News. Will be more mindful of this going forward!


Thanks! Working on this and all the exciting features was really fun, from semantic and generative search to using Weaviate as a Semantic Cache and translating natural queries to GraphQL. The live demo was also recently updated with some good stuff! https://healthsearch-frontend.onrender.com/


Loved this paper - so much opportunity to explore Retrieval-Augmented Generation with these longer input LLMs + Vector Databases & Search!


Thanks so much for sharing!


You got it (:


Awesome!


Hey everyone, I also work at Weaviate.

Weaviate has implemented Hybrid Search because it helps with search performance in a few ways (Zero-Shot, Out-of-Domain, Continual Learning). Members of the Qdrant team are arguing against implementing Hybrid Search in Vector Databases with 3 main points that I believe are incorrect: 1. That there are not comparative benchmarks on Hybrid Search. 2. “Multi-Tool” systems are generally flawed by design in favor of specialization. 3. Cross-Encoder inference does not need additional processing modules - this one is also more so related to the particular design differences of Weaviate and Qdrant, where the Qdrant team is again arguing that you don’t need to implement the thing you implemented.

TLDR

1. Such benchmarks exist as in trengrj's initial response and we are working on them as well.

2. There are a few arguments why adding sparse search doesn't require too much extra specialization, how it is already used in filtered vector search to begin with, and why it makes sense to apply the rank fusion in the database where each scoring method happens.

3. Cross Encoder inference generally doesn't happen in the database itself thus it makes sense to use modules to process the additional ranking logic. There are several other examples of inferences in search pipelines that this kind of design enables.

More detail:

1. “You don’t publish comparative benchmarks”

Firstly, jtrengrj has responded with exactly this request. I think it’s actually better than it comes from a 3rd party as well, since clearly Weaviate is biased in having implemented Hybrid Search and Qdrant is biased in not having implemented Hybrid Search.

However, here is a quick overview of benchmarking efforts at Weaviate so far.

The focus of Weaviate has been primarily understanding Approximate Nearest Neighbor vector search of which very thorough benchmarks have been published ablating hyperparameters of HNSW such as maxConnections, efConstruction, and ef. This is done to measure recall with respect to approximation.

ANN Benchmarks - https://weaviate.io/developers/weaviate/benchmarks/ann

Podcast about this :) - https://www.youtube.com/watch?v=kG3ji89AFyQ

With respect to comparative benchmarks that report IR metrics such as nDCG, hits at K, recall, precision, … We are beginning with this using the BEIR benchmarks. The BEIR benchmarks are much more of an industry standard for reporting the performance of BM25, Dense Retrieval, Hybrid, Cross Encoders, …

The qdrant team has taken 2 rather random examples of e-commerce datasets. Their results also conclude by advocating for Hybrid Search although differently in the sense of aggregating results to send to cross encoder rather than a rank fusion of each result list. The key challenge with that is that the cross encoder inference is very slow — more on that in point 2.

I think there is value to benchmarking eCommerce Search datasets because of the way they capture Multimodal, but this isn’t really much of a standard yet. Comparatively, many independent companies and researchers have reproduced the BEIR metrics.

Our findings so far support that there is no free lunch with this — BM25, Dense, or Hybrid does not consistently outperform the others.

Here is a quick preview of our BEIR nDCG results so far Note these are subject to change, some have been tested with WAND scoring and others have not, * denotes BM25 scoring with WAND. Hybrid is tested with alpha = 0.5: Vector Embeddings are done with sentence transformers `all-MiniLM-L6-v2`

NFCorpus, BM25 = 0.224, Hybrid = 0.280, Vector only = 0.265 FiQA, Bm25 = 0.284, Hybrid = 0.428, Vector only = 0.434 SCIFACT, Bm25 = 0.678, Hybrid = 0.714, Vector only = 0.683 ArguAna, BM25 = 0.368, Hybrid = 0.408, Vector Only = 0.411 Touche2020, BM25 = 0.351, Hybrid = 0.364, Vector Only = 0.249 *Quora, BM25 = 0.770, Hybrid = 0.867, 0.887

These BM25 results are similar to Vespa’s BM25 results

https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa...

The primary reason these scores are different is because I am not accounting for multi-level relevance which is just due to my lack of understanding of the dataset to begin with. I will correct this when we officially publish the Weaviate BEIR benchmarks.

Vespa BM25 NFCorpus - 0.313 || Weaviate = 0.224 FIQA - 0.244 || Weaviate = 0.284 SciFact - 0.673 || Weaviate = 0.678 ArguAna - 0.393 || Weaviate = 0.368 Touche2020 - 0.413 || Weaviate = 0.351 Quora - 0.761 || Weaviate = 0.770

This is of course very incomplete, these are only 6 / 14 BEIR datasets - as an update for those interested in Weaviate’s progress with these benchmarks - TREC-COVID and SCIDOCS really need to be updated with the multi-level relevance scores otherwise the scores really give a bad picture of the performance. FEVER, Climate-FEVER, HotpotQA, EnttiyDB, and MS MARCO have been vectorized but still need to be imported and then backed up in Weaviate for the sake of reproducibility.

The key point here is that “There is no free lunch” Hybrid helps catch the cases where BM25 work well AND when Vector Search works well. For Touche2020, SCIFACT, and NFCorpus - we get better results with Hybrid. In alignment with the underlying rank fusion algorithm, there isn’t a case where Hybrid Search is dramatically outperformed by either Bm25 or Vector Search only.

This is very important for vector databases, because the Zero-Shot performance or ability to cover 80% of the use cases is a huge enabler in our ability to evangelize the technology. Collectively, Vector Databases need to illustrate the potential value for searching through all sorts of domains from code documentation to emails, personal notes, eCommerce (as you mention), etc. The big point here is that Hybrid provides another performance layer to help avoid cases where Vector Search fails and people are put off of the technology.

So once people are interested in using Vector Search — now we have more of a Deep Learning problem of continual learning of the embeddings. For example, if you are using it for Code Documentation search and Weaviate introduces a new feature like ref2vec, the Dense model will not have a semantic embedding for this until it is optimized with the new data. This is another enormous application of Hybrid Search to use the keyword scoring to adapt to new terms faster than the Deep Learning models can be optimized to do so.

2. Multi-tool system

This argument completely lacks substance for our particular conversation here. Vector databases already integrate inverted indexing for filtered vector search. It makes a ton of sense to adopt these same building blocks into the sparse indexing, and further it makes a ton of sense to apply the rank fusion in the same database rather the networking ranked lists at scale. The scalability patterns overlap.

Plus you failed to acknowledge the key point of “it’s easier to manage”.

Generally this is just ugly communication. It reads like someone who is pissed off rather than wanting to have an honest discussion of the technology.

3. Cross Encoders

3A. The most important thing here is that Cross Encoder inference is very slow — don’t agree at all with “In the case of document retrieval, we care more about the search result quality and time is not a huge constraint”. Further Cross Encoders generally need to run on GPUs which is expensive.

3B. The module system is used to process the logic of Cross Encoder inference - whether self-hosted or OpenAI / etc.

It is most likely that re-rankers will come from OpenAI, Cohere, HuggingFace Inference Endpoints, … model as API generally. Or things like Metarank that host XGBoost APIs. They send predictions over network requests that you process with an external module (i.e. these predictions don’t happen in the database directly).

Of course there are more kinds of model inferences we want to use in Search Pipelines rather than just Cross Encoders (Question Answering, Summarization, …) and thus the module system handles the nuances of each respective model inference.


> There are a few arguments why adding sparse search doesn't require too much extra specialization

Full-text search != sparse search, that's a naive oversimplification. Btw, sparse search is on Qdrant roadmap, so we should be able to compare it's performance on benchmarks.

> Cross Encoder inference generally doesn't happen in the database itself thus it makes sense to use modules to process the additional ranking logic

That statement makes your argument that `A combined system have better end-to-end latency.` invalid

> Such benchmarks exist as in trengrj's initial response and we are working on them as well.

link or it didn't happen.

In your current benchmarks you advertise everywhere, you're just throwing in disproportionately powerful and expensive hardware. Even a full-scan can give good results under those conditions


1. Please expand on how you are defining full text search distinctly from sparse search to continue the discussion. My understanding is that full-text search is built on inverted indexing which is the intended meaning behind "sparse" search -- whether that be BM25 sparse or SPLADE sparse or exact keyword / phrase matching, my understanding is that it is the same underlying index algorithm with distinctions in scoring for BM25 / SPLADE.

2. The original argument references "combined system" in the sense of Hybrid Search (BM25 + Dense Vector Search). I don't think this is a fair comparison -- model inference services are extremely lightweight relative to the sparse / dense indexing systems we are primarily discussing. I also have not advocated for Cross Encoder inference in the spirit of improving latency, just clarifying why a module system is used for it.

3. The hybrid search results from researchers independent of either of us are linked in the original comment, here it is again - https://arxiv.org/abs/2201.10582. Your criticism of the Weaviate ANN benchmarks isn't relevant to our discussion on Hybrid Search. I have linked this to show that Weaviate has produced comparative benchmarks which was your original claim. Although I do not agree with your premise that a full-scan search with give similar speed results to HNSW on this setup, or however arbitrarily we are defining "good results". I acknowledge that it is not included in the benchmark report and is something that should be added. I also agree that it would be interesting to run ANN recall tests on several hardware configurations.


> Please expand on how you are defining full text search distinctly from sparse search to continue the discussion

In addition to the indexing algorithm, there is the tokenizer, which depends on the language, lemmatizer, synonyms, stop-words, and so on and so forth. In addition, the ranking function itself may be quite different and based on different rules. See how Meilisearch does it. Reducing full-text search to just a reverse index is a misconception

> Your criticism of the Weaviate ANN benchmarks isn't relevant to our discussion on Hybrid Search.

It is very much relevant, as I mentioned, in parallel processes

total_latency = max(BM25_latency, Vector_search_latency) + merge overhead

and my claim is that in specialized tools both BM25_latency and Vector_search_latency will be better than what the multi-tool system can provide.

> I have linked this to show that Weaviate has produced comparative benchmarks which was your original claim.

I don't see any comparisons in your benchmarks here - https://weaviate.io/developers/weaviate/benchmarks/ann

You just benchmarked yourself, that is not interesting and not helping.

> I also agree that it would be interesting to run ANN recall tests on several hardware configurations.

That is not the point. In our benchmark we run all engines on exactly the same machine to make it fair. Sometimes same configuration in different regions already gives very different performance on some cloud providers.


1. Weaviate does all these things as well, please see - https://weaviate.io/blog/pulling-back-the-curtains-on-text2v.... Although there are interesting optimizations around fitting a tokenizer, these are all relatively simple things whereas the inverted index is more so the interesting thing where the optimizations of specialized systems come into play e.g. WAND scoring.

2. Following on #1 what optimizations does BM25 require that justify an entirely separate tool that requires maintenance of two separate search systems? Also helps that merge overhead to have both searches in the same system.

3. Any company's report of benchmarking itself against its competitors should be taken with a grain of salt.. this is obviously bad practice. The purpose of these benchmarks are to compare in this case hyperparameters of HNSW and in future works around the BEIR numbers I provided earlier, Hybrid Search performance.

4. Again, no company can seriously benchmark itself against its competitors due to obvious conflicts of interest. Maybe this competition could be hosted here - https://big-ann-benchmarks.com/index.html#organizers.


Benchmark is open source - https://github.com/qdrant/vector-db-benchmark You are welcome to fork it and make your own measurements, if you suspect something.

Benchmarks like https://big-ann-benchmarks.com/index.html#organizers are good for comparing algorithms, but not engines. They are focused on a single use scenario and do not cover variety of possible applications. Like, for example, how filtering affect the performance.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: