Hey HN! I’ve recently open-sourced Pyversity, a lightweight library for diversifying retrieval results. Most retrieval systems optimize only for relevance, which can lead to top-k results that look almost identical. Pyversity efficiently re-ranks results to balance relevance and diversity, surfacing items that remain relevant but are less redundant. This helps with improving retrieval, recommendation, and RAG pipelines without adding latency or complexity.
Main features:
- Unified API: one function (diversify) supporting several well-known strategies: MMR, MSD, DPP, and COVER (with more to come)
- Lightweight: the only dependency is NumPy, keeping the package small and easy to install
- Fast: efficient implementations for all supported strategies; diversify results in milliseconds
Re-ranking with cross-encoders is very popular right now, but also very expensive. From my experience, you can usually improve retrieval results with simpler and faster methods, such as the ones implemented in this package. This helps retrieval, recommendation, and RAG systems present richer, more informative results by ensuring each new item adds new information.
Code and docs: github.com/pringled/pyversity
Let me know if you have any feedback, or suggestions for other diversification strategies to support!
Consider this simple test I’ve been running:
Anchor: “A background service listens to a task queue and processes incoming data payloads using a custom rules engine before persisting output to a local SQLite database.”
Option A (Lexical Match): “A background service listens to a message queue and processes outgoing authentication tokens using a custom hash function before transmitting output to a local SQLite database.”
Option B (Semantic Match): “An asynchronous worker fetches jobs from a scheduling channel, transforms each record according to a user-defined logic system, and saves the results to an embedded relational data store on disk.”
Any decent LLM (e.g., Gemini 2.5 Pro, GPT-4/5) immediately knows that the Anchor and Option B describe the same concept just with different words. But when I test embedding models like gemini-embedding-001 (currently top of MTEB), they consistently rate Option A as more similar measured by cosine similarity. They’re getting tricked by surface-level word overlap.
I put together a small GitHub repo that uses ChatGPT to generate and test these “semantic triplets:
https://github.com/semvec/embedstresstest
gemini-embedding-001 (current #1 on MTEB leaderboard ) scored close to 0% on these adversarial examples.
The repo is unpolished at the moment but it gets the idea across and everything is reproducible.
Anyway, did anyone else notice this problem?