Hacker Newsnew | past | comments | ask | show | jobs | submit | renchuw's commentslogin

Well, this method is based on the assumption that embeddings can accurately represent the texts and their structural relations are preserved.

So long as you have all the random seeds fixed, I think reproduction should be straight forward.


Thanks for the feedback! The reason the "code" part is more complete than the "research" part is because I originally planned for it to just be a hobby project and only very later on decided to perhaps try to be serious and make it a research work.

Not trying to make excuses tho. Your points are very valid and I would take them into account!


Correct.


Hi, OP here. I would kind of have to disagree here. You raised some interesting points, but I don't think something can be qualified as *moat* if it is overcome-able by just sharing the use cases. For example, we all know Google's use cases is to search, but no one has built one as well as they do. Their moat is in their technology and brand recognision.


Not to disagree with your argument as a whole, but Google's most hasn't been technological for years, but instead comes from their ability to be the default search engine everywhere they can, including if they need to pay Apple billions for that position.


This would be an inner loop process. However, the selection is way faster than LLMs so it shouldn't be noticable (hopefully).


Hi, OP here. I would say not really because the goals are different. Although both uses retrieval techniques, RAG wants to augment your query with factual information, where here we retrieve in order to evaluate on as few queries as possible (with performance guaranteed by bayesian optimization)


I designed 2 modes in the project, exploration mode and exploitation mode.

Exploration mode uses entropy search to explore the latent space (used for evaluating the LLM on the selected corpus to evaluate), and eploitation mode is used to figure out how well / bad the model is performing on what regions of the selected corpus.

For accurate evaluations, exploration is used. However, I'm also working on a visualization too s.t. users can see how well the model is performing at what region (courtesy of gaussian process models built in by bayesian optimization) and that is where exploitation mode can come in handy.

Sorry for the slightly messy explanation. Hope it clarifies things!


Thanks for the explanation!

I don't entirely understand what two models mean here, because typically the search strategy (or acquisition function) in bayesopt - which in your case seems to be some form of entropy search (ES) - decides the explore-vs-exploit tradeoff for itself (possibly with some additional hyperparams ofc). For ex., ES would do this one way, Expected Improvement (EI) would do it differently, etc. - all this in the service of the bayesopt objective you want to maximize (or minimize).

Assuming that you mean this objective when you mention exploitation, which here is based on the model performing well, wouldn't it just pick queries that the model can (or is likely to) answer correctly? This would be a very optimistic evaluation of the LLM.


Fair question.

Evaluate refers to the phase after training to check if the training is good.

Usually the flow goes training -> evaluation -> deployment (what you called inference). This project is aimed for evaluation. Evaluation can be slow (might even be slower than training if you're finetuning on a small domain specific subset)!

So there are [quite](https://github.com/microsoft/promptbench) [a](https://github.com/confident-ai/deepeval) [few](https://github.com/openai/evals) [frameworks](https://github.com/EleutherAI/lm-evaluation-harness) working on evaluation, however, all of them are quite slow, because LLM are slow if you don't have infinite money. [This](https://github.com/open-compass/opencompass) one tries to speed up by parallelizing on multiple computers, but none of them takes advantage of the fact that many evaluation queries might be similar and all try to evaluate on all given queries. And that's where this project might come in handy.


Your explanations are still unclear.

I know what evaluation is, and inference, and training. Deployment means to deploy - to put a model in production. It does not mean inference. Inference means to input a prompt into a model and get the next token, or tokens as the case may be. Training and inference are closely related, since during training, inference is run and the error given by the difference between the prediction and target is backpropagated, etc.

Evaluation is running inference over a suite of tests and comparing the outcomes to some target ideal. An evaluation on the MMLU dataset lets you run inference on zero and few shot prompts to test the knowledge and function acquisition of your model, for example.

So is your code using Bayesian Optimization to select a subset of a corpus, like a small chunk of the MMLU dataset, that is representative of the whole, so you can test on that subset instead of the whole thing?


Hi, OP here. It's not 10 times faster inference, but faster evaluation. You use evaluation on a dataset to check if your model is performing well. This takes a lot of time (might be more than training if you are just finetuning a pre-trained model on a small dataset)!

So the pipeline goes training -> evaluation -> deployment (inference).

Hope that explanation helps!


Hi, OP here. So you evaluate LLMs on corpuses to evaluate their performance right? Bayesian optimization is here to select points (in the latent space) and tell the LLM where to evaluate next. To be precise, entropy search is used here (coupled with some latent space reduction techniques like N-sphere representation and embedding whitening). Hope that makes sense!


The definition of "evaluate" isn't clear. Do you mean inference?


Perhaps I should clarify it in the project README. It's the phase to evaluate how well your model is performing. So the pipeline goes training -> evaluation -> deployment (inference) corresponding to the datasets in supervised training, training (training) -> evaluation (validation) -> deployment (testing).


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: