Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.


In the event this comment is slathered in sarcasm:

  Well done!  :-D


Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)


There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.


https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine


For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.


heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: