That's why you build extensive tooling to run your change hundreds of times in p...

AdieuToLogic · 2025-11-07T02:19:47 1762481987

In the event this comment is slathered in sarcasm:

  Well done!  :-D

ht96 · 2025-11-07T00:48:21 1762476501

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

AdieuToLogic · 2025-11-07T02:27:08 1762482428

There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

cantor_S_drug · 2025-11-07T08:39:15 1762504755

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

aenis · 2025-11-07T05:48:55 1762494535

For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.

saturatedfat · 2025-11-07T07:46:18 1762501578

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)