More

paraschopra · 2026-03-20T03:07:38 1773976058

(founder of Lossfunk, the lab behind this research.)

Esolang-Bench went viral on X. A lot of discussion ensued; addressing some of the common points that came up. Addressing a few questions about our Esolang-Bench. Hope it helps.

a) Why do it? Does it measure anything useful?

It was a curiosity-driven project. We're interested in how humans exhibit sample-efficiency in learning and OOD generalization. So we simply asked: if models can zero/few shot correct answers for simple programming problems in Python, can they do the same in esoteric languages as well?

The benchmark is what it is. Different people can interpret its usefulness differently, and we encourage that.

b) But humans can't also write esoteric languages well. It's an unfair comparison.

Primarily, we're interested in measuring LLM capabilities. With the talk of ASI, it is supposed that their capabilities will soon be super-human. So, our primary motivation wasn't to compare to humans but to check what they can do this by-construction difficult benchmark.

However, we do believe that humans are able to teach themselves a new domain by transferring their old skills. So this benchmark was to set a starting point to explore how AI systems can do the same as well (which is what we're exploring now)

c) But Claude Code crushes it. You limited models artificially.

Yes, we tested models in zero and few shot capabilities. And in the agentic loop we describe in the paper, we limit the number of iterations. As we wrote above, we wanted to understand their performance from a comparative point of view (say on highly represented languages like Python) and that's by the benchmark by design is like this.

After the paper was finalized, we experimented with agentic systems where we gave models tools like bash and allowed unlimited iterations (but limited submission attempts). They indeed perform much better.

The question that's relevant is what makes these models perform so well when you give them tools and iterations v/s when you don't. Are they reasoning / learning like humans or is it something else?

d) So, are LLMs hyped? Or is our study clickbait?

The paper, code and benchmark are all open source.

We encourage whoever is interested to read it, and make up their own minds.

(We couldn't help notice that the same set of results were interpreted wildly differently within the community. A debate between opposing camps of LLMs ensued. Perhaps that's a good thing?)

paraschopra · 2026-02-27T03:32:28 1772163148

I’m very happy that Anthropic chose not to cave into US Dept of War’s demands but their statement has an ambiguity.

Does this mean they’d be ok to have their models be used for mass surveillance & autonomous weapons against OTHER countries?

A clarification would help.

paraschopra · 2026-02-26T09:45:59 1772099159

Do you have more info on video encoding process?

You write:

>We created a model without this tradeoff by training our video encoder on a masked compression objective

And I understand why this would give you more detail per token, but how are you reducing total number of tokens?

paraschopra · 2026-02-26T09:17:37 1772097457

Curious - how much did this cost to train?

paraschopra · 2026-02-18T11:39:01 1771414741

It generated this: https://paraschopra.github.io/explainers/optical-interferome...

I haven't checked it, but I'm curious about your feedback.

maille · 2026-02-18T19:02:24 1771441344

What was the source of inspiration for Claude? I skimmed through the text and it does not look too bad, but devil is in the details and I need more time to go throught the fine prints. One remarks is that Young slit experiment could show what happens with a single slit vs 2

paraschopra · 2026-02-18T08:59:07 1771405147

Current prompt is like this:

I want to build a self-contained html/js/css file explainer page as close as possible to this explainer: https://explainers.blog/posts/why-is-the-sky-blue/

What I want you to do is this: - Install playwright and chromium headless to take screenshots of https://explainers.blog/posts/why-is-the-sky-blue/ and interact with the page to deeply understand its style, aesthetics, tone, interactivity, visuals, fonts, etc. - Make comprehensive notes of what you observe so you can implement EXACTLY that when building your explainer - Then on the topic provided below plan to build an explainer with similar length, quality, interactivity, writing style, fun, informative as the article given - produce animations in svg (or otherwise) and interactions as necessary. Similar colour scheme but fun/vibrant/happy. Be very very creative. Act like an expert UI/UX designer who can build stunning explainers. Target it for intelligent hacker-news reader. - Get your plan verified by codex - Produce page one small change at a time. Don't output big chunks in one go. But pay extra attention to number of sections and length of the explained. I want it to be as comprehensive as possible (don't skimp on length) - Keep testing what you produce via playwright on chromium headless.

After you’re finished with index.html, can you check via chromium that all animations, diagrams and interactions that they match with their captions and are visually ok (not too small, large, overlapping, etc.). Sometimes there are factual errors in what the caption or text says and what the diagram suggests.

Topic: diffusion models from first principles

yu3zhou4 · 2026-02-18T11:11:10 1771413070

Thank you very much!

paraschopra · 2026-02-18T08:57:46 1771405066

I pointed Claude Code towards https://explainers.blog/posts/why-is-the-sky-blue/ , take screenshots and build something like it on the topic provided.

rstuart4133 · 2026-02-19T02:27:04 1771468024

For the LLM explainer, did you point Claude at this one? https://explainextended.com/2023/12/31/happy-new-year-15/ This page Claude assisted page rhymes with that one. Sorta.

paraschopra · 2026-02-18T08:56:08 1771404968

I verified the Fourier one and the LLM one. The scaling law one is likely okay too as I long back read the book.

paraschopra · 2026-02-18T08:55:11 1771404911

yes, i noticed that occasionally but i'm curious which one did you find is incorrect?

lordgrenville · 2026-02-18T09:40:11 1771407611

Oh this was just snark.

paraschopra · 2026-02-18T08:54:44 1771404884

Yeah, that specific one doesn't work so well but apart from it, does any other example not work?

kelseydh · 2026-02-18T10:31:51 1771410711

The Fourier transform audio examples fooled me. The example sounds and slider for them appeared consistent as far as I could tell... but then again I don't know much about Fourier transforms.

Maybe I'm out of the loop but have to say this is the first time I have seen an LLM generate a webpage with working audio widgets.

rstuart4133 · 2026-02-18T21:35:40 1771450540

If you liked that explanation of Fourier transforms, you'll probably like this one: https://www.jezzamon.com/fourier/index.html

paraschopra · 2026-02-18T10:48:27 1771411707

yep, i was pretty surprised by audio widgets too.