Thanks for the recs, we will look into adding some of these, maybe OCaml for var...

isityettime · 2026-05-12T10:51:30 1778583090

Racket is a variety of Scheme that grew up as a teaching language, but now also has a few other notable niches as well.

Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".

Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.

I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)

PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.

When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.

phillc73 · 2026-05-12T06:15:43 1778566543

Thanks for that, I hadn't scrolled down far enough.

Just want to be sure I'm reading the results correctly... When I compare GPT-5.5 with Mistral Medium 3.5, I see in the tables:

a) Mistral beats GPT in Java and C++

b) It's close for Rust

c) GPT-5.5 easily wins for Go, Javascript, Python and Typescript

Model choice really does appear to be language dependent (assuming I'm reading the results correctly).

gertlabs · 2026-05-12T06:52:40 1778568760

The deeper you go into the filters (single models, cross correlated by specific languages), the smaller your sample sizes. A known limitation, tbh I doubt Mistral is better than GPT 5.5 at programming in any specific language and probably hit a few lower quality generations by GPT 5.5 by chance (but I could be wrong! We're always adding more samples so data improves over time. We always prioritize largest sample counts for near-frontier models first).

regularfry · 2026-05-12T08:55:37 1778576137

What's going on with Qwen3.6 27b? Filtered to Python it comes out at the top of the list, which seems... well, unlikely.

johndough · 2026-05-12T11:11:55 1778584315

While Qwen3.6 27B and 35B-A3B are very good, I am skeptical about them being that good. I think another factor is at play here.

The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.

gertlabs · 2026-05-12T15:48:45 1778600925

The more filters applied (one-shot coding only, Python only), the more variation you can expect from fewer samples -- that being said, it really is a great model so it's probably not too far above where it would end up with infinite samples.

2ndorderthought · 2026-05-12T10:23:27 1778581407

Qwen3.6 27b is a really strong model.

regularfry · 2026-05-12T11:53:02 1778586782

Yeah but that strong?

2ndorderthought · 2026-05-12T12:11:10 1778587870

Yes that strong. Its only lacking in context length, but it's not that small there and it gets caught in circles more often then say a 1t parameter model does.

That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.

johndough · 2026-05-12T16:07:15 1778602035

> it gets caught in circles more often then say a 1t parameter model does.

I've found that the Q5+ quants are less loopy than Q4. Still not perfect, but noticeably better.

> reasonable enough tokens per second

The speed has been amazing. I've been running the recent llama.cpp MTP branch with an uncensored variant of Qwen3.6-35B-A3B on my RTX 3090 over 170 tokens per second and it was able to turn a buffer overflow into a reliable shell exploit in just a few seconds (with reasoning disabled). Still a bit loopy though. Hopefully, the Qwen team will pay more attention to those looping issues. It feels like their models are especially susceptible.

2ndorderthought · 2026-05-12T17:23:01 1778606581

Is that on a single 3090? I need to change my settings it sounds like

johndough · 2026-05-12T20:05:53 1778616353

Yes, single RTX 3090 with this model https://huggingface.co/llmfan46/Qwen3.6-35B-A3B-uncensored-h... following these https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF instructions (should add "-j 8" to last cmake command for parallel build) and llama-server with --reasoning off

Note that the MTP PR https://github.com/ggml-org/llama.cpp/pull/22673 is still under development, so things might be broken.