https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report
There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)
I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.
Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek
But mimo seems like an interesting model and they are having some crazy discounts too.
Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.
Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.
I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.
I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.
Having used both Deepseek v4 Pro and Mimo v2.5 for agentic coding, I'm not surprised Mimo comes out quite far in front. It reflects my experience at least.
The recent hype is Deepseek is a combination of existing name recognition along with incredibly low pricing. Their v4 models, both pro and flash are incredible for their price. That's more revolutionary than Mimo which is multiple times more expensive, just like Kimi 2.6.
Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.
I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).
Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].
Every model release you'll post this, and every time I'll be there to point out how it's completely useless (for reasons you've shared are intentional)
It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5
At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.
Also, what about the major flaw/bias linked for Gemini 3.5 flash? That has major real-life consequences if the model ends up being used for any automated scoring systems.
I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.
I'm happy you do comment, I did add more coding tests since then and add more improvements (price history per model, displaying cost to run at current pricing, improved scoring).
How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?
I think their "code" ranking is biased towards visual aesthetics more than raw coding as the voters are just asked which generated website they prefer.
I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.
On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.
When I first graduated I read a bunch of tech focused books: they’re all helpful but I think practice and learning from more senior engineers is the most effective road to mastery! You can probably get away with not reading any of these books if you have good coworkers :).
That being said, these have been my favorites:
- designing data intensive applications (a great way of understanding systems + the basics of SRE)
- the senior engineer (I love the prototyping process he lays out)
- the effective engineer (lots of good gems for approaching prioritization)
- debugging (by David agans)- a great resource for a formalized debugging process if you don’t have one
- on writing well (I’m halfway through this, but it has been indispensable for writing tickets + messages at work)
I think these groups are out there, but unfortunately they are informal (not listed on the web or through a company), and also are not for a beginner level.
For me, I’ve been learning how to surf and cook. I found at an absolute beginner level no one really wants to go out surfing with you, or do dinner parties. I did find once I showed enough commitment, and reached a beginner-intermediate level, more people were willing to join me in learning, and form informal learning groups.
Therapists are trained to help and give you custom things to do based on what you need. I found one that gave me a lot of structured “assignments” and questions to ask myself and I’ve never felt better than I do now.
It's funny how hyped up stable diffusion is on HN right now: reminds me of when style transfer first started making it's rounds in 2017. https://news.ycombinator.com/item?id=13958366
I think as technologists we want to think that code can "solve" some of the problems in the art world... but I think we still have a really, really long way to go.
I tried to get style transfer adopted at work (worked at a creative technology firm in NY) but frankly I think deep learning methods for art generation tend to be really unpredictable, which make them pretty hard to use for professional applications. Imagine deploying production code that only worked 85% of the time... would be a nightmare. I felt, and feel similarly about deep learning approaches to art. They're just so finnicky and unpredictable, for example, add a single extra pixel to that example in this article and the output would look completely different.
Either way, cynicism aside, stable diffusion is awesome :).
> Imagine deploying production code that only worked 85% of the time... would be a nightmare. I felt, and feel similarly about deep learning approaches to art. They're just so finnicky and unpredictable, for example, add a single extra pixel to that example in this article and the output would look completely different.
Don't think the metaphor works. Code that only works 85% of the time is obviously broken but Art is subjective so an 85% solution to a creative problem could be more than enough for most consumers.
What kind of GPU are you running this on? My 3080 seems to take about 30 seconds per image with 50 passes. I'm wondering if I'm missing out on some optimizations. Could just be the quality of Linux NVidia drivers.
I'd recommend trying a different fork. Perhaps you're using the the official one. I believe that one still "ramps up the system" on every image generation. Other repos do the ramp up only once.
I'm using 512x768 as the default, but a quick test shows only a marginal difference in speed between the two. I'll have to give Windows a try to see if it's the driver holding me back. Do you have any tips or resources for up-scaling the image after?
As someone who took their first software engineering job as a junior during covid, I have to say I definitely struggled to learn and execute on tasks in a way which I know I wouldn't in an in person setting.
I found asking for help as a junior is definitely harder when you don't have people around (walking up to someone's desk vs slack message with ~20-60 minute delay then zoom call): and I often found myself blocked on tasks.
I found learning is generally harder remotely for me as well: the sheer amount of information + resources + help you get from serendipitous conversations with other engineers should not be understated. It's the same reason people got so angry over paying so much for remote university: it is objectively a worse learning experience.
I think this is just my personal stance: but I think in my perfect world I work in office for the first 5-10 years of my career to optimize for learning + relationship building, and then once I get more senior (or have kids) I transition into either hybrid or fully remote.
I'm sorry you had this experience, but I will put the blame on your company's onboarding.
This was your first job so you may lack data points, but if you ended up not getting helped/being supported as a new, out of school, engineer in a remote setting, I strongly doubt that it would have been any better in an office.
I've been a manager/director working with distributed teams for the past 6 or 7 years, I've onboarding dozens of folks for whom it was their first or second job and they all had a really good experience.
reply