More

aronowb14 · 2026-06-05T22:27:14 1780698434

why is it a great idea though? My question was basically what are the actual pros of doing it in space? I still haven't heard a good explanation.

aronowb14 · 2026-05-28T17:08:48 1779988128

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

Bnjoroge · 2026-05-28T17:53:23 1779990803

Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.

Imustaskforhelp · 2026-05-28T18:42:26 1779993746

This actually looks like a really good test.

There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)

I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.

Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek

But mimo seems like an interesting model and they are having some crazy discounts too.

Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.

Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.

I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.

I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.

GneojJ · 2026-05-29T13:49:43 1780062583

Having used both Deepseek v4 Pro and Mimo v2.5 for agentic coding, I'm not surprised Mimo comes out quite far in front. It reflects my experience at least.

The recent hype is Deepseek is a combination of existing name recognition along with incredibly low pricing. Their v4 models, both pro and flash are incredible for their price. That's more revolutionary than Mimo which is multiple times more expensive, just like Kimi 2.6.

Bnjoroge · 2026-05-29T18:11:04 1780078264

Agree on both counts. Mimo seemed to have reduced their prices significantly so if it’s comparable to deepseek v4 pro, it’s a much better value

XCSme · 2026-05-28T18:25:37 1779992737

Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.

I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).

Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].

[0]: https://aibenchy.com

[1]: https://news.ycombinator.com/item?id=48230368

BoorishBears · 2026-05-28T20:32:00 1780000320

Every model release you'll post this, and every time I'll be there to point out how it's completely useless (for reasons you've shared are intentional)

It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5

At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.

XCSme · 2026-05-28T20:54:31 1780001671

Also, what about the major flaw/bias linked for Gemini 3.5 flash? That has major real-life consequences if the model ends up being used for any automated scoring systems.

I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.

XCSme · 2026-05-28T20:49:47 1780001387

I'm happy you do comment, I did add more coding tests since then and add more improvements (price history per model, displaying cost to run at current pricing, improved scoring).

How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?

reckless · 2026-05-28T18:35:31 1779993331

No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators

WASDx · 2026-05-28T22:02:16 1780005736

I think their "code" ranking is biased towards visual aesthetics more than raw coding as the voters are just asked which generated website they prefer.

morley · 2026-05-28T18:27:48 1779992868

I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.

WarmWash · 2026-05-28T18:44:15 1779993855

On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.

dakolli · 2026-05-28T18:48:13 1779994093

If you don't know their methodology, or anything about it why do you think its a good ranker?

aronowb14 · 2026-01-20T05:04:42 1768885482

Yeah curious what would happen if they asked for an additional big feature on top of the original spec

aronowb14 · 2026-01-05T18:21:01 1767637261

A bit ironic that the website complaining about UI has virtual snow on it making reading hard.

aronowb14 · 2025-09-03T00:43:29 1756860209

Agreed. I think this Anthropic article is a realistic take on what’s possible (focus on prototyping)

https://www-cdn.anthropic.com/58284b19e702b49db9302d5b6f135a...

aronowb14 · on Sept 1, 2024

When I first graduated I read a bunch of tech focused books: they’re all helpful but I think practice and learning from more senior engineers is the most effective road to mastery! You can probably get away with not reading any of these books if you have good coworkers :).

That being said, these have been my favorites:

- designing data intensive applications (a great way of understanding systems + the basics of SRE)

- the senior engineer (I love the prototyping process he lays out)

- the effective engineer (lots of good gems for approaching prioritization)

- debugging (by David agans)- a great resource for a formalized debugging process if you don’t have one

- on writing well (I’m halfway through this, but it has been indispensable for writing tickets + messages at work)

aronowb14 · on July 1, 2023

I think these groups are out there, but unfortunately they are informal (not listed on the web or through a company), and also are not for a beginner level. For me, I’ve been learning how to surf and cook. I found at an absolute beginner level no one really wants to go out surfing with you, or do dinner parties. I did find once I showed enough commitment, and reached a beginner-intermediate level, more people were willing to join me in learning, and form informal learning groups.

aronowb14 · on Oct 22, 2022

Therapy worked wonders for me.

Therapists are trained to help and give you custom things to do based on what you need. I found one that gave me a lot of structured “assignments” and questions to ask myself and I’ve never felt better than I do now.

aronowb14 · on Aug 31, 2022

It's funny how hyped up stable diffusion is on HN right now: reminds me of when style transfer first started making it's rounds in 2017. https://news.ycombinator.com/item?id=13958366

I think as technologists we want to think that code can "solve" some of the problems in the art world... but I think we still have a really, really long way to go. I tried to get style transfer adopted at work (worked at a creative technology firm in NY) but frankly I think deep learning methods for art generation tend to be really unpredictable, which make them pretty hard to use for professional applications. Imagine deploying production code that only worked 85% of the time... would be a nightmare. I felt, and feel similarly about deep learning approaches to art. They're just so finnicky and unpredictable, for example, add a single extra pixel to that example in this article and the output would look completely different.

Either way, cynicism aside, stable diffusion is awesome :).

adamsmith143 · on Aug 31, 2022

> Imagine deploying production code that only worked 85% of the time... would be a nightmare. I felt, and feel similarly about deep learning approaches to art. They're just so finnicky and unpredictable, for example, add a single extra pixel to that example in this article and the output would look completely different.

Don't think the metaphor works. Code that only works 85% of the time is obviously broken but Art is subjective so an 85% solution to a creative problem could be more than enough for most consumers.

DethNinja · on Aug 31, 2022

It takes 3 seconds to generate 1 image with my GPU.

I can find a good prompt within 30 minutes to 1 hour.

My GPU can generate 100 images in 5 minutes.

Out of those 100 images, 10 is very close to what I exactly meant at professional concept artist level.

So, in this case Stable Diffusion only working 10% of the time is fine.

Future is already here, I’m already incorporating stable diffusion generated images to my professional work.

tstrimple · on Aug 31, 2022

What kind of GPU are you running this on? My 3080 seems to take about 30 seconds per image with 50 passes. I'm wondering if I'm missing out on some optimizations. Could just be the quality of Linux NVidia drivers.

bitshiftfaced · on Aug 31, 2022

I'd recommend trying a different fork. Perhaps you're using the the official one. I believe that one still "ramps up the system" on every image generation. Other repos do the ramp up only once.

tstrimple · on Aug 31, 2022

Yeah, this might be the problem. I was on the main fork, but going to try switching over to this: https://github.com/hlky/stable-diffusion

DethNinja · on Aug 31, 2022

That’s weird, I got RTX3070 on Windows.

Are you using 512x512 images or larger ones?

Best workflow is to keep images close to 512x512, record the seed and then upscale.

tstrimple · on Aug 31, 2022

I'm using 512x768 as the default, but a quick test shows only a marginal difference in speed between the two. I'll have to give Windows a try to see if it's the driver holding me back. Do you have any tips or resources for up-scaling the image after?

DethNinja · on Aug 31, 2022

Currently this library can generate multiple images and upscale them through RealESRGAN: https://github.com/hlky/stable-diffusion

If you are not using this library already, give it a shot.

Also, I'm using Nvidia Studio drivers though I'm not sure if that would make a difference.

tstrimple · on Aug 31, 2022

I've been using the main fork. This even has GFPGAN built in! Looks very useful thanks.

aronowb14 · on Aug 23, 2022

As someone who took their first software engineering job as a junior during covid, I have to say I definitely struggled to learn and execute on tasks in a way which I know I wouldn't in an in person setting.

I found asking for help as a junior is definitely harder when you don't have people around (walking up to someone's desk vs slack message with ~20-60 minute delay then zoom call): and I often found myself blocked on tasks.

I found learning is generally harder remotely for me as well: the sheer amount of information + resources + help you get from serendipitous conversations with other engineers should not be understated. It's the same reason people got so angry over paying so much for remote university: it is objectively a worse learning experience.

I think this is just my personal stance: but I think in my perfect world I work in office for the first 5-10 years of my career to optimize for learning + relationship building, and then once I get more senior (or have kids) I transition into either hybrid or fully remote.

dudul · on Aug 23, 2022

I'm sorry you had this experience, but I will put the blame on your company's onboarding.

This was your first job so you may lack data points, but if you ended up not getting helped/being supported as a new, out of school, engineer in a remote setting, I strongly doubt that it would have been any better in an office.

I've been a manager/director working with distributed teams for the past 6 or 7 years, I've onboarding dozens of folks for whom it was their first or second job and they all had a really good experience.