More

neilellis · 2026-03-25T23:11:31 1774480291

Unless it’s already happened and we missed it

threatripper · 2026-03-25T23:15:06 1774480506

Or nobody is around anymore to notice when it happens.

neilellis · 2026-03-03T18:02:58 1772560978

Caught in a landslide, no escape from reality

neilellis · 2026-02-15T23:47:52 1771199272

When I hear people talking about how insecure OpenClaw is, I remember how insecure the internet was in the early days. Sometimes it's about doing the right thing badly and fix the bad things after.

Big Tech can't release software this dangerous and then figure out how to make it secure. For them it would be an absolute disaster and could ruin them.

What OpenClaw did was show us the future, give us a taste of what it would be like and had the balls to do it badly.

Technology is often pushed forwards by ostensively bad ideas (like telnet) that carve a path through the jungle and let other people create roads after.

I don't get the hate towards OpenClaw, if it was a consumer product I would, but for hackers to play around to see what is possible it's an amazing (and ridiculously simple) idea. Much like http was.

If you connected to your bank account via telnet in the 1980s or plain http in the 90s or stored your secrets in 'crypt' well, you deserved what you got ;-) But that's how many great things get started, badly, we see the flaws fix them and we get the safe version.

And that I guess is what he'll get to do now.

* OpenClaw is a straw man for AGI *

neilellis · 2026-02-15T00:15:57 1771114557

Yes, also for semantic indexes, I use one for person/role/org matches. So that CEO == chief executive ~= managing director good when you have grey data and multiple look up data sources that use different terms.

neilellis · 2026-02-15T00:13:00 1771114380

That I would like to see too, usearch is amazingly fast, 44m embeddings in < 100ms

neilellis · 2026-02-12T17:45:06 1770918306

It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.

NitpickLawyer · 2026-02-12T17:52:31 1770918751

> Trouble is some benchmarks only measure horse power.

IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.

neilellis · 2026-02-12T17:43:50 1770918230

Less than a year to destroy Arc-AGI-2 - wow.

Davidzheng · 2026-02-12T17:52:22 1770918742

I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month

ACCount37 · 2026-02-12T20:40:36 1770928836

Not very likely?

ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.

Davidzheng · 2026-02-13T05:37:41 1770961061

We will see at the end of April right? It's more of a guess than a strongly held conviction--but I see models improving rapidly at long horizon tasks so I think it's possible. I think a benchmark which can survive a few months (maybe) would be if it genuinely tested long time-frame continual learning/test-time learning/test-time posttraining (idk honestly the differences b/t these).

But i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun.

etyhhgfff · 2026-02-12T18:21:59 1770920519

The AGI bar has to be set even higher, yet again.

red75prime · 2026-02-13T04:03:16 1770955396

And that's the way it should be. We're past the "Look! It can talk! How cute!" stage. AGI should be able to deal with any problem a human can.

dakolli · 2026-02-12T19:09:35 1770923375

wow solving useless puzzles, such a useful metric!

esafak · 2026-02-12T21:37:04 1770932224

How is spatial reasoning useless??

saberience · 2026-02-12T18:48:08 1770922088

It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.

Arc-AGI score isn't correlated with anything useful.

Legend2440 · 2026-02-12T21:11:27 1770930687

It's correlated with the ability to solve logic puzzles.

It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.

HDThoreaun · 2026-02-12T23:22:37 1770938557

ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful

fsh · 2026-02-13T00:34:45 1770942885

IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".

jabedude · 2026-02-12T18:51:40 1770922300

how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?

WarmWash · 2026-02-12T20:06:57 1770926817

Give it a prompt like

>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home

And get back an automatic coupon code app like the user actually wanted.

modeless · 2026-02-12T19:29:37 1770924577

It's still useful as a benchmark of cost/efficiency.

XCSme · 2026-02-12T18:51:09 1770922269

But why only a +0.5% increase for MMMU-Pro?

kingstnap · 2026-02-12T20:07:37 1770926857

Its possibly label noise. But you can't tell from a single number.

You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.

It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.

kenjackson · 2026-02-12T19:09:50 1770923390

Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.

XCSme · 2026-02-13T00:38:19 1770943099

But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.

Davidzheng · 2026-02-13T05:28:31 1770960511

I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)

kenjackson · 2026-02-13T00:59:37 1770944377

Are humans 100%?

XCSme · 2026-02-13T01:05:57 1770944757

If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.

But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.

kenjackson · 2026-02-13T03:56:55 1770955015

Actually faster and worse is a very common characterization of a LOT of automation.

XCSme · 2026-02-13T08:55:42 1770972942

That's true.

The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).

AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.

So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.

I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.

I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).

neilellis · 2026-02-10T18:31:58 1770748318

End of the World? Must be Tuesday.

neilellis · 2026-02-10T18:31:16 1770748276

But you're not Everyone - they are a fictional hacker collective from a TV show.

neilellis · 2026-02-10T18:20:16 1770747616

I suppose the question is "Do you feel Steve Jobs made the iPhone?"

Not saying right/wrong but it's a useful Rorschach Test - about what you feel defines 'making this'?

p-t · 2026-02-10T18:32:31 1770748351

it's more just a personal want to be able to see what I can do on my own tbh; i don't generally judge other people on that measure

although i do think Steve Jobs didn't make the iPhone /alone/, and that a lot of other people contributed to that. i'd like to be able to name who helps me and not say "gemini". again, it's more of a personal thing lol

neilellis · 2026-02-10T19:55:22 1770753322

So not disagreeing as you say, it is a personal thing!

I honestly find coding with AI no easier than coding directly, it certainly does not feel like AI is doing my work for me. If it was I wouldn't have anything to do, in reality I spend my time thinking about much higher level abstractions, but of course this is a very personal thing too.

I myself have never thought of code as being my output, I've always enjoyed solving problems, and solutions have always been my output. It's just that before I had to write the code for the solutions. Now I solve the problems and the AI makes it into code.

I think that this probably the dividing line, some people enjoy working with tools (code, unix commands, editors), some people enjoy just solving the problems. Both of course are perfectly valid, but they do create a divide when looking at AI.

Of course when AI starts solving all problems, I will have a very different feeling :-)

abustamam · 2026-02-11T06:42:45 1770792165

If you managed an AI (or rather, ai system) that wrote a compiler or web browser like Claude code or cursor did, would you feel like you did it?

Just a curious question, not trying to be combative or anything.

I myself will go into planning mode and ask it to implement a feature, and ask it to give me tradeoffs between implementation details. Then I might chat with it a bit to further understand the implementation before it writes the plan.

I find it to be very effective and gives me a sense of agency in my features.