I agree that a "long term fractional g spin test" is one of the most valuable things a LEO station can do. But there are others too.
For example, medical interventions against zero-g decay can be tested in any microgravity, spin or no spin. Development of in-space manufacturing and assembly can happen on any sufficiently capable space station.
All of that, however, requires a good amount of ambition. And I'm not sure if NASA under the current political system can deliver ambition.
"Body" is a pile of elaborate biochemistry. The muscles don't somehow evaporate when you stop exercising - it's the processes of the body itself that trim the "excess" muscle tissue.
And if it's the body doing that, you can, in theory, find a biochemical way to make it stop doing that.
"Physical damage and weakness can’t be stopped by a pill."
If you rephrase that to correct English then it would make sense. We aren't trying to stop physical damage or weakness we are trying to prevent it from happening. Pills can prevent many things that cause this.
Right. Claude models seem to have had very limited prohibitions in this area baked in via RLHF. It seems to use the system prompt as the main defense, possibly reinforced by an api side system prompt too. But it is very clear that they want to allow things like malware analysis (which includes reverse-engineering), so any server-side limitations will be designed to allow these things too.
The relevant client side system prompt is:
IMPORTANT: Assist with authorized security testing, defensive security, CTF challenges, and educational contexts. Refuse requests for destructive techniques, DoS attacks, mass targeting, supply chain compromise, or detection evasion for malicious purposes. Dual-use security tools (C2 frameworks, credential testing, exploit development) require clear authorization context: pentesting engagements, CTF competitions, security research, or defensive use cases.
----
There is also this system reminder that shows upon using the read tool:
<system-reminder>
Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
</system-reminder>
may i ask how the current generation language models are jailbroken? im aware the previous generation had 'do anything now' prompts. mostly curious from a psychological perspective.
> it's widely believed they are doing something to degrade service quality (quantizing?) in order to stretch resources
God, I wish this inane bullshit would just fucking die already.
Models are not "degrading". They're not being "secretly quantized". And no one is swapping out your 1.2T frontier behemoth for a cheap 120B toy and hoping you wouldn't notice!
It's just that humans are completely full of shit, and can't be trusted to measure LLM performance objectively!
Every time you use an LLM, you learn its capability profile better. You start using it more aggressively at what it's "good" at, until you find the limits and expose the flaws. You start paying attention to the more subtle issues you overlooked at first. Your honeymoon period wears off and you see that "the model got dumber". It didn't. You got better at pushing it to its limits, exposing the ways in which it was always dumb.
Now, will the likes of Anthropic just "API error: overloaded" you on any day of the week that ends in Y? Will they reduce your usage quotas and hope that you don't notice because they never gave you a number anyway? Oh, definitely. But that "they're making the models WORSE" bullshit lives in people's heads way more than in any reality.
It's possible though - it was a bug, a model pool instance wasn't updated properly and served a very old model for several months; whoever hit this instance would received a response from a prev version of a model.
While it's true that people are naturally predisposed to invent the "secret quantizing" conspiracy regardless of whether the actual conspiracy exists or not, I think there's more to the story.
I've seen Sonnet consistently start hallucinating on the exact same inputs for a couple hours, and then just go back to normal like nothing ever happened. It may just be a combination of hardware malfunction + session pinning. But at the end of the day the effects are indistinguishable from "secret quantizing".
"Raven's progressive matrices" is "infer and generalize rules". Performance there also improves once "you kinda get used to the style", which is why training for IQ tests can improve human performance on IQ tests, including on unseen examples. This is well known and well documented.
Yep. Behavior composition. If you train an LLM to do A and to do B, separately, chances are, it'll be decent at A+B despite not being trained for the combination.
It's kind of the point? To test AI where it's weak instead of where it's strong.
"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.
ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.
'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.
If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?
The measurement metric is in-game steps. Unlimited reasoning between steps is fine.
This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.
Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change.
Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.
Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.
Cost has utility in the real world and this doesn't. That's the only reason i would tolerate thinking about cost, and even then, i would never bundle it into the same score as the intelligence, because that's just silly.
It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).
You control the mirroring by moving the axis, they're what reflects your shapes. So my first move was always to identify the symmetries in the target shape, and position the axis accordingly.
This is the correct strategy for this particular game (center the mirrors between the yellow squares, move the black squares). I didn't realize it until about round 6 or 7.
They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.
Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.
For example, medical interventions against zero-g decay can be tested in any microgravity, spin or no spin. Development of in-space manufacturing and assembly can happen on any sufficiently capable space station.
All of that, however, requires a good amount of ambition. And I'm not sure if NASA under the current political system can deliver ambition.
reply