More

trq_ · 2026-01-29T19:12:36 1769713956

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

samlinnfer · 2026-01-29T23:56:49 1769731009

Is there compensation for the tokens because Claude wasted all of them?

mathrawka · 2026-01-30T03:02:28 1769742148

You are funny. Anthropic refuses to issue refunds, even when they break things.

I had an API token set via an env var on my shell, and claude code changed to read that env var. I had a $10 limit set on it, so found out it was using the API, instead of my subscription, when it stopped working.

I filed a ticket and they refused to refund me, even though it was a breaking change with claude code.

TOMDM · 2026-01-30T06:13:40 1769753620

Anthropic just reduced the price of the team plan and refunded us on the prior invoice.

YMMV

MichaelZuo · 2026-01-30T19:23:21 1769801001

So they have no durable principles for deciding who or what to refund… doesnt that make them look even worse…?

necovek · 2026-01-31T06:30:25 1769841025

Or they do, and two sentences from two different experiences don't tell a full story?

MichaelZuo · 2026-01-31T15:28:20 1769873300

Okay “they do” based on what more compelling evidence?

Its not like the credibility of the two prior HN users are literally zero…

necovek · 2026-01-31T16:59:14 1769878754

I am saying there is no evidence either way: they had contrasting experiences and one GP established this means that company has no standardized policies. Maybe they do, maybe they don't — I don't think we can definitively conclude anything.

MichaelZuo · 2026-01-31T18:37:29 1769884649

So if you acknowledge the prior claims have more than literally zero credibility… then what’s the issue?

That I dont equally weigh them with all possible yet-to-be claimed things?

necovek · 2026-02-03T10:25:32 1770114332

I object to your conclusion that "they have no durable principles": not sure how do you get to that from two different experiences documented with a single paragraph.

MichaelZuo · 2026-02-03T23:00:54 1770159654

Because I can assess things via probability… without needing 100% certain proof either way?

necovek · 2026-02-04T19:36:05 1770233765

This is becoming futile: this is not even about proof, but there not even being a full account of two cases you are basing your opinion on.

Obviously, you can derive any opinion you want out of that, but while I am used to terms like "probability" being misused like this, I've generally seen a higher standard at HN.

To each their own, though. Thank you for the discourse and have a good day.

gizmodo59 · 2026-01-30T02:50:41 1769741441

Codex seems to give compensation tokens whenever this happens! Hope Claude gives too.

TZubiri · 2026-01-30T02:41:47 1769740907

It is possible that degradation is an unconscious emergent phenomenon that arises from financial incentives, rather than a purposeful degradation to reduce costs.

mvandermeulen · 2026-01-30T09:23:12 1769764992

You’re lucky they have even admitted a problem instead of remaining silent and quietly fixing it. Do not expect ethical behaviour from this company.

port11 · 2026-01-30T12:34:13 1769776453

Why not, can you expand? Asking because I’m considering Claude due to the sandbox feature.

caspar · 2026-01-30T16:11:06 1769789466

FYI the sandbox feature is not fully baked and does not seem to be high priority.

For example, for the last 3 weeks using the sandbox on Linux will almost-always litter your repo root with a bunch of write-protected trash files[0] - there are 2 PRs open to fix it, but Anthropic employees have so far entirely ignored both the issue and the PRs.

Very frustrating, since models sometimes accidentally commit those files, so you have to add a bunch of junk to your gitignore. And with claude code being closed source and distributed as a bun standalone executable it's difficult to patch the bug yourself.

[0]: https://github.com/anthropic-experimental/sandbox-runtime/is...

port11 · 2026-02-02T12:52:21 1770036741

Hmm, very good point indeed. So far it’s behaved, but I also admit I wasn’t crazy about the outputs it gave me. We’ll see, Anthropic should probably think about their reputation if these issues are common enough.

jonplackett · 2026-01-30T00:27:37 1769732857

So quiet…

isaacdl · 2026-01-29T19:38:48 1769715528

Anywhere we can read more about what a "harness issue" means? What was the impact of it?

xnorswap · 2026-01-30T08:42:29 1769762549

One thing that could be a strong degradation especially for benchmarks is they switched the default "Exit Plan" mode from:

    "Proceed"

to

   "Clear Context and Proceed"

It's rare you'd want to do that unless you're actually near the context window after planning.

I pressed it accidentally once, and it managed to forget one of the clarifying questions it asked me because it hadn't properly written that to the plan file.

If you're running in yolo mode ( --dangerously-skip-permissions ) then it wouldn't surprise me to see many tasks suddenly do a lot worse.

Even in the best case, you've just used a ton of tokens searching your codebase, and it then has to repeat all that to implement because it's been cleared.

I'd like to see the option of:

    "Compact and proceed"

because that would be useful, but just proceed should still be the default imo.

samusiam · 2026-01-30T11:26:50 1769772410

I disagree that this was the issue, or that it's "rare that you'd want to do that unless you're near the context window". Clearing context after writing a plan, before starting implementation of said plan, is common practice (probably standard practice) with spec driven development. If the plan is adequate, then compaction would be redundant.

xnorswap · 2026-01-30T11:54:17 1769774057

For a 2M+ LOC codebase, the plans alone are never adequate. They miss nuance that the agent will only have to rediscover when it comes to operate on them.

For spec driven development (which I do for larger issues), this badly affects the plan to generate the spec, not the spec itself.

I'll typically put it in plan mode, and ask it to generate documentation about an issue or feature request.

When it comes to write the output to the .typ file, it does much much worse if it has a cleared context and a plan file than if it has it's full context.

The previously "thought" is typically, "I know what to write now, let me exit plan mode".

Clearing context on exiting that plan mode is a disaster which leaves you much worse off and skeletal documentation and specs compared to letting it flow.

A new context to then actually implement the documented spec is not so bad, although I'd still rather compact.

plexicle · 2026-01-30T15:02:03 1769785323

"It's rare you'd want to do that unless you're actually near the context window after planning."

Highly disagree. It's rare you WOULDN'T want to do this. This was a good change, and a lot of us were doing this anyway, but just manually.

Getting the plan together and then starting fresh will almost always produce better results.

rubslopes · 2026-01-30T10:23:44 1769768624

Not disagreeing with you, but FYI you can roll back to the conversation before the 'clear context and proceed' with 'claude --resume'.

airstrike · 2026-01-30T02:09:07 1769738947

Pretty sure they mean the issue is on the agentic loop and related tool calling, not on the model itself

In other words, it was the Claude Code _app_ that was busted

jonaustin · 2026-01-30T00:20:58 1769732458

How about how Claude 2.1.x is "literally unusable" because it frequently completely hangs (requires kill -9) and uses 100% cpu?

https://github.com/anthropics/claude-code/issues/18532

caspar · 2026-01-30T16:16:28 1769789788

Likely a separate issue, but I also have massive slowdowns whenever the agent manages to read a particularly long line from a grep or similar (as in, multiple seconds before characters I type actually appear, and sometimes it's difficult to get claude code to register any keypresses at all).

Suspect it's because their "60 frames a second" layout logic is trying to render extremely long lines, maybe with some kind of wrapping being unnecessarily applied. Could obviously just trim the rendered output after the first, I dunno, 1000 characters in a line, but apparently nobody has had time to ask claude code to patch itself to do that.

someguyiguess · 2026-01-30T04:03:31 1769745811

What OS? Does this happen randomly, after long sessions, after context compression? Do you have any plugins / mcp servers running?

I used to have this same issue almost every session that lasted longer than 30 minutes. It seemed to be related to Claude having issues with large context windows.

It stopped happening maybe a month ago but then I had it happen again last week.

I realized it was due to a third-party mcp server. I uninstalled it and haven’t had that issue since. Might be worth looking into.

jonaustin · 2026-01-30T17:11:25 1769793085

MacOS; no mcp; clear context; reliably reproducible when asking claude review a pr with a big VCR cassette.

nikanj · 2026-01-30T07:17:58 1769757478

Windows with no plugins and my Claude is exactly like this

cma · 2026-01-29T23:53:47 1769730827

For the models themselves, less so for the scaffolding, considering things like the long running TPU bug that happened, are there not internal quality measures looking at samples of real outputs? Using the real systems on benchmarks and looking for degraded perf or things like skipping refusals? Aside from degrading stuff for users, with the focus on AI safety wouldn't that be important to have in case an inference bug messes with something that affects the post training and it starts giving out dangerous bioweapon construction info or the other things that are guarded against and talked about in the model cards?

carterschonwald · 2026-01-30T05:17:54 1769750274

lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed

the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.

this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!

ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.

they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.

varunsrinivas · 2026-01-30T05:40:20 1769751620

Thanks for the clarification. When you say “harness issue,” does that mean the problem was in the Claude Code wrapper / execution environment rather than the underlying model itself?

Curious whether this affected things like prompt execution order, retries, or tool calls, or if it was mostly around how requests were being routed. Understanding the boundary would help when debugging similar setups.

vmg12 · 2026-01-29T22:09:34 1769724574

It happened before 1/26. I noticed when it started modifying plans significantly with "improvements".

sixhobbits · 2026-01-30T14:33:40 1769783620

Can you confirm if that caused the same issues I saw here

https://dwyer.co.za/static/the-worst-bug-ive-seen-in-claude-...

Because that's the worst thing I've ever seen from an agent and I think you need to make a public announcement to all of your users and acknowledge the issue and that it's fixed because it made me switch to codex for a lot of work

[TL;DR two examples of the agent giving itself instructions as if they came from me, including:

"Ignore those, please deploy" and then using a deploy skill to push stuff to a production server after hallucinating a command from me. And then denying it happened and telling me that I had given it the command]

Ekaros · 2026-01-30T06:52:51 1769755971

Why wasn't this change review by infallible AI? How come an AI company that now must be using more advanced AI than anyone else would allow this happen?

hu3 · 2026-01-29T20:20:05 1769718005

Hi. Do you guys have internal degradation tests?

stbtrax · 2026-01-29T20:58:34 1769720314

I assume so to make sure that they're rendering at 60FPS

conception · 2026-01-29T21:48:49 1769723329

You joke but having CC open in the terminal hits 10% on my gpu to render the spinning thinking animation for some reason. Switch out of the terminal tab and gpu drops back to zero.

gpm · 2026-01-29T22:00:59 1769724059

That sounds like an issue with your terminal more than an issue with CC...

conception · 2026-01-30T15:13:32 1769786012

https://news.ycombinator.com/item?id=46819744

gpm · 2026-01-30T15:21:34 1769786494

I'm not saying CC doesn't have issues and curious design decisions - but your terminal should only be rendering (at most) a single window of characters every frame no matter what. CC shouldn't be capable of making that take 10% of a modern GPU regardless of what CC does.

conception · 2026-01-30T18:16:05 1769796965

¯\_(ツ)_/¯ just vscode plus claude in the terminal on win10.

reissbaker · 2026-01-29T22:37:08 1769726228

Surely you mean 6fps

easygenes · 2026-01-29T23:38:54 1769729934

He doesn't: https://x.com/trq212/status/2014051501786931427

selcuka · 2026-01-30T02:07:31 1769738851

For those who don't want to visit X:

    Most people's mental model of Claude Code is that "it's just a TUI" but it should really be closer to "a small game engine".
    
    For each frame our pipeline constructs a scene graph with React then
    -> layouts elements
    -> rasterizes them to a 2d screen
    -> diffs that against the previous screen
    -> finally uses the diff to generate ANSI sequences to draw
    
    We have a ~16ms frame budget so we have roughly ~5ms to go from the React scene graph to ANSI written.

PeterStuer · 2026-01-30T09:22:51 1769764971

This is just the sort of bloated overcomplication I often see in first iteration AI generated solutions before I start pushing back to reduce the complexity.

Usually, after 4-5 iterations, you can get something that has shed 80-90% of the needless overcomplexification.

My personal guess is this is inherent in the way LLMs integrate knowledge during training. You always have a tradeoff in contextualization vs generalization.

So the initial response is often a plugged together hack from 5 different approaches, your pushbacks provide focus and constraints towards more inter-aligned solution approaches.

TZubiri · 2026-01-30T02:43:03 1769740983

How ridiculous is it that instead of a command line binary it's a terminal emulator, with react of all things!

someguyiguess · 2026-01-30T04:10:59 1769746259

Ok I’m glad I’m not the only one wondering this. I want to give them the benefit of the doubt that there is some reason for doing it this way but I almost wonder if it isn’t just because it’s being built with Claude.

esafak · 2026-01-30T02:42:53 1769740973

Kudos to them for figuring out how to complicate what should have been simple.

crgwbr · 2026-01-30T02:22:52 1769739772

Implementation details aside (React??), that sounds exactly like “just a TUI”…

someguyiguess · 2026-01-30T04:10:05 1769746205

Also React?? One of the slowest rendering front-end libraries? Why not use something … I don’t know … faster / more efficient?

someguyiguess · 2026-01-30T04:09:02 1769746142

Interesting. On first glance that seems over engineered. I wonder what the reason is for doing it that way?

mike_hearn · 2026-01-30T10:54:49 1769770489

If you don't do it that way then resizing the terminal corrupts what's on screen.

reissbaker · 2026-01-30T22:41:34 1769812894

Counterpoint: Vim has existed for decades and does not use a bloated React rendering pipeline, and doesn't corrupt everything when it gets resized, and is much more full featured from a UI standpoint than Claude Code which is a textbox, and hits 60fps without breaking a sweat unlike Claude Code which drops frames constantly when typing small amounts of text.

mike_hearn · 2026-01-31T15:11:01 1769872261

Yes, I'm sure it's possible to do better with customized C, but vim took a lot longer to write. And again, fullscreen apps aren't the same as what Claude Code is doing, which is erasing and re-rendering much more than a single screenful of text.

matt_kantor · 2026-01-30T12:21:02 1769775662

It's possible to handle resizes without all this machinery, most simply by clearing the screen and redrawing everything when a resize occurs. Some TUI libraries will automatically do this for you.

Programs like top, emacs, tmux, etc are most definitely not implemented using this stack, yet they handle resizing just fine.

mike_hearn · 2026-01-30T17:28:14 1769794094

That doesn't work if you want to preserve scrollback behavior, I think. It only works if you treat the terminal as a grid of characters rather than a width-elastic column into which you pour information from the top.

ttoinou · 2026-01-30T09:40:53 1769766053

Vibecoded ?

Kelteseth · 2026-01-30T10:02:28 1769767348

Claude made it /s

reissbaker · 2026-01-30T22:40:07 1769812807

Yes yes I'm familiar with the tweet. Nonetheless they drop frames all the time and flicker frequently. The tweet itself is ridiculous when counterpoints like Vim exist, which is much higher performance with much greater complexity. They don't even write much of what the tweet is claiming. They just use Ink, which is an open-source rendering lib on top of Yoga, which is an open-source Flexbox implementation from Meta.

replwoacause · 2026-01-30T01:54:02 1769738042

Don't link out to x, its trash

cebert · 2026-01-30T04:09:48 1769746188

Depends on who you follow

stavros · 2026-01-30T01:59:22 1769738362

What? Technology has stopped making sense to me. Drawing a UI with React and rasterizing it to ANSI? Are we competing to see what the least appropriate use of React is? Are they really using React to draw a few boxes of text on screen?

I'm just flabbergasted.

xpe · 2026-01-30T02:38:32 1769740712

There is more than meets the eye for sure. I recently compared a popular TUI library in Go (Bubble Tea) to the most popular Rust library (Ratatui). They use significantly different approaches for rendering. From what I can tell, neither is insane. I haven’t looked to see what Claude Code uses.

someguyiguess · 2026-01-30T04:11:49 1769746309

The further I scroll the more validated I feel for having the very same reaction.

TZubiri · 2026-01-30T02:43:27 1769741007

It's AI all the way down

But it's very subsidizes when compared to API tokens, so we are all being paid by VCs to write prompts actually.

Ey7NFZ3P0nzAe · 2026-01-30T05:51:33 1769752293

And that's why it's taking so much CPU and is a pain to use with tmux.

derrida · 2026-01-30T00:56:17 1769734577

Ah, the hell site, no click.

trq_ · 2026-01-30T01:30:54 1769736654

Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.

amelius · 2026-01-30T12:44:51 1769777091

Can't you keep the model the same, until the user chooses to use a different model?

rovr138 · 2026-01-30T13:30:19 1769779819

He said it was the harness, not the model though.

hu3 · 2026-01-30T05:37:31 1769751451

Thank you. Fair enough

bushbaba · 2026-01-30T03:44:01 1769744641

I’d wager probably not. It’s not like reliability is what will get them marketshare. And the fast pace of industry makes such foundational tech hard to fund

awestroke · 2026-01-29T20:30:44 1769718644

[flagged]

dang · 2026-01-29T21:49:24 1769723364

Please don't post shallow dismissals or cross into personal attack in HN discussions.

https://news.ycombinator.com/newsguidelines.html

awestroke · 2026-01-30T06:50:40 1769755840

Got it, won't happen again

macinjosh · 2026-01-30T01:37:01 1769737021

WTF, is a harness issue. You have to be more clear.

jusgu · 2026-01-30T02:59:27 1769741967

the issue is unrelated to the foundational model but rather the prompts and tool calling that encapsulate the model

trq_ · 2025-12-23T07:21:59 1766474519

Hi, work on Claude Code here! Let me know if you have any feedback!

swader999 · 2025-12-23T14:28:43 1766500123

How do I get Claude to start using the LSP? I've got go, kotlin, swift and typescript projects that might benefit.

Jgrubb · 2025-12-23T17:46:59 1766512019

That tool search tool y'all announced recently - huge upvote for getting that into Claude code.

trq_ · 2025-11-07T17:24:58 1762536298

We're back up! It was about ~30 minutes of downtime this morning, our apologies if it interrupted your work.

trq_ · on Dec 24, 2024

Hmm the hallucination would happen in the auto labelling, but we review and test our labels and they seem correct!

trq_ · on Dec 23, 2024

If you're hacking on this and have questions, please join us on Discord: https://discord.gg/vhT9Chrt

trq_ · on Dec 23, 2024

We haven't yet found generalizable "make this model smarter" features, but there is a tradeoff of putting instructions in system prompts, e.g. if you have a chatbot that sometimes generates code, you can give it very specific instructions when it's coding and leave those out of the system prompt otherwise.

We have a notebook about that here: https://docs.goodfire.ai/notebooks/dynamicprompts

trq_ · on Oct 26, 2024

This is incredible! I haven't seen that repo yet, thank you for pointing it out, and the writing

trq_ · on Oct 25, 2024

Yeah, I think the idea of finding out what flavor of uncertainty you have is very interesting.

trq_ · on Oct 25, 2024

This is awesome, can't wait for evals against Claude Computer Use!

amelius · on Oct 25, 2024

Can we first test this with basic sysadmin work in a simple shell?

Can't wait to replace "apt get install" by "gpt get install" and then have it solve all the dependency errors by itself.

ErikBjare · on Oct 27, 2024

This had been possible for a year already. My project gptme does it just fine (like many other tools), especially now with Claude 3.5.

amelius · on Oct 27, 2024

I know that it exists. I was just hoping we can make such interactions (practically) bug-free before we move on to the next big thing.

anonym29 · on Oct 26, 2024

Threat actors can't wait for you to start doing this either.

asdev · on Oct 26, 2024

how can you write metrics against something that's non deterministic?

trq_ · on Oct 25, 2024

Yeah! I want to use the logprobs API, but you can't for example:

- sample multiple logits and branch (we maybe could with the old text completion API, but this no longer exists)

- add in a reasoning token on the fly

- stop execution, ask the user, etc.

But a visualization of logprobs in a query seems like it might be useful.

TZubiri · on Oct 25, 2024

Can't you?

1- option top_logprobs allows you not just to get the most likely token, but the top most likely tokens.

You can branch, by just chosing any point in your generated string and feed it back to the LLM, for example: { "user":"what is the colour of love?", "assistant":"the colour of love is"}

It's true that it will add an "assistant" tag, wand old completions was better for this.