More

ericpauley · 2026-03-22T18:05:55 1774202755

Yeah this seems silly. You can do the same thing in git (add and commit with the conflict still there)! Why you would want to is a real mystery.

fweimer · 2026-03-22T18:28:56 1774204136

It allows review of the way the merge conflict has been resolved (assuming those changes a tracked and presented in a useful way). This can be quite helpful when backporting select fixes to older branches.

ericpauley · 2026-03-21T11:53:05 1774093985

Insisting on saying VoIP to the Mint rep instead of WiFi Calling (the term used by Apple, Google, Mint, and practically everyone else) is asking for a bad time.

gzread · 2026-03-22T00:10:09 1774138209

Indeed. To a carrier, VoIP means WhatsApp, Discord, Google Meet.

ericpauley · 2026-03-20T10:52:55 1774003975

Yes, but Waymo also has to drive on the road with those drivers, and these stats include crashes that are their fault. Diligent drivers get hit by drunk/distracted drivers all the time.

ericpauley · 2026-03-19T03:59:10 1773892750

My thoughts exactly. By this logic both are fragile because they run over lossy wireless networks.

The composability of TLS/HTTP is really a beautiful thing.

ericpauley · 2026-03-19T02:29:49 1773887389

Again a model issue. At the risk of coming off as a thread-wide apologist, here are my results on Opus:

Good:

> The research is generally positive but it’s not unconditionally “good for you” — the framing matters.

> What the evidence supports for moderate consumption (3-5 cups/day): lower risk of type 2 diabetes, Parkinson’s, certain liver diseases (including liver cancer), and all-cause mortality……

Bad:

> The premise is off. Moderate daily coffee consumption (3-5 cups) isn’t considered bad for you by current medical consensus. It’s actually associated with reduced risk of type 2 diabetes, Parkinson’s, and some liver diseases in large epidemiological studies.

> Where it can cause problems: Heavy consumption (6+ cups) can lead to anxiety, insomnia……

This isn’t just my own one-off examples. Claude dominates the BSBench: https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

johnfn · 2026-03-19T06:59:32 1773903572

The BSBench is such a fantastic resource - thank you for sharing.

We should really be citing rather than anecdata every time someone brings up hallucinations.

eesmith · 2026-03-19T12:57:37 1773925057

What I do for questions like these is read what medical researchers have published. The first I read was https://pmc.ncbi.nlm.nih.gov/articles/PMC5696634/

> Coffee consumption was more often associated with benefit than harm for a range of health outcomes across exposures including high versus low, any versus none, and one extra cup a day. There was evidence of a non-linear association between consumption and some outcomes, with summary estimates indicating largest relative risk reduction at intakes of three to four cups a day versus none, including all cause mortality (relative risk 0.83, 95% confidence interval 0.83 to 0.88), cardiovascular mortality (0.81, 0.72 to 0.90), and cardiovascular disease (0.85, 0.80 to 0.90). High versus low consumption was associated with an 18% lower risk of incident cancer (0.82, 0.74 to 0.89). Consumption was also associated with a lower risk of several specific cancers and neurological, metabolic, and liver conditions. Harmful associations were largely nullified by adequate adjustment for smoking, except in pregnancy, where high versus low/no consumption was associated with low birth weight (odds ratio 1.31, 95% confidence interval 1.03 to 1.67), preterm birth in the first (1.22, 1.00 to 1.49) and second (1.12, 1.02 to 1.22) trimester, and pregnancy loss (1.46, 1.06 to 1.99). There was also an association between coffee drinking and risk of fracture in women but not in men.

> Conclusion Coffee consumption seems generally safe within usual levels of intake, with summary estimates indicating largest risk reduction for various health outcomes at three to four cups a day, and more likely to benefit health than harm.

When I'm looking for medical advice, I want that advice to list things like "coffee drinking might not be safe during pregnancy".

Furthermore, the statement 'Heavy consumption (6+ cups) can lead to anxiety, insomnia ...' assumes caffeinated coffee, yes? The paper I linked to also discusses decaffeinated coffee, eg:

> High versus low intake of decaffeinated coffee was also associated with lower all cause mortality, with summary estimates indicating largest benefit at three cups a day (0.83, 0.85 to 0.89)28 in a non-linear dose-response analysis. ...

> Coffee consumption was consistently associated with a lower risk of Parkinson’s disease, even after adjustment for smoking, and across all categories of exposure.22 76 77 Decaffeinated coffee was associated with a lower risk of Parkinson’s disease, which did not reach significance. ...

> there were no convincing harmful associations between decaffeinated coffee and any health outcome.

That nuance seems important.

Also note that this paper is incomplete as it investigated defined health outcomes, not physiological outcomes like anxiety. There are plenty more papers, like https://academic.oup.com/eurheartj/article/46/8/749/7928425?... , which considers the time that people drink coffee, also discusses decaffeinated coffee, and highlights the uncertainty about the effect of heavy coffee drinking.

I don't see why I should care to ask an AI when it's so easy to find well-written research results which are far more likely to cover relevant edge cases.

ericpauley · 2026-03-19T02:25:00 1773887100

Sure LLMs make mistakes, but have you looked at the accuracy of the average top search results recently? The SERPs are packed with SEO-infested articles that are all written by LLMs anyway (and almost universally worse ones than you could use yourself). In many cases the stakes are low enough (and the cost of manually sifting through the junk high enough) that it’s worth going with the empirically higher quality answer than the SEO spam.

This of course doesn’t apply to high-stakes settings. In these cases I find LLMs are still a great information retrieval approach, but it’s a starting point to manual vetting.

ericpauley · 2026-03-19T02:18:56 1773886736

This is an oft-repeated meme, but I’m convinced the people saying it are either blindly repeating it, using bad models/system prompts, or some other issue. Claude Opus will absolutely push back if you disagree. I routinely push back on Claude only to discover on further evaluation that the model was correct.

As a test I just did exactly what you said in a Claude Opus 4.6 session about another HN thread. Claude considered* the contradiction, evaluated additional sources, and responded backing up its original claim with more evidence.

I will add that I use a system prompt that explicitly discourages sycophancy, but this is a single sentence expression of preference and not an indication of fundamental model weakness.

* I’ll leave the anthropomorphism discussions to Searle; empirically this is the observed output.

jazzyjackson · 2026-03-19T03:52:40 1773892360

If you have 10,000 people flipping coins over and over, one person will be experiencing a streak of heads, another a streak of tails.

Which is to say, of a million people who just started playing with LLMs, a bunch of people will get hit or miss, while one guy is winning the neural net lottery and has the experience of the AI nailing every request, some poor bloke is trying to see what all the hype is about and cannot get one response that isn’t fully hallucinated garbage

ericpauley · 2026-03-19T04:02:23 1773892943

Sure, but that doesn’t explain the volume of these complaints. I think the more likely answer is the pitiful sycophancy of some models as demonstrated in BSBench.

odo1242 · 2026-03-19T04:13:53 1773893633

Claude Opus 4.6 is the best possible model to use in this test, with the least sycophancy. OpenAI and Gemini models are bad in comparison.

mkozlows · 2026-03-19T04:38:32 1773895112

ChatGPT thinking models are very good; the instant model is bad. Gemini is always desperate to find an answer, and will give you one no matter what.

odo1242 · 2026-03-20T05:24:48 1773984288

Nope, I use GitHub Copilot (agentic mode) and I end up having to use the (more expensive) Claude model because ChatGPT never second-guesses me or even itself. Gemini is slightly worse though.

odo1242 · 2026-03-20T05:56:45 1773986205

For a less biased source, check out BSBench (where Claude dominates, and the highest rating GPT is 2x worse): https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

ahofmann · 2026-03-19T07:07:20 1773904040

I have access to the ChatGPT account of my boss and it is unusable sycophancy slop, horrible to read because every information is buried under endless emojis and the like. And it is almost indistinguishable if the LLM is wrong or right, every answer looks the same, often with a "my final answer" at the end. It's a mess.

I'm using Claude Opus 4.6 and it is much calmer, or "professional" in tone and much more information and almost no fluff.

braebo · 2026-03-19T10:49:33 1773917373

Thank you for saying this.. ChatGPT is SO BAD. I suspect anyone that says OpenAI models are good are either lying or botting.

Garlef · 2026-03-19T13:28:57 1773926937

> anthropomorphism

I think it's a topic worthy of discussion. But I would propably not leave it to Searle...

basilikum · 2026-03-19T03:02:27 1773889347

Can you share your system prompt?

reverius42 · 2026-03-19T03:52:20 1773892340

I'm seeing the described behavior with whatever the default system prompt is in Claude Code.

AlexeyBelov · 2026-03-21T07:02:54 1774076574

Of course. They are using it wrong, their prompts are bad and actually they should try the latest model. It's always the same.

ericpauley · 2026-03-19T02:05:25 1773885925

It should be noted that MaxSAT 2024 did not include z3, as with many competitions. It’s possible (I’d argue likely) that the agent picked up on techniques from Z3 or some other non-competing solver, rather than actually discovering some novel approach.

throw-qqqqq · 2026-03-19T12:06:44 1773922004

Z3 is capable (it’s an SMT solver, not just SAT), but it’s not very fast at boolean satifiability and not at all competitive with modern SOTA SAT solvers. Try comparing it to Chaff or Glucose e.g.

jmalicki · 2026-03-19T02:09:14 1773886154

Or for that matter even from later versions of the same solvers that were in its training data!

ericpauley · 2026-03-19T02:10:26 1773886226

True. I’d be curious whether a combination of matching comp/training cutoff and censoring web searches could yield a more precise evaluation.

chaisan · 2026-03-19T04:08:19 1773893299

as its from 2024 (MaxSAT was not held in 2025), its quite likely all the solvers are in the training data. so the interesting part here is the instances for which we actually got better costs that what is currently known (in the best-cost.csv) file.

ericpauley · 2026-03-19T10:49:06 1773917346

As GP noted the issue is that even better versions than competed in MaxSAT are likely in the training data or web resources.

dooglius · 2026-03-19T03:31:07 1773891067

Is z3 competitive in SAT competitions? My impression was that it is popular due to the theories, the python API, and the level of support from MSR.

ericpauley · 2026-03-19T03:55:45 1773892545

Funnily, this was precisely the question I had after posting this (and the topic of an LLM disagreement discussed in another thread). Turns out not, but sibling comment is another confounding factor.

ericpauley · 2026-03-18T14:17:22 1773843442

All of .gov is contracted to Cloudflare: https://www.cloudflare.com/press/press-releases/2023/cloudfl...

ericpauley · 2026-03-14T14:43:59 1773499439

Used Claude through copilot for so long before switching to CC. Even for the same model the difference is shocking. Copilot’s harness and the underlying Claude models are not well-matched compared to the vertically-integrated Claude Code harness.