This post just gives me more questions than answers and I'm unable to form a decision:
* Why was v3.4.1 the most buggy, right before the Claude commits? Why did "nobody notice"? It's way to strange to just say welp, it must be human error.
* Why does v3.4.2 have 0 bugs, or 0 bug score. And why was such an outlier (no other commit seemingly has this??) allowed to mix into aggregate statistics and bring all the "is Claude buggy?" scores down. Tbh idk how that _wasn't_ a red flag in the author's analysis...
This article feels like half of an analysis presented as a highly complex finished product due all the advanced stats they're running.
> Why was v3.4.1 the most buggy, right before the Claude commits? Why did "nobody notice"? It's way to strange to just say welp, it must be human error.
Why wouldn't it be except question begging priors assuming it couldn't be?
> Why does v3.4.2 have 0 bugs, or 0 bug score. And why was such an outlier (no other commit seemingly has this??) allowed to mix into aggregate statistics and bring all the "is Claude buggy?" scores down.
My original metrics which didn't filter out feature requests and questions had it at four bugs and prior to that it was even higher and it didn't make much of a difference to the overall analysis (fell well within the IQR, the lower end of it too). Also, removing one outlier just because it looks kind of funny to you, especially when we only have two Claude releases at all, would be worse in my opinion and more arbitrary.
> Why wouldn't it be except question begging priors assuming it couldn't be?
A multitude of reasons? A change in maintainer. A change in the mental state of a maintainer. A sudden focus by the community on a given undesirable behavior. Someone else here suggested use of Claude AI before it was disclosured. The framing implies that it was human-produced coding error, but my point is it could be _any other human error_ or even just some odd benign human behavior (a stampede of bug submitters), affecting the data. Which does not lead to the conclusion that AI code > human code. Not looking at these potentials is so unsatisfying.
> My original metrics which didn't filter out feature requests...
It still feels like a lot of weight of the phrase "If that doesn't look like a red flag to you, you'd be right." hinges on the fact that one of the versions has 0 bugs and it really killed the weight of that statement for me, because the oddity of there being 0 bugs just wasn't explained.
---
Could you please post the duckdb file that has the raw bug -> severity + version mapping to the GitHub repo? I have a desire to dig into this myself
Would be nice to see something referral based. If you don't like X, block them. If X invited Y and Z and their invites behave poorly, you can block the whole tree. Kinda like lobste.rs referrals but for wider internet
I guess the correlary would be like how you can block an entire ASN if you find a lot of abuse from it, but at the human-network level.
Aside from social dynamics, a chief issue is that if you're relying on this as a mechanism for content filtering, personal relations have low predictive value.
E.g., I may really like a person's content, but not their curation or referrals to other accounts. Conversely, I might not care for a person, but their recommendations may be excellent. More common might be the case that a given account produces little or no content of their own, but makes reliably predictable (either good or bad) recommendations, which would be useful for further filtering. Or the highly verbose individual who emits a constant stream of near-drek, but an occasional diamond.
Content production, content curation, and talent spotting are all distinct skills. Success or lack in any one says little about the others. This is where Bayesian indicators (including relationships and referrer / invite relations) would probably be more robust.
That said, tainting an entire invite tree is likely useful, with a caveat that if a particular invitee has an independent, untainted, relation, they might be worth following.
In practice what I've found most useful is to have a pretty tight primary list of follows, ~50 or fewer, and a slightly broader secondary list. Allow recommendations be default (that is, re-shares / boosts), but curtail those too if problematic. Be quite liberal in blocking / muting anything in the least bit annoying or problematic.
Or participate in a selective group with excellent moderation. HN isn't quite there, but it approaches this ideal more closely than any other major forum I'm aware of presently.
Just be careful, if you host your DNS at Cloudflare (maybe others?), they will rewrite your CAA record[0] if you use TLS with them. This is in the name of convenience but it was surprising when I first learned.
Is this the new norm for trying to make software projects in the wild?
The 14000 sends over 3 hours (< 1/s) makes it sound more-than-human speed. E.g. automated.
Wondering if LLM-assisted vulnerability hunting will lead to the same gains in scale for bad actors wanting to find spammable channels in applications. The barrier to entry becomes so much greater because any small project, once found, can be wrung dry of all its trust signals by third parties
Abuse such as this wasn't uncommon before, email platforms with lax ratelimits have always been abused through their clients' unsecured infrastructure. The only difference in post-LLM world is the amount of platforms as well as clients popping up in this space with dubious code quality that may lead to more attacks as;
a) having an email-sending product typically meant you had a project with a lot of effort invested into it as well as knowledge
b) the models, tokens spent and review done differs in the world of vibecoding and there is a race to the bottom to produce, produce, produce. Quantity > quality
If you have a website somewhere with an unrestricted comment box, it gets spammed. That doesn't take a special AI, because for years there have been script kiddies scanning new domains, IP addresses on AWS, common wp-admin URLs, etc.
A bit culty, if I never hear talk about hatting and going on post again I'll be quite happy. So many practices and stuff that I didn't realize were Scientology-related until I looked them up later
I've never had a reliability issue with Vaultwarden. Hosted it 5+ years now. Even with random off/on of the server and other bumps in the road in life, the Docker container I run has had no issues with hosting. The user interface is friendly but can be just a little slow.
Mine is not exposed to the public internet, though some friends of mine do. I use a VPN when I need to access fresh data from the home server, otherwise both the Firefox client and Android client will generally keep a cache of the last data pull when they had connection (so it wasn't an issue the 4 or so years I didn't have a VPN yet).
reply