Hacker Newsnew | past | comments | ask | show | jobs | submit | dakshgupta's commentslogin

The signal-to-noise ratio problem is unexpectedly difficult.

We wrote about our approach to it some time ago here - https://www.greptile.com/blog/make-llms-shut-up

Much has changed on our approach since then, so we'll probably write a a new blog post.

The tl;dr of what makes it hard is - different people have different ideas of what a nitpick is - it's not a spectrum, the differences are qualitative - LLMs are reluctant to risk downplaying the severity of an issue and therefore are unable to usefully filter out nits. - theory: they are paid by the token and so they say more stuff


very interesting! yes everything you say aligns with my experience and instincts.



I agree that none perform _super_ well.

I would argue they go far beyond linters now, which was perhaps not true even nine months ago.

To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.


I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.


TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.

[1] https://doc.rust-lang.org/stable/clippy/


> To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.

do you have a bot to do this too?


For it to be evidence, you would need to know the number of Greptile comments made and how many of those comments were instead considered to be poor. You need to contrast false positive rate with true positive rate to simply plot a single point along a classifier curve. You would then need to contrast that with a control group of experts or a static linter which means you would need to modify the "conservativeness" of the classifier to produce multiple points along its ROC curve, then you could compare whether the classifier is better or worse than your control by comparing the ROC curves.

Sample number of true positives says more or less nothing on its own.


That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything


I like number of "great catches" as a measure of AI code review effectiveness


People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment


When mature, funtional adults say it, the read is "wow, I would have missed that, good job, you did better than me".

Reading embarrassment into that is extremely childish and disrespectful.


What I'm saying is that a corporate or professional environment can make people communicate in weird ways due to various incentives. Reading into people's communication is an important skill in these kinds of environments, and looking superficially at their words can be misleading.


I mean how far Rusts own clippy lint went before any LLMs was actually insane.

Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.


Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.


over 7 days does contextualize it some, though.

9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?


Bro stop trying to deflate the boosters, they got wares to sell and shares to dump.


2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.

You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.


> Independence

It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?

> Autonomy

Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.

> Loops

You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.


Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.

It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.


> Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?

You're assuming models/prompts insist on a previous iteration of their work being right. They don't. Models try to follow instructions, so if you ask them to find issues, they will. 'Trust' is a human problem, not a model/harness problem.

> Our view is that code validation will be completely autonomous in the medium term.

If reviews are going to be autonomous, they'd be part of the coding agent. Nobody would see it as an independent activity, you mentioned above.

> Our first step towards making this easier is a native Claude Code plugin.

Claude can review code based on a specific set of instructions/context in an MD file. An additional plugin is unnecessary.

My view is that to operate in this space, you gotta build a coding agent or get acquired by one. The writing was on the wall a year ago.


> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.

Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”

“Vibes” (2/3 on scale) are ok, just honestly curious.


How would you measure code quality? Would persistence be a good measure?


"It's difficult to come up with a good metric" doesn't imply "we should use a known-bad metric".

I'm kind of baffled that "lines of code" seems to have come back; by the 1980s people were beginning to figure out that it didn't make any sense.


Bad code can persist because nobody wants to touch it.

Unfortunately I’m not sure there are good metrics.


That question has been baffling product managers, scrum masters, and C-suite assholes for decades. Along with how you measure engineering productivity.


The folks at Stanford in this video have a somewhat similar dataset, and they account for "code churn" i.e. reworking AI output: https://www.youtube.com/watch?v=tbDDYKRFjhk -- I think they do so by tracking if the same lines of code are changed in subsequent commits. Maybe something to consider.


I don't know if code is literacy but I think measuring code quality is somehow like measuring the quality of a novel.


The way DORA does. Error rate and mean time to recovery.


This is a great suggestion. I'll note it down for next years. Curious, do you think this would be a good proxy for code quality?


I would consider feature complete with robust testing to be a great proxy for code quality. Specifically, that if a chunk of code is feature complete and well tested and now changing slowly, it means -- as far as I can tell -- that the abstractions contained are at least ok at modeling the problem domain.

I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.


Most of our customers are enterprises, so I feel relatively comfortable assuming they have some decent testing and QA in place. Perhaps I am too optimistic?


That sounds like an opportunity for some inspection; coverage, linting (type checking??), and a by-hand spot check to assess the quality of testing. You might also inspect the QA process (ride-along with folks from QA).


It's tricky, but one can assume that code written once and not touched in a while is good code (didn't cause any issues, performance is good enough, ecc).

I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.


It depends on how well you vetted your sanples.

fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.


Apologies, that is poor wording on our part. It's internal data from engineers that use Greptile, which are tens of thousands of people from a variety of industries. As opposed to external, public data, which is where some of the charts are from.


This is per month, I see now that's not super clear on the chart!


We're careful not to draw any conclusions from LoC. The fact is LoCs are higher, which by itself is interesting. This could be a good or bad thing depending on code quality, which itself varied wildly person-to-person and agent-to-agent.


When the heading above it says "Developer output increased by x" I think you're very much drawing conclusions


Can you expand on why it is interesting?


Because it's different. Change is important to track


We weren’t able to agree on a good way to measure this. Curious - what’s your opinion on code churn as a metric? If code simply persists over some number of months, is that indication it’s good quality code?


I've seen code persist a long time because it is unmaintainable gloop that takes forever to understand and nobody is brave enough to rebuild it.

So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it


I second the persistence. Some of the most persistent code we own is because it’s untested and poorly written, but managed to become critical infrastructure early on. Most new tests are best-effort black box tests and guesswork, since the creators have left a long time ago.

Of course, feeding the code to an LLM makes it really go to town. And break every test in the process. Then you start babying it to do smaller and smaller changes, but at that point it’s faster to just do it manually.


You run a company that does AI code review, and you've never devised any metrics to assess the quality of code?


We have ways to approximate our impact on code quality, because we track:

- Change in number of revisions made between open and merge before vs. after greptile

- Percentage of greptile's PR comments that cause the developer to change the flagged lines

Assuming the author is will only change their PR for the better, this tells us if we're impacting quality.

We haven't yet found a way to measure absolute quality, beyond that.


Might be harder to track but what about CFR or some other metric to measure how many bugs are getting through review before versus after the introduction of your product?

You might respond that ultimately, developers need to stay in charge of the review process, but tracking that kind of thing reflects how the product is actually getting used. If you can prove it helps to ship features faster as opposed to just allowing more LOC to get past review (these are not the same thing!) then your product has a much stronger demonstrable value.


I've seen code entropy as the suggested hueriatic to measure.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: