> The analysis uses a single metric: bugs per 10 commits (bugs/10c).
Bugs per commit as a metric papers over severity, both in terms of security severity as well as the effect on the user. A mislabeled button has the same weight as the entire app crashing in this framework.
IMO "bugs per commit" is even worse than that, because, in addition to what you say, it also hides the extraordinary spike of commit activity of a project that had previously been stable. [0]
It is the exact metric you'd choose if you wanted to make the current situation of rsync look like not a big deal.
Yes, but we know why there was an "extraordinary spike," and it has nothing to do with rsync being "vibe coded." The maintained has directly addressed this.
I link to it multiple times in TFA and quote the specific thing I'm talking about here in there to explain that possible confounder. I think I've done more than the work I'm obligated to it.do to make all of the relevant information available to you. You are just refusing to use
I am not finding these links in TFA, I see a link to an issue #929 which (as mentioned in TFA) has over 350 replies, and and opinionated summary of what transpired, including some detailed description of specific posts there. However I did not find the maintainers response.
EDIT 2 (and advice on design): The page design changes backgrounds after the results sections, which kind of conveys to the user that they have reached the end of what was is important and can just skim over the rest (usually pages have a radical change in typography like these when you’ve reached the comment section), however this is what is analogous to a discussion in a typical paper, and is arguably the most important part. I had simply assumed that you just left it at the result and skipped the discussion as a stylistic choice.
> EDIT: Found it! it is in the (untitled) discussion section (after the results).
I also paraphrase Tridge himself explicitly saying that this is why commits/releases have increased:
> Essentially, this isn't a "Claude" problem, it's a "more security work" problem, something that Tridge himself confirmed in his response, describing how a flood of AI-generated CVE reports forced rapid, extensive changes to rsync's attack surface.
> The page design changes backgrounds after the results sections, which kind of conveys to the user that they have reached the end of what was is important and can just skim over the rest (usually pages have a radical change in typography like these when you’ve reached the comment section), however this is what is analogous to a discussion in a typical paper, and is arguably the most important part. I had simply assumed that you just left it at the result and skipped the discussion as a stylistic choice.
Good point, I assumed everyone would read till the end, that's on me. I'll give it a heading.
I've now resolved this. The new version, which should be live on GH Pages soon, uses — what I think is — a pretty good methodology for assigning severity to each bug, normalizes it to 0.0-1.0, sums that, and treats that as the total severity weighted bugs, then does the analysis based on that. It did not change the analysis in any material way.
A commit is a measure of nothing. Severity weighted bugs per unit of nothing? What does that even mean? In any repo it's trivial to achieve a sev/10c that's arbitrarily close to zero while completely ruining everything.
I suggest you practice some humility and update your conclusion instead of updating the mental gymnastics you used to arrive at the same conclusion.
Whether commits decrease the sev/10c depends on if there are a lot of small commits increasing the demoninator. In reality, we have the opposite: the post-Claude releases have way fewer commits than the pre-Claude ones.
Thus, if anything their sev/10c is inflated. If I changed it to lines of code changed, the relative bug ratios would be much smaller, and the conclusion wouldn't change. In fact, the conclusion would look "better" for Claude; if I was using "mental gymnastics" to come to this conclusion, I would have already used a metric other than adjusting per commits!
What different metric would you suggest that would change the conclusion?
Showing "humility", as you so moralistically and condescendingly put it, would require being wrong first.
> In reality, we have the opposite: the post-Claude releases have way fewer commits than the pre-Claude ones.
No, they don't, you just made that up.
> What different metric would you suggest that would change the conclusion?
What would be a lot more useful to know is whether or not the original prompt used to generate this post instructed you to do a fair and unbiased review of these bugs, and whether or not that prompt itself was framed in a fair and unbiased way. If you take a piece of paper and write "therefore Claude is not at fault" at the bottom, then nothing you write above that line is admissible, no matter how well-reasoned.
There was no analysis of severity in all of the rage posting that occurred. The single point being pushed was "use of an LLM led/leads to more bugs". The author specifically states that's what they're addressing (blunt accusation -> blunt response).
The specific problems mentioned were all reasonably severe. The original post itself described a show-stopping bug:
So my systems recently updated to rsync 3.4.3, and as soon as that happened my backup system - which does incremental backups using multiple --compare-dest= arguments - started to fail on anything but a full backup.
Incremental backups is perhaps the primary use of rsync, and they were broken for this person. That's pretty severe.
The second reply is similar:
i wondered why my 3d printers were running like sh*t and at 100% cpu; turns out log2ram uses rsync.
This one I took with a grain of salt, since it read more like a dogpile than an actual bug report. However, if it's genuine, it's also reasonably severe.
Later in the comments, someone attempted to provide a list of issues that had been added: https://github.com/RsyncProject/rsync/issues/929#issuecommen.... The list included several failures to build or run rsync that appear to have resulted from broken backward compatibility. That seems reasonably severe. If intentional, I would have expected mention in the release notes about the removal of backwards compatibility, but none was made.
The issue comments already degraded into a lot of unnecessary vitriol even before the above mentioned comment and only gets worse from there, so I stopped. But, the fact remains that the whole issue started with a severe bug.
I applaud the attempt at dispassionately analyzing whether the recent LLM releases of rsync were normal or outliers as far as bugs are concerned, but I don't think you can do so properly without analyzing severity.
To keep such an analysis fair and contextually relevant, it would have to be extended to the previous 928 issues as well (of course filtering for bug reports). I don't see anyone doing such an analysis, I think because they don't expect they'd find it useful (at least not as the rage fuel that many are seeking); what they'd be more likely to find is that there is a similar severity-mix going all the way back to v1.0.0, because these things inevitably happen whether coding is done by human or machine.
"A lot of claims in the wider discussion have treated every recent bug report as if it had the same cause. That is not accurate. Some reports were regressions from recent security hardening, some were missing historical test coverage, some were older bugs found because rsync suddenly had more eyes on it (especially by AI that can find issues quickly) and some were packaging or environment-specific failures. A Co-authored-by line is not enough by itself to establish root cause." - https://github.com/RsyncProject/rsync/issues/929#issuecommen...
Pinyin is widely used, but pinyin’s primacy is oversold. Chinese texts start with teaching Chinese characters - many are recognizable to children from daily exposure to begin with, so they don’t need the pinyin. Pinyin only comes in when the character is genuinely unknown.
I think computer/smartphone usage has been changing that latent space for quite some time. People have been talking about "character amnesia" since 2010.
I've been taking Chinese lessons for a number of years, and my teacher described her son as learning characters via pinyin. But it's quite possible (even likely) that the common ones don't require pinyin, and/or that I misunderstood how it's used. Nevertheless, even if I pushed the analogy a bit, I still think this might happen as a bridge between learning to code and agentic coding.
it's not that huge of a deal if you compare commercial costs in china and cheapest us states, and electricity is only one of the factors.
The real reason: anthropic + openai just cut the reasoning output to prevent distill, and hence you see the rise of chinese models to establish contracts globally .
I've heard on podcasts that AI data centers in the US are powered by natural gas. Apparently there is currently a glut of natural gas. So the energy costs are actually pretty low in the US.
In China the state and corporations can blend so it's difficult to tell the difference between the two. It is known for government sponsored dumping to meet some state goal or another.
This runs counter to the last 50 years of American propaganda espousing the inefficiency of government. If the Chinese government can just throw money at industries and have them flourish, why can't other governments?
I believe it is more complicated than simply “throwing money at industries”. It seems to me that in China, the Government actually runs the country, while in the US, private capital does.
Government central planning and industrial policy is always less efficient than free markets. But government can sometimes be more effective in accomplishing critical strategic goals when those are more important than efficiency.
Doubt it is best. Taiwan China and Singapore easily better than SKorea. Singapore is more unique where everything is resources tight they still able to create that system.
> If the Chinese government can just throw money at industries and have them flourish, why can't other governments?
One possibility that seems likely to me: it takes longer than a single election cycle for an investment like that to bear fruit. And you have to be willing to admit that some bets the state places will lose. This is harder in the kind of democracy and political climate that the US currently has. China's government has more continuity of leadership and a strong emphasis on stability that seem hard to achieve in the US without a lot more political cohesion and more nuanced opposition than the two-party system currently affords.
If we could achieve it, though, it'd be awesome. Some "best of both worlds" stuff.
Chinese ghost cities you see online already non ghost cities. Ever wonder why ghost cities topic no longer trending on YT and Twitter? You are 10 years behind the propaganda. Now is more of Xinjiang enslavement even though Chinese factories run without lights using robots. But hey narrative is important to keep population control. Ever wonder what happen to XiaoHongShui access in America? It scare the shit out of government and American media just stop highlighting it.
The US government could throw tons of money at everything and get some good results, that doesn't mean it would be efficient. And their system is fundamentally different, I think most westerners would appreciate less efficient AI companies in return for democracy and human rights.
Highly recommend everyone check out Breakneck. Felt like that gave me my first real insight into the relationship of the government and business in China.
It is basically Tang dynasty with tech and CCP members or politburo running the country instead of just 1 emperor. It is deeply ingrain into Chinese culture for 5 millennia. The closest thing is like American arguing about 1st and the love of guns. We are at around LiSiMin CCP peak China. Hopefully there won't be a repeat of Anlushan like incident. America system is really oligarchy cowboys with private cowboys taking turn running the government supported by other cowboys.
Chinese students study like it is Battle Royale Squid Game. Just read up what is Gaokao. And boosted by nearly 1/3 of household income for extra classes. And other study resources. You have that in America but the volume is closer to 500 to 1. This is why before Trump you see American colleges saturated with 20%+ Chinese students. India also the same but the second factor comes in. Proximity. They build universities powerplant factories port airports in cluster. So wastage due to "in between logistics" minimized. And finally everything is way way cheaper in China compare to even cheapest American town. So even if corruption is bad, the underlying structure ensure efficient output multitude higher than peak America. Another country have similar design is Singapore but they lack the talents and resources of China. This can't be replicated in America. American parents don't go nuts spending 30-60% HOUSEHOLD income for education. Most American parents already struggling paying gas and electricity bills. At best they give their children access to TikTok and Snapchat and hint them do sports like Tiger or Beyonce twerking to fame and wealth.
You are way behind about China. That was 60s about 60 years ago! Today China is robots. China has way more robots than Japan America and EU all add up together! Their factories run by robots not slaves. You should consider visiting Shenzhen and see for yourself. Or if you are lucky can ask your Chinese friends to register you WeChat and Baidu account and create China version of Rednote XiaoHongShu. That place has uncensored Chinese day to day lifestyle. What you see on America media like YT X FB are heavily censored about China. Things about government is censored. But lifestyle not much. Almost everything in America is censored or fake flooded. You have to be outside of America like in Germany or Indonesia to really see how censored American from what is going on outside of America.
Any government can and does regularly throw money at industries to make them flourish. The American propaganda claims that this is less efficient than letting market forces decide which companies win.
I'm actually vibe coding a game engine right now using a Hexagonal Architecture, and I ran into this exact same issue when trying to synchronize the feedback loop between the viewport and the editor. To be fair, I probably messed up the domain boundaries myself in the first place, but honestly, the AI-generated code wasn't very effective at solving it either
Game engines are one of the worst case scenarios for something like Rust as game heaps are simulations of the real world, so have lots of complex graph structures in them. Affine types work best for request/response oriented stuff where lifetimes are clearly bounded.
The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?
They're responding to the people doing things like buying the most expensive Mac they can find specifically to do local inference for their AI agents.
Some do it to have control over their ability to use AI. Some do it because they think it will be cheaper to not have to pay a SaaS to generate tokens for them.
But for those interested in the latter case, it seems like it's not actually cheaper after all, at least at current prices. But then I don't expect prices to drastically jump because of how much competition there is in model development.
It's worth paying a premium for the privacy (assuming that llama.cpp and ollama aren't sending my sessions back to the cloud regardless...), and for the concerns about not getting a surprise bill.
Correct me if I'm wrong, but I believe this is a feature that only Google has figured out how to implement. All of the other pay-as-you-go token services have a cap you can set, some by monthly spending, some with API key resolution, others by how much you put into the account. I use many, and if configured with auto-purchase disabled, it's not possible to have a "surprise" bill (except for Google!)
You also have control over your costs. It is reasonable to assume that tokens will cost significantly more in the near to medium future as the market consolidates and subsidies decline.
Google, Microsoft, Meta, Anthropic, OpenAI, Oracle and others are going to be looking to recoup all the money that they’ve spent to date. Why would the price go down in the future?
Because price is driven mainly by competition, not by a desire to recoup prior spending.
Investors aren't doing things out of the sheer goodness of their hearts, so if they could just bump the price up they'd have already forced it up. The very existence of workable local models puts a cap on how high the price can realistically go, but the high level of competition still extant makes the price floor ever closer to the actual cost to generate tokens.
The AI numbers are huge, but I remember similar arguments about residential high-speed internet. According to Gemini, the "price for internet" is down 12% in real terms (ugh, capitalism!), while speeds are staggeringly faster.
The providers have spent a fortune on wireless, pulled a lot of fiber/cable, and it's cheaper than it was when it started.
No, that’s not the point. I think this is to help people who are thinking about getting a beefier Mac so they can run their LLMs on it too. Some in particular want a dedicated Mac Mini or Studio for this purpose. The breakdown, even if slightly flawed, offers a good insight into the economics of it.
For most people, they might be better off with OpenRouter models and providers supporting Zero Data Retention. On the cloud, that’s as good as it gets for privacy - your data is never retained beyond the life of the request.
> your data is never retained beyond the life of the request.
Like with OpenAI for a year?
” In June 2025, the court ordered OpenAI to retain its consumer and API customer chat logs indefinitely, including any that had been deleted, so they could be investigated […]”
I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.
But the people who want to do local inference are putting some amount of value on privacy that’s not captured by the raw monetary value so just comparing the price is somewhat beside the point, it’s also true that, if you have eg a Mac and you use that as your main computing device then you would have spent money on it anyway, so you can’t even really compare its value to spend on something that’s not general purpose.
That's a lot of assumptions. I think there are also people buying new hardware specifically for this purpose, and their motivation to do it is thinking it will be cheaper in the long run. Privacy is not necessarily the motivation.
My overall opinion is that the smart thing is not to upgrade to the maximum memory for AI purposes. It's worth quantifying how much extra we pay for privacy.
I think I was using GitHub Copilot when I made the experience that led me to this statement. I guess the experience of using LLMs can be quite different depending on model version and harness.
Anthropic's position is that thinking tokens aren't actually faithful to the internal logic that the LLM is using, which may be one reason why they started to exclude them:
That's interesting research, but I think a more important reason that you don't have access to them (not even via the bare Anthropic api) is to prevent distillation of the model by competitors (using the output of Anthropic's model to help train a new model).
If you can't trust a company, don't use their api or cloud services. No amount of external output will ever validate anything, ever. You never know what's really happening, just because you see some text they sent you.
Do you not see that the next (or previous) logical step would be a "commercial ban" of frontier models, all "distilled" from an enormous amount of copyrighted material?
That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.
All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
> For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).
It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.
From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.
It's contextual though, and pragmatic seems different to me than correct.
For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.
Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".
From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?
(To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)
> It's contextual though, and pragmatic seems different to me than correct.
To me too, that's why I say they are measurements on different dimensions.
To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.
If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.
Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.
I don't think it's a single axis even in the original poster's conception, since you could be both incorrect and also not pragmatic.
But if a fix needs to be described as pragmatic relative to the alternatives, that's probably because it couldn't be described as correct. Otherwise you wouldn't be talking about how pragmatic it is.
> also whenever "pre-existing issue" appears (it's never pre-existing)
I dunno... There were some pre-existing issues in my projects. Claude ran into them and correctly classified as pre-existing. It's definitely a problem if Claude breaks tests then claims the issue was pre-existing, but is that really what's happening?
I had some interesting experience to the opposite last night, one of my tests has been failing for a long time, something to do with dbus interacting with Qt segfaulting pytest. Been ignoring it for a long time, finally asked claude code to just remove the problematic test. Come back a few minutes later to find claude burning tokens repeatedly trying and failing to fix it. "Actually on second thought, it would be better to fix this test."
Match my vibes, claude. The application doesn't crash, so just delete that test!
I somewhat understand Anthropic's position. However, thinking tokens are useful even if they don't show the internal logic of the LLM. I often realize I left out some instruction or clarification in my prompt while reading through the chain of reasoning. Overall, this makes the results more effective.
It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.
What's the implication of this? That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?
But reasoning does improve performance on many tasks, and even weirder, the performance improves if reasoning tokens are replaced with placeholder tokens like "..."
I don't understand how LLMs actually work, I guess there's some internal state getting nudged with each cycle?
So the internal state converges on the right solution, even if the output tokens are meaningless placeholders?
>That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?
Yes it plans ahead, but with significant uncertainty until it actually outputs these tokens and converges on a definite trajectory, so it's not a useless filler - the closer it is to a given point, the more certain it is about it, kind of similar to what happens explicitly in diffusion models. And it's not all that happens, it's just one of many competing phenomena.
So like many of the promises from AI companies, reported chain of thought is not actually true (see results below). I suppose this is unsurprising given how they function.
Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?
People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.
‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘
I mean, obviously, it's not going to be a faithful representation of the actual thinking. The model isn't aware of how it thinks any more than you are aware how your neurons fire. But it does quantitatively improve performance on complex tasks.
As you can see from posts on this story, most people believe it reflects what the model is thinking and use it as a guide to that so they can ‘correct’ it. If it is not in fact chain of thought or thinking it should not be called that.
It is the same with human chain of thought, though. Both of them are post-hoc rationalisations justifying "gut feelings" that come from thought processes the human/agent doesn't have introspection into. And yet asking humans or machines to "think out loud" this way does increase the quality of their work.
I disagree - humans often reason in a series of steps, and can write these down before they've reached an answer. They don't always wait till they reach a conclusion (with no self-insight into how they did so) and then retrospectively generate a plausible answer as LLMs do.
In mathematical proofs they may guess and answer and then work out a proof, but that is a different process.
Bugs per commit as a metric papers over severity, both in terms of security severity as well as the effect on the user. A mislabeled button has the same weight as the entire app crashing in this framework.
reply