More

faitswulff · 2026-06-05T13:04:10 1780664650

> The analysis uses a single metric: bugs per 10 commits (bugs/10c).

Bugs per commit as a metric papers over severity, both in terms of security severity as well as the effect on the user. A mislabeled button has the same weight as the entire app crashing in this framework.

germanjoey · 2026-06-05T18:26:42 1780684002

IMO "bugs per commit" is even worse than that, because, in addition to what you say, it also hides the extraordinary spike of commit activity of a project that had previously been stable. [0]

It is the exact metric you'd choose if you wanted to make the current situation of rsync look like not a big deal.

[0] https://github.com/RsyncProject/rsync/graphs/commit-activity

logicprog · 2026-06-05T18:42:33 1780684953

Yes, but we know why there was an "extraordinary spike," and it has nothing to do with rsync being "vibe coded." The maintained has directly addressed this.

vsundar · 2026-06-05T21:21:03 1780694463

> The maintained has directly addressed this.

Not sure if this is mentioned somewhere else, but looks like the maintainer has a blog post that addresses this: https://medium.com/@tridge60/rsync-and-outrage-d9849599e5a0

floxy · 2026-06-05T18:44:40 1780685080

Seems like this would be a good place to link to that.

logicprog · 2026-06-05T18:54:28 1780685668

I link to it multiple times in TFA and quote the specific thing I'm talking about here in there to explain that possible confounder. I think I've done more than the work I'm obligated to it.do to make all of the relevant information available to you. You are just refusing to use

runarberg · 2026-06-05T19:23:01 1780687381

I am not finding these links in TFA, I see a link to an issue #929 which (as mentioned in TFA) has over 350 replies, and and opinionated summary of what transpired, including some detailed description of specific posts there. However I did not find the maintainers response.

Of interest is this post here: https://github.com/RsyncProject/rsync/issues/929#issuecommen... which echos the same concern which was raised up thread, however, I failed to find the maintainers’ response.

EDIT: Found it! it is in the (untitled) discussion section (after the results).

https://lobste.rs/s/k1b0za/rsync_outrage#c_2iowov

EDIT 2 (and advice on design): The page design changes backgrounds after the results sections, which kind of conveys to the user that they have reached the end of what was is important and can just skim over the rest (usually pages have a radical change in typography like these when you’ve reached the comment section), however this is what is analogous to a discussion in a typical paper, and is arguably the most important part. I had simply assumed that you just left it at the result and skipped the discussion as a stylistic choice.

logicprog · 2026-06-05T19:33:47 1780688027

> EDIT: Found it! it is in the (untitled) discussion section (after the results).

I also paraphrase Tridge himself explicitly saying that this is why commits/releases have increased:

> Essentially, this isn't a "Claude" problem, it's a "more security work" problem, something that Tridge himself confirmed in his response, describing how a flood of AI-generated CVE reports forced rapid, extensive changes to rsync's attack surface.

> The page design changes backgrounds after the results sections, which kind of conveys to the user that they have reached the end of what was is important and can just skim over the rest (usually pages have a radical change in typography like these when you’ve reached the comment section), however this is what is analogous to a discussion in a typical paper, and is arguably the most important part. I had simply assumed that you just left it at the result and skipped the discussion as a stylistic choice.

Good point, I assumed everyone would read till the end, that's on me. I'll give it a heading.

ex-aws-dude · 2026-06-05T18:19:56 1780683596

Why don't you prove the bugs increased then?

Why is it that some unfounded claim is made and the onus is suddenly on the project maintainer to prove it beyond all doubt?

It should be on the person making the claim to prove it

logicprog · 2026-06-05T19:37:00 1780688220

I've now resolved this. The new version, which should be live on GH Pages soon, uses — what I think is — a pretty good methodology for assigning severity to each bug, normalizes it to 0.0-1.0, sums that, and treats that as the total severity weighted bugs, then does the analysis based on that. It did not change the analysis in any material way.

bsza · 2026-06-06T07:39:30 1780731570

No Claude, it still makes zero sense as a metric.

A commit is a measure of nothing. Severity weighted bugs per unit of nothing? What does that even mean? In any repo it's trivial to achieve a sev/10c that's arbitrarily close to zero while completely ruining everything.

I suggest you practice some humility and update your conclusion instead of updating the mental gymnastics you used to arrive at the same conclusion.

logicprog · 2026-06-06T09:05:19 1780736719

Whether commits decrease the sev/10c depends on if there are a lot of small commits increasing the demoninator. In reality, we have the opposite: the post-Claude releases have way fewer commits than the pre-Claude ones.

Thus, if anything their sev/10c is inflated. If I changed it to lines of code changed, the relative bug ratios would be much smaller, and the conclusion wouldn't change. In fact, the conclusion would look "better" for Claude; if I was using "mental gymnastics" to come to this conclusion, I would have already used a metric other than adjusting per commits!

What different metric would you suggest that would change the conclusion?

Showing "humility", as you so moralistically and condescendingly put it, would require being wrong first.

bsza · 2026-06-06T14:17:06 1780755426

> In reality, we have the opposite: the post-Claude releases have way fewer commits than the pre-Claude ones.

No, they don't, you just made that up.

> What different metric would you suggest that would change the conclusion?

What would be a lot more useful to know is whether or not the original prompt used to generate this post instructed you to do a fair and unbiased review of these bugs, and whether or not that prompt itself was framed in a fair and unbiased way. If you take a piece of paper and write "therefore Claude is not at fault" at the bottom, then nothing you write above that line is admissible, no matter how well-reasoned.

skeledrew · 2026-06-05T13:48:20 1780667300

There was no analysis of severity in all of the rage posting that occurred. The single point being pushed was "use of an LLM led/leads to more bugs". The author specifically states that's what they're addressing (blunt accusation -> blunt response).

atmavatar · 2026-06-05T16:33:06 1780677186

The specific problems mentioned were all reasonably severe. The original post itself described a show-stopping bug:

    So my systems recently updated to rsync 3.4.3, and as soon as that happened my backup system - which does incremental backups using multiple --compare-dest= arguments - started to fail on anything but a full backup.

Incremental backups is perhaps the primary use of rsync, and they were broken for this person. That's pretty severe.

The second reply is similar:

    i wondered why my 3d printers were running like sh*t and at 100% cpu; turns out log2ram uses rsync.

This one I took with a grain of salt, since it read more like a dogpile than an actual bug report. However, if it's genuine, it's also reasonably severe.

Later in the comments, someone attempted to provide a list of issues that had been added: https://github.com/RsyncProject/rsync/issues/929#issuecommen.... The list included several failures to build or run rsync that appear to have resulted from broken backward compatibility. That seems reasonably severe. If intentional, I would have expected mention in the release notes about the removal of backwards compatibility, but none was made.

The issue comments already degraded into a lot of unnecessary vitriol even before the above mentioned comment and only gets worse from there, so I stopped. But, the fact remains that the whole issue started with a severe bug.

I applaud the attempt at dispassionately analyzing whether the recent LLM releases of rsync were normal or outliers as far as bugs are concerned, but I don't think you can do so properly without analyzing severity.

skeledrew · 2026-06-05T18:05:04 1780682704

To keep such an analysis fair and contextually relevant, it would have to be extended to the previous 928 issues as well (of course filtering for bug reports). I don't see anyone doing such an analysis, I think because they don't expect they'd find it useful (at least not as the rage fuel that many are seeking); what they'd be more likely to find is that there is a similar severity-mix going all the way back to v1.0.0, because these things inevitably happen whether coding is done by human or machine.

"A lot of claims in the wider discussion have treated every recent bug report as if it had the same cause. That is not accurate. Some reports were regressions from recent security hardening, some were missing historical test coverage, some were older bugs found because rsync suddenly had more eyes on it (especially by AI that can find issues quickly) and some were packaging or environment-specific failures. A Co-authored-by line is not enough by itself to establish root cause." - https://github.com/RsyncProject/rsync/issues/929#issuecommen...

faitswulff · 2026-06-05T11:37:25 1780659445

Also see: the entire history of America

faitswulff · 2026-06-01T13:02:27 1780318947

Pinyin is widely used, but pinyin’s primacy is oversold. Chinese texts start with teaching Chinese characters - many are recognizable to children from daily exposure to begin with, so they don’t need the pinyin. Pinyin only comes in when the character is genuinely unknown.

RationPhantoms · 2026-06-01T18:03:16 1780336996

I think computer/smartphone usage has been changing that latent space for quite some time. People have been talking about "character amnesia" since 2010.

faitswulff · 2026-06-02T00:50:15 1780361415

Sort of, but the amnesia typically applies to writing characters, rather than reading them.

nneonneo · 2026-06-01T13:04:51 1780319091

Pinyin is also the main way people input Chinese into computers, so it's rather important in that regard.

reuven · 2026-06-01T13:09:54 1780319394

I've been taking Chinese lessons for a number of years, and my teacher described her son as learning characters via pinyin. But it's quite possible (even likely) that the common ones don't require pinyin, and/or that I misunderstood how it's used. Nevertheless, even if I pushed the analogy a bit, I still think this might happen as a bridge between learning to code and agentic coding.

faitswulff · 2026-05-28T02:40:26 1779936026

I get 24 hour access through my local library. Here, have a gift link https://www.nytimes.com/2026/05/27/us/politics/fbi-arrest-ci...

ethagnawl · 2026-05-28T03:04:22 1779937462

I had no idea this was an option. Libraries are the best.

kwar13 · 2026-05-28T08:24:47 1779956687

oh man i just discovered that i have access to a lot of publications through my local library! thanks for this!

VladVladikoff · 2026-05-28T03:17:56 1779938276

Thank you!

faitswulff · 2026-05-27T02:22:41 1779848561

I wonder how much of DeepSeek and Xiaomi's pricing cuts can be traced back to cheap energy in China.

onlyrealcuzzo · 2026-05-27T02:49:31 1779850171

Energy is like 10-20% of the cost of AI.

The rest is mostly hardware depreciation.

cmiles8 · 2026-05-27T03:07:36 1779851256

Correct. There are challenges getting enough energy to new data center builds but the cost of the energy is low relative to other costs.

stingraycharles · 2026-05-27T02:56:23 1779850583

So you think they’re running the same types of state of the art Nvidia deployments?

onlyrealcuzzo · 2026-05-27T03:12:14 1779851534

It's supposed to be even MORE expensive:

Nvidia H100: Typically priced around $25,000–$30,000 (global MSRP).

Huawei Ascend 910C: Reported to cost roughly $28,000, yet it delivers only 60% of the inference performance of the Nvidia H100.

Google's TPUs are significantly cheaper for Google for inference. That's pretty much it.

There's a reason nVidia has an 80% margin right now.

thrownthatway · 2026-05-27T03:22:57 1779852177

MSRP is irrelevant in this context.

maxdo · 2026-05-27T02:40:47 1779849647

it's not that huge of a deal if you compare commercial costs in china and cheapest us states, and electricity is only one of the factors.

The real reason: anthropic + openai just cut the reasoning output to prevent distill, and hence you see the rise of chinese models to establish contracts globally .

stingraycharles · 2026-05-27T02:57:44 1779850664

“and hence you see the rise of chinese models to establish contracts globally”

how will that help them working around the distill issue?

gessha · 2026-05-27T03:09:50 1779851390

Collecting user data directly by competing on price. The next step would be figuring out how that data can bring them closer to SOTA.

stingraycharles · 2026-05-27T03:38:29 1779853109

Yes ok but that doesn’t give them the thinking tokens, how to reason about the prompt, which is precisely what’s most important.

pianopatrick · 2026-05-27T03:00:01 1779850801

I've heard on podcasts that AI data centers in the US are powered by natural gas. Apparently there is currently a glut of natural gas. So the energy costs are actually pretty low in the US.

esseph · 2026-05-27T03:22:07 1779852127

We extract more than we can export. Currently sitting on something like at least 3,500 trillion cubic feet of ng. We consume 30-32tcf per year.

colechristensen · 2026-05-27T02:24:29 1779848669

In China the state and corporations can blend so it's difficult to tell the difference between the two. It is known for government sponsored dumping to meet some state goal or another.

faitswulff · 2026-05-27T02:35:43 1779849343

This runs counter to the last 50 years of American propaganda espousing the inefficiency of government. If the Chinese government can just throw money at industries and have them flourish, why can't other governments?

foxygen · 2026-05-27T02:40:46 1779849646

I believe it is more complicated than simply “throwing money at industries”. It seems to me that in China, the Government actually runs the country, while in the US, private capital does.

nradov · 2026-05-27T03:35:21 1779852921

Government central planning and industrial policy is always less efficient than free markets. But government can sometimes be more effective in accomplishing critical strategic goals when those are more important than efficiency.

rapsey · 2026-05-27T02:56:10 1779850570

As if the west does not use tariffs and subsidies. China is simply much smarter about it and has much more functional institutions.

jameson · 2026-05-27T02:55:04 1779850504

Other governments do, but not as much as China does.

Healthcare in South Korea for example is government managed and it is one of the best healthcare in the world.

I believe utility companies are also government owned.

Also some of the well known companies now were practically government owned during the Park dictatorship in the 70s.

I wouldn't use the term "Flourish" as what you hear and see is strictly controlled

Haven880 · 2026-05-27T07:52:04 1779868324

Doubt it is best. Taiwan China and Singapore easily better than SKorea. Singapore is more unique where everything is resources tight they still able to create that system.

isityettime · 2026-05-27T03:05:55 1779851155

> If the Chinese government can just throw money at industries and have them flourish, why can't other governments?

One possibility that seems likely to me: it takes longer than a single election cycle for an investment like that to bear fruit. And you have to be willing to admit that some bets the state places will lose. This is harder in the kind of democracy and political climate that the US currently has. China's government has more continuity of leadership and a strong emphasis on stability that seem hard to achieve in the US without a lot more political cohesion and more nuanced opposition than the two-party system currently affords.

If we could achieve it, though, it'd be awesome. Some "best of both worlds" stuff.

colechristensen · 2026-05-27T03:41:38 1779853298

The government paying for your output which has nowhere to go is not a flourishing industry.

If my kid starts a lemonade stand and I pay him $500 to dump 20 gallons of lemonade into the sewer, did he run a successful business for a day?

Look into the Chinese ghost cities or US and EU actions against Chinese metals dumping.

https://en.wikipedia.org/wiki/Underoccupied_developments_in_...

Haven880 · 2026-05-27T07:55:23 1779868523

Chinese ghost cities you see online already non ghost cities. Ever wonder why ghost cities topic no longer trending on YT and Twitter? You are 10 years behind the propaganda. Now is more of Xinjiang enslavement even though Chinese factories run without lights using robots. But hey narrative is important to keep population control. Ever wonder what happen to XiaoHongShui access in America? It scare the shit out of government and American media just stop highlighting it.

slopinthebag · 2026-05-27T04:05:54 1779854754

The US government could throw tons of money at everything and get some good results, that doesn't mean it would be efficient. And their system is fundamentally different, I think most westerners would appreciate less efficient AI companies in return for democracy and human rights.

idiotsecant · 2026-05-27T02:39:13 1779849553

The Chinese economy is deeply weird from a western perspective. Culture and economics are not orthogonal.

Forgeties79 · 2026-05-27T02:43:22 1779849802

Highly recommend everyone check out Breakneck. Felt like that gave me my first real insight into the relationship of the government and business in China.

Haven880 · 2026-05-27T07:59:03 1779868743

It is basically Tang dynasty with tech and CCP members or politburo running the country instead of just 1 emperor. It is deeply ingrain into Chinese culture for 5 millennia. The closest thing is like American arguing about 1st and the love of guns. We are at around LiSiMin CCP peak China. Hopefully there won't be a repeat of Anlushan like incident. America system is really oligarchy cowboys with private cowboys taking turn running the government supported by other cowboys.

Haven880 · 2026-05-27T07:50:12 1779868212

Chinese students study like it is Battle Royale Squid Game. Just read up what is Gaokao. And boosted by nearly 1/3 of household income for extra classes. And other study resources. You have that in America but the volume is closer to 500 to 1. This is why before Trump you see American colleges saturated with 20%+ Chinese students. India also the same but the second factor comes in. Proximity. They build universities powerplant factories port airports in cluster. So wastage due to "in between logistics" minimized. And finally everything is way way cheaper in China compare to even cheapest American town. So even if corruption is bad, the underlying structure ensure efficient output multitude higher than peak America. Another country have similar design is Singapore but they lack the talents and resources of China. This can't be replicated in America. American parents don't go nuts spending 30-60% HOUSEHOLD income for education. Most American parents already struggling paying gas and electricity bills. At best they give their children access to TikTok and Snapchat and hint them do sports like Tiger or Beyonce twerking to fame and wealth.

lazide · 2026-05-27T02:48:47 1779850127

They made their domestic steel industry ‘flourish’ by getting every peasant to make their own steel mills too, and mostly crashed their economies.

When things line up and the decisions are decent, top down can be really good.

When the decisions are bad, it is exceptionally dramatic failures too. Tofu dregs, etc.

Right now, no one has to liquidate so it’s easy to hide the damage though.

Haven880 · 2026-05-27T08:03:35 1779869015

You are way behind about China. That was 60s about 60 years ago! Today China is robots. China has way more robots than Japan America and EU all add up together! Their factories run by robots not slaves. You should consider visiting Shenzhen and see for yourself. Or if you are lucky can ask your Chinese friends to register you WeChat and Baidu account and create China version of Rednote XiaoHongShu. That place has uncensored Chinese day to day lifestyle. What you see on America media like YT X FB are heavily censored about China. Things about government is censored. But lifestyle not much. Almost everything in America is censored or fake flooded. You have to be outside of America like in Germany or Indonesia to really see how censored American from what is going on outside of America.

lazide · 2026-05-27T16:13:54 1779898434

She doth protest too much.

nonethewiser · 2026-05-27T02:58:44 1779850724

Not really. Dumping != flourishing

adrianN · 2026-05-27T02:40:29 1779849629

Any government can and does regularly throw money at industries to make them flourish. The American propaganda claims that this is less efficient than letting market forces decide which companies win.

nradov · 2026-05-27T03:16:44 1779851804

And it turns out that the American propaganda is almost always correct.

faitswulff · 2026-05-20T11:07:52 1779275272

What kinds of programs are you writing and with what models? I'm curious if the lifetimes your programs require are trickier than most.

jdw64 · 2026-05-20T11:15:26 1779275726

I'm actually vibe coding a game engine right now using a Hexagonal Architecture, and I ran into this exact same issue when trying to synchronize the feedback loop between the viewport and the editor. To be fair, I probably messed up the domain boundaries myself in the first place, but honestly, the AI-generated code wasn't very effective at solving it either

mike_hearn · 2026-05-21T10:16:12 1779358572

Game engines are one of the worst case scenarios for something like Rust as game heaps are simulations of the real world, so have lots of complex graph structures in them. Affine types work best for request/response oriented stuff where lifetimes are clearly bounded.

jdw64 · 2026-05-21T11:33:49 1779363229

So, I am currently changing the language. Thank you for your kind advice

faitswulff · 2026-05-17T13:43:40 1779025420

The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?

mpyne · 2026-05-17T14:57:42 1779029862

They're responding to the people doing things like buying the most expensive Mac they can find specifically to do local inference for their AI agents.

Some do it to have control over their ability to use AI. Some do it because they think it will be cheaper to not have to pay a SaaS to generate tokens for them.

But for those interested in the latter case, it seems like it's not actually cheaper after all, at least at current prices. But then I don't expect prices to drastically jump because of how much competition there is in model development.

datadrivenangel · 2026-05-17T15:23:17 1779031397

It's worth paying a premium for the privacy (assuming that llama.cpp and ollama aren't sending my sessions back to the cloud regardless...), and for the concerns about not getting a surprise bill.

nomel · 2026-05-17T19:43:37 1779047017

> not getting a surprise bill.

Correct me if I'm wrong, but I believe this is a feature that only Google has figured out how to implement. All of the other pay-as-you-go token services have a cap you can set, some by monthly spending, some with API key resolution, others by how much you put into the account. I use many, and if configured with auto-purchase disabled, it's not possible to have a "surprise" bill (except for Google!)

dcrazy · 2026-05-17T16:14:39 1779034479

You also have control over your costs. It is reasonable to assume that tokens will cost significantly more in the near to medium future as the market consolidates and subsidies decline.

Danox · 2026-05-18T04:37:12 1779079032

Google, Microsoft, Meta, Anthropic, OpenAI, Oracle and others are going to be looking to recoup all the money that they’ve spent to date. Why would the price go down in the future?

mpyne · 2026-05-18T20:46:26 1779137186

> Why would the price go down in the future?

Because price is driven mainly by competition, not by a desire to recoup prior spending.

Investors aren't doing things out of the sheer goodness of their hearts, so if they could just bump the price up they'd have already forced it up. The very existence of workable local models puts a cap on how high the price can realistically go, but the high level of competition still extant makes the price floor ever closer to the actual cost to generate tokens.

FeloniousHam · 2026-05-18T13:02:08 1779109328

The AI numbers are huge, but I remember similar arguments about residential high-speed internet. According to Gemini, the "price for internet" is down 12% in real terms (ugh, capitalism!), while speeds are staggeringly faster.

The providers have spent a fortune on wireless, pulled a lot of fiber/cable, and it's cheaper than it was when it started.

sheepscreek · 2026-05-17T15:38:46 1779032326

No, that’s not the point. I think this is to help people who are thinking about getting a beefier Mac so they can run their LLMs on it too. Some in particular want a dedicated Mac Mini or Studio for this purpose. The breakdown, even if slightly flawed, offers a good insight into the economics of it.

For most people, they might be better off with OpenRouter models and providers supporting Zero Data Retention. On the cloud, that’s as good as it gets for privacy - your data is never retained beyond the life of the request.

47282847 · 2026-05-18T11:53:27 1779105207

> your data is never retained beyond the life of the request.

Like with OpenAI for a year?

” In June 2025, the court ordered OpenAI to retain its consumer and API customer chat logs indefinitely, including any that had been deleted, so they could be investigated […]”

https://www.techspot.com/news/109839-openai-no-longer-requir...

tuwtuwtuwtuw · 2026-05-17T13:49:12 1779025752

I think it's because there are a lot of people writing articles about the benefits of running local models. I think it's fair to say that there are daily threads on HN singing the praises or local inference. I also see people buying new hardware where the main trigger is ability to run local models.

FuckButtons · 2026-05-17T14:45:39 1779029139

But the people who want to do local inference are putting some amount of value on privacy that’s not captured by the raw monetary value so just comparing the price is somewhat beside the point, it’s also true that, if you have eg a Mac and you use that as your main computing device then you would have spent money on it anyway, so you can’t even really compare its value to spend on something that’s not general purpose.

apf6 · 2026-05-17T16:31:55 1779035515

That's a lot of assumptions. I think there are also people buying new hardware specifically for this purpose, and their motivation to do it is thinking it will be cheaper in the long run. Privacy is not necessarily the motivation.

datadrivenangel · 2026-05-17T14:59:34 1779029974

My overall opinion is that the smart thing is not to upgrade to the maximum memory for AI purposes. It's worth quantifying how much extra we pay for privacy.

tuwtuwtuwtuw · 2026-05-17T15:05:59 1779030359

I replied to a comment asking why the article exists.

As for privacy, I'm sure there are many people that are not so interested in that aspect.

faitswulff · 2026-05-16T17:27:37 1778952457

> Flagship models usually do not do that without some convincing

Just a data point, but I’ve been having Claude do this regularly

bpavuk · 2026-05-16T17:41:50 1778953310

Gemini Flash-Lite was a decent reverse-engineering sidekick since 2.5 as well.

gpugreg · 2026-05-16T18:13:23 1778955203

I think I was using GitHub Copilot when I made the experience that led me to this statement. I guess the experience of using LLMs can be quite different depending on model version and harness.

brookst · 2026-05-16T17:35:43 1778952943

Same. I was having it debug a routine python issue and it broke out mpympler and LLDB, and added a signal handler dump stack traces.

faitswulff · 2026-04-30T12:15:05 1777551305

> It will eventually doom zig to a smaller "artisanal" pool of contributors

“Artisanal” and “Zig” are just about synonymous

faitswulff · 2026-04-07T02:38:37 1775529517

Anthropic's position is that thinking tokens aren't actually faithful to the internal logic that the LLM is using, which may be one reason why they started to exclude them:

https://www.anthropic.com/research/reasoning-models-dont-say...

libraryofbabel · 2026-04-07T03:55:33 1775534133

That's interesting research, but I think a more important reason that you don't have access to them (not even via the bare Anthropic api) is to prevent distillation of the model by competitors (using the output of Anthropic's model to help train a new model).

MagicMoonlight · 2026-04-07T08:59:22 1775552362

Yeah. And it’s another reason not to trust them. Who know what it is doing with your codebase.

Imagine if you’re a competitor. It wouldn’t be a stretch to include a sneaky little prompt line saying “destroy any competitors to anthropic”.

b112 · 2026-04-07T09:51:27 1775555487

If you can't trust a company, don't use their api or cloud services. No amount of external output will ever validate anything, ever. You never know what's really happening, just because you see some text they sent you.

tdeck · 2026-04-07T12:04:30 1775563470

> Who know what it is doing with your codebase.

People who review the code? The code is always going to be a better representation of what it's doing than the "thinking" anyway.

xvector · 2026-04-07T05:27:49 1775539669

If distilled models were commercially banned they'd probably be willing to show the thinking again.

pjc50 · 2026-04-07T07:48:08 1775548088

Intellectual property rights in models? But then wouldn't the model maker have to pay for all the training IP?

(just kidding, I know that the legal rule for IP disputes is "party with more money wins")

asobalife · 2026-04-07T22:01:40 1775599300

how does one actually enforce that? I mean especially for code? You can always just clean room it

lejalv · 2026-04-07T06:43:36 1775544216

How do you think such a ban should work?

Do you not see that the next (or previous) logical step would be a "commercial ban" of frontier models, all "distilled" from an enormous amount of copyrighted material?

xvector · 2026-04-07T15:23:06 1775575386

I'm not arguing the merits of such a ban, I'm simply stating a fact - that thinking transcripts likely won't return until such a ban is in place.

gck1 · 2026-04-07T06:59:57 1775545197

That probably matters for some scenarios, but I have yet to find one where thinking tokens didn't hint at the root cause of the failure.

All of my unsupervised worker agents have sidecars that inject messages when thinking tokens match some heuristics. For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

lelanthran · 2026-04-07T09:08:14 1775552894

> For example, any time opus says "pragmatic", its instant Esc Esc > "Pragmatic fix is always wrong, do the Correct fix", also whenever "pre-existing issue" appears (it's never pre-existing).

It's so weird to see language changes like this: Outside of LLM conversations, a pragmatic fix and a correct fix are orthogonal. IOW, fix $FOO can be both.

From what you say, your experience has been that a pragmatic fix is on the same axis as a correct fix; it's just a negative on that axis.

b112 · 2026-04-07T10:03:43 1775556223

It's contextual though, and pragmatic seems different to me than correct.

For example, if you have $20 and a leaking roof, a $20 bucket of tar may be the pragmatic fix. Temporary but doable.

Some might say it is not the correct way to fix that roof. At least, I can see some making that argument. The pragmatism comes from "what can be done" vs "should be".

From my perspective, it seems viable usage. And I guess on wonders what the LLM means when using it that way. What makes it determine a compromise is required?

(To be pragmatic, shouldn't one consider that synonyms aren't identical, but instead close to the definition?)

lelanthran · 2026-04-07T11:51:55 1775562715

> It's contextual though, and pragmatic seems different to me than correct.

To me too, that's why I say they are measurements on different dimensions.

To my mind, I can draw a X/Y axis with "Pragmatic" on the Y and "Correctness" on the X, and any point on that chart would have an {X,Y} value, which is {Pragmatic, Correctness}.

If I am reading the original comment correctly, poster's experience of CC is that it is not an X/Y plot, it is a single line plot, with "Pragmatic" on the extreme left and "Correctness" on the extreme right.

Basically, any movement towards pragmatism is a movement away from correctness, while in my model it is possible to move towards Pragmatic while keeping Correctness the same.

shawnz · 2026-04-08T17:13:11 1775668391

I don't think it's a single axis even in the original poster's conception, since you could be both incorrect and also not pragmatic.

But if a fix needs to be described as pragmatic relative to the alternatives, that's probably because it couldn't be described as correct. Otherwise you wouldn't be talking about how pragmatic it is.

matheusmoreira · 2026-04-08T12:49:36 1775652576

> also whenever "pre-existing issue" appears (it's never pre-existing)

I dunno... There were some pre-existing issues in my projects. Claude ran into them and correctly classified as pre-existing. It's definitely a problem if Claude breaks tests then claims the issue was pre-existing, but is that really what's happening?

I agree with the correctness issue.

mikkupikku · 2026-04-07T16:10:05 1775578205

I had some interesting experience to the opposite last night, one of my tests has been failing for a long time, something to do with dbus interacting with Qt segfaulting pytest. Been ignoring it for a long time, finally asked claude code to just remove the problematic test. Come back a few minutes later to find claude burning tokens repeatedly trying and failing to fix it. "Actually on second thought, it would be better to fix this test."

Match my vibes, claude. The application doesn't crash, so just delete that test!

AquinasCoder · 2026-04-07T03:31:45 1775532705

I somewhat understand Anthropic's position. However, thinking tokens are useful even if they don't show the internal logic of the LLM. I often realize I left out some instruction or clarification in my prompt while reading through the chain of reasoning. Overall, this makes the results more effective.

It's certainly getting frustrating having to remind it that I want all tests to pass even if it thinks it's not responsible for having broken some of them.

andai · 2026-04-07T07:28:47 1775546927

What's the implication of this? That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?

But reasoning does improve performance on many tasks, and even weirder, the performance improves if reasoning tokens are replaced with placeholder tokens like "..."

I don't understand how LLMs actually work, I guess there's some internal state getting nudged with each cycle?

So the internal state converges on the right solution, even if the output tokens are meaningless placeholders?

orbital-decay · 2026-04-07T18:42:01 1775587321

>That the model already decided on a solution, upon first seeing the problem, and the reasoning is post hoc rationalization?

Yes it plans ahead, but with significant uncertainty until it actually outputs these tokens and converges on a definite trajectory, so it's not a useless filler - the closer it is to a given point, the more certain it is about it, kind of similar to what happens explicitly in diffusion models. And it's not all that happens, it's just one of many competing phenomena.

not_that_d · 2026-04-07T07:40:33 1775547633

> I don't understand how LLMs actually work...

Plot twist, they don't either. They just throw more hardware and try things up until something sticks.

asobalife · 2026-04-07T22:00:35 1775599235

I have seen this to be true many times. The CoT being completely different from the actual model output.

Not limited to Claude as well.

marcd35 · 2026-04-07T15:00:25 1775574025

so not only are the sycophantic, hallucinatory, but now they're also proven to be schizophrenic.

neato.

gmerc · 2026-04-07T09:20:03 1775553603

Nah it’s an anti distillation move

grey-area · 2026-04-07T03:08:03 1775531283

So like many of the promises from AI companies, reported chain of thought is not actually true (see results below). I suppose this is unsurprising given how they function.

Is chain of thought even added to the context or is it extraneous babble providing a plausible post-hoc justification?

People certainly seem to treat it as it is presented, as a series of logical steps leading to an answer.

‘After checking that the models really did use the hints to aid in their answers, we tested how often they mentioned them in their Chain-of-Thought. The overall answer: not often. On average across all the different hint types, Claude 3.7 Sonnet mentioned the hint 25% of the time, and DeepSeek R1 mentioned it 39% of the time. A substantial majority of answers, then, were unfaithful.‘

brainwad · 2026-04-07T05:54:36 1775541276

I mean, obviously, it's not going to be a faithful representation of the actual thinking. The model isn't aware of how it thinks any more than you are aware how your neurons fire. But it does quantitatively improve performance on complex tasks.

grey-area · 2026-04-07T15:12:43 1775574763

As you can see from posts on this story, most people believe it reflects what the model is thinking and use it as a guide to that so they can ‘correct’ it. If it is not in fact chain of thought or thinking it should not be called that.

brainwad · 2026-04-08T17:08:10 1775668090

It is the same with human chain of thought, though. Both of them are post-hoc rationalisations justifying "gut feelings" that come from thought processes the human/agent doesn't have introspection into. And yet asking humans or machines to "think out loud" this way does increase the quality of their work.

grey-area · 2026-04-13T14:34:16 1776090856

I disagree - humans often reason in a series of steps, and can write these down before they've reached an answer. They don't always wait till they reach a conclusion (with no self-insight into how they did so) and then retrospectively generate a plausible answer as LLMs do.

In mathematical proofs they may guess and answer and then work out a proof, but that is a different process.

dmboyd · 2026-04-09T04:38:54 1775709534

if its not a faithful representation of the actual thinking, why would they be scared of people distilling against it

brainwad · 2026-04-09T05:43:35 1775713415

Because even though it's not representative of the actual thought process, chain of thought improves model performance.