Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Using an LLM on data does not ingest that data into the training corpus. LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

None of the mainstream paid services ingest operating data into their training sets. You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.





Companies have already shifting from not using customer data to giving them an option to opt out ex:

“How can I control whether my data is used for model training?

If you are logged into Copilot with a Microsoft Account or other third-party authentication, you can control whether your conversations are used for training the generative AI models used in Copilot. Opting out will exclude your past, present, and future conversations from being used for training these AI models, unless you choose to opt back in. If you opt out, that change will be reflected throughout our systems within 30 days.” https://support.microsoft.com/en-us/topic/privacy-faq-for-mi...

At this point suggesting it has never and will her happen is wildly optimistic.


An enterprise Copilot contract will have already decided this for the organization.

That possibility in no way address the underlying concern here.

30 days to opt out? That's skeezy as fuck.

> LLMs don’t “learn” from the information they operate on, contrary to what a lot of people assume.

Nothing is really preventing this though. AI companies have already proven they will ignore copyright and any other legal nuisance so they can train models.


They're already using synthetic data generated by LLMs to further train LLMs. Of course they will not hesitate to feed "anonymized" data generated by user interactions. Who's going to stop them? Or even prove that it's happening. These companies have already been allowed to violate copyright and privacy on a historic global scale.

How should they dinstinguish between real and fake data? It would be far to easy to pollute their models with nonesense.

I have no doubt that Microsoft has already classified the nature of my work and quality of my code. Of course it's probably "anonymized". But there's no doubt in my mind that they are watching everything you give them access to, make no mistake.

I mean is it really ignoring copyright when copyright doesn't limit them in anyway on training?

Tell that to all the people suing them for using their copyrighted work. In some cases the data was even pirated.

> Nothing is really preventing this though

The enterprise user agreement is preventing this.

Suggesting that AI companies will uniquely ignore the law or contracts is conspiracy theory thinking.


It already happened.

"Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal"

https://www.wired.com/story/new-documents-unredacted-meta-co...

They even admitted to using copyrighted material.

"‘Impossible’ to create AI tools like ChatGPT without copyrighted material, OpenAI says"

https://www.theguardian.com/technology/2024/jan/08/ai-tools-...


Though the porn they copied was just for personal use, because clearly that's an important perk of being employed there:

https://www.vice.com/en/article/meta-says-the-2400-adult-mov...


Information about the way we interact with the data (RLHF) can be used to refine agent behaviour.

While this isn't used specifically for LLM training, it can involve aggregating insights from customer behaviour.


That’s a training step. It requires explicitly collecting the data and using it in the training process.

Merely using an LLM for inference does not train it on the prompts and data, as many incorrectly assume. There is a surprising lack of understanding of this separation even on technical forums like HN.


That's definitely a fair point.

However, let's say I record human interactions with my app; for example when a user accepts or rejects an AI sythesised answer.

This data can be used by me, to influence the behaviour of an LLM via RAG or by altering application behaviour.

It's not going to change the weighting of the model, but it would influence its behaviour.


They are not directly ingesting the data into their trainning sets but they are in most cases collecting it and will be using it to train future models.

Do you have any source for this at all?

Its stated in the private policy.

If they weren't, then why would enterprise level subscriptions include specific terms stating that they don't train on user provided data? There's no reason to believe that they don't, and if they don't now then there's no reason to believe that they won't later whenever it suits them.

> then why would enterprise level subscriptions include specific terms stating that they don't train on user provided data?

What? That’s literally my point: Enterprise agreements aren’t training on the data of their enterprise customers like the parent commenter claimed.


Just read the ToS of the LLM products please

This is so naive. The ToS permits paraphrasing of user conversations, by not excluding it, and then training on THAT. You’d never be able to definitively connected paraphrased data to yours, especially if they only train on paraphrased data that covers frequent, as opposed to rare, topics.

Do it have a citation for this?

“Hey DoctorPangloss, how can we train on user data without training on user data?”

“You can use an LLM to paraphrase the incoming requests and save that. Never save the verbatim request. If they ask for all the request data we have, we tell them the truth, we don’t have it. If they ask for paraphrased data, we’d have no way of correlating it to their requests.”

“And what would you say, is this a 3 or a 5 or…”

Everything obvious happens. Look closely at the PII management agreements. Btw OpenAI won’t even sign them because they’re not sure if paraphrasing “counts.” Google will.


I have. Have you? Can you quote the sections you’re talking about?

https://www.anthropic.com/news/updates-to-our-consumer-terms

"We will train new models using data from Free, Pro, and Max accounts when this setting is on (including when you use Claude Code from these accounts)."


> You will find a lot of conspiracy theories claiming that companies are saying one thing but secretly stealing your data, of course.

It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this. And it keeps happening. Granted I'm unaware of cases of this occuring currently with professional AI services but it's basic security 101 that you should never let anything even have the remote opportunity to ingest data unless you don't care about the data.


> never let anything even have the remote opportunity to ingest data unless you don't care about the data

This is objectively untrue? Giants swaths of enterprise software is based on establishing trust with approved vendors and systems.


> It's not really a conspiracy when we have multiple examples of high profile companies doing exactly this.

Do you have any citations or sources for this at all?


To be pedantic, it is still a conspiracy, just no longer a theory.

To be pedantic, a theory that has been proven correct is still a theory, right?

Wrong, buddy.

Many of the top AI services use human feedback to continuously apply "reinforcement learning" after the initial deployment of a pre-trained model.

https://en.wikipedia.org/wiki/Reinforcement_learning_from_hu...


RLHF is a training step.

Inference (what happens when you use an LLM as a customer) is separate from training.

Inference and training are separate processes. Using an LLM doesn’t train it. That’s not what RLHF means.


I am aware, I've trained my own models. You're being obtuse.

The big companies - take Midjourney, or OpenAI, for example - take the feedback that is generated by users, and then apply it as part of the RLHF pass on the next model release, which happens every few months. That's why they have the terms in their TOS that allow them to do that.


maybe prompts are enough to infer the rest ?

[flagged]


>Ah yes, blindly trusting the corpo fascists that stole the entire creative output of humanity to stop now.

Stealing implies the thing is gone, no longer accessible to the owner.

People aren't protected from copying in the same way. There are lots of valid exclusions, and building new non competing tools is a very common exclusion.

The big issue with the OpenAI case, is that they didn't pay for the books. Scanning them and using them for training is very much likely to be protected. Similar case with the old Nintendo bootloader.

The "Corpo Fascists" are buoyed by your support for the IP laws that have thus far supported them. If anything, to be less "Corpo Fascist" we would want more people to have more access to more data. Mankind collectively owns the creative output of Humanity, and should be able to use it to make derivative works.


> Stealing implies the thing is gone, no longer accessible to the owner.

Isn't this a little simplistic?

If the value of something lies in its scarcity, then making it widely available has robbed the owner of a scarcity value which cannot be retrieved.

A win for consumers, perhaps, but a loss for the owner nonetheless.


No calling every possible loss due to another persons actions "Stealing" is simplistic. We have terms for all these things, like "intellectual property infringement".

Trying to group (Thing I dont like) with (Thing everyone doesnt like) is an old semantic trick that needs to be abolished. Taxonomy is good, if your arguments are good, you dont need emotively charged imprecise language.


I literally reused the definition of stealing you gave in the post above.

> Stealing implies the thing is gone, no longer accessible to the owner.

You know a position is indefensible when you equivocation fallacy this hard.

> The "Corpo Fascists" are buoyed by your support for the IP laws

You know a position is indefensible when you strawman this hard.

> If anything, to be less "Corpo Fascist" we would want more people to have more access to more data. Mankind collectively owns the creative output of Humanity, and should be able to use it to make derivative works.

Sounds about right to me, but why you would state that when defending slop slingers is enough to give me whiplash.

> Scanning them and using them for training is very much likely to be protected.

Where can I find these totally legal, free, and open datasets all of these slop slingers are trained on?


>You know a position is indefensible when you equivocation fallacy this hard.

No its quite defensible. And if that was equivocation, you can simply outline that you didn't mean to invoke the specific definition of stealing, but were just using it for its emotive value.

>You know a position is indefensible when you strawman this hard.

Its accurate. No one wants thes LLM guys stopped more than other big fascistic corporations, plenty of oppositional noise out there for you to educate yourself with.

>Sounds about right to me, but why you would state that when defending slop slingers is enough to give me whiplash.

Cool, so if you agree all data should usable to create derivative works then I don't see what your complaint is.

>Where can I find these totally legal, free, and open datasets all of these slop slingers are trained on?

You invoked "strawman" and then hit me with this combo strawman/non sequitur? Cool move <1 day old account, really adds to your 0 credibility.

I literally pointed out they should have to pay the same access fee as anyone else for the data, but once obtained, should be able to use it any way. Reading the comment explains the comment.

Unless, charitably, you are suggesting that if a company is legally able to purchase content, and use it as training data, that somehow compels them to release that data for free themselves?

Weird take if true.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: