A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

assassin_aragorn@lemmy.world · 2 years ago

A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

DigitalWebSlinger@lemmy.world · 2 years ago

“AI model unlearning” is the equivalent of saying “removing a specific feature from a compiled binary executable”. So, yeah, basically not feasible.

But the solution is painfully easy: you remove the data from your training set (ie, the source code), and re-train your model (recompile the executable).

Yes, it may cost you a lot of time and money to accomplish this, but such are the consequences of breaking the law. Maybe be extra careful about obeying laws going forward, eh?

Ajen@sh.itjust.works · 2 years ago

removing a specific feature from a compiled binary executable

That’s actually very feasible. Compiled binaries translate directly to assembly, which is taught to most (all?) comp sci undergrads. When the binary is compiled by a standard compiler the translated assembly is very easy to understand, and for software that has protections/obfuscations like DRM and viruses there are reverse engineering tools like IDA Pro.

londos@lemmy.world · 2 years ago

Far cheaper to just buy politicians and change the law.

Ann Archy@lemmy.world · 2 years ago

Just ask the AI to do it for you. Much better return on investment.

CoderKat@lemm.ee · 2 years ago

Retraining the model is incredibly expensive. That basically means not training the model with any user data, even if it slips in accidentally, by someone sabotage the training data, or even with consent (since consent can be revoked).

Thann@lemmy.ml · 2 years ago

consent cant be revoked, theyre not even trying to get consent.

They seemingly all have a “use first then ask for forgiveness” approach which should come around to bite them in the ass

Touching_Grass@lemmy.world · 2 years ago

They shouldn’t need consent unless they’re reselling the works in question

Blackmist@feddit.uk · 2 years ago

Yeah, there’s no point in the model where you can pinpoint that data. It’s like asking a brain surgeon to slice your brain to make you forget something. Sure, he could do it, but don’t be surprised if you can’t speak or remember your wife when you wake up…

The only option is to relearn from the new filtered training data, or filter it on the way out, which is likely easier said than done because it has no real context of what it’s doing.

Dkarma@lemmy.world · edit-2 2 years ago

It takes so.much money to retrain models tho…like the entire cost all over again …and what if they find something else?

Crazy how murky the legalities are here …just no caselaw to base anything on really

For people who don’t know how machine learning works at a very high level

basically every input the AI is trained on or “sees” changes a set of weights (float type decimal numbers) and once the weights are changed you can’t remove that input and change the weights back to what they were you can only keep changing them on new input

DigitalWebSlinger@lemmy.world · 2 years ago

So we just let them break the law without penalty because it’s hard and costly to redo the work that already broke the law? Nah, they can put time and money towards safeguards to prevent themselves from breaking the law if they want to try to make money off of this stuff.

Dkarma@lemmy.world · 2 years ago

No one has established that they’ve broken the law in any way, though. Authors are upset but it’s unclear if they can prove they were damaged in some way or that the companies in question are even liable for anything.

Remember,the burden of proof is on the plaintiff not these companies if a suit is brought.

vrighter@discuss.tchncs.de · 2 years ago

I’m european. I have a right to be forgotten.

frezik@midwest.social · 2 years ago

The “safeguard” would be “no PII in training data, ever”. Which is fine by me, but that’s what it really means. Retraining a large dataset every time a GDPR request comes in is completely infeasible.

AWittyUsername@lemmy.world · 2 years ago

Much like DLLs exist for compiled binary executables, could we not have modular AI training data? Then only a small chunk would need to be relearned at a time.

Just throwing this into the void here.

SGforce@lemmy.ca · 2 years ago

Nah, it’s too much like how a lobotomy works. Even taking a small chunk of your brain might have huge impacts.

Aceticon@lemmy.world · 2 years ago

The difference in between having or not something in the training set of a Neural Network is going to be different values for non-integer factors all over the neural network and, worse, it is just as like that they’re tiny differences as it is that they’re massive differences.

Or to give you a decent metaphor for it, “it would be like trying to remove a specific egg from a bowl of scrambled eggs”.

Fushuan [he/him]@lemm.ee · 2 years ago

A trained AI model is a set of weights that is applied to the given neural network, the difference between two models, one trained without key data and one trained with key data, can be computed and a tool can be created to generate a transformation from model A to model B, or even a good approximation of model B trained with another AI.

It’s not THAT hard actually.

applebusch@lemmy.world · 2 years ago

I don’t doubt that mathematically, but practically that sounds like it would be functionally equivalent to just retraining the model. Like if it were more efficient to just calculate the model weights based on input data, that’s what we would do, there would be no need to go through the training process. We could just start with a completely untrained model and calculate the difference between that model and one that was trained with all the data. The more I think about it the more I doubt that mathematically. The feasibility of this would depend heavily on the details of the model and how it was trained. Lots of times the order in which the data was presented during training has an impact on the final result, so there’s no guarantee your subtraction would achieve the same or even similar result as retraining without the specified data. Maybe you can reference some papers on the topic.

stratoscaster@lemmy.zip · 2 years ago

You are correct. It would be heinously expensive to “remove” training data. Even training a very rudimentary model can take hours on a high-end tensor processor.

SoBoredAtWork@lemmy.world · 2 years ago

You don’t work in AI, do you?

Fushuan [he/him]@lemm.ee · edit-2 2 years ago

I have a bachelors in computer science specialised in data engineering and data science, with a masters in data science, and I have worked for some years in computer vision, training and tweaking models.

Currently specialised in data engineering, but I’d wager I do know about what I’m talking about.

People who “work with AI” most of the time don’t know shit about how it internally works, so I don’t know if that’s a label I’d even use to give an informed opinion about the matter.

down daemon@lemmy.ml · 2 years ago

fuck laws

trashgirlfriend@lemmy.world · 2 years ago

Man, fuck these user data protection laws, hate em

hglman@lemmy.ml · 2 years ago

The issue is the ownership of the AI; if it were not ownable or instead owned by everyone, there wouldn’t be an issue.

trashgirlfriend@lemmy.world · edit-2 2 years ago

Ah yes, let’s just quickly switch the mode of production in this industry, I’m sure that’s going to happen.

I also don’t want my data to be processed by the fully automated luxy gay space machine learning algorithms either.