168
submitted 2 weeks ago by Noerknhar@feddit.org to c/fuck_ai@lemmy.world
top 27 comments
sorted by: hot top controversial new old
[-] ParadoxSeahorse@lemmy.world 24 points 2 weeks ago

This is physically bought, scanned, books. Not covered by this case is what they’re allowed to do with that model, eg. charge people for access to it.

Maybe controversial, but compared to meta pirating books, claiming it makes no difference, and that each book is individually worthless to the model (but the model is of course worth billions), is it wrong that I’m like “hmm at they’re least buying books”?

As others say, there should be specific licensing, so they actually need to pay a cost per book, set by the publisher, specifically to legally include it in their model, not just shopping as humans but actually an llm skin suit slave.

[-] altkey@lemmy.dbzer0.com 4 points 2 weeks ago

Your comment made me think of the LLM piping this way (as if it could've started legal):

  1. Shit goes in: sourcing material should be treated not like for a personal, but for a commercial use over some volume by default. It's clearly differentiated in licenses, pricing, fees, etc.
  2. Shit goes out: the strictiest license of all dataset is applied to how the output can be used. If we can't discern if X was in the mix, we can't say it wasn't, and therefore assume it's there.

To claim X is not in the dataset, the LLM's owner's dataset should be open unless parts of it are specifically closed by contract obligations with the dataminer\broker. Both open and closed parts with the same parameters should produce the same hash sums of datasets and the resulting weights as in the process of learning itself. If open parts don't contain said piece of work, the responsibility is on data providers, thus closed parts get inspected by an unaffilated party and the owner of LLM. Brokers there are interested in showing it's not on them, and there should be a safeguard against swiftly deleting the evidence - thus the initial trade deal is fixed by some hash once again.

Broker with someone's pirated work can't knowingly sell the same dataset unless problematic pieces are deleted. The resulting model can continue learning on additional material, but then a complete relearning should be done on new, updated datasets, otherwise it's a crime.

Failure to provide hashes or other possible signatures verifying datasets are the same, shifts the blame onto LLM's owner. Producing and sharing them in the open and observable manner, having more of their data pool public grants one a right to make it a business and shields from possible lawsuits.

Data brokers may not disclose their datasets to public, but all direct for-profit piracy charges are on them, not the LLM owner, if the latter didn't obtain said content themselves but purchased it from other party.

It got longer than I thought.

[-] HobbitFoot@thelemmy.club 3 points 2 weeks ago

Except that some derivative works are allowed by humans under current copyright law. This has been degraded to the point where reaction videos have some defense as a derivative work.

If a reaction video is a derivative work, why can't an AI trained on that work also count?

[-] ParadoxSeahorse@lemmy.world 2 points 2 weeks ago

“Derivative” is less questionable than “work”.

For eg. AI Gen imagery is not copyrightable for the most part, legally closer to plagiarism than art?

[-] HobbitFoot@thelemmy.club 1 points 2 weeks ago

Derivative describes what happened to the copyrighted work, not what slop was churned out by it.

If the plagiarism is far enough from the original work, it isn't protected by the original copyright.

[-] ParadoxSeahorse@lemmy.world 2 points 2 weeks ago

I really like the idea of signing the model with a dataset hash. Each legally licensable piece of source material could provide a hash, maybe?

In terms of outputs, it’s really difficult to judge how transformative a model is without transparency of dataset. We’ve obviously seen prompts regurgitate verbatim known works, it could be even more prevalent than apparent just through obscurity as opposed to transformation. More than meets the eye.

[-] altkey@lemmy.dbzer0.com 1 points 2 weeks ago

Each legally licensable piece of source material could provide a hash, maybe?

We may generate a hash sum for every piece but I don't see now how it would help. The only application I assume is to know that between stages A and B the database of many works hasn't been modified. But if we have a hash of a singular piece, we can't tell by it, if it was included in the dataset or not, persecute cases of it's misuse etc. For licensing stuff it wouldn't hurt to obtain it, I guess, but I don't know how it would be applied to prove something. Alas, I think I do now*.

In terms of outputs, it’s really difficult to judge how transformative a model is without transparency of dataset.

True. That's why I assume everything in the dataset is involved in every creation.

It is, probably, the level of fight only accessible by the likes of Disney with their endless pockets, but if they do their lawsuit thing frequently enough (correctly assumimg the likeness of Mickey is in every graphical dataset), there's a hope LLM's owners and dataset brokers would go more transparent about the data they obtain and use, thus helping everyone.

One tool I see created is - here's the asterix * - a standard look-up webpage where you can search a closed commercial dataset (or many of them at once) by hash or by providing a file**. Hash sux ass due to it naturally changing itself whenever the file is slightly modified. But if it's a known copy-version that circulated the web for a while, it can serve as a unique identifier as that one thing.

Asterix two** - I imagine if something like that occures, it'd be a captcha-, ad-, js-code-ridden nightmare. If there could be a bill about that whole thing, the look-up site should be included too, with instructions to make an API for that resource and limitations on how awful it can be.

[-] herseycokguzelolacak@lemmy.ml 8 points 2 weeks ago

If I can read books and learn, why can't AI?

[-] monogram@feddit.nl 33 points 2 weeks ago

Just because you own a cd doesn’t mean you have a license to play it in a club.

[-] herseycokguzelolacak@lemmy.ml 2 points 2 weeks ago

It's a good thing they are not playing at a club then.

[-] jmill@lemmy.zip 16 points 2 weeks ago

In this analogy, the AI uses books like a remix DJ would use bits and pieces of songs from different tracks to splice together their output. Except in the case of AI, it will be much harder to identify the original source.

[-] HobbitFoot@thelemmy.club 2 points 2 weeks ago

Under this definition, it is illegal summarize news articles behind a paywall.

[-] jmill@lemmy.zip 7 points 2 weeks ago

If you made money doing that, it probably would be illegal. You would certainly get sued, in any case.

[-] HobbitFoot@thelemmy.club 1 points 2 weeks ago

People make a lot of money summarizing articles behind paywalls and it is generally considered legal as long as it is a summary and not copied text.

[-] njm1314@lemmy.world 4 points 2 weeks ago

Who are you paying for that?

[-] HobbitFoot@thelemmy.club 2 points 2 weeks ago

You don't have to pay for fair use.

[-] njm1314@lemmy.world 3 points 2 weeks ago

So how are they making a lot of money then?

[-] HobbitFoot@thelemmy.club 2 points 2 weeks ago

Advertisement. You don't have to pay for original content. You just need to pay someone/thing to summarize it and get clicks for advertisement.

[-] njm1314@lemmy.world 3 points 2 weeks ago

I can't say I've ever seen this in my life. Paid advertisement on summaries of paywalled articles. Not something I've come across. Certainly they would be sued if they were found by the companies in question I imagine.

[-] HobbitFoot@thelemmy.club 1 points 2 weeks ago

Sure you have. If you ever have read an article that says "As reported in X", that is a summarization of another journalist's work.

[-] njm1314@lemmy.world 2 points 2 weeks ago

Well now I don't think you understand what a summarization is.

[-] HobbitFoot@thelemmy.club 1 points 2 weeks ago

I was thinking the same about you.

[-] amorpheus@lemmy.world 1 points 2 weeks ago* (last edited 2 weeks ago)

Since when can't you use knowledge gained from books for personal profit?

The only difference is scale.

[-] wewbull@feddit.uk 17 points 2 weeks ago

...because you are a person, not a product.

[-] SpaceNoodle@lemmy.world 15 points 2 weeks ago

LLMs (currently colloquially "AI") are literally incapable of "learning."

[-] nickwitha_k@lemmy.sdf.org 13 points 2 weeks ago

LLMs are not sentient and never can be.

[-] nandeEbisu@lemmy.world 1 points 2 weeks ago

With the law as I understand it (not a lawyer) this seems correct.

I think this is unhealthy for society as a whole though, but it is the legislature's job to fix that, not the judiciary.

this post was submitted on 02 Jul 2025
168 points (100.0% liked)

Fuck AI

3507 readers
271 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 1 year ago
MODERATORS