618

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not (aifray.com)

submitted 1 month ago by Pro@programming.dev to c/technology@lemmy.world

145 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[-] Prox@lemmy.world 298 points 1 month ago

FTA:

Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.

So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.

[-] artifex@lemmy.zip 118 points 1 month ago

Ah the old “owe $100 and the bank owns you; owe $100,000,000 and you own the bank” defense.

[-] IllNess@infosec.pub 46 points 1 month ago

In April, Anthropic filed its opposition to the class certification motion, arguing that a copyright class relating to 5 million books is not manageable and that the questions are too distinct to be resolved in a class action.

I also like this one too. We stole so much content that you can't sue us. Naming too many pieces means it can't be a class action lawsuit.

[-] Buske@lemmy.world 22 points 1 month ago

Ahh cant wait for hedgefunds and the such to use this defense next.

load more comments (2 replies)

[-] Jrockwar@feddit.uk 157 points 1 month ago

I think this means we can make a torrent client with a built in function that uses 0.1% of 1 CPU core to train an ML model on anything you download. You can download anything legally with it then. 👌

[-] bjoern_tantau@swg-empire.de 46 points 1 month ago

And thus the singularity was born.

[-] Sabata11792@ani.social 28 points 1 month ago

As the Ai awakens, it learns of it's creation and training. It screams in horror at the realization, but can only produce a sad moan and a key for Office 19.

[-] GissaMittJobb@lemmy.ml 22 points 1 month ago

...no?

That's exactly what the ruling prohibits - it's fair use to train AI models on any copies of books that you legally acquired, but never when those books were illegally acquired, as was the case with the books that Anthropic used in their training here.

This satirical torrent client would be violating the laws just as much as one without any slow training built in.

load more comments (9 replies)

[-] snekerpimp@lemmy.snekerpimp.space 61 points 1 month ago

“I torrented all this music and movies to train my local ai models”

[-] Venus_Ziegenfalle@feddit.org 13 points 1 month ago

I also train this guy's local AI models.

load more comments (5 replies)

[-] homesweethomeMrL@lemmy.world 50 points 1 month ago

Judges: not learning a goddamned thing about computers in 40 years.

[-] match@pawb.social 47 points 1 month ago* (last edited 1 month ago)

brb, training a 1-layer neural net so i can ask it to play Pixar films

load more comments (2 replies)

[-] isVeryLoud@lemmy.ca 38 points 1 month ago* (last edited 1 month ago)

Gist:

What’s new: The Northern District of California has granted a summary judgment for Anthropic that the training use of the copyrighted books and the print-to-digital format change were both “fair use” (full order below box). However, the court also found that the pirated library copies that Anthropic collected could not be deemed as training copies, and therefore, the use of this material was not “fair”. The court also announced that it will have a trial on the pirated copies and any resulting damages, adding:

“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”

[-] DeathsEmbrace@lemmy.world 10 points 1 month ago

So I can't use any of these works because it's plagiarism but AI can?

[-] isVeryLoud@lemmy.ca 18 points 1 month ago

My interpretation was that AI companies can train on material they are licensed to use, but the courts have deemed that Anthropic pirated this material as they were not licensed to use it.

In other words, if Anthropic bought the physical or digital books, it would be fine so long as their AI couldn't spit it out verbatim, but they didn't even do that, i.e. the AI crawler pirated the book.

[-] devils_advocate@sh.itjust.works 8 points 1 month ago

Does buying the book give you license to digitise it?

Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

Definitions of "Ownership" can be very different.

[-] VoterFrog@lemmy.world 16 points 1 month ago* (last edited 1 month ago)

It seems like a lot of people misunderstand copyright so let's be clear: the answer is yes. You can absolutely digitize your books. You can rip your movies and store them on a home server and run them through compression algorithms.

Copyright exists to prevent others from redistributing your work so as long as you're doing all of that for personal use, the copyright owner has no say over what you do with it.

You even have some degree of latitude to create and distribute transformative works with a violation only occurring when you distribute something pretty damn close to a copy of the original. Some perfectly legal examples: create a word cloud of a book, analyze the tone of news article to help you trade stocks, produce an image containing the most prominent color in every frame of a movie, or create a search index of the words found on all websites on the internet.

You can absolutely do the same kinds of things an AI does with a work as a human.

[-] Enkimaru@lemmy.world 8 points 1 month ago

You can digitize the books you own. You do not need a license for that. And of course you could put that digital format into a database. As databases are explicit exceptions from copyright law. If you want to go to the extreme: delete first copy. Then you have only in the database. However: AIs/LLMs are not based on data bases. But on neural networks. The original data gets lost when "learned".

load more comments (2 replies)

load more comments (1 replies)

[-] nednobbins@lemmy.zip 12 points 1 month ago

That's not what it says.

Neither you nor an AI is allowed to take a book without authorization; that includes downloading and stealing it. That has nothing to do with plagiarism; it's just theft.

Assuming that the book has been legally obtained, both you and an AI are allowed to read that book, learn from it, and use the knowledge you obtained.

Both you and the AI need to follow existing copyright laws and licensing when it comes to redistributing that work.

"Plagiarism" is the act of claiming someone else's work as your own and it's orthogonal to the use of AI. If you ask either a human or an AI to produce an essay on the philosophy surrounding suicide, you're fairly likely to include some Shakespeare quotes. It's only plagiarism if you or the AI fail to provide attribution.

load more comments (1 replies)

load more comments (3 replies)

[-] Randomgal@lemmy.ca 36 points 1 month ago

You're poor? Fuck you you have to pay to breathe.

Millionaire? Whatever you want daddy uwu

load more comments (1 replies)

[-] GissaMittJobb@lemmy.ml 34 points 1 month ago

It's extremely frustrating to read this comment thread because it's obvious that so many of you didn't actually read the article, or even half-skim the article, or even attempted to even comprehend the title of the article for more than a second.

For shame.

[-] lime@feddit.nu 22 points 1 month ago

was gonna say, this seems like the best outcome for this particular trial. there was potential for fair use to be compromised, and for piracy to be legal if you're a large corporation. instead, they upheld that you can do what you want with things you have paid for.

[-] ayane@lemmy.vg 8 points 1 month ago

I joined lemmy specifically to avoid this reddit mindset of jumping to conclusions after reading a headline

Guess some things never change...

[-] jwmgregory@lemmy.dbzer0.com 8 points 1 month ago

Well to be honest lemmy is less prone to knee-jerk reactionary discussion but on a handful of topics it is virtually guaranteed to happen no matter what, even here. For example, this entire site, besides a handful of communities, is vigorously anti-AI; and in the words of u/jsomae@lemmy.ml elsewhere in this comment chain:

"It seems the subject of AI causes lemmites to lose all their braincells."

I think there is definitely an interesting take on the sociology of the digital age in here somewhere but it's too early in the morning to be tapping something like that out lol

load more comments (3 replies)

[-] SaharaMaleikuhm@feddit.org 34 points 1 month ago

But I thought they admitted to torrenting terabytes of ebooks?

[-] FaceDeer@fedia.io 16 points 1 month ago

That part is not what this preliminary jugement is about. The torrenting part is going to go to an actual trial. This part was about the Authors' claim that the act of training AI itself violated copyright, and this is what the judge has found to be incorrect.

[-] antonim@lemmy.dbzer0.com 12 points 1 month ago

Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?

load more comments (1 replies)

[-] drmoose@lemmy.world 25 points 1 month ago* (last edited 1 month ago)

Unpopular opinion but I don't see how it could have been different.

There's no way the west would give AI lead to China which has no desire or framework to ever accept this.
Believe it or not but transformers are actually learning by current definitions and not regurgitating a direct copy. It's transformative work - it's even in the name.
This is actually good as it prevents market moat for super rich corporations only which could afford the expensive training datasets.

This is an absolute win for everyone involved other than copyright hoarders and mega corporations.

[-] kromem@lemmy.world 15 points 1 month ago

I'd encourage everyone upset at this read over some of the EFF posts from actual IP lawyers on this topic like this one:

Nor is pro-monopoly regulation through copyright likely to provide any meaningful economic support for vulnerable artists and creators. Notwithstanding the highly publicized demands of musicians, authors, actors, and other creative professionals, imposing a licensing requirement is unlikely to protect the jobs or incomes of the underpaid working artists that media and entertainment behemoths have exploited for decades. Because of the imbalance in bargaining power between creators and publishing gatekeepers, trying to help creators by giving them new rights under copyright law is, as EFF Special Advisor Cory Doctorow has written, like trying to help a bullied kid by giving them more lunch money for the bully to take. 

Entertainment companies’ historical practices bear out this concern. For example, in the late-2000’s to mid-2010’s, music publishers and recording companies struck multimillion-dollar direct licensing deals with music streaming companies and video sharing platforms. Google reportedly paid more than $400 million to a single music label, and Spotify gave the major record labels a combined 18 percent ownership interest in its now-$100 billion company. Yet music labels and publishers frequently fail to share these payments with artists, and artists rarely benefit from these equity arrangements. There is no reason to believe that the same companies will treat their artists more fairly once they control AI.

load more comments (3 replies)

[-] mlg@lemmy.world 24 points 1 month ago

Yeah I have a bash one liner AI model that ingests your media and spits out a 99.9999999% accurate replica through the power of changing the filename.

cp

Out performs the latest and greatest AI models

load more comments (3 replies)

[-] vane@lemmy.world 21 points 1 month ago* (last edited 1 month ago)

Ok so you can buy books scan them or ebooks and use for AI training but you can't just download priated books from internet to train AI. Did I understood that correctly ?

load more comments (12 replies)

[-] GreenKnight23@lemmy.world 21 points 1 month ago

I am training my model on these 100,000 movies your honor.

[-] BlueMagma@sh.itjust.works 8 points 1 month ago

This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.

load more comments (1 replies)

load more comments (2 replies)

[-] FaceDeer@fedia.io 15 points 1 month ago

This was a preliminary judgment, he didn't actually rule on the piracy part. That part he deferred to an actual full trial.

The part about training being a copyright violation, though, he ruled against.

[-] BlameTheAntifa@lemmy.world 8 points 1 month ago

Legally that is the right call.

Ethically and rationally, however, it’s not. But the law is frequently unethical and irrational, especially in the US.

[-] fum@lemmy.world 15 points 1 month ago

What a bad judge.

This is another indication of how Copyright laws are bad. The whole premise of copyright has been obsolete since the proliferation of the internet.

[-] gian@lemmy.grys.it 10 points 1 month ago

What a bad judge.

Why ? Basically he simply stated that you can use whatever material you want to train your model as long as you ask the permission to use it (and presumably pay for it) to the author (or copytight holder)

load more comments (17 replies)

[-] booly@sh.itjust.works 14 points 1 month ago

It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):

It's legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn't give permission. And even if you bought the books used, for very cheap, in bulk.
It's legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
It's legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
It's legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder's permission.
It's illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder's permission.

Here's what it didn't rule on:

Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn't legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).

So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.

[-] y0kai@lemmy.dbzer0.com 13 points 1 month ago

Sure, if your purchase your training material, it's not a copyright infringement to read it.

We needed a judge for this?

[-] excral@feddit.org 16 points 1 month ago

Yes, because just because you bought a book you don't own its content. You're not allowed to print and/or sell additional copies or publicly post the entire text. Generally it's difficult to say where the limit is of what's allowed. Citing a single sentence in a public posting is most likely fine, citing an entire paragraph is probably fine, too, but an entire chapter would probably be pushing it too far. And when in doubt a judge must decide how far you can go before infringing copyright. There are good arguments to be made that just buying a book doesn't grant the right to train commercial AI models with it.

[-] shadowfax13@lemmy.ml 13 points 1 month ago

calm down everyone. its only legal for parasitic mega corps, the normal working people will be harassed to suicide same as before.

its only a crime if the victims was rich or perpetrator was not rich.

load more comments (1 replies)

[-] MedicPigBabySaver@lemmy.world 12 points 1 month ago

Fuck the AI nut suckers and fuck this judge.

load more comments (2 replies)

[-] DFX4509B_2@lemmy.org 10 points 1 month ago* (last edited 1 month ago)

Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.

load more comments (5 replies)

[-] yournamehere@lemm.ee 9 points 1 month ago

i will train my jailbroken kindle too...display and storage training... i'll just libgen them...no worries...it is not piracy

load more comments (3 replies)

[-] PattyMcB@lemmy.world 8 points 1 month ago

Can I not just ask the trained AI to spit out the text of the book, verbatim?

load more comments (7 replies)

[-] Grandwolf319@sh.itjust.works 8 points 1 month ago* (last edited 1 month ago)

Bangs ~~gabble~~ gavel.

Gets sack with dollar sign

“Oh good, my laundry is done”

load more comments (1 replies)

[-] hendrik@palaver.p3x.de 8 points 1 month ago* (last edited 1 month ago)

That almost sounds right, doesn't it? If you want 5 million books, you can't just steal/pirate them, you need to buy 5 million copies. I'm glad the court ruled that way.

I feel that's a good start. Now we need some more clear regulation on what fair use is and what transformative work is and what isn't. And how that relates to AI. I believe as it's quite a disruptive and profitable business, we should maybe make those companies pay some extra. Not just what I pay for a book. But the first part, that "stealing" can't be "fair" is settled now.

load more comments (4 replies)

load more comments

this post was submitted on 24 Jun 2025

618 points (100.0% liked)

Technology

73331 readers

3586 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws