795
submitted 1 year ago by L4s@lemmy.world to c/technology@lemmy.world

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

top 50 comments
sorted by: hot top controversial new old
[-] TropicalDingdong@lemmy.world 173 points 1 year ago

Its a bit pedantic, but I'm not really sure I support this kind of extremist view of copyright and the scale of whats being interpreted as 'possessed' under the idea of copyright. Once an idea is communicated, it becomes a part of the collective consciousness. Different people interpret and build upon that idea in various ways, making it a dynamic entity that evolves beyond the original creator's intention. Its like issues with sampling beats or records in the early days of hiphop. Its like the very principal of an idea goes against this vision, more that, once you put something out into the commons, its irretrievable. Its not really yours any more once its been communicated. I think if you want to keep an idea truly yours, then you should keep it to yourself. Otherwise you are participating in a shared vision of the idea. You don't control how the idea is interpreted so its not really yours any more.

If thats ChatGPT or Public Enemy is neither here nor there to me. The idea that a work like Peter Pan is still possessed is such a very real but very silly obvious malady of this weirdly accepted but very extreme view of the ability to possess an idea.

[-] Laticauda@lemmy.ca 52 points 1 year ago* (last edited 1 year ago)

Ai isn't interpreting anything. This isn't the sci-fi style of ai that people think of, that's general ai. This is narrow AI, which is really just an advanced algorithm. It can't create new things with intent and design, it can only regurgitate a mix of pre-existing stuff based on narrow guidelines programmed into it to try and keep it coherent, with no actual thought or interpretation involved in the result. The issue isn't that it's derivative, the issue is that it can only ever be inherently derivative without any intentional interpretation or creativity, and nothing else.

Even collage art has to qualify as fair use to avoid copyright infringement if it's being done for profit, and fair use requires it to provide commentary, criticism, or parody of the original work used (which requires intent). Even if it's transformative enough to make the original unrecognizable, if the majority of the work is not your own art, then you need to get permission to use it otherwise you aren't automatically safe from getting in trouble over copyright. Even using images for photoshop involves creative commons and commercial use licenses. Fanart and fanfic is also considered a grey area and the only reason more of a stink isn't kicked up over it regarding copyright is because it's generally beneficial to the original creators, and credit is naturally provided by the nature of fan works so long as someone doesn't try to claim the characters or IP as their own. So most creators turn a blind eye to the copyright aspect of the genre, but if any ever did want to kick up a stink, they could, and have in the past like with Anne Rice. And as a result most fanfiction sites do not allow writers to profit off of fanfics, or advertise fanfic commissions. And those are cases with actual humans being the ones to produce the works based on something that inspired them or that they are interpreting. So even human made derivative works have rules and laws applied to them as well. Ai isn't a creative force with thoughts and ideas and intent, it's just a pattern recognition and replication tool, and it doesn't benefit creators when it's used to replace them entirely, like Hollywood is attempting to do (among other corporate entities). Viewing AI at least as critically as actual human beings is the very least we can do, as well as establishing protection for human creators so that they can't be taken advantage of because of AI.

I'm not inherently against AI as a concept and as a tool for creators to use, but I am against AI works with no human input being used to replace creators entirely, and I am against using works to train it without the permission of the original creators. Even in the artist/writer/etc communities it's considered to be a common courtesy to credit other people/works that you based a work on or took inspiration from, even if what you made would be safe under copyright law regardless. Sure, humans get some leeway in this because we are imperfect meat creatures with imperfect memories and may not be aware of all our influences, but a coded algorithm doesn't have that excuse. If the current AIs in circulation can't function without being fed stolen works without credit or permission, then they're simply not ready for commercial use yet as far as I'm concerned. If it's never going to be possible, which I just simply don't believe, then it should never be used commercially period. And it should be used by creators to assist in their work, not used to replace them entirely. If it takes longer to develop, fine. If it takes more effort and manpower, fine. That's the price I'm willing to pay for it to be ethical. If it can't be done ethically, then imo it shouldn't be done at all.

load more comments (13 replies)
[-] Bogasse@lemmy.world 17 points 1 year ago

Well, I'd consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is "they build original content", both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their "original content" is not derivated from copyrighted content 🤷

load more comments (1 replies)
load more comments (3 replies)
[-] fubo@lemmy.world 109 points 1 year ago* (last edited 1 year ago)

If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.

A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.

Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM's "brain" has not yet been adjudicated by any court anywhere.

If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it "hiding" is backwards.

[-] cantstopthesignal@sh.itjust.works 28 points 1 year ago* (last edited 1 year ago)

You are a human, you are allowed to create derivative works under the law. Copyright law as it relates to machines regurgitating what humans have created is fundamentally different. Future legislation will have to address a lot of the nuance of this issue.

load more comments (1 replies)
[-] GyozaPower@discuss.tchncs.de 18 points 1 year ago

Let's not pretend that LLMs are like people where you'd read a bunch of books and draw inspiration from them. An LLM does not think nor does it have an actual creative process like we do. It should still be a breach of copyright.

[-] efstajas@lemmy.world 19 points 1 year ago

... you're getting into philosophical territory here. The plain fact is that LLMs generate cohesive text that is original and doesn't occur in their training sets, and it's very hard if not impossible to get them to quote back copyrighted source material to you verbatim. Whether you want to call that "creativity" or not is up to you, but it certainly seems to disqualify the notion that LLMs commit copyright infringement.

load more comments (4 replies)
load more comments (1 replies)
load more comments (13 replies)
[-] Blapoo@lemmy.ml 96 points 1 year ago

We have to distinguish between LLMs

  • Trained on copyrighted material and
  • Outputting copyrighted material

They are not one and the same

[-] Even_Adder@lemmy.dbzer0.com 34 points 1 year ago

Yeah, this headline is trying to make it seem like training on copyrighted material is or should be wrong.

[-] scv@discuss.online 27 points 1 year ago

Legally the output of the training could be considered a derived work. We treat brains differently here, that's all.

I think the current intellectual property system makes no sense and AI is revealing that fact.

[-] TropicalDingdong@lemmy.world 14 points 1 year ago

I think this brings up broader questions about the currently quite extreme interpretation of copyright. Personally I don't think its wrong to sample from or create derivative works from something that is accessible. If its not behind lock and key, its free to use. If you have a problem with that, then put it behind lock and key. No one is forcing you to share your art with the world.

load more comments (2 replies)
load more comments (5 replies)
[-] Skanky@lemmy.world 69 points 1 year ago

Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.

load more comments (2 replies)
[-] Sentau@lemmy.one 44 points 1 year ago* (last edited 1 year ago)

I think a lot of people are not getting it. AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour. Similar to how using copyrighted clips in a monetized video can make you get a strike against your channel but if the video is not monetized, the chances of YouTube taking action against you is lower.

Edit - If this was an open source model available for use by the general public at no cost, I would be far less bothered by claims of copyright infringement by the model

[-] Tyler_Zoro@ttrpg.network 29 points 1 year ago

AI/LLMs can train on whatever they want but when then these LLMs are used for commercial reasons to make money, an argument can be made that the copyrighted material has been used in a money making endeavour.

And does this apply equally to all artists who have seen any of my work? Can I start charging all artists born after 1990, for training their neural networks on my work?

Learning is not and has never been considered a financial transaction.

load more comments (15 replies)
[-] FMT99@lemmy.world 18 points 1 year ago

But wouldn't this training and the subsequent output be so transformative that being based on the copyrighted work makes no difference? If I read a Harry Potter book and then write a story about a boy wizard who becomes a great hero, anyone trying to copyright strike that would be laughed at.

load more comments (2 replies)
load more comments (6 replies)
[-] rosenjcb@lemmy.world 43 points 1 year ago* (last edited 1 year ago)

The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It's historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

[-] uzay@infosec.pub 42 points 1 year ago

I hope OpenAI and JK Rowling take each other down

load more comments (15 replies)
[-] paraphrand@lemmy.world 39 points 1 year ago

Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

[-] bamboo@lemm.ee 28 points 1 year ago

Mostly because fuck corporations trying to milk their copyright. I have no particular love for OpenAI (though I do like their product), but I do have great distain for already-successful corporations that would hold back the progress of humanity because they didn't get paid (again).

load more comments (12 replies)
[-] Stinkywinks@lemmy.world 15 points 1 year ago

Because everyone learns from books, it's stupid.

[-] otherbastard@lemm.ee 19 points 1 year ago

An LLM is not a person, it is a product. It doesn't matter that it "learns" like a human - at the end of the day, it is a product created by a corporation that used other people's work, with the capacity to disrupt the market that those folks' work competes in.

load more comments (23 replies)
load more comments (6 replies)
[-] Technoguyfication@lemmy.ml 32 points 1 year ago

People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

[-] abbotsbury@lemmy.world 17 points 1 year ago

but on mega steroids with a nearly limitless capacity for information retention.

That sounds like redistributing copyrighted books

[-] hup@lemmy.world 15 points 1 year ago* (last edited 1 year ago)

Nope people are just acting like ChatGPT is making commercial use of the content. Knowing a quote from a book isn't copyright infringement. Selling that quote is. Also it doesn't need to be content stored 1:1 somewhere to be infringement. That misses the point. If you're making money of a synopsis you wrote based on imperfect memory and in your own words it's still copyright infringment until you sign a licensing agreement with JK. Even transforming what you read into a different medium like a painting or poetry cam infinge the original authors copyrights.

Now mull that over and tell us what you think about modern copyright laws.

load more comments (3 replies)
load more comments (23 replies)
[-] uriel238 30 points 1 year ago* (last edited 1 year ago)

Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.

The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.

The problem is this is a shitty, unethical way to determine who gets to survive and who doesn't. All the current controversy about generative AI does is kick this can down the road a bit. But we're going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.

Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.

load more comments (3 replies)
[-] RadialMonster@lemmy.world 24 points 1 year ago

what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

load more comments (5 replies)
[-] Default_Defect@midwest.social 24 points 1 year ago

They made it read Harry Potter? No wonder its gonna kill us all one day.

[-] scarabic@lemmy.world 23 points 1 year ago

One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.

So yeah, case closed. They are full of shit.

[-] PraiseTheSoup@lemm.ee 36 points 1 year ago

There is enough non-copywrited Harry Potter fan fiction out there that it would not need to be trained on the actual books to know all the characters. While I agree they are full of shit, your anecdote proves nothing.

load more comments (6 replies)
[-] Gnubyte@lemdit.com 23 points 1 year ago* (last edited 1 year ago)

Our ancient legal system trying to lend itself to "protecting authors" is fucking absurd. AI is the future. Are we really going to let everyone take a shot suing these guys over this crap? Its a useful program and infrastructure for everyone.

Holding technology back for antiquated copyright law is downright absurd.

Edit: I want to add that I'm not suggesting copyright should be a free for all on your books or hard work, but rather that this is a computer program and a major breakthrough, and in the same way that if I read a book no one sues my brain for consumption I don't think we should sue an AI: it is not reproducing books. In the same manner that many footnotes websites about books do not reproduce a book by summarizing their content. With the contingency that until Open AI does not have an event where their reputation has to be re-evaluated (IE this is subject to change if they start trying to reproduce books).

[-] scarabic@lemmy.world 18 points 1 year ago

Stop comparing AI to a person. It’s not a person, it doesn’t do the things a person does, and it doesn’t have the rights of a person.

And yes the laws are antiquated. We need new laws that will protect authors.

Finally, no, you can’t just throw out all other considerations because you think AI is useful.

load more comments (14 replies)
load more comments (4 replies)
[-] GeneralEmergency@lemmy.world 22 points 1 year ago

So that explains the "problematic" responses.

[-] Tetsuo@jlai.lu 20 points 1 year ago* (last edited 1 year ago)

If I'm not mistaken AI work was just recently considered as NOT copyrightable.

So I find interesting that an AI learning from copyrighted work is an issue even though what will be generated will NOT be copyrightable.

So even if you generated some copy of Harry Potter you would not be able to copyright it. So in no way could you really compete with the original art.

I'm not saying that it makes it ok to train AIs on copyrighted art but I think it's still an interesting aspect of this topic.

As others probably have stated, the AI may be creating content that is transformative and therefore under fair use. But even if that work is transformative it cannot be copyrighted because it wasn't created by a human.

load more comments (11 replies)
[-] ClamDrinker@lemmy.world 20 points 1 year ago* (last edited 1 year ago)

This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they've been doing for a while, hence why their AI models are known to be massively censored. I wouldn't call that 'hiding'. It's kind of hard to hide it was trained on copyrighted material, since that's common knowledge, really.

[-] LordShrek@lemmy.world 19 points 1 year ago

are we no longer allowed to borrow books from friends?

[-] benni@lemmy.world 16 points 1 year ago

Yeah, but if you wanna act out the contents of the book and sell it as a movie, you need to buy the rights.

load more comments (7 replies)
load more comments (1 replies)
[-] Jat620DH27@lemmy.world 19 points 1 year ago

I thought everyone knows that OpenAI has the same access to any books, knowledge that human beings have.

[-] Redditiscancer789@lemmy.world 17 points 1 year ago

Yes, but it's what it is doing with it that is the murky grey area. Anyone can read a book, but you can't use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.

load more comments (10 replies)
[-] afraid_of_zombies@lemmy.world 16 points 1 year ago

I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.

load more comments (2 replies)
load more comments
view more: next ›
this post was submitted on 22 Aug 2023
795 points (100.0% liked)

Technology

60123 readers
2086 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS