392

We Asked A.I. to Create the Joker. It Generated a Copyrighted Image. (www.nytimes.com)

submitted 2 years ago by L4s@lemmy.world to c/technology@lemmy.world

277 comments fedilink hide all child comments

We Asked A.I. to Create the Joker. It Generated a Copyrighted Image.::Artists and researchers are exposing copyrighted material hidden within A.I. tools, raising fresh legal questions.

you are viewing a single comment's thread
view the rest of the comments

[-] dragontamer@lemmy.world 80 points 2 years ago* (last edited 2 years ago)

Because this proves that the "AI", at some level, is storing the data of the Joker movie screenshot somewhere inside of its training set.

Likely because the "AI" was trained upon this image at some point. This has repercussions with regards to copyright law. It means the training set contains copyrighted data and the use of said training set could be argued as piracy.

Legal discussions on how to talk about generative-AI are only happening now, now that people can experiment with the technology. But its not like our laws have changed, copyright infringement is copyright infringement. If the training data is obviously copyright infringement, then the data must be retrained in a more appropriate manner.

[-] abhibeckert@lemmy.world 38 points 2 years ago* (last edited 2 years ago)

But where is the infringement?

This NYT article includes the same several copyrighted images and they surely haven't paid any license. It's obviously fair use in both cases and NYT's claim that "it might not be fair use" is just ridiculous.

Worse, the NYT also includes exact copies of the images, while the AI ones are just very close to the original. That's like the difference between uploading a video of yourself playing a Taylor Swift cover and actually uploading one of Taylor Swift's own music videos to YouTube.

Even worse the NYT intentionally distributed the copyrighted images, while Midjourney did so unintentionally and specifically states it's a breach of their terms of service. Your account might be banned if you're caught using these prompts.

[-] jacksilver@lemmy.world 28 points 2 years ago

You do realize that newspapers do typically pay the licensing for images, it's how things like Getty images exist.

On the flip side, OpenAI (and other companies) are charging someone access to their model, which is then returning copyrighted images without paying the original creator.

That's why situations like this keep getting talked about, you have a 3rd party charging people for copyrighted materials. We can argue that it's a tool, so you aren't really "selling" copyrighted data, but that's the issue that is generally be discussed in these kinds of articles/court cases.

[-] ApollosArrow@lemmy.world 3 points 2 years ago

Mostly playing devil’s advocate here (since I don’t think ai should be used commercially), but I’m actually curious about this, since I work in media… You can get away using images or footage for free if it falls under editorial or educational purposes. I know this can vary from place to place, but with a lot of online news sites now charging people to view their content, they could potentially be seen as making money off of copyrighted material, couldn’t they?

[-] jacksilver@lemmy.world 3 points 2 years ago

It's not a topic that I'm super well versed in, but here is a thread from a photography forum indicating that news organizations can't take advantage of fair use https://www.dpreview.com/forums/thread/4183940.

I think these kinds of stringent rules are why so many are up in arms about how AI is being used. It's effectively a way for big players to circumvent paying the people who out all the work into the art/music/voice acting/etc. The models would be nothing without the copyrighted material, yet no one seems to want to pay those people.

It gets more interesting when you realize that long term we still need people creating lots of content if we want these models to be able to create things around concepts that don't yet exist (new characters, genres of music, etc.)

[-] dragontamer@lemmy.world 6 points 2 years ago

But where is the infringement?

Do Training weights have the data? Are the servers copying said data on a mass scale, in a way that the original copyrighters don't want or can't control?

[-] orclev@lemmy.world 12 points 2 years ago

Data is not copyrighted, only the image is. Furthermore you can not copyright a number, even though you could use a sufficiently large number to completely represent a specific image. There's also the fact that copyright does not protect possession of works, only distribution of them. If I obtained a copyrighted work no matter the means chosen to do so, I've committed no crime so long as I don't duplicate that work. This gets into a legal grey area around computers and the fundamental way they work, but it was already kind of fuzzy if you really think about it anyway. Does viewing a copyrighted image violate copyright? The visual data of that image has been copied into your brain. You have the memory of that image. If you have the talent you could even reproduce that copyrighted work so clearly a copy of it exists in your brain.

[-] dragontamer@lemmy.world 5 points 2 years ago* (last edited 2 years ago)

only distribution of them.

Yeah. And the hard drives and networks that pass Midjourney's network weights around?

That's distribution. Did Midjourney obtain a license from the artists to allow large numbers of "Joker" copyrighted data to be copied on a ton of servers in their data-center so that Midjourney can run? They're clearly letting the public use this data.

[-] orclev@lemmy.world 6 points 2 years ago

Because they're not copying around images of Joker, they're copying around a work derived from many many things including images of Joker. Copying a derived work does not violate the copyright of the work it was derived from. The wrinkle in this case is that you can extract something very similar to the original works back out of the derived work after the fact. It would be like if you could bake a cake, pass it around, and then down the line pull a whole egg back out of it. Maybe not the exact egg you started with, but one very similar to it. This is a situation completely unlike anything that's come before it which is why it's not actually covered by copyright. New laws will need to be drafted (or at a bare minimum legal judgements made) to decide how exactly this situation should be handled.

[-] archomrade@midwest.social 3 points 2 years ago

Someone already downvoted you but this is exactly the topic of debate surrounding this issue.

Other recognized fair-use exemptions have similar interpretations: a computer model analyzes a large corpus of copyrighted work for the purposes of being able to search their contents and retrieve relevant snippets and works based on semantic and abstract similarities. The computer model that is the representation of those works for that purpose is fair use: it contains only factual information about those works. It doesn't matter if the works used for that model were unlicensed: the model is considered fair use.

AI models operate by a very similar method, albeit one with a lot more complexity. But the model doesn't contain copyrighted works, it is only itself a collection of factual information about the copyrighted works. The novel part of this case is that it can be used to re-construct expressions very similar to the original (it should be pointed out that the fidelity is often very low, and the more detailed the output the less like the original it becomes). It isn't settled yet if that fact changes this interpretation, but regardless I think copyright is already not the right avenue to pursue, if the goal is to remediate or prevent harm to creators and encourage novel expressions.

load more comments (2 replies)

load more comments (1 replies)

[-] abhibeckert@lemmy.world 3 points 2 years ago

Do Training weights have the data?

The answer to that question is extensively documented by thousands of research papers - it's not up for debate.

[-] Mirodir@discuss.tchncs.de 4 points 2 years ago

If someone wants to read one of those papers, I can recommend Extracting Training Data from Diffusion Models. It shouldn't be too hard for someone with little experience in the field to be able to follow along.

load more comments (5 replies)

[-] GenderNeutralBro@lemmy.sdf.org 16 points 2 years ago

Because this proves that the “AI”, at some level, is storing the data of the Joker movie

I don't think that's a justified conclusion.

If I watched a movie, and you asked me to reproduce a simple scene from it, then I could do that if I remembered the character design, angle, framing, etc. None of this would require storing the image, only remembering the visual meaning of it and how to represent that with the tools at my disposal.

If I reproduced it that closely (or even not-nearly-that-closely), then yes, my work would be considered a copyright violation. I would not be able to publish and profit off of it. But that's on me, not on whoever made the tools I used. The violation is in the result, not the tools.

The problem with these claims is that they are shifting the responsibility for copyright violation off of the people creating the art, and onto the people making the tools used to create the art. I could make the same image in Photoshop; are they going after Adobe, too? Of course not. You can make copyright-violating work in any medium, with any tools. Midjourney is a tool with enough flexibility to create almost any image you can imagine, just like Photoshop.

Does it really matter if it takes a few minutes instead of hours?

[-] rambaroo@lemmy.world 3 points 2 years ago* (last edited 2 years ago)

AIs are not humans my dude. I don't know why people keep using this argument. They specifically designed this thing to scrape copyrighted material, it's not like an artist who was just inspired by something.

[-] archomrade@midwest.social 4 points 2 years ago

It isn't human, but that IS how it works.

It's analyzing material and extracting data about it, not compiling the data itself. In much the same way TDM (textual data mining) analyzes text and extracts information about it for the purposes of search and classification, or sentiment analysis, ECT, an "AI" model analyses material and extracts information on how to construct new language or visual media that relates to text prompts.

It's important to understand this because it's core to the fair use defence getting claimed. The models are derived from copyrighted works, but they aren't themselves infringing. There is precedent for similar cases being fair use.

[-] GenderNeutralBro@lemmy.sdf.org 3 points 2 years ago

Photoshop is not human. AutoTune is not human. Cameras are not human. Microphones are not human. Paintbrushes are not human. Etc.

AI did not create this. A HUMAN created this with AI. The human is responsible for the creating it. The human is responsible for publishing it.

Please stop anthropomorphizing AI!

[-] archomrade@midwest.social 14 points 2 years ago

I've had this discussion before, but that's not how copyright exceptions work.

Right or wrong (it hasn't been litigated yet), AI models are being claimed as fair use exceptions to the use of copyrighted material. Similar to other fair uses, the argument goes something like:

"The AI model is simply a digital representation of facts gleamed from the analysis of copyrighted works, and since factual data cannot be copyrighted (e.g. a description of the Mona Lisa vs the painting itself), the model itself is fair use"

I think it'll boil down to whether the models can be easily used as replacements to the works being claimed, and honestly I think that'll fail. That the models are quite good at reconstructing common expressions of copyrighted work is novel to the case law, though, and worthy of investigation.

But as someone who thinks ownership of expressions is bullshit anyway, I tend to think copyright is not the right way to go about penalizing or preventing the harm caused by the technology.

[-] kromem@lemmy.world 6 points 2 years ago

Training should not and I suspect will not be found to be infringement. If old news articles from the NYT can teach a model language in ways that help it review medical literature to come up with novel approaches to cure cancer, there's a whole host of features from public good to transformational use going on.

What they should be throwing resources at is policing usage not training. Make the case that OpenAI is liable for infringing generation. Ensure that there needs to be copyright checking on outputs. In many ways this feels like a repeat of IP criticisms around the time Google acquired YouTube which were solved with an IP tagging system.

[-] freeman@sh.itjust.works 5 points 2 years ago

Should Photoshop check your image for copyright infringement? Should Adobe be liable for copyright infringing or offensive images users of it's program create?

load more comments (5 replies)

[-] ryathal@sh.itjust.works 4 points 2 years ago

There's no money for them in that angle though. It's much easier to sue xerox for enabling copyright violations than the person who used the machine to violate copyright.

Courts have already handled this with copy machines. AI isn't terribly different, it's unlikely these suits against model creators succeed.

load more comments (1 replies)

[-] Jilanico@lemmy.world 11 points 2 years ago

Because this proves that the “AI”, at some level, is storing the data of the Joker movie screenshot somewhere inside of its training set.

Is it tho? Honest question.

[-] QubaXR@lemmy.world 6 points 2 years ago* (last edited 2 years ago)

Yes it is. Honest answer.

[-] Jilanico@lemmy.world 4 points 2 years ago

So stable diffusion, midjourney, etc., all have massive databases with every picture on the Internet stored in them? I know the AI models are trained on lots of images, but are the images actually stored? I'm skeptical, but I'm no expert.

[-] QubaXR@lemmy.world 6 points 2 years ago

These models were trained on datasets that, without compensating the authors, used their work as training material. It's not every picture on the net, but a lot of it is scrubbing websites, portfolios and social networks wholesale.

A similar situation happens with large language models. Recently Meta admitted to using illegally pirated books (Books3 database to be precise) to train their LLM without any plans to compensate the authors, or even as much as paying for a single copy of each book used.

[-] Jilanico@lemmy.world 5 points 2 years ago

Most of the stuff that inspires me probably wasn't paid for. I just randomly saw it online or on the street, much like an AI.

AI using straight up pirated content does give me pause tho.

[-] QubaXR@lemmy.world 4 points 2 years ago* (last edited 2 years ago)

I was on the same page as you for the longest time. I cringed at the whole "No AI" movement and artists' protest. I used the very same idea: Generations of artists honed their skills by observing the masters, copying their techniques and only then developing their own unique style. Why should AI be any different? Surely AI will not just copy works wholesale and instead learn color, composition, texture and other aspects of various works to find it's own identity.

It was only when my very own prompts started producing results I started recognizing as "homages" at best and "rip-offs" at worst that gave me a stop.

I suspect that earlier generations of text to image models had better moderation of training data. As the arms race heated up and pace of development picked up, companies running these services started rapidly incorporating whatever training data they could get their hands on, ethics, copyright or artists' rights be damned.

I remember when MidJourney introduced Niji (their anime model) and I could often identify the mangas and characters used to train it. The imagery Niji produced kept certain distinct and unique elements of character designs from that training data - as a result a lot of characters exhibited "Chainsaw Man" pointy teeth and sticking out tongue - without as much as a mention of the source material or even the themes.

[-] topinambour_rex@lemmy.world 3 points 2 years ago

How much profit do you make from this stuff ?

load more comments (1 replies)

load more comments (3 replies)

[-] dragontamer@lemmy.world 4 points 2 years ago

How did the Joker image get replicated?

[-] Jilanico@lemmy.world 4 points 2 years ago

It's too hard to type up how generative AIs work, but look up a video on "how stable diffusion works" or something like that. I seriously doubt they have a massive database with every image from the Internet inside it, with the AI just spitting those pics out, but I'm no expert.

[-] ryannathans@aussie.zone 3 points 2 years ago

Sure, but so is your memory, you could study the originals and re-draw them a similar way.

[-] Jilanico@lemmy.world 4 points 2 years ago

I agree, but I don't think these generative AIs actually store image files off the Internet in a massive database. I could be wrong.

[-] ryannathans@aussie.zone 5 points 2 years ago* (last edited 2 years ago)

That's correct. The structure of information isn't anywhere remotely similar to a file or database. Information pixel by pixel isn't stored, it more loosely remembers correlations and similarities and facts about the content as opposed to storing and copying it

load more comments (2 replies)

[-] CyberSeeker@discuss.tchncs.de 9 points 2 years ago

So let’s say I ask a talented human artist the same thing.

Doesn’t this prove that a human, at some level, is storing the data of the Joker movie screenshot somewhere inside of their memory?

[-] dragontamer@lemmy.world 9 points 2 years ago* (last edited 2 years ago)

So let’s say I ask a talented human artist the same thing.

Artists don't have hard drives or solid state drives that accept training weights.

When you have a hard drive (or other object that easily creates copies), then the law that follows is copyright, with regards to the use and regulation of those copies. It doesn't matter if you use a Xerox machine, VHS tape copies, or a Hard Drive. All that matters is that you're easily copying data from one location to another.

And yes. When a human recreates a copy of a scene clearly inspired by copyrighted data, its copyright infringement btw. Even if you recreate it from memory. It doesn't matter how I draw Pikachu, if everyone knows and recognizes it as Pikachu, I'm infringing upon Nintendo's copyright (and probably their trademark as well).

[-] Auli@lemmy.ca 3 points 2 years ago

Nope humans don't store data perfectly with perfect recall.

[-] lolcatnip@reddthat.com 7 points 2 years ago

Neither do neural networks.

[-] abhibeckert@lemmy.world 4 points 2 years ago* (last edited 2 years ago)

Humans can get pretty close to perfect recall with enough practice - show a human that exact joker image hundreds of thousands of times, they're going to be able to remember every detail.

That's what happened here - the example images weren't just in the training set once, they are in the training set over and over and over again across hundreds of thousands of websites.

If someone wants these images nobody is going to use AI to access it - they'll just do a google image search. There is no way Warner Brothers is harmed in any way by this, which is a strong fair use defence.

[-] Jilanico@lemmy.world 3 points 2 years ago

Some do. Should we jail all the talented artists with photographic memories?

[-] topinambour_rex@lemmy.world 3 points 2 years ago

If they exactly reproduce others work, and gain a profit for it, a fine would be the minimum.

load more comments (2 replies)

load more comments (1 replies)

[-] orclev@lemmy.world 7 points 2 years ago

If the training data is obviously copyright infringement, then the data must be retrained in a more appropriate manner.

This is the crux of the issue, it isn't obviously copyright infringement. Currently copyright is completely silent on the matter one way or another.

The thing that makes this particularly interesting is that the traditional copyright maximalists, the ones responsible for ballooning copyright durations from its original reasonable limit of 14 years (plus one renewal) to its current absurd duration of 95 years, also stand to benefit greatly from generative works. Instead of the usual full court press we tend to see from the major corporations around anything copyright related we're instead seeing them take a rather hands off approach.

[-] dragontamer@lemmy.world 4 points 2 years ago

This is the crux of the issue, it isn’t obviously copyright infringement. Currently copyright is completely silent on the matter one way or another.

Its clear that the training weights have the data on recreating this Joker scene. Its also clear that if the training-data didn't contain this image, then the copy of the image would never result into the weights that have been copy/pasted everywhere.

[-] orclev@lemmy.world 6 points 2 years ago* (last edited 2 years ago)

Except it isn't a perfect copy. It's very similar, but not exact. Additionally for every example you can find where it spits out a nearly identical image you can also find one where it produces nothing like it. Even more complicated you can get images generated that very closely match other copyrighted works, but which the model was never trained on. Does that mean copying the model violates the copyright of a work that it literally couldn't have included in its data?

You're making a lot of assumptions and arguments that copyright covers things that it very much does not cover or at a minimum that it hasn't (yet) been ruled to cover.

Legally, as things currently stand, an AI model trained on a copyrighted work is not a copy of that work as far as copyright is concerned. That's today's legal reality. That might change in the future, but that's far from certain, and is a far more nuanced and complicated problem than you're making it out to be.

Any legal decision that ruled an AI model is a copy of all the works used to train it would also likely have very far reaching and complicated ramifications. That's why this needs to be argued out in court, but until then what midjourney is doing is perfectly legal.

load more comments (6 replies)

[-] orclev@lemmy.world 7 points 2 years ago

Wasn't that known? Have midjourney ever claimed they didn't use copyrighted works? There's also an ongoing argument about the legality of that in general. One recent court case ruled that copyright does not protect a work from being used to train an AI. I'm sure that's far from the final word on the topic, but it does mean this is a legal grey area at the moment.

[-] dragontamer@lemmy.world 5 points 2 years ago* (last edited 2 years ago)

If it is known, then it is copyright infringement to download the training sets and therefore a crime to do so. You cannot reproduce a copy of the works without the express permission of the copyright holder.

How many computers did Midjourney copy its training weights to? Has Midjourney (and the IT team behind it) paid royalties for every copyrighted image in its training set to have a proper copyright license to copy all of this data from computer to computer?

I'm guessing no. Which means the Midjourney team (if you say is true) is committing copyright infringement every time they spin up a new server with these weights.

Pro-AI side will obviously argue that the training weights do not contain the data of these copyrighted works. A claim that is looking more-and-more laughable as these experiments happen.

[-] db0@lemmy.dbzer0.com 7 points 2 years ago

No it's not illegal to download publicly available content it's a copyright violation to republish it.

load more comments (14 replies)

this post was submitted on 26 Jan 2024

392 points (100.0% liked)

Technology

86636 readers

2822 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 3 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws