351

Google says AI systems should be able to mine publishers’ work unless companies opt out, turning copyright law on its head (www.theguardian.com)

submitted 1 year ago by 0x815@feddit.de to c/technology@beehaw.org

204 comments fedilink hide all child comments

In its submission to the Australian government’s review of the regulatory framework around AI, Google said that copyright law should be altered to allow for generative AI systems to scrape the internet.

you are viewing a single comment's thread
view the rest of the comments

[-] FaceDeer@kbin.social 57 points 1 year ago

Copyright law already allows generative AI systems to scrape the internet. You need to change the law to forbid something, it isn't forbidden by default. Currently, if something is published publicly then it can be read and learned from by anyone (or anything) that can see it. Copyright law only prevents making copies of it, which a large language model does not do when trained on it.

[-] maynarkh@feddit.nl 35 points 1 year ago

A lot of licensing prevents or constrains creating derivative works and monetizing them. The question is for example if you train an AI on GPL code, does the output of the model constitute a derivative work?

If yes, Github Copilot is illegal as it produces code that should comply to multiple conflicting license requirements. If no, I can write some simple AI that is "trained" to regurgitate its output on a prompt, and run a leaked copy of Windows through it, then go around selling Binbows and MSFT can't do anything about it.

The truth is mostly between the two, this is just piracy, which always has been a gray area because of the difficulty of prosecuting it, previously because the perpetrators were many and hard to find, now it's because the perpetrators are billion dollar companies with expensive lawyer teams.

[-] FaceDeer@kbin.social 21 points 1 year ago

The question is for example if you train an AI on GPL code, does the output of the model constitute a derivative work?

This question is completely independent of whether the code was generated by an AI or a human. You compare code A with code B, and if the judge and jury agree that code A is a derivative work of code B then you win the case. If the two bodies of work don't have sufficient similarities then they aren't derivative.

If no, I can write some simple AI that is “trained” to regurgitate its output on a prompt

You've reinvented copy-and-paste, not an "AI." AIs are deliberately designed to not copy-and-paste. What would be the point of one that did? Nobody wants that.

Filtering the code through something you call an AI isn't going to have any impact on whether you get sued. If the resulting code looks like copyrighted code, then you're in trouble. If it doesn't look like copyrighted code then you're fine.

[-] maynarkh@feddit.nl 11 points 1 year ago

AIs are deliberately designed to not copy-and-paste.

AI is a marketing term, not a technical one. You can call anything "AI", but it's usually predictive models that get called that.

AIs are deliberately designed to not copy-and-paste. What would be the point of one that did? Nobody wants that.

For example if the powers that be decided to say licenses don't apply once you feed material through an "AI", and failed to define AI, you could say you wrote this awesome OS using an AI that you trained exclusively using Microsoft proprietary code. Their licenses and copyright and stuff doesn't apply to AI training data so you could sell that new code your AI just created.

It doesn't even have to be 100% identical to Windows source code. What if it's just 80%? 50%? 20%? 5%? Where is the bar where the author can claim "that's my code!"?

Just to compare, the guys who set out to reimplement Win32 APIs for use in Linux (the thing that made it into MacOS as well now) deliberately would not accept help from anyone who ever saw any Microsoft source code for fear of being sued. The bar was that high when it was a small FOSS organization doing it. It was 0%, proven beyond a doubt.

Now that Microsoft is the author, it's not a problem when Github Copilot spits out GPL code word for word, ironically together with its license.

[-] FaceDeer@kbin.social 7 points 1 year ago

AI is a marketing term, not a technical one.

The reverse, actually. Artificial intelligence is a field of research that includes things like machine learning, as well as lots of even more mundane applications. It's pop culture that has hijacked it to mean "a thing exactly as capable as a human brain, but in computer form."

For example if the powers that be decided to say licenses don’t apply once you feed material through an “AI”, and failed to define AI, you could say you wrote this awesome OS using an AI that you trained exclusively using Microsoft proprietary code.

Once again, it doesn't matter what you "feed code through." Copyright applies to the tangible result. If the output from the AI matches closely to something that's already copyrighted then that copyright applies to it. If it doesn't match closely then that copyright doesn't apply to it. The actual process by which the code was produced doesn't matter one whit. If I took a Harry Potter book, put its pages through a shredder, randomly glued the particles of paper back together and it just so happened to closely replicate Lord of the Rings then the Tolkien estate has a case against me but the Rowling estate does not.

[-] nous@programming.dev 9 points 1 year ago

If the resulting code looks like copyrighted code, then you’re in trouble. If it doesn’t look like copyrighted code then you’re fine.

^^ Very much this.

Loads of people are treating the process of AI creating works as either violating copyright or not. But that is not how copyright works. It applies to the output of a process not the process itself. If someone ends up writing something that happens to be a copy of something they read before - that is a violation of copy write laws. If someone uses various works and creates something new and unique then that is not a violation. It does not - at this point in time at least - matter if that someone is a real person or an AI.

AI can both violate copy write on one work and not on another. Each case is independent and would need to be legislated differently. But AI can produce so much content so quickly that it creates a real problem for a case by case analysis of copy write infringement. So it is quite likely the laws will need to change to account for this and will likely need to treat AI works differently from human created works. Which is a very hard thing to actually deal with.

Now, one could also argue the model itself is a violation of copyright. But that IMO is a stretch - a model is nothing like the original work and the copyright law also does not cover this case. It would need to be taken to court to really decide on if this is allowed or not.

Personally I don't think the conversation should be on what the laws currently allow - they were not designed for this. But instead what the laws should allow. So we can steer the conversation towards a better future. Lots of artists are expressing their distaste for AI models to be trained on their works - if enough people do this laws can be crafted to backup this view.

[-] AbsolutelyNotABot@feddit.it 6 points 1 year ago

then go around selling Binbows and MSFT can't do anything about it

I think this already happen. A very practical example, windows GUI has been copied by many Linus distros. And with windows 11 there's clearly a reference to Apple MacOS GUI with a sparkling of Google material design.

Should apple and Google be able to sue Microsoft because it "copied" their work? Should Google be able to sue apple because they "copied" the notification drop-down in iOS?

As you say it's really a grey area because the only reason we consider AI code to be "regurgitated" while human code to be "inspired" is only because we give humans more recognition of their intellectual abilities.

[+] Boinketh@lemm.ee 24 points 1 year ago* (last edited 1 year ago)

[deleted]

[-] Niello@kbin.social 7 points 1 year ago

Exactly this right here.

[-] nous@programming.dev 4 points 1 year ago

Someone getting sued does not mean they are wrong or that they lost the case. Each case needs to look at the works in question and decide if that perceptual case violates copy write. Lots of things are taken into account here, and even is small elements might have been used or be similar does not automatically win the case.

There is also a difference between some implementation and the overall feature in question. For instance, APIs are not copy writeable, nor are cords in music, nor what something does overall. Only specific implementations are copy writeable.

The same can apply to AI - if it generates a work that if a human did it it would violate copy write then it does - if not then it does not. But AI shows a different problem. That of scale. There is only a limited amount of work that a human can do. But an AI can produce vastly more content - enough that a case by case evaluation of infringement might not be viable. And if that becomes the case then AI works might need to be treated differently from human created works - or maybe how the models are created and how they can use copy writed works. The current laws were never designed with the speed at which AI can work in mind.

[+] Boinketh@lemm.ee 3 points 1 year ago* (last edited 1 year ago)

[deleted]

[-] nous@programming.dev 3 points 1 year ago

What do you mean by infringement already? So you mean it automatically infringes copyright for all its output just because it might create something similar to a copyrighted work? Or do you mean that if it does create a copyrighted work that work in infringing on a copyright? Your wording is vague here.

can be shown to be capable of reproducing something close enough to said material

I don't think it is a good benchmark for forbidding AI generation of content. If you create a random image generate that has no inputs and is truly random then it is capable of generating something similar to copyrighted work - by pure chance. Even if that chance is very low you could generate enough images and show it can create something similar to copyrighted works.

What happens if you create one that is trained only on public domain images or works properly licensed? Its output is still partially random and could still generate an image similar to some other copyrighted work outside of its training set by pure chance.

I would argue that both of these should be allowed. They are not doing anything obviously wrong even if they could be used to generate copyrighted works. Just like you could use photoshop - or a paint brush to create copyrighted work.

But then, what if you take some other AI that is trained on all sorts of data, copyrighted or not. But then the output of that is fed through a checker that compares it to the training set (and maybe more copyrighted content) and rejects/regenerates work until it is known to not infringe on copyrighted work. Making the chances of it ever producing a copyrighted work far less then the above programs? Should that be allowed? It is using copyrighted work much like an artist would and you could argue that any copyrighted work it does produce was by pure accident as there are intentional steps to mitigate that.

If you use a paid service like Midjourney to generate copyrighted content, the company is essentially selling you access to copyrighted content they lack the rights to.

As far as I understand the laws involved, yeah I would expect that to infringe on some copyright holders work and midjourney would likely be coppable for damages. Just like hiring a artist to create some work and they decide to copy some copyrighted work would also make that artist coppable for damages.

And you also have to consider another side of things - if you can effectively stop AI from training on most works you will effectively stunt its usefulness. Which could lead all efforts in regulated nations to become useless which can result in it just moving to places that are much more open with the technology and where authors of the copyrighted work will have far less control over things. IMO AI generated content is out of the bag now and we will not get it back in. So the best we can do is ensure the right people get compensated for their works. Push to hard in the wrong direction (either way) and there is a real chance they never will.

I don't really have the solutions to many of these problems - but I do think it is worth talking about and don't think that outright bans (or actions leading to an effective ban) on this tech is the correct way to go.

[-] Even_Adder@lemmy.dbzer0.com 5 points 1 year ago

You should read this.

[-] lostmypasswordanew@feddit.de 11 points 1 year ago

An AI model is a derivative work of its training data and thus a copyright violation if the training data is copyrighted.

[-] BlameThePeacock@lemmy.ca 17 points 1 year ago

A human is a derivative work of its training data, thus a copyright violation if the training data is copyrighted.

The difference between a human and ai is getting much smaller all the time. The training process is essentially the same at this point, show them a bunch of examples and then have them practice and provide feedback.

If that human is trained to draw on Disney art, then goes on to create similar style art for sale that isn't a copyright infringement. Nor should it be.

[-] Phanatik@kbin.social 15 points 1 year ago* (last edited 1 year ago)

This is stupid and I'll tell you why.
As humans, we have a perception filter. This filter is unique to every individual because it's fed by our experiences and emotions. Artists make great use of this by producing art which leverages their view of the world, it's why Van Gogh or Picasso is interesting because they had a unique view of the world that is shown through their work.
These bots do not have perception filters. They're designed to break down whatever they're trained on into numbers and decipher how the style is constructed so it can replicate it. It has no intention or purpose behind any of its decisions beyond straight replication.
You would be correct if a human's only goal was to replicate Van Gogh's style but that's not every artist. With these art bots, that's the only goal that they will ever have.

I have to repeat this every time there's a discussion on LLM or art bots:
The imitation of intelligence does not equate to actual intelligence.

[-] frog@beehaw.org 12 points 1 year ago

Absolutely agreed! I think if the proponents of AI artwork actually had any knowledge of art history, they'd understand that humans don't just iterate the same ideas over and over again. Van Gogh, Picasso, and many others, did work that was genuinely unique and not just a derivative of what had come before, because they brought more to the process than just looking at other artworks.

[-] nickwitha_k@lemmy.sdf.org 7 points 1 year ago

Yup. There seems to be a strong motive in many to not understand this concept as it makes their practices clearly ethically questionable.

[-] frog@beehaw.org 6 points 1 year ago

My feeling is that the vast majority of pro-AI techbros come from a computer science, finance, or business background; undoubtedly intelligent people, but completely and utterly lacking in any appreciation or understanding of what actually goes into creative work. I'm sure they genuinely believe that there's no difference between what a human does and what an AI does, because they think art (or writing, music, etc) are just the product of an algorithm.

[-] Phanatik@kbin.social 2 points 1 year ago

Ironically, my background is in mathematics but I also happen to be a writer so I see both sides of the argument. I just see the utter lack of compassion people have for those who produce creative work and the same people believe that if it can be automated, it should be automated.

load more comments (1 replies)

[-] davehtaylor@beehaw.org 7 points 1 year ago

I really, really, really wish people would understand this.

AI can only create a synthesis of exactly what it's fed. It has no life experience, no emotional experience, no nurture-related experiences, no cultural experiences that color it's thinking, because it isn't thinking.

The "AI are only doing what humans do" is such a brain-dead line of thinking, to the point that it almost feels like it's 100% in bad faith whenever it's brought up.

[-] BlameThePeacock@lemmy.ca 3 points 1 year ago

You're completely wrong, and I'll tell you why.

None of what you said matters, perception filters, intent, intelligence... it's all irrelevant to the discussion.

Copyright infringement only gives certain rights, and at least here in Canada using them to generate a model isn't one of those. Rights are for things like distribution, reproduction, public performance, communication, and exhibition. US law says you can't "Prepare derivative works based upon the work." but the model isn't a derivative work because it's not really a work at all, you can't even visually look at the model. You can't copyright an algorithm in the US or Canada.

Only the created art should be scrutinized for copyright infringement, and these systems can generate both (just like a human can).

Any enforcement should then be handled when that protected work is then used to infringe on the actual rights of the copyright holder.

load more comments (7 replies)

[-] acastcandream@beehaw.org 2 points 1 year ago

this is stupid I’ll tell you why

Not sure why you think anyone would read anything if that’s how you start it.

[-] 50gp@kbin.social 10 points 1 year ago

a human does not copy previous work exactly like these algorithms, whats this shit take?

[-] BlameThePeacock@lemmy.ca 11 points 1 year ago

A human can absolutely copy previous works, and they do it all the time. Disney themselves license books teaching you how to do just that. https://www.barnesandnoble.com/w/learn-to-draw-disney-celebrated-characters-collection-disney-storybook-artists/1124097227

Not to mention the amount of porn online based on characters from copyrighted works. Porn that is often done as a paid commission, expressly violating copyright laws.

[-] Ret2libsanity@infosec.pub 7 points 1 year ago

Neither does AI?

[-] Niello@kbin.social 6 points 1 year ago

But considering that humans do get copyright strikes when they do something too similar that should also applies to AI, doesn't matter if it's not exact.

[-] Phanatik@kbin.social 5 points 1 year ago

That should tell you something about how companies act. They're fine with these LLMs plagiarising content but when someone gets marginally close to their own trademarks, they get slammed.

[-] lostmypasswordanew@feddit.de 8 points 1 year ago

Humans and AI are not the same and an equivalence should never be drawn.

[-] BlameThePeacock@lemmy.ca 2 points 1 year ago

Your feelings don't really matter, the fact of the matter is that the goal of ai is literally to replicate the function of a human brain. The way we're building them is often mimicking the same processes.

[-] nickwitha_k@lemmy.sdf.org 7 points 1 year ago

And LLMs and related technologies, by themselves, are artificial but not intelligent. So, the facts are not in favor of your argument to allow commercial parasitism on creative works.

[-] BlameThePeacock@lemmy.ca 2 points 1 year ago

I think you're missing a point here. If someone uses these to models to produce and distribute copyright infringing works, the original rights holder could go after the infringer.

The model itself isn't infringing though, and the process of creating the model isn't either.

It's a similar kind of argument to the laws that protect gun manufacturers from culpability from someone using their weapon to commit a crime. The user is the one doing the bad thing, they just produce a tool.

Otherwise, could Disney go after a pencil company because someone used one of their pencils to infringe on their copyright. Even if that pencil company had designed the pencil to be extremely good at producing Disney imagery by looking at a whole bunch of Disney images and movies to make sure it matches the size, colour, etc? No, because a pencil isn't a copyright infringement of art, regardless of the process used to design it.

[-] nickwitha_k@lemmy.sdf.org 2 points 1 year ago* (last edited 1 year ago)

Nah. You're missing the forest for the trees. Let's get abstract:

Person A makes a living by making product X and selling it.

Person B makes a living by making product Y and selling it.

Both A and B are in the same industry.

Person C uses a machine to extract the essence of product X and Y and blend them. Person C then claims authorship and sells it as product Z, which they sell in competition to X and Y.

Person C has not created anything. Their machine does not have value in the absence of products X and Y, yet received no permission, offers no credit nor compensation. In addition, they are competing for the same customers and harming the livelihoods of A and B. Person C is acting in a purely parasitic manner that cannot be seen as ethical in any widely accepted definition of the word.

load more comments (7 replies)

load more comments (1 replies)

[-] Zapp@beehaw.org 2 points 1 year ago

The goal of AI is fictional, and there's no solid evidence today that it will ever stop being fiction.

What at have today are stupid learning algorithms that are surprisingly good at mimicing intelligent people.

The most apt comparison today is a particularly clever parrot.

I'm all for having the discussion about how to handle AI when we have it, but it's bad faith to apply it to what we have today.

Critically, what we have today will never ever go on strike, or really make any kind of correct moral decision on it's own. We must treat it like dumb automation, because it is dumb automation.

load more comments (1 replies)

[-] conciselyverbose@kbin.social 8 points 1 year ago

Derivative works are only copyright violations when they replicate substantial portions of the original without changes.

The entirety of human civilization is derivative works. Derivative works aren't infringement.

[-] lostmypasswordanew@feddit.de 7 points 1 year ago

That's just not true

[-] conciselyverbose@kbin.social 2 points 1 year ago

It absolutely is. There's nothing out there in the past thousand years that isn't based on other prior art, copyright law only replies to direct copies, and there are explicit cutouts past that that allow you to directly copy some things if your work is transformative.

[-] FaceDeer@kbin.social 5 points 1 year ago

It is not a derivative work, the model does not contain any recognizable part of the original material that it was trained on.

[-] frog@beehaw.org 13 points 1 year ago

Except when it produces exact copies of existing works, or when it includes a recognisable signature or watermark?

[+] NumbersCanBeFun@kbin.social 3 points 1 year ago

[deleted]

[-] frog@beehaw.org 6 points 1 year ago

The point is that if the model doesn't contain any recognisable parts of the original material it was trained on, how can it reproduce recognisable parts of the original material it was trained on?

[-] ricecake@beehaw.org 2 points 1 year ago

That's sorta the point of it.
I can recreate the phrase "apple pie" in any number of styles and fonts using my hands and a writing tool. Would you say that I "contain" the phrase "apple pie"? Where is the letter 'p' in my brain?

Specifically, the AI contains the relationship between sets of words, and sets of relationships between lines, contrasts and colors.
From there, it knows how to take a set of words, and make an image that proportionally replicates those line pattern and color relationships.

You can probably replicate the Getty images watermark close enough for it to be recognizable, but you don't contain a copy of it in the sense that people typically mean.
Likewise, because you can recognize the artist who produced a piece, you contain an awareness of that same relationship between color, contrast and line that the AI does. I could show you a Picasso you were unfamiliar with, and you'd likely know it was him based on the style.
You've been "trained" on his works, so you have internalized many of the key markers of his style. That doesn't mean you "contain" his works.

[-] frog@beehaw.org 2 points 1 year ago

Just because you can't point to a specific part of your brain that contains the letter 'p' doesn't mean it isn't in there somewhere. If you didn't contain the letter 'p', or Getty watermark, or Picasso's work, you wouldn't be able to recognise them when you saw them or tried to replicate them. The act of recognising something that is familiar is basically the brain comparing what the eye sees with what is stored in the memory. The brain stores it differently to an exact copy on a hard drive, but it does, nevertheless, contain everything that it remembers.

load more comments (1 replies)

[-] FaceDeer@kbin.social 2 points 1 year ago

Ah, this old paper again. When it first came out it got raked over the coals pretty thoroughly. The authors used an older, poorly-trained version of Stable Diffusion that had been trained on only 160 million images and identified 350,000 images from the training set that had many duplicates and therefore could potentially be overfitted. They then generated 175 million images using tags commonly associated with those duplicate images.

After all that, they found 109 images in the output that looked like fuzzy versions of the input images. This is hardly a triumph of plagiarism.

As for the watermark, look closely at it. The AI clearly just replicated the idea of a Getty-like watermark, it's barely legible. What else would you expect when you train an AI on millions of images that contain a common feature, though? It's like any other common object - it thinks photographs often just naturally have a grey rectangle with those white squiggles in it, and so it tries putting them in there when it generates photographs.

These are extreme stretches and they get dredged up every time by AI opponents. Training techniques have been refined over time to reduce overfitting (since what's the point in spending enormous amounts of GPU power to produce a badly-artefacted copy of an image you already have?) so it's little wonder there aren't any newer, better papers showing problems like these.

[-] frog@beehaw.org 5 points 1 year ago

Nevertheless, the Getty watermark is a recognisable element from the images the model was trained on, therefore you cannot state that the models don't spit out images with recognisable elements from the training data.

load more comments (3 replies)

this post was submitted on 09 Aug 2023

351 points (100.0% liked)

Technology

37757 readers

532 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

Los@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org