351

Google says AI systems should be able to mine publishers’ work unless companies opt out, turning copyright law on its head (www.theguardian.com)

submitted 1 year ago by 0x815@feddit.de to c/technology@beehaw.org

204 comments fedilink hide all child comments

In its submission to the Australian government’s review of the regulatory framework around AI, Google said that copyright law should be altered to allow for generative AI systems to scrape the internet.

top 50 comments

sorted by: hot top controversial new old

[-] db0@lemmy.dbzer0.com 108 points 1 year ago

I agree with google, only I go a step further and say any AI model trained on public data should likewise be public for all and have its data sources public as well. Can't have it both ways Google.

[-] domi@lemmy.secnd.me 47 points 1 year ago

To be fair, Google releases a lot of models as open source: https://huggingface.co/google

Using public content to create public models is also fine in my book.

But since it's Google I'm also sure they are doing a lot of shady stuff behind closed doors.

load more comments (1 replies)

[-] FaceDeer@kbin.social 57 points 1 year ago

Copyright law already allows generative AI systems to scrape the internet. You need to change the law to forbid something, it isn't forbidden by default. Currently, if something is published publicly then it can be read and learned from by anyone (or anything) that can see it. Copyright law only prevents making copies of it, which a large language model does not do when trained on it.

[-] maynarkh@feddit.nl 35 points 1 year ago

A lot of licensing prevents or constrains creating derivative works and monetizing them. The question is for example if you train an AI on GPL code, does the output of the model constitute a derivative work?

If yes, Github Copilot is illegal as it produces code that should comply to multiple conflicting license requirements. If no, I can write some simple AI that is "trained" to regurgitate its output on a prompt, and run a leaked copy of Windows through it, then go around selling Binbows and MSFT can't do anything about it.

The truth is mostly between the two, this is just piracy, which always has been a gray area because of the difficulty of prosecuting it, previously because the perpetrators were many and hard to find, now it's because the perpetrators are billion dollar companies with expensive lawyer teams.

[-] FaceDeer@kbin.social 21 points 1 year ago

The question is for example if you train an AI on GPL code, does the output of the model constitute a derivative work?

This question is completely independent of whether the code was generated by an AI or a human. You compare code A with code B, and if the judge and jury agree that code A is a derivative work of code B then you win the case. If the two bodies of work don't have sufficient similarities then they aren't derivative.

If no, I can write some simple AI that is “trained” to regurgitate its output on a prompt

You've reinvented copy-and-paste, not an "AI." AIs are deliberately designed to not copy-and-paste. What would be the point of one that did? Nobody wants that.

Filtering the code through something you call an AI isn't going to have any impact on whether you get sued. If the resulting code looks like copyrighted code, then you're in trouble. If it doesn't look like copyrighted code then you're fine.

[-] maynarkh@feddit.nl 11 points 1 year ago

AIs are deliberately designed to not copy-and-paste.

AI is a marketing term, not a technical one. You can call anything "AI", but it's usually predictive models that get called that.

AIs are deliberately designed to not copy-and-paste. What would be the point of one that did? Nobody wants that.

For example if the powers that be decided to say licenses don't apply once you feed material through an "AI", and failed to define AI, you could say you wrote this awesome OS using an AI that you trained exclusively using Microsoft proprietary code. Their licenses and copyright and stuff doesn't apply to AI training data so you could sell that new code your AI just created.

It doesn't even have to be 100% identical to Windows source code. What if it's just 80%? 50%? 20%? 5%? Where is the bar where the author can claim "that's my code!"?

Just to compare, the guys who set out to reimplement Win32 APIs for use in Linux (the thing that made it into MacOS as well now) deliberately would not accept help from anyone who ever saw any Microsoft source code for fear of being sued. The bar was that high when it was a small FOSS organization doing it. It was 0%, proven beyond a doubt.

Now that Microsoft is the author, it's not a problem when Github Copilot spits out GPL code word for word, ironically together with its license.

[-] FaceDeer@kbin.social 7 points 1 year ago

AI is a marketing term, not a technical one.

The reverse, actually. Artificial intelligence is a field of research that includes things like machine learning, as well as lots of even more mundane applications. It's pop culture that has hijacked it to mean "a thing exactly as capable as a human brain, but in computer form."

For example if the powers that be decided to say licenses don’t apply once you feed material through an “AI”, and failed to define AI, you could say you wrote this awesome OS using an AI that you trained exclusively using Microsoft proprietary code.

Once again, it doesn't matter what you "feed code through." Copyright applies to the tangible result. If the output from the AI matches closely to something that's already copyrighted then that copyright applies to it. If it doesn't match closely then that copyright doesn't apply to it. The actual process by which the code was produced doesn't matter one whit. If I took a Harry Potter book, put its pages through a shredder, randomly glued the particles of paper back together and it just so happened to closely replicate Lord of the Rings then the Tolkien estate has a case against me but the Rowling estate does not.

[-] nous@programming.dev 9 points 1 year ago

If the resulting code looks like copyrighted code, then you’re in trouble. If it doesn’t look like copyrighted code then you’re fine.

^^ Very much this.

Loads of people are treating the process of AI creating works as either violating copyright or not. But that is not how copyright works. It applies to the output of a process not the process itself. If someone ends up writing something that happens to be a copy of something they read before - that is a violation of copy write laws. If someone uses various works and creates something new and unique then that is not a violation. It does not - at this point in time at least - matter if that someone is a real person or an AI.

AI can both violate copy write on one work and not on another. Each case is independent and would need to be legislated differently. But AI can produce so much content so quickly that it creates a real problem for a case by case analysis of copy write infringement. So it is quite likely the laws will need to change to account for this and will likely need to treat AI works differently from human created works. Which is a very hard thing to actually deal with.

Now, one could also argue the model itself is a violation of copyright. But that IMO is a stretch - a model is nothing like the original work and the copyright law also does not cover this case. It would need to be taken to court to really decide on if this is allowed or not.

Personally I don't think the conversation should be on what the laws currently allow - they were not designed for this. But instead what the laws should allow. So we can steer the conversation towards a better future. Lots of artists are expressing their distaste for AI models to be trained on their works - if enough people do this laws can be crafted to backup this view.

load more comments (8 replies)

[-] lostmypasswordanew@feddit.de 11 points 1 year ago

An AI model is a derivative work of its training data and thus a copyright violation if the training data is copyrighted.

[-] BlameThePeacock@lemmy.ca 17 points 1 year ago

A human is a derivative work of its training data, thus a copyright violation if the training data is copyrighted.

The difference between a human and ai is getting much smaller all the time. The training process is essentially the same at this point, show them a bunch of examples and then have them practice and provide feedback.

If that human is trained to draw on Disney art, then goes on to create similar style art for sale that isn't a copyright infringement. Nor should it be.

[-] Phanatik@kbin.social 15 points 1 year ago* (last edited 1 year ago)

This is stupid and I'll tell you why.
As humans, we have a perception filter. This filter is unique to every individual because it's fed by our experiences and emotions. Artists make great use of this by producing art which leverages their view of the world, it's why Van Gogh or Picasso is interesting because they had a unique view of the world that is shown through their work.
These bots do not have perception filters. They're designed to break down whatever they're trained on into numbers and decipher how the style is constructed so it can replicate it. It has no intention or purpose behind any of its decisions beyond straight replication.
You would be correct if a human's only goal was to replicate Van Gogh's style but that's not every artist. With these art bots, that's the only goal that they will ever have.

I have to repeat this every time there's a discussion on LLM or art bots:
The imitation of intelligence does not equate to actual intelligence.

[-] frog@beehaw.org 12 points 1 year ago

Absolutely agreed! I think if the proponents of AI artwork actually had any knowledge of art history, they'd understand that humans don't just iterate the same ideas over and over again. Van Gogh, Picasso, and many others, did work that was genuinely unique and not just a derivative of what had come before, because they brought more to the process than just looking at other artworks.

[-] nickwitha_k@lemmy.sdf.org 7 points 1 year ago

Yup. There seems to be a strong motive in many to not understand this concept as it makes their practices clearly ethically questionable.

load more comments (3 replies)

[-] davehtaylor@beehaw.org 7 points 1 year ago

I really, really, really wish people would understand this.

AI can only create a synthesis of exactly what it's fed. It has no life experience, no emotional experience, no nurture-related experiences, no cultural experiences that color it's thinking, because it isn't thinking.

The "AI are only doing what humans do" is such a brain-dead line of thinking, to the point that it almost feels like it's 100% in bad faith whenever it's brought up.

load more comments (9 replies)

[-] 50gp@kbin.social 10 points 1 year ago

a human does not copy previous work exactly like these algorithms, whats this shit take?

[-] BlameThePeacock@lemmy.ca 11 points 1 year ago

A human can absolutely copy previous works, and they do it all the time. Disney themselves license books teaching you how to do just that. https://www.barnesandnoble.com/w/learn-to-draw-disney-celebrated-characters-collection-disney-storybook-artists/1124097227

Not to mention the amount of porn online based on characters from copyrighted works. Porn that is often done as a paid commission, expressly violating copyright laws.

load more comments (3 replies)

[-] lostmypasswordanew@feddit.de 8 points 1 year ago

Humans and AI are not the same and an equivalence should never be drawn.

load more comments (14 replies)

[-] conciselyverbose@kbin.social 8 points 1 year ago

Derivative works are only copyright violations when they replicate substantial portions of the original without changes.

The entirety of human civilization is derivative works. Derivative works aren't infringement.

[-] lostmypasswordanew@feddit.de 7 points 1 year ago

That's just not true

load more comments (1 replies)

load more comments (12 replies)

[-] Gutless2615@ttrpg.network 46 points 1 year ago* (last edited 1 year ago)

It’s not turning copyright law on its head, in fact asserting that copyright needs to be expanded to cover training a data set IS turning it on its head. This is not a reproduction of the original work, its learning about that work and and making a transformative use from it. An generative work using a trained dataset isn’t copying the original, its learning about the relationships that original has to the other pieces in the data set.

[-] argv_minus_one@beehaw.org 14 points 1 year ago

This is artificial pseudointelligence, not a person. It doesn't learn about or transform anything.

[-] Gutless2615@ttrpg.network 8 points 1 year ago* (last edited 1 year ago)

Im not the one anthropomorphising the technology here.

load more comments (2 replies)

[-] phillaholic@lemm.ee 13 points 1 year ago

The lines between learning and copying are being blurred with AI. Imagine if you could replay a movie any time you like in your head just from watching it once. Current copyright law wasn’t written with that in mind. It’s going to be interesting how this goes.

[-] ricecake@beehaw.org 12 points 1 year ago

Imagine being able to recall the important parts of a movie, it's overall feel, and significant themes and attributes after only watching it one time.

That's significantly closer to what current AI models do. It's not copyright infringement that there are significant chunks of some movies that I can play back in my head precisely. First because memory being owned by someone else is a horrifying thought, and second because it's not a distributable copy.

load more comments (10 replies)

load more comments (16 replies)

[-] ConsciousCode@beehaw.org 42 points 1 year ago

To be honest I'm fine with it in isolation, copyright is bullshit and the internet is a quasi-socialist utopia where information (an infinitely-copyable resource which thus has infinite supply and 0 value under capitalist economics) is free and humanity can collaborate as a species. The problem becomes that companies like Google are parasites that take and don't give back, or even make life actively worse for everyone else. The demand for compensation isn't so much because people deserve compensation for IP per se, it's an implicit understanding of the inherent unfairness of Google claiming ownership of other people's information while hoarding it and the wealth it generates with no compensation for the people who actually made that wealth. "If you're going to steal from us, at least pay us a fraction of the wealth like a normal capitalist".

If they made the models open source then it'd at least be debatable, though still suss since there's a huge push for companies to replace all cognitive labor with AI whether or not it's even ready for that (which itself is only a problem insofar as people need to work to live, professionally created media is art insofar as humans make it for a purpose but corporations only care about it as media/content so AI fits the bill perfectly). Corporations are artificial metaintelligences with misaligned terminal goals so this is a match made in superhell. There's a nonzero chance corporations might actually replace all human employees and even shareholders and just become their own version of skynet.

Really what I'm saying is we should eat the rich, burn down the googleplex, and take back the means of production.

[-] Ubermeisters@lemmy.zip 10 points 1 year ago

Okay so I took back the means of production but it says it's a subscription basis now

[-] ConsciousCode@beehaw.org 9 points 1 year ago

That's late-stage capitalism for you – even revolution comes with a subscription fee

load more comments (1 replies)

[-] cambriakilgannon@beehaw.org 8 points 1 year ago

Or, if it was some non-profit doing the work for the good of everyone :')

load more comments (1 replies)

[-] andresil@lemm.ee 40 points 1 year ago* (last edited 1 year ago)

Copyright law is gaslighting at this point. Piracy being extremely illegal but then this kind of shit being allowed by default is insane.

We really are living under the boot of the ruling classes.

[-] FaceDeer@kbin.social 7 points 1 year ago

If you want "this kind of stuff" (by which I assume you mean the training of AI) to not be allowed by default, then you are basically asking for a world in which the only legal generative AIs belong to giant well-established copyright holders like Adobe and Getty. That path leads deeper underneath the boots of those ruling classes, not out from under them.

load more comments (6 replies)

load more comments (1 replies)

[-] nightmaaaare@lemmy.one 30 points 1 year ago

Personally I’d rather stop posting creative endeavours entirely than simply let it be stolen and regurgitated by every single company who’s built a thing on the internet.

load more comments (15 replies)

[-] GyozaPower@discuss.tchncs.de 26 points 1 year ago

With each day I hate the internet and these fucking companies even more.

[-] yoz@aussie.zone 26 points 1 year ago* (last edited 1 year ago)

Can we get some young politicians elected who has a degree in IT ? Boomers dont understand technology that's why these companies keeps screwing the people.

[-] zephyrvs@lemmy.ml 12 points 1 year ago

It's because they're corrupt and young people are just as susceptible to lobbyists bribes, unfortunately. The gerontocracy doesn't make things better though, that's for sure.

load more comments (3 replies)

load more comments (1 replies)

[-] Pixel@lemmy.sdf.org 24 points 1 year ago

Books will start needing to add a robots.txt page to the back of the book

[-] sirjash@feddit.de 18 points 1 year ago

Which will be ignored by search engines, as is tradition?

load more comments (1 replies)

[-] LastOneStanding@beehaw.org 20 points 1 year ago

OK, so I shall create a new thread, because I was harassed. Why bother publishing anything if it's original if it's just going to be subsumed by these corporations? Why bother being an original human being with thoughts to share that are significant to the world if, in the end, they're just something to be sucked up and exploited? I'm pretty smart. Keeping my thoughts to myself.

load more comments (8 replies)

[-] EgoNo4@beehaw.org 19 points 1 year ago

Google can go suck on a lemon!

[-] pfannkuchen_gesicht@lemmy.one 7 points 1 year ago

Lemons are delicious af though. Why reward them for their bs?

[-] modulus@lemmy.ml 18 points 1 year ago

Worth considering that this is already the law in the EU. Specifically, the Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market has exceptions for text and data mining.

Article 3 has a very broad exception for scientific research: "Member States shall provide for an exception to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, and Article 15(1) of this Directive for reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access." There is no opt-out clause to this.

Article 4 has a narrower exception for text and data mining in general: "Member States shall provide for an exception or limitation to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, Article 4(1)(a) and (b) of Directive 2009/24/EC and Article 15(1) of this Directive for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining." This one's narrower because it also provides that, "The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online."

So, effectively, this means scientific research can data mine freely without rights' holders being able to opt out, and other uses for data mining such as commercial applications can data mine provided there has not been an opt out through machine-readable means.

[-] frog@beehaw.org 30 points 1 year ago

I think the key problem with a lot of the models right now is that they were developed for "research", without the rights holders having the option to opt out when the models were switched to for-profit. The portfolio and gallery websites, from which the bulk of the artwork came from, didn't even have opt out options until a couple of months ago. Artists were therefore considered to have opted in to their work being used commercially because they were never presented with the option to opt out.

So at the bare minimum, a mechanism needs to be provided for retroactively removing works that would have been opted out of commercial usage if the option had been available and the rights holders had been informed about the commercial intentions of the project. I would favour a complete rebuild of the models that only draws from works that are either in the public domain or whose rights holders have explicitly opted in to their work being used for commercial models.

Basically, you can't deny rights' holders an ability to opt out, and then say "hey, it's not our fault that you didn't opt out, now we can use your stuff to profit ourselves".

[-] tochee@aussie.zone 8 points 1 year ago

Common sense would surely say that becoming a for-profit company or whatever they did would mean they've breached that law. I assume they figured out a way around it or I've misunderstood something though.

[-] frog@beehaw.org 9 points 1 year ago

I think they just blatantly ignored the law, to be honest. The UK's copyright law is similar, where "fair dealing" allows use for research purposes (legal when the data scrapes were for research), but fair dealing explicitly does not apply when the purpose is commercial in nature and intended to compete with the rights holder. The common sense interpretation is that as soon as the AI models became commercial and were being promoted as a replacement for human-made work, they were intended to be a for profit competition to the rights holders.

If we get to a point where opt outs have full legal weight, I still expect the AI companies to use the data "for research" and then ship the model as a commercial enterprise without any attempt to strip out the works that were only valid to use for research.

load more comments (8 replies)

load more comments (2 replies)

[-] autotldr@lemmings.world 7 points 1 year ago

🤖 I'm a bot that provides automatic summaries for articles:

Click here to see the summary

The company has called for Australian policymakers to promote “copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data, while supporting workable opt-outs for entities that prefer their data not to be trained in using AI systems”.

The call for a fair use exception for AI systems is a view the company has expressed to the Australian government in the past, but the notion of an opt-out option for publishers is a new argument from Google.

Dr Kayleen Manwaring, a senior lecturer at UNSW Law and Justice, told Guardian Australia that copyright would be one of the big problems facing generative AI systems in the coming years.

“The general rule is that you need millions of data points to be able to produce useful outcomes … which means that there’s going to be copying, which is prima facie a breach of a whole lot of people’s copyright.”

“If you want to reproduce something that’s held by a copyright owner, you have to get their consent, not an opt out type of arrangement … what they’re suggesting is a wholesale revamp of the way that exceptions work.”

Toby Murray, associate professor at the University of Melbourne’s computing and information systems school, said Google’s proposal would put the onus on content creators to specify whether AI systems could absorb their content or not, but he indicated existing licensing schemes such as Creative Commons already allowed creators to mark how their works can be used.

load more comments

this post was submitted on 09 Aug 2023

351 points (100.0% liked)

Technology

37757 readers

491 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

Los@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org