1579

It's all made from our data, anyway, so it should be ours to use as we want

top 50 comments
sorted by: hot top controversial new old
[-] just_another_person@lemmy.world 144 points 1 week ago* (last edited 1 week ago)

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

[-] sugar_in_your_tea@sh.itjust.works 37 points 1 week ago

They pulled a very pubic and out in the open data heist

Oh no, not the pubes! Get those curlies outta here!

[-] just_another_person@lemmy.world 16 points 1 week ago

Best correction ever. Fixed. ♥️

[-] FaceDeer@fedia.io 28 points 1 week ago

Legislation that prohibits publicly-viewable information from being analyzed without permission from the copyright holder would have some pretty dramatic and dire unintended consequences.

load more comments (18 replies)
[-] Grimy@lemmy.world 8 points 1 week ago* (last edited 1 week ago)

If we can't train on unlicensed data, there is no open-source scene. Even worse, AI stays but it becomes a monopoly in the hands of the few who can pay for the data.

Most of that data is owned and aggregated by entities such as record labels, Hollywood, Instagram, reddit, Getty, etc.

The field would still remain hyper competitive for artists and other trades that are affected by AI. It would only cause all the new AI based tools to be behind expensive censored subscription models owned by either Microsoft or Google.

I think forcing all models trained on unlicensed data to be open source is a great idea but actually rooting for civil lawsuits which essentially entail a huge broadening of copyright laws is simply foolhardy imo.

load more comments (5 replies)
[-] Avatar_of_Self@lemmy.world 7 points 1 week ago

It's already illegal in some form. Via piracy of the works and regurgitating protected data.

The issue is mega Corp with many rich investors vs everyone else. If this were some university student their life would probably be ruined like with what happened to Aaron Swartz.

The US justice system is different for different people.

[-] NoForwardslashS@sopuli.xyz 6 points 1 week ago

But wouldn't that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?

Probably not "burden of proof in a court of law" prove though.

[-] Bronzebeard@lemm.ee 9 points 1 week ago

Making it open source doesn't change how it works. It doesn't need the data after it's been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.

load more comments (8 replies)
load more comments (3 replies)
load more comments (1 replies)
[-] fmstrat@lemmy.nowsci.com 87 points 1 week ago

So banks will be public domain when they're bailed out with taxpayer funds, too, right?

[-] ArchRecord@lemm.ee 61 points 1 week ago

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank's assets, and now effectively owns the bank.

[-] booly@sh.itjust.works 11 points 1 week ago

At the same time, if a bank goes under, that means they owe more than they own, so "ownership" of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.

The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.

AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.

So it's not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don't necessarily want is the government owning a company long term, because there's some conflict of interest between its role as regulator and its interest as a shareholder.

[-] xthexder@l.sw0.com 10 points 1 week ago* (last edited 1 week ago)

Public domain wouldn't be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can't copy a company with unique customers and physical property.

load more comments (1 replies)
load more comments (2 replies)
[-] john89@lemmy.ca 60 points 1 week ago

I don't think it should be a "punishment." It should be done on principal.

load more comments (5 replies)
[-] circuitfarmer@lemmy.sdf.org 60 points 1 week ago

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I'm not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

[-] dragonfucker@lemmy.nz 10 points 1 week ago

Yes, mining companies should all be nationalised for digging up the country's ground and putting carbon in the country's air.

load more comments (4 replies)
[-] mp3@lemmy.ca 41 points 1 week ago

It could also contain non-public domain data, and you can't declare someone else's intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

[-] grue@lemmy.world 21 points 1 week ago

So what you're saying is that there's no way to make it legal and it simply needs to be deleted entirely.

I agree.

[-] FaceDeer@fedia.io 7 points 1 week ago

There's no need to "make it legal", things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

Training an AI doesn't involve copying the training data, the AI model doesn't literally "contain" the stuff it's trained on. So it's not likely that existing copyright law makes it illegal to do without permission.

load more comments (2 replies)
[-] drkt@scribe.disroot.org 18 points 1 week ago

Forcing a bunch of neural weights into the public domain doesn't make the data they were trained on also public domain, in fact it doesn't even reveal what they were trained on.

[-] deegeese@sopuli.xyz 9 points 1 week ago

LOL no. The weights encode the training data and it’s trivially easy to make AI generators spit out bits of their training data.

[-] drkt@scribe.disroot.org 11 points 1 week ago
load more comments (1 replies)
[-] merc@sh.itjust.works 8 points 1 week ago

It wouldn't contain any public-domain data though. That's the thing with LLMs, once they're trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn't re-create your tax data on command, that data is now gone, but if it's seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.

load more comments (1 replies)
[-] nutsack@lemmy.world 37 points 1 week ago* (last edited 1 week ago)

intellectual property doesn't really exist in most of the world. they don't give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore...

it's arbitrary law that is designed to protect corporations and it's generally unenforceable.

[-] echodot@feddit.uk 11 points 1 week ago

But they're not developing AI in those countries they're developing it mostly in the US. In the US copyright law is enforced.

[-] dsilverz@thelemmy.club 8 points 1 week ago

There are many AI development happening in China. Doubao (from Bytedance, the same company behind TikTok), DeepSeek and Qwen are some examples of Chinese LLMs.

load more comments (1 replies)
[-] sugar_in_your_tea@sh.itjust.works 11 points 1 week ago

it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

It's arbitrary, but it was designed to protect individuals, but it has been morphed to protect corporations. If we reset the law back to the original copyright act of 1790 w/ a 14-year duration, it would go a long way toward removing power from corporations. I think we should take it a step further and perhaps make it 10 years, with an optional extension for another 10 years if you can show need (i.e. you're an indie dev and your game is finally making a splash after 8 years).

[-] FlyingSquid@lemmy.world 8 points 1 week ago

they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

Unless it's their intellectual property, whereupon it's suddenly a whole different story. I'm sure you knew that.

load more comments (2 replies)
[-] ClamDrinker@lemmy.world 32 points 1 week ago

Although I'm a firm believer that most AI models should be public domain or open source by default, the premise of "illegally trained LLMs" is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of... well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn't tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

[-] barsoap@lemm.ee 8 points 1 week ago* (last edited 1 week ago)

As per torrentfreak

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.

...crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It's US startup culture, plain and simple, "move fast and break laws", get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.

load more comments (1 replies)
load more comments (25 replies)
[-] hark@lemmy.world 26 points 1 week ago

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

[-] merc@sh.itjust.works 12 points 1 week ago

It's like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same "buy an album from a record store" model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

Spotify's solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their "buy an album" business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

[-] Taleya@aussie.zone 6 points 1 week ago

Bandcamp still runs on this mode though, and quite well

[-] xthexder@l.sw0.com 8 points 1 week ago

It's also one of the few places that have lossless audio files available for download. I'm a big fan of Bandcamp. I like having all my music local.

load more comments (1 replies)
[-] noxypaws@pawb.social 25 points 1 week ago

I'd rather they were destroyed, but practically speaking that's impossible, and this sounds like the next best idea to me.

[-] cypherpunks@lemmy.ml 22 points 1 week ago* (last edited 1 week ago)

"Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data"

I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

I liked the author's earlier very-unlikely-to-be-met-demand activism last year better:

I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

...which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it's technically true - a court didn't order it, but a guy who goes by the name "That One Privacy Guy" while blogging on linkedin did).

[-] Magnetic_dud@discuss.tchncs.de 19 points 1 week ago

I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un'ipnosi non verbale

Clearly stolen from this Dr paret YouTube channels where he's selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

[-] werefreeatlast@lemmy.world 13 points 1 week ago

I want to have a personal llm that learns all my interests from my files and websites visited. I just want to ask it stuff that I don't have to remember.

load more comments (7 replies)
[-] chiliedogg@lemmy.world 10 points 1 week ago

Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.

[-] Blackmist@feddit.uk 10 points 1 week ago

They don't mean your data, silly. They don't give a fuck about that.

They mean other huge corporations data.

[-] FaceDeer@fedia.io 10 points 1 week ago

Are you threatening me with a good time?

First of all, whether these LLMs are "illegally trained" is still a matter before the courts. When an LLM is trained it doesn't literally copy the training data, so it's unclear whether copyright is even relevant.

Secondly, I don't think that making these models "public domain" would have the negative effects that people angry about AI think it would. When a company is running a closed model internally, like ChatGPT for example, the model is never available for download in the first place. It doesn't matter if it's public domain or not because you can't get a copy of it. When a company releases an open-weight model for public use, on the other hand, they usually encumber them with some sort of license that makes them harder for competitors to monetize or build on. Making those public-domain would greatly increase their utility. It might make future releases less likely, but in the meantime it'll greatly enhance AI development.

load more comments (5 replies)
[-] brucethemoose@lemmy.world 9 points 1 week ago* (last edited 1 week ago)

The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs... probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don't have much pressure to optimize GPU usage.

Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

load more comments (3 replies)
[-] Hackworth@lemmy.world 9 points 1 week ago* (last edited 1 week ago)

Calling something illegal in spite of or in absence of precedent is a time-honored tactic - though not a particularly persuasive one.

load more comments (1 replies)
[-] HexesofVexes@lemmy.world 7 points 1 week ago* (last edited 1 week ago)

I mean, if we really are following the spirit of copyright, since no-one at open AI or other companies developed matrix and vector multiplication (operations existing in the public domain because Platonism is a thing).

Edit: oh my, I guess the consensus is that stealing the work of mathematicians is ok (or more, classifying our constructions as discoveries).

You can't patent math, though you can copyright a specific explanation of math concepts.

If Open AI (or any AI company) is including copyrighted works in their solution, that's a copyright violation and should be treated as such. But if they're merely using the information from a copyrighted work but not violating the copyright itself, they're fine.

load more comments (4 replies)
load more comments (7 replies)
load more comments
view more: next ›
this post was submitted on 22 Dec 2024
1579 points (100.0% liked)

Technology

60186 readers
1394 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS