175
        you are viewing a single comment's thread
view the rest of the comments
    
  
  
    view the rest of the comments
        this post was submitted on 26 Jun 2025
        
  
      
  
      175 points (100.0% liked)
      Technology
    76180 readers
  
      
      2544 users here now
  
      This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
        founded 2 years ago
      
  
  
      MODERATORS
      
  
    
American law has become a literal fucking joke (IAAL). I could’ve guessed the could get the outcome of this case without any facts: the huge corporation wins over authors. American law is no longer capable of holding major corporations to account, so we need a new legal system—one that’s actually functional.
Thats the actual argument and the judge is right here. LLMs are transformative in every sense of the word. The technology is even called "transformers".
Fallacious argument.
Something that can't generate wine glass full to the brim without a band-aid fix is far from "transformative." Even if it were:
More like obfuscated plagiarism.
Nope I'm literally a data programmer working in this field. Any sufficiently transformed data even coming from hard copyright is transformative work and currently LLMs meet this criteria and will continue to do so. Wanna bet?
I think there's a blurry line here where you can easily train an LLM to just regurgitate the source material by overfitting, and at what point is it "transformative enough"? I think there's little doubt that current flagship models usually are transformative enough, but that doesn't apply to everything using the same technology - even though this case will be used as precedence for all of that.
There's also another issue in that while safeguards are generally in place, without them llms would be very capable of quoting entire pages at least of popular books. And jailbreaking llms isn't exactly unheard of. They also at least used to really like just verbatim repeating news articles on obscure topics.
What I'm mainly getting at is that LLMs can be transformative, but they also can plagiarize. Much like any human could. The question is then, if training LLMs on copyrighted data is allowed, will the company be held accountable when their LLM does plagiarize, the same way a person would be? Or would the better decision be to prohibit training on copyrighted data because actually transforming it meaningfully can not be guaranteed, and copyright holders actually finding these violations is very hard?
Though idk the case details, if the argument was purely focused on using the material to produce the model, rather than including the ultimate step of outputting text to anyone who asks, it was probably doomed to fail from the start and the decision makes perfect sense. And that doesn't seem too unlikely to have happened because realizing this would require the lawyer making the case to actually understand what training an LLM does.
This case didn't cover the copyright status of outputs. The ruling so far is just about the process of training itself.
IMHO the generative ML companies should be required to build a process tracking the influence of distinct samples on the outputs, and inform users of potential licensing status
Division of liability / licensing responsibility should depend on who contributes what to the prompt / generation. The less it takes for the user to trigger the model to generate an output clearly derived from a protected work, the more liability lies on the model operator. If the user couldn't have known, they shouldn't be liable. If the user deliberately used jailbreaks, etc, the user is clearly liable.
But you get a weird edge case when users unknowingly copy prompts containing jailbreaks, though
https://infosec.pub/comment/16682120
You are agreeing with the post you responded to. This ruling is only about training a model on legally obtained training data. It does not say it is ok to pirate works--if you pirate a work, no matter what you do with the infringing copy you've made, you've committed copyright infringement. It does not talk about model outputs, which is a very nuanced issue and likely to fall along similar analyses as music copyright imo. It only talks about whether training a model is intrinsically an infringement of copyright. And it isn't because anything else is insane and be functionally impossible to differentiate from learning a writing technique by reading a book you bought from an author. Even a model that has overfit training data, it is in no way recognizable to any particular training datum. It's hyperdimensioned matrix of numbers defining relationships between features and relationships between relationships.
It depends on your definition of "is". In reality it depends on the original art and how it's transformed. But legally it's whatever benefits capital (aka your boss). I wouldn't bet against your boss paying off the courts, lawyers, etc.
Ok so if I don't generate capital from it theres no crime? You can see how the original argument that all copying is copyright breach. Then you can infinitely dig into this - is my monitor copying pixels on my screen copying? What about the browser cache? So copyright can only be argued from the pov that breach has to be capital generating or direct damage creating like using that data for libel or something.