Should lemmy.ml block chatgpt scraping in robots.txt? (lemmy.ml)

submitted 1 year ago by GnuLinuxDude@lemmy.ml to c/meta@lemmy.ml

13 comments fedilink hide all child comments

Some context about this here: https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/

the robots.txt would be updated with this entry

User-agent: GPTBot
Disallow: /

Obviously this is meaningless against non-openai scrapers or anyone who just doesn't give a shit.

all 14 comments

sorted by: hot top controversial new old

[-] recursive_recursion@programming.dev 3 points 1 year ago

I could be wrong but wouldn't people be able to file class action lawsuits against these companies? because they are literally copying content without obtaining any prior explicit user consent, also I'm pretty sure Europeans have an upper hand with data privacy protection from GDPR (European data being extracted/harvested and transferred to US servers)

I could be wrong though

[-] GnuLinuxDude@lemmy.ml 1 points 1 year ago

The possibility exists, though I wouldn't hold my breath.

https://arstechnica.com/tech-policy/2023/08/report-potential-nyt-lawsuit-could-force-openai-to-wipe-chatgpt-and-start-over/

[-] totallynotarobot@lemmy.world 3 points 1 year ago

If they'll pay us when they scrape our content, sure.

[-] Geist_@lemmy.world 1 points 1 year ago

... Is that like a non-argument? How do you suppose they would pay sites, let alone site users to scrape their content?

[-] totallynotarobot@lemmy.world 1 points 1 year ago

Yes that's the point

[-] Hubi@feddit.de 3 points 1 year ago

Wouldn't they theoretically be able to set up their own instance, federate with all the larger ones and scrape the data this way? Not sure if blocking them via the robots.txt file is the most effective barrier in case that they really want the data.

[-] dreadedsemi@lemmy.world 12 points 1 year ago* (last edited 1 year ago)

Robots.txt is more of an honor system. If they respect , they won't do that trick.

[-] NightAuthor@beehaw.org 5 points 1 year ago

Robots.txt is just a notice anyways. Your scraper could just ignore it, no workaround necessary.

[-] Roastchicken@lemmy.world 3 points 1 year ago

[-] Mechanismatic@lemmy.ml 2 points 1 year ago* (last edited 1 year ago)

I can understand privacy concerns, but I feel like it's inevitable that LLMs will be used to make lots of decisions, some possibly important, so wouldn't you want some content included in its training? For instance, would you want an LLM to be ignorant of FOSS because all the FOSS sites blocked it, and then a child asks an LLM for advice on software and gets recommended Microsoft and Apple products only?

[-] Geist_@lemmy.world 1 points 1 year ago* (last edited 1 year ago)

... It's probably going to recommend paid and non-FOSS apps and programs just on the basis that those companies probably will pay to be the top suggestions. Just like google ads. So no, I don't think that's a good enough reason. They can still scrape wiki's if they need info on FOSS sites, imo. Those shouldn't (?) block AI's and other aggregators.

[-] 7heo@lemmy.ml 2 points 1 year ago

That won't stop OpenAI. We need actual blocking, on the server side. Problem is, with federation and all, it will be really, really difficult to do. And expensive.

[-] maegul@lemmy.ml 1 points 1 year ago

I think this is a general question and problem for the whole fediverse, and can easily lead to the question of whether, or even when the fediverse is going to embrace having closed or private spaces or even invite only spaces, in order to try to secure some "human interaction only" social media.

this post was submitted on 20 Aug 2023

36 points (100.0% liked)

lemmy.ml meta

1406 readers

1 users here now

Anything about the lemmy.ml instance and its moderation.

For discussion about the Lemmy software project, go to !lemmy@lemmy.ml.

founded 3 years ago

MODERATORS

nutomic@lemmy.ml