321

rulebots.txt (lemmy.world)

submitted 4 weeks ago by GroupNebula563@lemmy.world to c/196

33 comments fedilink hide all child comments

all 34 comments

sorted by: hot top controversial new old

[-] shikogo@pawb.social 70 points 4 weeks ago

I am confused, does this mean Reddit is not going to be searchable on search engines anymore?

[-] Zagorath@aussie.zone 85 points 4 weeks ago

Unfortunately yes. It was reported on last month.

[-] Aeri@lemmy.world 66 points 4 weeks ago

oh no, Reddit is like, the only way to have google still be useful.

[-] germanatlas 54 points 4 weeks ago

Funnily enough, google is also the only way to have Reddit be useful.

Their own search function has been nothing but garbage.

[-] morgunkorn@discuss.tchncs.de 43 points 4 weeks ago

That's the catch, Google made a deal with Reddit and remains the only search engine allowed to access its data for indexing. It cuts off every other search engine

[-] Vorticity@lemmy.world 27 points 4 weeks ago

Tell me that there is an anti trust suit over this.

[-] GroupNebula563@lemmy.world 26 points 4 weeks ago

There's a suit over google in general so this may well be part of it

[-] TriflingToad@lemmy.world 3 points 3 weeks ago

really? ddg will show me reddit links, did they have to make a webscraper or something

[-] morgunkorn@discuss.tchncs.de 4 points 3 weeks ago

There's a cutoff date, anything indexed before the robots.txt was changed stays in the index

[-] riodoro1@lemmy.world 31 points 4 weeks ago

We fucked the internet. It’s proprietary now.

[-] GroupNebula563@lemmy.world 11 points 4 weeks ago* (last edited 4 weeks ago)

we fucked the internet

kinky

[-] pupbiru@aussie.zone 8 points 4 weeks ago

cat5 sounding you say?

[-] Swedneck@discuss.tchncs.de 2 points 3 weeks ago

cat5-o-nine-tails

[-] princessnorah 9 points 4 weeks ago

Good news! Google paid up and still has access I'm pretty sure.

[-] GroupNebula563@lemmy.world 1 points 3 weeks ago

That's bad news, that means the internet is dying

[-] princessnorah 2 points 3 weeks ago

Sorry, the /s was sort of implied.

[-] GroupNebula563@lemmy.world 2 points 3 weeks ago

Ah, sorry. I have trouble with that sometimes :P

[-] GroupNebula563@lemmy.world 9 points 4 weeks ago

Perhaps, likely depends on the crawler though

[-] unexposedhazard@discuss.tchncs.de 12 points 4 weeks ago

Yeah i dont think ignoring robots.txt is even illegal. They can ofcourse just block your crawlers IP but that would be a cat and mouse game that they would lose in the end.

[-] JusticeForPorygon@lemmy.world 55 points 4 weeks ago

Not gonna lie this seems like ultimately a win for the Internet. The years of troubleshooting solutions Reddit Provided can be archived (hopefully) but the less people rely on the site itself, the better. At least in my opinion.

[-] TriflingToad@lemmy.world 2 points 3 weeks ago

I disagree, kinda. Stackoverflow is the other option for questions which is a lot less user friendly, and Lemmy has never shown up in search results for me. If something comes along and makes it simple, great! however I just see a lot more of ad filled hellhole sites in the meantime.

[-] Kojichan@lemmy.world 52 points 4 weeks ago

I remember finding Google's robots.txt when they first came out. It was a cute little text ASCII art of a robot with a heart that said, "We love robots!"

[-] jabathekek@sopuli.xyz 50 points 4 weeks ago

An ancient text from the before-fore.

[-] GroupNebula563@lemmy.world 60 points 4 weeks ago

this is actually quite recent. the old one was much funnier and clearly had actual soul put into it.

[-] AsudoxDev@programming.dev 6 points 4 weeks ago

my shiny metal ass

[-] itsnicodegallo@lemm.ee 8 points 4 weeks ago

As annoying as this is, it's to prevent LLMs from training themselves using Reddit content, and that's probably the greater of the two evils.

[-] GroupNebula563@lemmy.world 37 points 4 weeks ago

That's all well and good, but how many LLMs do you think actually respect robots.txt?

[-] colin@lemmy.uninsane.org 14 points 4 weeks ago

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i'd never notice unless i checked the logs, at least.