109

Microsoft and Reddit Are Fighting About Why Bing’s Crawler Is Blocked on Reddit (www.404media.co)

submitted 2 months ago by theangriestbird@beehaw.org to c/technology@beehaw.org

30 comments fedilink hide all child comments

all 31 comments

sorted by: hot top controversial new old

[-] theangriestbird@beehaw.org 59 points 2 months ago

The beef between Microsoft and Reddit came to light after I published a story revealing that Reddit is currently blocking every crawler from every search engine except Google, which earlier this year agreed to pay Reddit $60 million a year to scrap the site for its generative AI products.

I know the author meant "scrape", but sometimes it really does feel like AI is just scrapping the old internet for parts.

[-] cybermass@lemmy.ca 15 points 2 months ago

Yeah, aren't like over half of reddit comments/posts by bots these days?

[-] originalucifer@moist.catsweat.com 13 points 2 months ago

yep, and the longer that happens the less value to the dataset. its becoming aged.

[-] RiikkaTheIcePrincess@pawb.social 13 points 2 months ago* (last edited 2 months ago)

[Joke] See, Reddit's doing a nice thing here! They're making sure nobody ends up toxifying their own dataset by using Reddit's garbage heap of bot posts!

[-] originalucifer@moist.catsweat.com 5 points 2 months ago

google needs a checkbox of 'ignore reddit' im sick of having to manually add -reddit

[-] Cube6392@beehaw.org 13 points 2 months ago

Hey good news. Turns out you can use bing and not get back Reddit results

[-] originalucifer@moist.catsweat.com 3 points 2 months ago

yeah but then i get back bing results. no one needs that

[-] doctortofu@reddthat.com 44 points 2 months ago

I can see why spez is upset about scrappers and search engines - image a company profiting from people creating lots of data, just hoarding it and using it for free, and not paying those people a cent, preposterous, right? :)

[-] Ilandar@aussie.zone 28 points 2 months ago

“This was Microsoft's choice, not ours,” Reddit spokesperson Tim Rathschmidt told me in an email. “We are and have been open to agreements with companies who are open about their intentions and commit to treat us and our users fairly. If Bing or others want access within our policies, without training, without summarization, and without selling it to others, we are and have always been open to that. If they want to build a business selling Reddit data or using the data for training, we could be open to that, but it’s a commercial conversation.”

Mojeek, the search engine that initially told me that Reddit was blocking all search engines but Google, and which was unable to get in touch with Reddit at the time, told me Reddit got in touch after that story was published. Mojeek said it was unable to share any details about the deal because of an NDA, but confirmed that Reddit wanted to get paid for letting Mojeek crawl the site, even though Mojeek does not have any AI products.

This doesn't add up and it makes me wonder what else Google and reddit agreed upon. This situation benefits no one except Google, as far as I can tell. If reddit wants to milk search engines, and Microsoft is willing and able to pay (which I assume they are), there is no reason for the deal to not go ahead like it did with Google. Kinda makes my brain start going down the conspiracy path, but then again it's hardly unbelievable that Google would pursue anti-competitive business strategies, particularly when it comes to generative AI.

[-] Moonrise2473@feddit.it 28 points 2 months ago* (last edited 2 months ago)

A search engine can't pay a website for having the honor of bringing them visits and ad views.

Fuck reddit, get delisted, no problem.

Weird that google is ignoring their robots.txt though.

Even if they pay them for being able to say that glue is perfect on pizza, having

User-agent: *
Disallow: /

should block googlebot too. That means google programmed an exception on googlebot to ignore robots.txt on that domain and that shouldn't be done. What's the purpose of that file then?

Because robots.txt is completely based on honor (there's no need to pretend being another bot, could just ignore it), should be

User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /

[-] MrSoup@lemmy.zip 28 points 2 months ago

I doubt Google respects any robots.txt

[-] DaGeek247@fedia.io 26 points 2 months ago

My robots.txt has been respected by every bot that visited it in the past three months. I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.

I've only gotten like, 20 visits in the past three months though, so, very small sample size.

[-] mozz@mbin.grits.dev 14 points 2 months ago

I know this because i wrote a page that IP bans anything that visits it, and l also put it as a not allowed spot in the robots.txt file.

This is fuckin GENIUS

[-] Moonrise2473@feddit.it 8 points 2 months ago

only if you don't want any visits except from yourself, because this removes your site from any search engine

should write a "disallow: /juicy-content" and then block anything that tries to access that page (only bad bots would follow that path)

[-] Miaou@jlai.lu 23 points 2 months ago

That's exactly what was described..?

[-] Moonrise2473@feddit.it 3 points 2 months ago

Oops. As a non-native English speaker I misunderstood what he meant. I understood wrongly that he set the server to ban everything that asked for robots.txt

[-] Zoop@beehaw.org 2 points 2 months ago

Just in case it makes you feel any better: I'm a native English speaker who always aced the reading comprehension tests back in school, and I read it the exact same way. Lol! I'm glad I wasn't the only one. :)

[-] mozz@mbin.grits.dev 5 points 2 months ago

You need to read again the thing that was described, more carefully. Imagine for example that by “a page,” the person means a page called /juicy-content or something.

[-] MrSoup@lemmy.zip 2 points 2 months ago

Thank you for sharing

[-] thingsiplay@beehaw.org 2 points 2 months ago* (last edited 2 months ago)

Interesting way of testing this. Another would be to search the search machines with adding site:your.domain (Edit: Typo corrected. Off course without - at -site:, otherwise you will exclude it, not limit to.) to show results from your site only. Not an exhaustive check, but another tool to test this behavior.

[-] Moonrise2473@feddit.it 10 points 2 months ago

for common people they respect and even warn a webmaster if they submit a sitemap that has paths included in robots.txt

[-] jarfil@beehaw.org 3 points 2 months ago

Google is paying for the use of Reddit's API, not for scraping the site.

That's the new Reddit's business model: want "their" (users') content, then pay for API access.

[-] ssm@lemmy.sdf.org 20 points 2 months ago

I hope all big corporate SEO trash follows suite, once they've all filtered themselves out for profit we can hopefully get some semblance of an unshittified search experience.

[-] CanadaPlus@lemmy.sdf.org 2 points 2 months ago

Man, wouldn't that be nice. There's too much money in appearing on searches for me to ever expect that to happen, though.

[-] TehPers@beehaw.org 11 points 2 months ago

Joke's on Reddit. I've been blocking their results in the search engine I use for months!

I wonder if this will end up being pursued as an antitrust case. If anything, it'll reduce traffic to Reddit from non-Google users, so hopefully that kills them off just a little faster.

[-] AVincentInSpace@pawb.social 10 points 2 months ago

Come on. Be realistic. Chrome has 70% browser market share and people are already used to tacking "Reddit" onto the end of their search queries to find useful information. If anything this will have no effect besides steering people towards Google.

[-] TehPers@beehaw.org 5 points 2 months ago

People on Chrome adding Reddit to their Google searches already use Google. People not using Google who don't search "Reddit" are going to see fewer Reddit results.

No, this won't kill Reddit, but it certainly isn't helping them get more traffic.

[-] Cube6392@beehaw.org 2 points 2 months ago

They don't care about traffic. They care about the existing barrel of data for the data models

[-] lemmyvore@feddit.nl 2 points 2 months ago

...I thought that was the whole point of Spez blocking other spiders.

this post was submitted on 01 Aug 2024

109 points (100.0% liked)

Technology

37683 readers

232 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

Los@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org