27
submitted 3 weeks ago* (last edited 3 weeks ago) by thenexusofprivacy to c/fediverse@piefed.social

cross-posted from: https://infosec.exchange/users/thenexusofprivacy/statuses/115012347040350824

As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it's certainly a threat worth thinking about.

So I'm wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don't want to disclose ... @deadsuperhero@social.wedistribute.org has some good discussion on We Distribute. It would b e very interesting to hear what various instances are doing.

And a couple of more open-ended questions:

  • Do you feel like your defenses against scraping are generally holding up pretty well?

  • Are there other approaches that you think might be promising that you just haven't had the time or resources to try?

  • Do you have any language in your terms of servive that attempts to prohibit training for AI?

Here's @FediPact's post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

https://cyberpunk.lol/@FediPact/114999480874284493

@fediverse @fediversenews

#MastoAdmin #Meta #FediPact

top 12 comments
sorted by: hot top controversial new old
[-] jeena@piefed.jeena.net 10 points 3 weeks ago* (last edited 3 weeks ago)

The only thing I've been doing on my blog (not on my piefed instance yet, but probably should) was user agent filtering:

if ($http_user_agent ~* (SemrushBot|AhrefsBot|PetalBot|YisouSpider|Amazonbot|VelenPublicWebCrawler|DataForSeoBot|Expanse,\ a\ Palo\ Alto\ Networks\ company|BacklinksExtendedBot|ClaudeBot|OAI-SearchBot)) {
return 403;
}

[-] thenexusofprivacy 1 points 3 weeks ago

Thanks! Does it seem like that's affective, or are you getting the feel that that the bots are just changing user agent to get aroud it?

[-] jeena@piefed.jeena.net 2 points 3 weeks ago

I got the list from a friend who checks his logs every now and then and adds new not names there.

[-] rimu@piefed.social 8 points 3 weeks ago* (last edited 3 weeks ago)

There are no PieFed instances in that list. Maybe because Meta is blocked in the default PieFed robots.txt or maybe PieFed is too obscure.

The robots.txt on Mastodon and Lemmy is basically useless.

The Mbin robots.txt is massive but does not block Meta's crawler so presumably it is not being kept up to date.

Any fedi devs reading this: add these

User-agent: meta-externalagent  
User-agent: Meta-ExternalAgent  
User-agent: meta-externalfetcher  
User-agent: Meta-ExternalFetcher  
User-agent: TikTokSpider  
User-agent: DuckAssistBot  
User-agent: anthropic-ai  
Disallow: /  
[-] rhythmisaprancer@piefed.social 3 points 3 weeks ago

@originalucifer@moist.catsweat.com in case you are interested

[-] CameronDev@programming.dev 6 points 3 weeks ago

Just to clarify your question, are you concerned about metas scrapers causing additional server load, or about them stealing the content?

[-] the_abecedarian@piefed.social 5 points 3 weeks ago

Not OP but i'd be concerned about both

[-] CameronDev@programming.dev 7 points 3 weeks ago* (last edited 3 weeks ago)

The nature of federation makes the later basically impossible to prevent. All data is federated freely, so all meta has to do is spin up an instance and the data is handed directly to them.

[-] the_abecedarian@piefed.social 4 points 3 weeks ago

Yeah. It's really just making them do that kind of work. We can block those instances, though ofc it won't truly stop them, it'll change the cost benefit analysis.

That plus anubis or something, and whatever future tech that arises

[-] thenexusofprivacy 3 points 3 weeks ago

Agreed, it's all about changing the cost-benefit analysis, great framing. And also agreed, blocking -- and/or shifting to allow-list federation or something more nuanced (to deal with the point @CameronDev@programming.dev makes about Meta just being able to spin up a new instance -- is a really important complement to preventing scraping.

[-] rimu@piefed.social 2 points 3 weeks ago

Only from the moment they start the instance. That doesn't give them historical data.

[-] thenexusofprivacy 2 points 3 weeks ago

Yeah I think most admins are concerned about both. And whether or not it's "stealing" (in the legal sense), a lot of people want to keep their content and personal information out of these AI systems.

this post was submitted on 11 Aug 2025
27 points (100.0% liked)

Fediverse

1098 readers
31 users here now

Downvote are limited to members of this community

Welcome!

Can you imagine, years ago how the internet was before? We know Facebook, Twitter, Tiktok, Youtube. We knew blogger, Tumblr, Skyrock... and long before, it was the forum era as phpBB..and mail-lists.

And now with ActivityPub, we are reshaping the web, and achieving much with lots of freedom. So thank you all, and welcome ๐ŸคŸ๐Ÿ˜

Our thread

Wiki

Ressources

Related communities

If you want to donate, double check on the official website and report any problem to mod team

Social network

Threadiverse

Blog

Microblog

Event

Mediaverse

Audio

Streaming/live

Book

Culture review

Picture

Short-video

Video


Image Credits :
Avatar : Wikipedia Eukombos
Banner : David Revoy licence : CC-BY-4.0

Rules

Moderation process
We all make mistakes,

If your comment is reported, and brings up a complex issue, we will reach out to you and ask you to rephrase it.

Our goal, is to create a serene space for discussion. Nothing more.

If the post isn't edited to remove hurtful language element, we will have to remove it. It would be a shame because your comment was interesting and you took some time to write it.

In case of xenophobia, racism, transphobia, homophobia or harassment, it will be a permanent ban.

founded 5 months ago
MODERATORS