Thread | Brutkey

The Nexus of Privacy
@thenexusofprivacy@infosec.exchange

As you've probably seen or heard Dropsitenews has published a list (from a Meta whistleblower) of "the roughly 100,000 top websites and content delivery network addresses scraped to train Meta's proprietary AI models" -- including quite a few fedi sites. Meta denies everything of course, but they routinely lie through their teeth so who knows. In any case, whether the specific details in the report are accurate, it's certainly a threat worth thinking about.

So I'm wondering what defenses fedi admins are using today to try to defeat scrapers: robots.txt, user-agent blocking, firewall-level blocking of ip ranges, Cloudflare or Fastly AI scraper blocking, Anubis, stuff you don't want to disclose ... @deadsuperhero@social.wedistribute.org has some good discussion on We Distribute, and it would b e very interesting to hear what various instances are doing.

And a couple of more open-ended questions:

Do you feel like your defenses against scraping are generally holding up pretty well?

Are there other approaches that you think might be promising that you just haven't had the time or resources to try?

Do you have any language in your terms of servive that attempts to prohibit training for AI?

Here's @FediPact@cyberpunk.lol's post with a link to the Dropsitenews report and (in the replies) a list of fedi instances and CDNs that show up on the list.

https://cyberpunk.lol/@FediPact/114999480874284493

@fediverse@lemmy.world @fediversenews@venera.social

#MastoAdmin #Meta #FediPact

rancidrabbit
@rancidrabbit@anarchism.space

@thenexusofprivacy@infosec.exchange I'd like to know if they're using the API to scrape or the WebUI. Considering Anubis.

The Nexus of Privacy
@thenexusofprivacy@infosec.exchange

@rancidrabbit@anarchism.space good point, thanks. In practice both are potential vectors (as is RSS), so useful to lock them all down.

rancidrabbit
@rancidrabbit@anarchism.space

@thenexusofprivacy@infosec.exchange Other instances need access to the API so it's not possible to do tests like Anubis does to try to make sure there's a real browser there. And they're willing to change their UserAgent strings to evade bans so likely will change IPs too but I guess I should be looking for that list of netblocks.

The Nexus of Privacy
@thenexusofprivacy@infosec.exchange

@rancidrabbit@anarchism.space It's not clear whether or not Meta is changing UserAgent strings - I've seen at least one admin saying he's not seeing that. It's totally the kind of thing they'd do, but they get so much PR value out of being perceived as "good fedi citizens" that I don't know how they think of the tradeoffs.

here's a post from @cuchaz@gladtech.social with some tools to set up firewall-level blocks - https://gladtech.social/@cuchaz/115004304985099620

Moved to @bonfire@bonfire.cafe
@bonfire@indieweb.social

@thenexusofprivacy@infosec.exchange

That would be great to know.

Probably this requires blocking at every possible level to be sure (eg. robots.txt, user agent, IP ranges...) And if some bots are using ActivityPub for scraping we could also block their HTTP signature public keys?

We've prototyped a system that builds on Bonfire's circles/boundaries to define and enforce blocks at the instance, user, and post levels. Would love feedback and suggestions to make it stronger!

@rancidrabbit@anarchism.space @cuchaz@gladtech.social @FediPact@cyberpunk.lol