Thread | Brutkey

Rather than scraping from sites directly, many of the addresses on Meta’s leaked list belong to Content Delivery Networks (CDNs) that are used by websites to cache and store information to improve site performance.

This is a critical point. An instance or website can defend itself in numerous different ways, including actively adversarial strategies, and still succumb to extraction - if they're using Cloudflare

cc: @subMedia@kolektiva.social

ophiocephalic 🐍

@ophiocephalic@kolektiva.social

@FediPact@cyberpunk.lol
Another sickening consideration here. If they're scraping Cloudflare and CDNs rather than directly, it's possible or likely they're not just extracting public posts, but all posts, including DMs

@subMedia@kolektiva.social

Jon
@jdp23@neuromatch.social

Yeah, definitely grounds for concern. And yet another good reminder DO NOT USE FEDI FOR ANYTHING CONFIDENTIAL, that's what Signal is for!

All that being said, it's not clear to me whether Meta is scraping images from DMs or followers-only post. If they're looking at neuromatch.social (or follow 'show original post' link for any profiles or public/unlisted posts that federate from neuromatch to instances that don't have their public feeds locked down) they'd get files from media.neuromatch.social. But is there a realistic way for them to scrape all images from media.neuromatch.social?

Anyhow @jonny@neuromatch.social @moderation@neuromatch.social according to this report Meta's scraping a list of domains that include media.neuromatch.social. Meta denies it of course but we all know that doesn't mean anything. It's not clear just what it means for domains to be on the list, and I'm not sure what to do in response -- blocking all known Meta domains and IPs at the network level is a good idea if that's not already happening, although it's easy enough for them to work around it.

@ophiocephalic@kolektiva.social @FediPact@cyberpunk.lol @subMedia@kolektiva.social