announcement, anti llm scraper tool i made, boosts encouraged
as some of you may know i made sth called bombai (name comes from bomba and ai) thats kinda like anubis or iocaine in purpose. those solutions are kind of βfineβ already, but the thing about them is that anubis sucks for regular users too, and isnt always effective, and that iocaine usually relies on lists or similar things.
scrapers unfortunately come in all shapes and sizes and with new user agents or hidden ones all the time. after using anubis for a while, i got my forgejo downed again and so i went looking. iocaine seems like a good idea, but i want something that is sure to stop my git from going down even if i dont maintain it or the lists are incomplete.
what i made now does the following:
very configurable detection entirely based on behaviour, without modifying site content
request counting
one fail = timeout, continuously resetting if attempts continue (this can be excellently combined with trap paths)
weighted by path and such
blobbing entire subnets together if desired (needed for alibabaβs bot for example)
allows setting up βtrap pathsβ that instantly flag someone for timeout upon visit
customizable response
redirection to iocaine or other trap
zip bombs (small ones usually since most scrapers are smart enough to not decompress them fully otherwise - but it makes it cheaper on bandwidth either way)
maze similar to but less sophisticated than iocaine
plain http or html response from file
it is worth mentioning especially explicitly that paths that are expensive for the server to provide (in storage or otherwise), can be thus limited extremely well, and of course most scrapers are blocked even far before the first time they request such a path.
if interest exists and help is needed setting it up or you just want to chat about it, DO IT I WOULD LOVE THAT.
if a bot makes it through that is capable of doing any harm, i consider that a bug immediately. make an issue and ill debug either your config or the program itself.
i use this myself for a bit now, and it works excellently for me with my forgejo.
#opensource #forgejo