Cool project: "Nepenthes" is a tarpit to catch (AI) web crawlers.
"It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse."
@tante I have mixed feelings.
Crawlers should respect robots.txt….
At the same time: there is clearly an emotionally based bias happening with LLM’s.
I feel weird about the idea of actively sabotaging. Considering it is only towards bad actors… and considering maybe robots.txt often are too restrictive in my opinion… the gray areas overlap a bit.
Why should we want to actively sabatoge AI dev? Wouldn’t that lead to possible catastrophic results? Who benefits from dumber ai?
@altruios Hi, author of Nepenthes here.
I respect your discomfort, but honestly I'm angry enough about their behavior I want to see them burn. There's been far too much of this:
https://mastodon.social/@khobochka/113724300122190730
I cannot trust traffic to my site to be harmless, so I don't see any reason why something connecting to every site on the internet should be able to trust the site isn't harmful.
@tante
@thibaultamartin @aaron @altruios @tante : this is awesome and has only one flaw: nepenthes domains can easily be identified and blacklisted.
I don’t know how to really solve this.
@ploum How do you propose identifying them?
Or suppose you want them to stop training on your data. They blocklist your domain and stop crawling it. Mission accomplished?
@thibaultamartin @altruios @tante