Keeping Disrespectful Robots Out of Metafilter. February 21, 2025 2:41 PM Subscribe

In the recent thread regarding use of LLMs on AskMe, et al, Lanark reported on the current state of Metafilter's robots.txt file and its limitations:

Related to this, the metafilter robots.txt currently blocks GPTBot, but allows ChatGPT-User, Anthropic-ai, Applebot-Extended, Google-Extended, ClaudeBot, Cohere-ai, PerplexityBot and probably a bunch more AI scrapers.
To only block one seems a bit inconsistent.

--

It'd be lovely if updating robots.txt these days were effective, but my understanding is that there are many AI spiders crawling the web these days that utterly ignore robots.txt.

This can even go so far as to become a form of DOS attack.

Do we currently have a policy / measures in place to prevent Metafilter content from being scooped up by random AI webcrawlers? If not, can be put one together and get it in place? Have there been any performance problems imposed on metafilter by webcrawlers recently?

I'd like to suggest the use of nepenthes, iocaine or something similar.

posted by ursus_comiter to Feature Requests at 2:41 PM (13 comments total) 2 users marked this as a favorite

From the write up of Nepenthes and I suspect this may apply to other tarpits:

ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

Removing Metafilter from Google would perhaps not be the best for site health.
posted by deadwax at 1:28 AM on February 22 [5 favorites]

all we have to do is seed each comment with a logical paradox. no AI can resist thinking about them!
posted by mittens at 5:32 AM on February 22 [4 favorites]

I have a web site and have given this a fair bit of thought. My conclusion is that trying to block robots is a fools errand - robots.txt only blocks well behaved bots and any sort of tarpit runs the very real risk of blocking the “good” bots that power search engines.

The really sneaky bots try very hard to mimic real browsers anyway.

If you are posting a comment on MeFi (or any site really) then that text will end up in an AI model. It is unavoidable, there really isn’t a technical solution.

The one exception is blocking bots that generate enough traffic to affect the operation of the site itself. These are worth blocking.
posted by AndrewStephens at 12:24 PM on February 22 [2 favorites]

I deal with bots professionally - it's been a significant part of my jobs for well over a decade. I go back to the original LaBrea with tar pitting. I'm in the camp that anything other than updating robots.txt carries more potential harm (see above) than benefits.
posted by Candleman at 1:44 PM on February 22 [1 favorite]

I don't think any of the AI scrapers respects robots.txt
posted by Pyrogenesis at 2:36 AM on February 23

Westphalia posting was before its time.
posted by lucidium at 3:52 AM on February 23 [1 favorite]

The 3 big players Apple, Google and Microsoft all have stated they will follow robots.txt
Some of the smaller startups are still in the 'move fast and break things' stage, so probably aren't following the rules.
While there is an expectation that scrapers would read robots.txt and then scrape accordingly I think in many cases they just scrape everything first and then (maybe) follow the robots.txt rules later on when processing the data.

This EFF article points towards the NYT robots.txt as a reasonable starting point you could copy and adapt.

Legally I think the robots.txt file is an implied license so it is worth having an entry even for scrapers like Brave that ignore it. Also they may change in the future and start behaving.
posted by Lanark at 10:59 AM on February 25 [3 favorites]

Most AI crawlers (there are so, so many) will aggressively misrepresent themselves as normal user agents and change their UA string when they get blocked.

For my own website (which gets very little traffic, but it's the principle of the thing), I throttle traffic per IP number and issue 6-hour timeouts when the throttle limit is exceeded.
posted by adamrice at 12:27 PM on February 25 [1 favorite]

Robots.txt is not legally binding nor a license.

https://www.robotstxt.org/faq/legal.html
posted by Candleman at 4:52 PM on February 25

implied license
posted by Lanark at 1:11 AM on February 26

I am capable of Googling, thank you.

As noted, dealing with this sort of thing has literally been part of my job for a long time. I will take the opinions of my very expensive corporate lawyers over a Wikipedia article or your opinion. I have quite a bit of experience with fair use, within US laws, and adversarial bots. I'm not one of the FAANG people here but I've done this for some of the most used web sites in existence.

Field v. Google is a singular case specific to caching content as it was presented, not creating derivative works, and the decision was much more based on fair use than implied license. You can't just rip off someone's writing and say they didn't say not to with meta tags or robots.txt and that failing to have done so creates a license.
posted by Candleman at 12:44 PM on February 27 [1 favorite]

I'm sure everyone has better stuff to do, but someone could adapt LLM poisioning like nepenthes to metafilter, or create real but hidden posts that contain or link to Markov generator outputs, maybe trained on metafilter plus gamer forums, 4chan, and worse.
posted by jeffburdges at 5:08 PM on February 27

Google and others actively detect whether you feed their obvious bots significantly different data than an anonymous source and will delist/rank you if they think you're cheating. As above, Metafilter has more to lose playing with this stuff than it has to gain.
posted by Candleman at 5:30 PM on February 27 [2 favorites]

« Older Interviewees wanted! | Sexism: not just for "young girls" anymore Newer »

You are not logged in, either login or create an account to post comments

MetaTalk

Keeping Disrespectful Robots Out of Metafilter. February 21, 2025 2:41 PM Subscribe

Tags

Share