Ask HN: AI bots everywhere – does anyone have a good whitelist for robots.txt?

57 points by scoofy a day ago

My niche little site, http://golfcourse.wiki seems to be very popular with AI bots. They basically become most of my traffic. Most of them follow robots.txt, and that's nice and all, but they are costing me non-trivial amounts of money.

I don't want to block most search engines. I don't want to block legitimate institutions like archive.org. Is there a whitelist that I could crib instead of pretty much having to update my robots file every damn day?

ryukoposting a day ago

I operate under the assumption that Google, OpenAI, Anthropic, Bytedance et al either totally ignore it or only follow it selectively. I haven't touched my Robots.txt in a while. Instead, I have nginx return empty and/or bogus responses when it sees those UA substrings.

taftster a day ago

Right?! robots.txt is basically a map to the inner secret parts of your website, the "real" content. Robots.txt is basically treated as "these rules are for thee, not for me" access control.
If you don't want your content crawled, you really need to put it behind a login of some sort (and watch it still get crawled regardless) or just not publish it at all. Sad state we're in.
- scoofy a day ago
  
  I'm getting close to a building a honey pot, because I want responsible robots to be able to crawl my site! It's about sharing information after all. If it gets large enough to have a positive monetary benefit to me, I'd happily let the ai bots crawl it responsibly too.
  It's a wiki, so I don't want it to me my information, ever. I'm just the one footing the bill right now.
  
  taftster a day ago
  
  Yeah, I used to do that. I'd put bad actors into a randomly generated never ending link farm. I'd put a special link in my robots.txt and any bot that visited it, I would assume was a bad actor. This link was not accessible from any other location.
  The results were striking, how many bots just stupidly tumbled down the random link rabbit hole. I'm not sure what they were searching for, but all they got from me was gobbledygook text and more links.

gigaArpit a day ago

Robots.txt is a joke, use Cloudflare's Block AI Bots feature if you are using Cloudflare.

userbinator a day ago

More precisely described as "block non-sanctioned user agents".
Using that feature will ensure I never visit your site again.
- aetherspawn a day ago
  
  It's a free wiki so not visiting is a right you have, sure .. but also OP shouldn't have to spend loads of $ to host a free wiki and at the same time support a very bespoke thing such as non standard user agents.
  
  zaphodias a day ago
  
  > is a right
  I believe they're saying that cloudflare will block them just for using a blacklisted client, even if they're legit users and not bots
graeme a day ago

On this point, if you turn on bot fight mode it also says it blocks verified bots.
But, bot fight mode says "there is a newer version of this setting" however it does not link to it.
Anyone have any insight on the blocked verified bots or the supposed new version?

captn3m0 a day ago

I’ve been using https://github.com/ai-robots-txt/ai.robots.txt on my Gitea instance.

blakesterz a day ago

I found that robots.txt does an OK job. It doesn't block everything, but I wouldn't run without one because many really busy bots do follow the rules. It's simple, cheap and does knock out a bunch of traffic. The AWS WAF has some challenge rules that I found work great stopping some of the other crappy and aggressive bots. And then the last line of defense is some addition nginx blocks. Those three layers really got things under control.

andrewfromx a day ago

robots.txt will just be ignored by bad actors.

And the user-agent sent is completely in control of the bad actor. They can send their user-agent as "google bot".

You would need something like WAF from https://www.cloudflare.com/ or AWS

dankwizard 12 hours ago

People pointing at AI are right, but also, I've done a lot of scraping for personal sites and small side hustles and not once even been concerned to check for robots.txt.

palmfacehn a day ago

Blocking Amazon, Huawei, China Telecom and few other ASNs did it for me. You should be able to cut, sort -u your log files to find the biggest offenders by user-agent if they truly obey robots.txt.

aetherspawn a day ago

Use Cloudflare (free) and just disable bots or add a captcha.

You'll probably also save money if you enable and configure free caching.

scoofy a day ago

Will this still allow Google and Bing and Archive bots?
- aetherspawn a day ago
  
  I quickly checked for you, and my Cloudflare account currently has the following option: "Block AI Bots" under xxx.com > Security > Bots which I think does what you want.
  I am not paying for Cloudflare and it allows me to enable this option. If you use the (other) Bot Fight Mode or captcha options, then yes, you will block crawlers. AFAIK specific bot user agents can also be blocked (or allowed) using a Page Rule, which is also a free tier feature.

xyzal a day ago

Recently discussed ... https://marcusb.org/hacks/quixotic.html

pdntspa a day ago

It's so cute how people think anyone obeys robots.txt :)

skybrian 11 hours ago

And why should anyone believe you when you didn’t give any evidence of anything?
memhole a day ago

I personally don’t even bother with it for these reasons.
CommanderData a day ago

Robots.txt? Just send a fake user agent.

keisborg a day ago

An old trick is to add a page to the robots disallow list, but the page should also be findable by crawlers. If a bot visits this page, you know it’s a bad actor.

foreigner a day ago

I love this idea! Can anybody else with experience with this chime in? Does it actually work?

userbinator a day ago

Just ratelimit.

trod1234 a day ago

Most of the major players seem to be actively ignoring robots.txt.

They claim they don't do this, but those claims amount to lies. The access logs shows what they actually do, and its gotten so egregious that some sites have buckled under resource exhaustion caused by these bad actors. (DDOS attack).

_blk a day ago

Takes balls to put the URL in a HN post if you're looking to reduce traffic costs.. Of course it's legit traffic for once but I assume curiosity just got the better half of them.. .. On the constructive side: shield it with cloudflare.

scoofy a day ago

I’m fine with a burst of traffic… that’s what it’s designed for. What it’s not designed for is 24-7-365 being slowly pinged on every single page (thousands), by every single robot (thousands).
- dapperdrake a day ago
  
  <snark> Maybe the bots are hallucinating off-by-one errors and then 25-8-367 isn’t all that bad. </snark>

anxixddjs a day ago

sorry cant help - but really cool site!