Drowning in spidersYour Bug Report Will Be Graded

Fri 26 October 2018 | -- (permalink)

There's this web site I'm responsible for maintaining, and it has this problem: it depends on an old piece of third party software for which no good replacement exists, and that software does not deal at all well with being walked by web spiders. So of course it has a robots.txt file asking polite spiders to keep away. Sadly, there are a lot of badly written web crawlers these days which don't honor robots.txt, apparently more from incompetence than from malice.

So I needed a cheap way to block this traffic. There are heuristic rate limiting tools like fail2ban which one can adapt for this purpose, but configuring them is complicated, and rate limiting has its own drawbacks.

Since this is just accidental traffic rather than a deliberate attack, it turned out we were able to reduce the load significantly with a trivial addition to the Apache configuration:

<If "%{HTTP_USER_AGENT} =~ /spider|bot|crawl|qwantify/i">
    Require all denied
</If>

Yes, this just blocks clients which identify themselves as crawlers, which they could stop doing at any time. But we're really just using this technique to patch around a voluntary standard which the crawlers in question don't seem to implement properly, so, whatever. It's cheap and it works well enough for the moment.

Time to sign the horse up for singing lessons again.