IP address of spiders and “official” web bots

Quintin Par asked:

Is there an official API to iplists.com from where I can get the list of spiders?

My intention is to whitelist these IPs for site scraping.

My answer:

There’s no list of IP addresses for “good” search engine bots that I know of, and if there were it would be horribly out of date pretty quickly, as you’ve already discovered.

One thing you can do is to create a bot trap. This is simple in theory: You create a page that is linked to in your web site but hidden from normal users (e.g. via CSS tricks) and then Disallow it in robots.txt. You then wait a week since legitimate search engines may cache robots.txt for that long, then start banning anything that hits the trap page (e.g. with fail2ban).

View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.