Quintin Par asked:
Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
There’s no list of IP addresses for “good” search engine bots that I know of, and if there were it would be horribly out of date pretty quickly, as you’ve already discovered.
One thing you can do is to create a bot trap. This is simple in theory: You create a page that is linked to in your web site but hidden from normal users (e.g. via CSS tricks) and then
Disallow it in
robots.txt. You then wait a week since legitimate search engines may cache
robots.txt for that long, then start banning anything that hits the trap page (e.g. with fail2ban).
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.