Use a honey-pot to catch browsers that don't obey robots.txt. Send yourself an email for every new IP address so you can catch any false positives.
Redirect requests with no user agent specified to your honey-pot URL.
Same for known bad/useless user agents: wget, curl, Bing, etc.
Validate supposed good crawler agents via reverse IP lookup - cache bad IP addresses - redirect to honey-pot URL
Filter remainder based on request rate over a given period (use a DB to cache requests). More than 1 request per second for 20 seconds = bot, more than 200 requests per day = bot : redirect to honey-pot URL
I've been doing this for a site with around 150K URLs for the past 4 years. Have about a 600 URLs blocked and around 35 user-agents.
One of my jobs is to scrape websites. The websites we scrape we have permission to scrape. Sadly many companys we work with don't have full time developers, or they have rules set up to stop other from scraping their site, that make it hard on us.
Other than Bad Behavior, I have circumvented all of the above suggestions. I don't know what Bad Behavior is and as far as I know I have not encountered it.
We falsify user agents to say they are browseres/google/whateversneeded.
We script lynxs or write programs to type in login/pass, and change the page.
We write programs to follow do_postback(with their var's) stuff that ASP.NET creates for multi page tables.
We add in random time intervals to ease the load on some companies sites, and to circumvent blocks on others.
We download and parse PDF, DOC, XLS, even some images.
We have went as far as scraping Flash app's using Screen shots and OCR.
The list goes on. My point is that nothing is completely safe.
Check out bad-behavior: http://bad-behavior.ioerror.us/
Otherwise:
Use a honey-pot to catch browsers that don't obey robots.txt. Send yourself an email for every new IP address so you can catch any false positives.
Redirect requests with no user agent specified to your honey-pot URL.
Same for known bad/useless user agents: wget, curl, Bing, etc.
Validate supposed good crawler agents via reverse IP lookup - cache bad IP addresses - redirect to honey-pot URL
Filter remainder based on request rate over a given period (use a DB to cache requests). More than 1 request per second for 20 seconds = bot, more than 200 requests per day = bot : redirect to honey-pot URL
I've been doing this for a site with around 150K URLs for the past 4 years. Have about a 600 URLs blocked and around 35 user-agents.