I'm about to put a ton of really useful, unique and valuable (in the CPC and usefulness sense) content online. Scrapers are very quickly going to want it all. How do you guys protect your content? [keeping in mind that suing someone offshore is difficult]
My current approach is going to be to limit all non-USA crawlers and crawlers that don't identify themselves as Google, Bing or someone else I care about. I'm planning on using nginx.conf and the maxmind country database to do this.
By limit I mean limiting each IP to viewing a maximum of 50 to 100 unique documents per day.
Any suggestions would be much appreciated.
Check out bad-behavior: http://bad-behavior.ioerror.us/
Otherwise:
Use a honey-pot to catch browsers that don't obey robots.txt. Send yourself an email for every new IP address so you can catch any false positives.
Redirect requests with no user agent specified to your honey-pot URL.
Same for known bad/useless user agents: wget, curl, Bing, etc.
Validate supposed good crawler agents via reverse IP lookup - cache bad IP addresses - redirect to honey-pot URL
Filter remainder based on request rate over a given period (use a DB to cache requests). More than 1 request per second for 20 seconds = bot, more than 200 requests per day = bot : redirect to honey-pot URL
I've been doing this for a site with around 150K URLs for the past 4 years. Have about a 600 URLs blocked and around 35 user-agents.