Set no-cache meta tag so scrapers can't use Google cache. Check out bad-behavior...

peelle · on Jan 17, 2011

These are good ways to stop your basic scraper.

One of my jobs is to scrape websites. The websites we scrape we have permission to scrape. Sadly many companys we work with don't have full time developers, or they have rules set up to stop other from scraping their site, that make it hard on us.

Other than Bad Behavior, I have circumvented all of the above suggestions. I don't know what Bad Behavior is and as far as I know I have not encountered it.

We falsify user agents to say they are browseres/google/whateversneeded. We script lynxs or write programs to type in login/pass, and change the page. We write programs to follow do_postback(with their var's) stuff that ASP.NET creates for multi page tables. We add in random time intervals to ease the load on some companies sites, and to circumvent blocks on others. We download and parse PDF, DOC, XLS, even some images. We have went as far as scraping Flash app's using Screen shots and OCR.

The list goes on. My point is that nothing is completely safe.

mmaunder · on Jan 17, 2011

Awesome, thanks. There are a few great ideas in your post I'll implement.