Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How do you guys stop scrapers from mirroring your site?
20 points by mmaunder on Jan 16, 2011 | hide | past | favorite | 27 comments
I'm about to put a ton of really useful, unique and valuable (in the CPC and usefulness sense) content online. Scrapers are very quickly going to want it all. How do you guys protect your content? [keeping in mind that suing someone offshore is difficult]

My current approach is going to be to limit all non-USA crawlers and crawlers that don't identify themselves as Google, Bing or someone else I care about. I'm planning on using nginx.conf and the maxmind country database to do this.

By limit I mean limiting each IP to viewing a maximum of 50 to 100 unique documents per day.

Any suggestions would be much appreciated.



Set no-cache meta tag so scrapers can't use Google cache.

Check out bad-behavior: http://bad-behavior.ioerror.us/

Otherwise:

Use a honey-pot to catch browsers that don't obey robots.txt. Send yourself an email for every new IP address so you can catch any false positives.

Redirect requests with no user agent specified to your honey-pot URL.

Same for known bad/useless user agents: wget, curl, Bing, etc.

Validate supposed good crawler agents via reverse IP lookup - cache bad IP addresses - redirect to honey-pot URL

Filter remainder based on request rate over a given period (use a DB to cache requests). More than 1 request per second for 20 seconds = bot, more than 200 requests per day = bot : redirect to honey-pot URL

I've been doing this for a site with around 150K URLs for the past 4 years. Have about a 600 URLs blocked and around 35 user-agents.


These are good ways to stop your basic scraper.

One of my jobs is to scrape websites. The websites we scrape we have permission to scrape. Sadly many companys we work with don't have full time developers, or they have rules set up to stop other from scraping their site, that make it hard on us.

Other than Bad Behavior, I have circumvented all of the above suggestions. I don't know what Bad Behavior is and as far as I know I have not encountered it.

We falsify user agents to say they are browseres/google/whateversneeded. We script lynxs or write programs to type in login/pass, and change the page. We write programs to follow do_postback(with their var's) stuff that ASP.NET creates for multi page tables. We add in random time intervals to ease the load on some companies sites, and to circumvent blocks on others. We download and parse PDF, DOC, XLS, even some images. We have went as far as scraping Flash app's using Screen shots and OCR.

The list goes on. My point is that nothing is completely safe.


Awesome, thanks. There are a few great ideas in your post I'll implement.


You will quickly find out that the scrapers are coming from USA-based proxies, they pretend to be Google or Bing, they use an intermediary (say, Google Cache)... and so on, until you are tired of fighting them.

It's a losing game, unless you don't mind making it hard for actual users to view your content.


I agree with bobds. It's retarded to worry about scrapers. You're better off focusing on doing your own thing. Large sites get scrapped and cloned all the time. At the end of the day, it's better to be focused on what you do rather than the clones.


Limit all non-USA crawlers? The USA traffic is probably the worst offenders for scraping content.

Personally, I wouldn't even bother with this. As long as other sites aren't outranking you with your own content then I don't see a problem. If they are outranking you with your own content, then you need to evaluate your SEO strategy.


Google will apparently remove sites that are copying copyright content: http://www.google.com/support/websearch/bin/answer.py?hl=en&...

So he will definitely outrank them if he owns the content.


You could add some unique words/sentences to the pages which will make googling for "mirrors" easy (i.e. automated) so sending copyright notices to Google can be almost fully automated.


Google will also pay those sites infringing your copyright to display their pay-per-click ads.

They will comply with some removal requests, but sending a letter for each case of infringing content does not scale well.


If you do not want your content downloaded, then don't put your content online.

The web, and computers in general, work by copying data. Trying to prevent that is like trying to "make water not wet" (Bruce Schneier).


But honestly, the web is considered 'public domain' and you should be happy they didn't just lift your whole article and put someone else's name on it.


You could do a reverse lookup on crawler IP addresses to make sure that they match the User Agent. Google PTR records will end in google.com or googlebot.com. Maybe compare with the list at http://chceme.info/ips/

You probably can’t prevent this altogether, but you can probably do a lot to raise the cost of this type of crawling. The exception might be if there are people out to copy your content specifically and will actually tailor their crawlers to your countermeasures.


And regardless how good your defences are going to be, there's always Mechanical Turk.


Can you partition it? Have one portion of the data available to all and sundry, then have another more valuable part hidden behind a captcha'd login?

I know this won't prevent anyone else from being able to match your Google ranking on the free content, but it should mean that you can maintain your position as the "default place to go" for it, which should come with attached PageRank goodness. It also means you can institute per-account rate limits, rather than IP-based ones, which might help.


If you have a good lawyer, sue them for copyright infringement. Otherwise ignore it, the current crawlers can do pretty much an average visitor can, some even use real browser engines.

> By limit I mean limiting each IP to viewing a maximum of 50 to 100 unique documents per day.

That won't do you any good, It's ridiculously simple to obtain unique IP addresses and proxies, not to mention that your database may blow up with that much data. :)


If they want to get it, they'll get it. There are plenty of services available that will sell you lists of open proxy servers by the hundreds. If they really want your content, they'll just pop up each request on a different IP. The only way to really make it tough to get, is to take the content out of the HTML.. which.. you prob don't want to do


Use absolute URLs, a lot of scrapers don't put a lot of effort into rewriting URLs so you can lead users back to the real site. Plus Google will take links coming from the scraped site to your site as an indication that you're the original source, and as a bonus you'll get PR for it as well.


i) Javascript: Very few crawlers can read your javascript code. Look at the source html of Google. Is pure Javascript! ii) Accept crawling based on IPs. iii) Use captchas. iv) Use cookies and Measure the crawling speed. v) Think that if someone wants to copy your content, they can just look at the cache of some search engine, they don't need to crawl you.


Javascript isn't going to stop me from crawling a site if I want the content. Browser automation is simply too good.


This is basically cloaking (showing your content one way to regular readers and showing it a different way to search engines,) which could get you slapped by Google. Not worth this risk IMO. However, maybe there is a way to do this so you don't have to cloak? Not sure, I have never looked into it.


Just support Google's AJAX crawling standards and disallow the escaped_fragment requests from crawlers that are not on your whitelist.


... Isn't that exactly what the New York Times does with its required sign-on? You get prompted to sign in unless the referrer URL is google.


I didn't say cloaking, you can show the same content in javascript and html. Show html for search engines.


Isn't this a problem for a CAPTCHA?


No.

If the content is only accessible after you fill out a CAPTCHA, Google won't be able to index your content, and you won't have any visitors for scrapers to steal from you.

I suppose you could provide the content without a CAPTCHA if the user agent matches the crawlers of Google, Bing, etc. But it's still a bad idea because a lot (most?) of your users will bounce on the CAPTCHA, killing your ad revenue.


... And it's very easy to just impersonate the GoogleBot by using its User-Agent.


Ditto.. there is no point trying to prevent certain bots, etc from scraping your content. It's a never ending battle. Just like all the others said before me, if it's your content, you should be fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: