Ask HN: How do you guys stop scrapers from mirroring your site?

autoreverse · on Jan 16, 2011

Set no-cache meta tag so scrapers can't use Google cache.

Check out bad-behavior: http://bad-behavior.ioerror.us/

Otherwise:

Use a honey-pot to catch browsers that don't obey robots.txt. Send yourself an email for every new IP address so you can catch any false positives.

Redirect requests with no user agent specified to your honey-pot URL.

Same for known bad/useless user agents: wget, curl, Bing, etc.

Validate supposed good crawler agents via reverse IP lookup - cache bad IP addresses - redirect to honey-pot URL

Filter remainder based on request rate over a given period (use a DB to cache requests). More than 1 request per second for 20 seconds = bot, more than 200 requests per day = bot : redirect to honey-pot URL

I've been doing this for a site with around 150K URLs for the past 4 years. Have about a 600 URLs blocked and around 35 user-agents.

peelle · on Jan 17, 2011

These are good ways to stop your basic scraper.

One of my jobs is to scrape websites. The websites we scrape we have permission to scrape. Sadly many companys we work with don't have full time developers, or they have rules set up to stop other from scraping their site, that make it hard on us.

Other than Bad Behavior, I have circumvented all of the above suggestions. I don't know what Bad Behavior is and as far as I know I have not encountered it.

We falsify user agents to say they are browseres/google/whateversneeded. We script lynxs or write programs to type in login/pass, and change the page. We write programs to follow do_postback(with their var's) stuff that ASP.NET creates for multi page tables. We add in random time intervals to ease the load on some companies sites, and to circumvent blocks on others. We download and parse PDF, DOC, XLS, even some images. We have went as far as scraping Flash app's using Screen shots and OCR.

The list goes on. My point is that nothing is completely safe.

mmaunder · on Jan 17, 2011

Awesome, thanks. There are a few great ideas in your post I'll implement.

bobds · on Jan 16, 2011

You will quickly find out that the scrapers are coming from USA-based proxies, they pretend to be Google or Bing, they use an intermediary (say, Google Cache)... and so on, until you are tired of fighting them.

It's a losing game, unless you don't mind making it hard for actual users to view your content.

whatevers2009 · on Jan 16, 2011

I agree with bobds. It's retarded to worry about scrapers. You're better off focusing on doing your own thing. Large sites get scrapped and cloned all the time. At the end of the day, it's better to be focused on what you do rather than the clones.

gexla · on Jan 16, 2011

Limit all non-USA crawlers? The USA traffic is probably the worst offenders for scraping content.

Personally, I wouldn't even bother with this. As long as other sites aren't outranking you with your own content then I don't see a problem. If they are outranking you with your own content, then you need to evaluate your SEO strategy.

redstripe · on Jan 16, 2011

Google will apparently remove sites that are copying copyright content: http://www.google.com/support/websearch/bin/answer.py?hl=en&...

So he will definitely outrank them if he owns the content.

silvestrov · on Jan 16, 2011

You could add some unique words/sentences to the pages which will make googling for "mirrors" easy (i.e. automated) so sending copyright notices to Google can be almost fully automated.

bobds · on Jan 16, 2011

Google will also pay those sites infringing your copyright to display their pay-per-click ads.

They will comply with some removal requests, but sending a letter for each case of infringing content does not scale well.

pwg · on Jan 16, 2011

If you do not want your content downloaded, then don't put your content online.

The web, and computers in general, work by copying data. Trying to prevent that is like trying to "make water not wet" (Bruce Schneier).

jhamburger · on Jan 16, 2011

But honestly, the web is considered 'public domain' and you should be happy they didn't just lift your whole article and put someone else's name on it.

guan · on Jan 16, 2011

You could do a reverse lookup on crawler IP addresses to make sure that they match the User Agent. Google PTR records will end in google.com or googlebot.com. Maybe compare with the list at http://chceme.info/ips/

You probably can’t prevent this altogether, but you can probably do a lot to raise the cost of this type of crawling. The exception might be if there are people out to copy your content specifically and will actually tailor their crawlers to your countermeasures.

deno · on Jan 16, 2011

And regardless how good your defences are going to be, there's always Mechanical Turk.

regularfry · on Jan 16, 2011

Can you partition it? Have one portion of the data available to all and sundry, then have another more valuable part hidden behind a captcha'd login?

I know this won't prevent anyone else from being able to match your Google ranking on the free content, but it should mean that you can maintain your position as the "default place to go" for it, which should come with attached PageRank goodness. It also means you can institute per-account rate limits, rather than IP-based ones, which might help.

DjDarkman · on Jan 16, 2011

If you have a good lawyer, sue them for copyright infringement. Otherwise ignore it, the current crawlers can do pretty much an average visitor can, some even use real browser engines.

> By limit I mean limiting each IP to viewing a maximum of 50 to 100 unique documents per day.

That won't do you any good, It's ridiculously simple to obtain unique IP addresses and proxies, not to mention that your database may blow up with that much data. :)

nolite · on Jan 16, 2011

If they want to get it, they'll get it. There are plenty of services available that will sell you lists of open proxy servers by the hundreds. If they really want your content, they'll just pop up each request on a different IP. The only way to really make it tough to get, is to take the content out of the HTML.. which.. you prob don't want to do

ig1 · on Jan 16, 2011

Use absolute URLs, a lot of scrapers don't put a lot of effort into rewriting URLs so you can lead users back to the real site. Plus Google will take links coming from the scraped site to your site as an indication that you're the original source, and as a bonus you'll get PR for it as well.

wslh · on Jan 16, 2011

i) Javascript: Very few crawlers can read your javascript code. Look at the source html of Google. Is pure Javascript! ii) Accept crawling based on IPs. iii) Use captchas. iv) Use cookies and Measure the crawling speed. v) Think that if someone wants to copy your content, they can just look at the cache of some search engine, they don't need to crawl you.

regularfry · on Jan 16, 2011

Javascript isn't going to stop me from crawling a site if I want the content. Browser automation is simply too good.

gexla · on Jan 16, 2011

This is basically cloaking (showing your content one way to regular readers and showing it a different way to search engines,) which could get you slapped by Google. Not worth this risk IMO. However, maybe there is a way to do this so you don't have to cloak? Not sure, I have never looked into it.

evgen · on Jan 16, 2011

Just support Google's AJAX crawling standards and disallow the escaped_fragment requests from crawlers that are not on your whitelist.

ToastOpt · on Jan 16, 2011

... Isn't that exactly what the New York Times does with its required sign-on? You get prompted to sign in unless the referrer URL is google.

wslh · on Jan 16, 2011

I didn't say cloaking, you can show the same content in javascript and html. Show html for search engines.

bartonfink · on Jan 16, 2011

Isn't this a problem for a CAPTCHA?

jackowayed · on Jan 16, 2011

No.

If the content is only accessible after you fill out a CAPTCHA, Google won't be able to index your content, and you won't have any visitors for scrapers to steal from you.

I suppose you could provide the content without a CAPTCHA if the user agent matches the crawlers of Google, Bing, etc. But it's still a bad idea because a lot (most?) of your users will bounce on the CAPTCHA, killing your ad revenue.

damncabbage · on Jan 17, 2011

... And it's very easy to just impersonate the GoogleBot by using its User-Agent.

sursani · on Jan 17, 2011

Ditto.. there is no point trying to prevent certain bots, etc from scraping your content. It's a never ending battle. Just like all the others said before me, if it's your content, you should be fine.