Using Spider Traps to Discourage Site Scraping

Computers & TechnologySearch Engine Optimization

  • Author Rob Sullivan Sullivan
  • Published November 29, 2005
  • Word count 841

Sometimes your competitors will do almost anything to compete with you including stealing your content.

To do this they sometimes employ automated software much like a search engine crawler to make the process quicker and easier than manually copying your site. This can cause many problems for you.

In this article we look at ways to stop this from happening.

Stealing on the web is rampant. I don’t mean stealing people’s user id’s and passwords. I mean the stealing that goes on to a website.

Webmasters and designers steal images they like, or find a cool JavaScript they like so they steal that as well.

But what really causes problems is when your competitors steal your content.

As we all know, content is king on the web. Whoever has the most content wins. So if a competitor of yours needs to grow quickly, one of the easiest ways to do it is through the use of a website harvester.

A website harvester is no different than any other search engine crawler. It goes and requests all the URLs it can find and then proceeds to download all the content associated with those URLs.

So how does one protect themselves from malicious scrapers?

Simple really. You build a spider trap.

As the name implies, you create a section of your site devoted to luring in the spiders that are not friendly, and then you proceed to either trap them or ban them from accessing your site.

What’s involved in making a spider trap?

Usually a bit of PHP code combined with a database and a URL rewriter.

The first thing you need to do is create the space on the site dedicated to capturing those bad bots. You then use robots.txt to exclude that section from crawling.

You do this because you want to ensure Googlebot, Yahoo! Slurp, MSNbot and the others don’t also get trapped. Since most good spiders will follow the robots.txt exclusion protocol you are going to politely deny them access to this location.

From here there are various options. One of my favorite involves logging to a database or text file and then dynamically denying access to the bad bot.

How does it work?

Let me give you a practical example.

I once had a client that was getting harvested many times per day by many different bad spiders. It was so bad at one point that the bad bots were doubling his bandwidth usage.

So we devised a plan whereby we’d create this trap as mentioned above and when we captured the user agent and IP info we immediately banned them from the site.

This is how it worked:

The bad bot would come to the site and find a link on an image. The link would point to the trap directory.

Normally, a regular spider would first check the robots.txt file to ensure they could in fact index the content in that directory. Since the file excluded this directory, the “good” spiders wouldn’t go in.

However the bad spiders ignored robots.txt and went into the directory.

From here there was a PHP script which would run and capture the user agent and IP address.

Another script would take that information and rewrite the .htaccess file with the bad spider information as soon as it was received.

Then the .htaccess file was reloaded by the server and that spider was then not allowed to visit the site anymore.

In another incarnation this system would right to a database or text file that is then referenced by the site pages through a small php script that would allow or deny access based on that list.

Keep in mind that this is very advanced stuff. You don’t want to take it on lightly. Doing so can (and likely will) get your site removed from any search indexes.

Not because you are doing something you shouldn’t but because there’s always the chance that somehow a good spider like Googlebot would end up on your blacklist.

Therefore, before you get into such advanced things, I’d make sure you are intimately familiar with what they are and how they work.

A good place to start is to read this page. I wouldn’t advise using this code just yet, but take a look at it to see what it can do.

Also, do some searches on the engines for “bot trap” and “spider trap” to see what other options there are out there. Then, pick the one that works best for you.

In the end, the best bot trap is the one that does what it is supposed to – block harvesters from scraping your site while allowing legitimate search engines to effectively and efficiently index your site.

And if you are at all concerned with this tactic, don’t use it. It’s better to use the manual approach – scour your server logs looking for high activity from unknown user agents then manually ban them using .htaccess.

Rob Sullivan - SEO Specialist and Internet Marketing Consultant. Reproduction of this article is allowed with an html link pointing to http://www.textlinkbrokers.com

Article source: https://articlebiz.com
This article has been viewed 1,585 times.

Rate article

Article comments

There are no posted comments.

Related articles