Using Spider Traps to Discourage Site Scraping

Computers & Technology → Search Engine Optimization

Author Rob Sullivan Sullivan
Published November 29, 2005
Word count 841

Sometimes your competitors will do almost anything to compete with you including stealing your content.

To do this they sometimes employ automated software much like a search engine crawler to make the process quicker and easier than manually copying your site. This can cause many problems for you.

In this article we look at ways to stop this from happening.

Stealing on the web is rampant. I don’t mean stealing people’s user id’s and passwords. I mean the stealing that goes on to a website.

Webmasters and designers steal images they like, or find a cool JavaScript they like so they steal that as well.

But what really causes problems is when your competitors steal your content.

As we all know, content is king on the web. Whoever has the most content wins. So if a competitor of yours needs to grow quickly, one of the easiest ways to do it is through the use of a website harvester.

A website harvester is no different than any other search engine crawler. It goes and requests all the URLs it can find and then proceeds to download all the content associated with those URLs.

So how does one protect themselves from malicious scrapers?

Simple really. You build a spider trap.

As the name implies, you create a section of your site devoted to luring in the spiders that are not friendly, and then you proceed to either trap them or ban them from accessing your site.

What’s involved in making a spider trap?

Usually a bit of PHP code combined with a database and a URL rewriter.

The first thing you need to do is create the space on the site dedicated to capturing those bad bots. You then use robots.txt to exclude that section from crawling.

You do this because you want to ensure Googlebot, Yahoo! Slurp, MSNbot and the others don’t also get trapped. Since most good spiders will follow the robots.txt exclusion protocol you are going to politely deny them access to this location.

From here there are various options. One of my favorite involves logging to a database or text file and then dynamically denying access to the bad bot.

How does it work?

Let me give you a practical example.

I once had a client that was getting harvested many times per day by many different bad spiders. It was so bad at one point that the bad bots were doubling his bandwidth usage.

So we devised a plan whereby we’d create this trap as mentioned above and when we captured the user agent and IP info we immediately banned them from the site.

This is how it worked:

The bad bot would come to the site and find a link on an image. The link would point to the trap directory.

Normally, a regular spider would first check the robots.txt file to ensure they could in fact index the content in that directory. Since the file excluded this directory, the “good” spiders wouldn’t go in.

However the bad spiders ignored robots.txt and went into the directory.

From here there was a PHP script which would run and capture the user agent and IP address.

Another script would take that information and rewrite the .htaccess file with the bad spider information as soon as it was received.

Then the .htaccess file was reloaded by the server and that spider was then not allowed to visit the site anymore.

In another incarnation this system would right to a database or text file that is then referenced by the site pages through a small php script that would allow or deny access based on that list.

Keep in mind that this is very advanced stuff. You don’t want to take it on lightly. Doing so can (and likely will) get your site removed from any search indexes.

Not because you are doing something you shouldn’t but because there’s always the chance that somehow a good spider like Googlebot would end up on your blacklist.

Therefore, before you get into such advanced things, I’d make sure you are intimately familiar with what they are and how they work.

A good place to start is to read this page. I wouldn’t advise using this code just yet, but take a look at it to see what it can do.

Also, do some searches on the engines for “bot trap” and “spider trap” to see what other options there are out there. Then, pick the one that works best for you.

In the end, the best bot trap is the one that does what it is supposed to – block harvesters from scraping your site while allowing legitimate search engines to effectively and efficiently index your site.

And if you are at all concerned with this tactic, don’t use it. It’s better to use the manual approach – scour your server logs looking for high activity from unknown user agents then manually ban them using .htaccess.

Rob Sullivan - SEO Specialist and Internet Marketing Consultant. Reproduction of this article is allowed with an html link pointing to http://www.textlinkbrokers.com

Using Spider Traps to Discourage Site Scraping

Rate article

Article comments

Related articles

Related articles

Are Shortened URLs Beneficial for SEO?

How Corporate Website Design Drives Brand Authority in Chicago

How Mobile-Friendliness Impacts SEO - And What You Can Do About It

How to Use Reverse Image Search on Mobile to Discover Backlink Opportunities

Is SEO Worth It for Small Businesses?

Top 10 Web Development Companies in Dubai: 2025 Edition

10 Key Elements of Effective Web Design in Chicago

Best SEO Company in Gurgaon for Your Website’s Long-Term Success

Which SEO Company in Gurgaon Offers Proven Results?

The Future of Web Development: What Makes Web Design Chicago Unique?

Cross-Border E-Commerce: Expanding Beyond Domestic Markets

How Business Intelligence Dashboards Empower Leadership

Why You Need a Search Engine Optimization Company

What It Is Local SEO and How to Boost Your Presence Locally

The Ultimate Guide to Effective Link Building Strategies for SEO

How to Choose the Right Training Management System for Your Business

Why Your Business Needs a Payroll Expert Today

Why Every Website Needs SEO To Compete Online

SEO Trends You Need to Know to Stay Ahead in 2025

Transform Your Business with Expert Mulesoft Consulting Partners

Why Talent Management Tools Are the Future of HR

The Benefits of Salesforce Cloud Application Development Services

How Outsourced HR Payroll Services Ensure Compliance and Accuracy

A Beginner’s Guide to Employee Training Management Software

The Ultimate Checklist for ADP Workforce Implementation

The Importance of Keyword Research in SEO

Why Every Organization Needs a Training Management Software System

Future Trends in Enterprise Integration with MuleSoft Anypoint Platform

How to Claim Retroactive Benefits for the Employee Retention Credit