Before Google existed, the great new search engine was AltaVista. In an effort to show its power, Digital’s AltaVista team decided to crawl and index the entire web, which was a new concept at the time. There were many who did not like the idea of ​​a “robot” program accessing all the pages of their websites because it would give their web servers more load time and increase bandwidth costs for them. To address their growing concerns, the Robot Exclusion Standard was created in 1996.

You can use a simple text file called robots.txt to keep search engines out of a directory. Here is a very simple example that will prevent all search engines (user agents) from accessing the / images directory.

User agent: * Do not allow: / images

When you lock the / images directory, you also lock all subdirectories. For example, the / images / logos directory and the /images.html file will also not be allowed.

Oddly enough, the first draft of this standard did not contain a “Allow” directive. It has subsequently been added, but without guarantee of support from all search engines. This implies that anything that is not specifically rejected should be considered a target for web crawlers.

To disallow access to your entire website, use a robots.txt file like this:

User agent: * Do not allow: /

The following lines apply to all search bots when the user agent is *. By specifying the signature of a web crawler as a user-agent, specific instructions can be given to that search bot.

User-agent: Googlebot Disallow: / google-secrets

Since the original specification was published, various search engines have expanded the protocol. A popular extension is allowing wildcards.

User-agent: Slurp Disallow: /*.gif$

As a result, the Yahoo! (Called Slurp) you can’t index files on your site if they end in .gif. You need to precede these lines with the required user agent line, as not all search engines currently support wildcard matches.

You can combine several of these practices into one robots.txt file. To illustrate that theory, here is an example.

User-agent: * Disallow: / bar User-agent: Googlebot Allow: / foo Disallow: / bar Disallow: /*.gif$ Disallow: /

Computer applications work very well when it comes to following well-defined instructions. However, the human brain is less efficient at these functions, so the best advice is to keep things simple.

For us mortals, there is a robots.txt parsing tool in Google’s webmaster tools. Highly recommended. Another good resource for more information on the Robot Exclusion Standard is http://www.robotstxt.org.

Many companies are willing to spend a large sum of money to get their page listed in search engines. So getting out of this mix can seem like a setback. However, there are clever security reasons to put a limit on how much of your site a search engine can index.

Leave a Reply

Your email address will not be published. Required fields are marked *