January 29, 2009

Robots.txt

The most basic technique is the "robots.txt" file. This file allows you to tell search engine robots what parts of your site that the cannot crawl. To start, you need to create a file called robots.txt, and it must live in the root directory for your domain. This means that if your site is "www.yourdomain.com" the robots.txt file must be located at "www.yourdomain.com/robots.txt". Do not place it any where else, because it will have no effect.

The basic technique is simple. To exclude all bots from your server, structure your robots.txt as follows:

User-agent: *
Disallow: /

You can choose to disable only certain bots, simply by specifying the bot name on the User-Agent line, instead of using the "*" to indicate all bots. You can also specify that only certain directories are protected, with a file similar to this one:

User-agent: *
Disallow: /cgi-bin/
Disallow: /php/

The definitive definitions for the robots.txt file can be found at this location.

No comments: