Bots Are NOT Your Friends
Spiders, Crawlers, and Bots are NOT your friends. Most search engines likewise are hardly your friends, either. Don’t you think that it is about time that you took control of the situation?
The robots.txt File to the Rescue
Thank goodness that webmasters have a tool to handle the situation. It is called the robots.txt file. ALL serious Web sites without exception should have one. The file happens to be one of the most active files on many Web sites. So, not having one will cause a lot of 404 Not Found errors. Hence, even if all your online information is accessible you should have at the very least one that looks like this example.
Let’s face it. Google currently is the only search engine that produces the hits that everyone is looking for. Yahoo comes in a distant second. While MSN shows up in a very week third place. So, why not simply block everybody else? The problem is that most search engines use more than one bot, and could easily add new ones as time moves on. And, what about new players trying to enter the search engine field?
In some respects any help from any search engine is better than no help at all. Good site statistics can help you decide if any bot is crawling your site excessively.
One could take a very negative point of view and say that robots.txt files are useless because many bots don’t follow the rules and are there to simply scrapeyour site. You might try to use a server side script like perl that would feed your content only to legitimate visitors and search engines. The reason that the robots.txt file approach is recommended is that taking this type of radical approach could easily be suicide for your Web site, if Google decides to ban it for cloaking. There is no work around to this problem that is worth risking getting ban by Google.
Relax there is an easier alternative, simply list out all of the bots that you would rather not let crawl your site. Here is a robots.txt file generator that has a handyDisallow Bad Bots option built into it.
Now, to be in control all you have to do is upload your robots.txt file to the root directory of your Web site.