07 November 2006
Robots Invading Your Website: Protecting your website using robots.txt.
In today’s search driven world most every website owner is aware of the major search engine and even tries to modify its contents to have a higher ranking on a particular search engine.
How the major search engines are able to search through millions of documents on the internet is that they have automated programs that browse the internet day and night grabbing your pages, images, and files and storing them back at their datacenters. These automated programs are often known as robots (also know as crawlers or spiders).
These robots are good and make the internet a better place. By allowing them to get to your site you might be opening a door for hundreds or thousands of people to find your website that might not have had any other way to find you. One issue we run into with robots though is that they get everything that they have access to, and if we don’t take special measures that might be most everything on your websites. Search engines need the core content on your site, but they don’t usually need all of your images or other documents that have no need to show up in a search engine. Another disadvantage to the spiders hitting sections of your site they don't need is it can clog up your log files making them larger than necessary and difficult to parse.
Here’s where the robots.txt file comes in. A majority of the robots first look for a text file called robots.txt in your site's root directory before browsing your website. This file tells the robot where it is allowed to go and where it is not. With just a few examples you should be able to write a robots.txt file to meet your needs.
The first thing we have to do is tell which robot we are talking to. Each robot has a name defined as a user-agent. A list of robots can be found here at www.robotstxt.org. After defining the robot we are talking to we will give it some rules.
If we have a website with “images”, “scripts”, and “about” directories and we didn’t want Google to browse the “images” or “script” directory but still browse the “about” directory our robots.txt might look like this:
User-agent: Googlebot
Disallow: /images
Disallow: /scripts
If we want to block all search engines from these directories we will use an asterisk (*) as a wildcard. We may also want just Google to be able to browse our images so they show up in the Google images search (this can keep the thousands of small unknown robots from using our bandwidth by still allowing access where needed). Our robots.txt might look something like this:
User-agent: *
Disallow: /images
Disallow: /scripts
User-agent: Googlebot
Allow: /images
Another common scenario would be if you have a rogue robot that you don't care to browse your site at all but you still want to block your “images” and “script” directories from all other robots. Your robots.txt would look something like this:
User-agent: e-collector
Disallow: /
User-agent: *
Disallow: /images
Disallow: /scripts
Robots.txt really is that simple. If you know what you want to block, allow, and the user-agent for the robots that you want to guide around your site you are good to go. One of the great things is that since it is just simple text and all sites needs all robots to be able to read it, you can take a look at the robots.txt for any site that has one. Some fun ones to look at are:
http://www.microsoft.com/robots.txt
http://myspace.com/robots.txt
http://www.cnet.com/robots.txt
http://google.com/robots.txt
http://asp.net/robots.txt
Try it for yourself. Go to your favorite website and then go to /robots.txt and see what their robots.txt looks like.
It is important to realize that not all spiders will obey the commands, it is just an industry standard recommendation. Do not concider your robots.txt file to be bulletproof. Scott Forsyth says, "It will be obeyed if they want to obey it."
There are some other options for the robots.txt but these are your most common scenarios. More information can be found by going to your favorite search engine and searching for “robots.txt” (or start here: http://www.robotstxt.org/wc/robots.html).