Sunday, January 28, 2007

Controlling the crawlers

robots.txt is a standard file that any webmaster can put up in his/her web directory which would contain instructions for any web crawler to control which pages (or parts of pages) should/should not be indexed.

Google has started an interesting series of posts on how to use robots.txt and control the Googlebot itself.

This is a comprehensive list of all the web robots out there. Whats even more interesting to note is that the list contains almost 300 web crawlers which are crawling our sites everyday.