Information. Integration. Distribution: January 2007

robots.txt is a standard file that any webmaster can put up in his/her web directory which would contain instructions for any web crawler to control which pages (or parts of pages) should/should not be indexed.

Google has started an interesting series of posts on how to use robots.txt and control the Googlebot itself.

This is a comprehensive list of all the web robots out there. Whats even more interesting to note is that the list contains almost 300 web crawlers which are crawling our sites everyday.

Information. Integration. Distribution

Sunday, January 28, 2007

Controlling the crawlers

About Me

Links

Previous Posts

Archives