Is your robots.txt file on the clock?

Just recently a strange problem came across my desk that I thought was worth sharing with you. A customer notified us that content from a site she was interested in was not showing up in our results. Wanting to understand why we may or may not have indexed the site, I took a look to see what the problem was and stumbled upon an interesting but a potentially very bad use of the robots.txt file.

The first visit I made to the site had a very standard robots.txt file that read like:

User-agent:*
Disallow: /cgi-bin/
Disallow: /ads/

But within an hour the robots.txt had changed to:

User-agent: *
Disallow: /cgi-bin/
Disallow: /ads/
Disallow: /products/
Disallow: /home/content/archive/
Disallow: /Survey/
Disallow: /info/
Disallow: /staff/

The new robots.txt file was blocking the Live Search crawler and all others from accessing the main content for the site. The webmaster of this site was using different robots.txt files switched multiple times throughout the day as a method of controlling the rate of crawl or the impact of search crawlers on their site.

At Live Search, and in fact on most search engines, the behavior of changing your robots.txt every few hours can cause problems. When you change your robots.txt file, the changes are perceived as definitive (until they are downloaded again). For most search engines, Robots.txt files may be downloaded a few times per day to provide allow and disallow information per URL to the crawl servers. 

When the crawler returns it does not retry fetching the content for URLs previously disallowed a few hours ago and the crawler may fetch content outside of your new directive as the crawler may be using the previously cached robots.txt file. You may perceive a benefit from changing your file throughout the day as you may see less server load from some search engines on your site, but this behavior also works against you.

The downside to changing your robots.txt file is that your content will dance, going in and out from search engines index depending if the content was fetched on search crawlers servers that were authorized or not.  When content is disallowed, Search Engines will think that you don’t want it indexed and will hide the blocked URLs for days to weeks until they update it.

Instead of changing your robots.txt throughout the day, we recommended having a fixed robots.txt with the same rules for all time of the day and if needed using the crawl-delay directive to prevent aggressive crawls.  In a previous post we discussed how we continue to partner with Google and Yahoo to provide better standards and communication around the Robots Exclusion Protocol. While we don’t have complete agreement on how to delay the crawl, the crawl-delay directive is supported by Live Search and Yahoo today. For Google you can make crawl speed adjustments within Google’s Webmaster Central.  

If you have additional questions regarding robots.txt or feedback on how we handle robots.txt directives you can discuss it in our forums.

–Jeremiah Andrick, Program Manager, Live Search Webmaster Center