Crawl delay and the Bing crawler, MSNBot

Search engines, such as Bing, need to regularly crawl websites not only  to  index new content, but also to check for content changes and removed content. Bing offers webmasters the ability to slow down the crawl rate to accommodate web server load issues.

The use of such a setting is not always needed nor is it generally recommended, but it is available for use by webmasters should the need arise. Websites that are small (page-wise) and whose content is not regularly updated probably will never need to set crawl delay settings. They will likely receive no benefit from the use of a crawl delay, as the bot will automatically adjust its crawl rate to an appropriate level based on the content it finds with each pass.

Larger sites that have a great many pages of content may need to be crawled more deeply and/or more often so that their latest content may be added into the index.

Should you set a crawl delay?

Many factors affect the crawling of a site, including (but not limited to):

  • The total number of pages on a site (is the site small, large, or somewhere in-between?)
  • The size of the content (PDFs and Microsoft Office files are typically much larger than regular HTML files)
  • The freshness of the content (how often is content added/removed/changed?)
  • The number of allowed concurrent connections (a function of the web server infrastructure)
  • The bandwidth of the site (a function of the host’s service provider; the lower the bandwidth, the lower the server’s capacity to serve page requests)
  • How highly does the site rank (content judged as not relevant won’t be crawled as often as highly relevant content)

The rate at which a site is crawled is an amalgam of all of those factors and more. If a site is highly ranked and has a ton of pages, more of those pages will be indexed, which means it needs to be crawled more thoroughly (and that takes time). If the site’s content is regularly updated, it’ll be crawled more often to keep the index fresh, which better serves search customers (as well as the goals of the site’s webmasters).

As so many factors are involved in the crawl rate, there is no clear, generic answer as to whether you should set a crawl delay. And how long it takes to finish a crawl of a site is also based on the above factors. The bottom line is this: if webmasters want their content to be included in the index, it has to be crawled. There are only 86,400 seconds in a day (leap seconds excluded!), so any delay imposed upon the bot will only reduce the amount and the freshness of the content placed into the index on a daily basis.

That said, some webmasters, for technical reasons on their side, need a crawl delay option. As such, we want to explain how to do it, what your choices are for the settings, and the implications for doing so.

Delay crawling frequency in the robots.txt file

Bing supports the directives of the Robots Exclusion Protocol (REP) as listed in a site’s robots.txt file, which is stored at the root folder of a website. The robots.txt file is the only valid place to set a crawl-delay directive for MSNBot.

The robots.txt file can be configured to employ directives set for specific bots and/or a generic directive for all REP-compliant bots. Bing recommends that any crawl-delay directive be made in the generic directive section for all bots to minimize the chance of code mistakes that can affect how a site is indexed by a particular search engine.

Note that any crawl-delay directives set, like any REP directive, are applicable only on the web server instance hosting the robots.txt file.

How to set the crawl delay parameter

In the robots.txt file, within the generic user agent section, add the crawl-delay directive as shown in the example below:

User-agent: *
Crawl-delay: 1

Note: If you only want to change the crawl rate of MSNBot, you can create another section in your robots.txt file specifically for MSNBot to set this directive. However, specifying directives for individual user agents, in addition to using the generic set of directives, is not recommended. This is a common source of crawling errors as sections dedicated to specific user agent directives are often not updated with those in the generic section. An example of a section for MSNBot would look like this:

User-agent: msnbot
Crawl-delay: 1

The crawl-delay directive accepts only positive, whole numbers as values. Consider the value listed after the colon as a relative amount of throttling down you want to apply to MSNBot from its default crawl rate. The higher the value, the more throttled down the crawl rate will be.

Bing recommends using the lowest value possible, if you must use any delay, in order to keep the index as fresh as possible with your latest content. We recommend against using any value higher than 10, as that will severely affect the ability of the bot to effectively crawl your site for index freshness.

Think of the crawl delay settings in these terms:

Crawl-delay setting

Index refresh speed

No crawl delay set

Normal

1

Slow

5

Very slow

10

Extremely slow

Feedback

The Bing team is interested in your feedback on how the bot is working for your site, and if you decide a crawl delay is needed, which setting works best for you for getting your content indexed while not seeing an unreasonable impact upon the web server traffic. We want to hear from you so we can improve how the bot works in future development.

If you have any questions, comments, feedback, or suggestions about the MSNBot, feel free to post them in our Crawling/Indexing Discussion forum. There’s another SEM 101 post coming soon. Until then…

— Rick DeJarnette, Bing Webmaster Center