To crawl or not to crawl, that is BingBot's question

If you are reading this column, there is a good chance you publish quality content to your web site, which you would like to get indexed by Bing. Usually, things go smoothly: BingBot visits your web site and indexes your content, which then appears in our search results and generates traffic to your site. You are happy, Bing is happy and the searcher is happy.

However, things do not always go so smoothly. Sometimes BingBot gets really excited about your quality content and ends up crawling your web site beyond all expectations, digging deeper and harder than you otherwise wanted. Sometimes you did everything you could to promote your quality content but BingBot still does not visit your site.

As much as robots.txt is a reference tool to control BingBot’s behavior, it is also a double-edged sword that may be interpreted in a way that disallows (or allows) much more than you thought initially. In this column, we will go through the most common robots.txt directives supported by Bing, highlighting a few of their pitfalls, as seen in real-life feedback over the past few months.

Where does BingBot look for my robots.txt file?

For a given page, BingBot looks at the root of the host for your robots.txt file. For example, in order to determine if it is allowed to crawl the following page (and at which rate):

http://us.contoso.com/products.htm

BingBot will fetch and analyze your robots.txt file at:

http://us.contoso.com/robots.txt

Note that the host here is the full subdomain (us.contoso.com), not contoso.com nor www.contoso.com. This means that if you have multiple subdomains, BingBot must be able to fetch robots.txt at the root of each one of them, even if all these robots.txt files are the same. In particular, if a robots.txt file is missing from a subdomain, BingBot will not try to fall back to any other file in your domain, meaning it will consider itself allowed anywhere on the subdomain. BingBot does not “assume” directives from other hosts which have a robots.txt in place, associated with a domain.

When does BingBot look for my robots.txt file?

Because it would cause a lot of unwanted traffic if BingBot tried to fetch your robots.txt file every single time it wanted to crawl a page on your website, it keeps your directives in memory for a few hours. Then, on an ongoing basis, it tries to fetch your robots.txt file again to see if anything changed.

This means that any change you put in your robots.txt file will be honored only after BingBot fetches the new version of the file, which could take a few hours if it was fetched recently.

Which directives does BingBot honor?

If there is no specific set of directives for the bingbot or msnbot user agent, then BingBot will honor the default set of directives, defined with the wildcard user agent. For example:

User-Agent: *
Disallow: /useless_folder

In most cases, you want to tell all search engines the URL paths where you want them to crawl, and the URL paths you want them to not crawl. Also, maintaining only one default set of directives for all search engines is less error-prone and is our recommendation.

What if I want to allow only BingBot?

In your robots.txt file, you can choose to define individual sections based on user agent. For example, if you want to authorize only BingBot when others crawlers are disallowed, you can do this by including the following directives in your robots.txt file:

User-Agent: *
Disallow: /

User-Agent: bingbot
Allow: /

A key rule to remember is that BingBot honors only one set of directives, in this order of priority:

The section for the bingbot user agent, discarding everything else.
The section for the msnbot user agent (for backwards compatibility), discarding everything else.
The default section (wildcard user agent).

This rule has two main consequences in terms of what BingBot will be allowed to crawl (or not):

If you have a specific set of directives for the bingbot user agent, BingBot will ignore all the other directives in the robots.txt file. Therefore, if there is a default directive that should apply to BingBot as well, you must copy it to the bingbot section in order for BingBot to honor it.
If you have a specific set of directives for the msnbot user agent (but not for the bingbot user agent), BingBot will honor these. In particular, if you have old directives blocking MSNBot, you are also blocking BingBot altogether as a side effect. The most common example is:

User-agent: msnbot
Disallow: /

Does BingBot honor the Crawl-delay directive?

Yes, BingBot honors the Crawl-delay directive, whether it is defined in the most specific set of directives or in the default one – that is an important exception to the rule defined above. This directive allows you to throttle BingBot and set, indirectly, a cap to the number of pages it will crawl.

One common mistake is that Crawl-delay does not represent a crawl rate. Instead, it defines the size of a time window (from 1 to 30 seconds) during which BingBot will crawl your web site only once. For example, if your crawl delay is 5, BingBot will slice the day in smaller five-second windows, crawling only one page (or none) in each of these, for a maximum of around 17,280 pages during the day.

This means the higher your crawl delay is, the fewer pages BingBot will crawl. As crawling fewer pages may result in getting less content indexed, we usually do not recommend it, although we also understand that different web sites may have different bandwidth constraints.
Importantly, if your web site has several subdomains, each having its own robots.txt file defining a Crawl-delay directive, BingBot will manage each crawl delay separately. For example, if you have the following directive for both robots.txt files on us.contoso.com and www.contoso.com:

User-agent: *
Crawl-delay: 1

Then BingBot will be allowed to crawl one page at us.contoso.com and one page at www.contoso.com during each one-second window. Therefore, this is something you should take into account when setting the crawl delay value if you have several subdomains serving your content.

My robots.txt file looks good… what else should I know?

There are some other mechanisms available for you to control BingBot’s behavior. One of them is to define hourly crawl rates through the Bing Webmaster Tools (see the Crawl Settings section). This is particularly useful when your traffic is very cyclical during the day and you would like BingBot to visit your web site more outside of peak hours. By adjusting the graph up or down, you can apply a positive or negative factor to the crawl rate automatically determined by BingBot. This fine tunes the crawl activity to be more or less at a given time of the day, all controlled by you. It is important to note that a crawl delay noted in your robots.txt file will override the direction set within the Bing Webmaster Tool, so plan carefully to ensure you are not sending BingBot contradictory messages.