In a blog post last month, we identified the preferred method for providing feedback on any Live Search crawler issues you may have experienced. We would like to thank everyone for taking the time to provide us with this feedback and let you know what we have done on our end. But first, a little background.
Our crawler, MSNBot, performs many different functions. We have blogged about our cloaking detection in the past, so I’m not giving away any secrets by saying that we are actively looking to identify and weed out spammers who are using cloaking in malicious ways. We’ve also introduced a new feed crawling function that will help provide fresh results. In addition to this, we are also introducing a new 2.0 version of our crawler, MSNBot, which has various other new functions as well as upgrades on some of the old functions.
So, what I’m getting at is, we are busy adding upgraded and updated technologies to provide better search results. That’s good news for you!
Unfortunately, things can and do go wrong from time to time. The initial complaints, that we were over-crawling some servers with our cloaking detector, was compounded by and also confused with the new release of our feed crawler that was also overzealous in its attempt to crawl and provide up-to-the-minute results. However, we have taken all of the feedback you have provided and made some improvements.
We have modified the cloaking detector. Using the valuable feedback we received regarding the feed crawling issues, we proactively released a patch late last week that should significantly reduce the number of requests to a more acceptable rate.
What you can do
Help us discover the content changes on your site. You can do this via sitemaps and various meta properties per link or via RSS link to notify us about very important content. To prevent us from having to monitor lots of feeds often, we recommend aggregating content change onto a few feeds; adding the name, “Aggregate” somewhere in the feed name. We also suggest referring to them in robots.txt and your sitemap—both of which will help us detect them and their use.
Despite releasing a patch for our feed crawler, not all sites are the same, so it is a challenge to gauge a feed crawl-rate that is considered reasonable for all sites. If you believe we are still crawling more than necessary, an alternative option would be setting a crawl-delay. We would urge caution while setting the crawl-delay times as they can severely hamper our ability to crawl your site. As an example:
A crawl-delay: 5 means we wait 5 seconds per page request. That means if you have a site with 100 pages, it will take us 8.3 minutes to crawl your entire site. A site with 1,000 pages will take 83 minutes. As you can see, this can severely hamper our ability to index fresh content from your site.
If you are using crawl-delays and we don’t seem to be honoring those directives, please do not block us. Instead, let us know about these issues and we will work together to find out why we aren’t honoring the crawl-delay directive.
For all of these issues, please send us feedback by posting a message in our Indexing and Ranking Discussion forum. Thank you once again. We look forward to working with you in the future!
–Brett Yount, Live Search Webmaster Center