Find out how Live Search is crawling your site

My favorite feature of our recent launch is the Crawl Issues tool, which gives you details about issues Live Search may encounter while crawling and indexing your website. This information can help you better understand what Live Search sees when crawling your site and should ultimately help you improve your results from Live Search.

crawl-issue-screen-shot

We report four types of issues:

  • File not found (404) errors – reported when Live Search encountered a “404 File Not Found” HTTP status code when last attempting to load a URL.
  • Pages Blocked by Robots Exclusion Protocol (REP) – reported when Live Search has been prevented from indexing or displaying a cached copy of the page because of a policy in your robots exclusion protocol (REP).
  • Long Dynamic URLs – reported when Live Search encounters a URL with an exceptionally long query string. These URLs have the potential to create an infinite loop for search engines due to the number of combinations of their parameters, and are often not crawled.
  • Unsupported Content-Types – reported when a page either specifies a content-type that is not supported by Live Search, or simply doesn’t specify any content type. Examples of supported content-types are: text/html, text/xml, and application/PowerPoint.

For each of these types of issues, we show you the first 20 results on the Crawl Issues page, and allow you to download up to the first 1,000 results in a CSV file that opens easily in Excel. For large websites with potentially thousands of issues, we’ve supplied a filter option that allows you to scope the results by subdomain or by subfolder. For example, if you were the webmaster for microsoft.com and there were 250,000 file-not-found results, you could filter them by “support.microsoft.com” or “support.microsoft.com/kb” to see just the issues from a particular section of your website. Generally, we support up to 2 levels of subdomains and 2 levels of subfolders per URL, but a website may have fewer available.

crawl-issue-filter-box-2

Once you’ve created the filter that gives you just the URLs you need, you can download the results in CSV format and email them to the webmaster that owns that part of the website. This gives them a clear idea of the issues that need to be fixed.

Let’s take an example site and see how you might use this tool—fortunately microsoft.com is always willing to help us out here. Microsoft.com is a gigantic website, with more than 300 full-time people working on the site between the developers, IT personnel, marketers and content authors. And they have almost every type of legacy system you can think of, so it is no wonder that they experience almost every type of issue there is. For example, if we take a look at the number of File Not Founds, it is about 218,000. That alone is way too many to deal with, so I usually scan through the first couple hundred results to see which parts of the website are effected, or I start with subdomains that are the most important.

One of the most highly trafficked portions of the site is the popular support knowledge base (KB articles). These articles are used to store information about all security and other issues, so let’s drill in there. Looking through the 404 pages from that section, one of the first issues that I notice is a series of URLs that look like this: https://mvp.support.microsoft.com/default.aspx/profile/hongfeng.liu. Adding to the mystery, when I pull the page up a browser, it loads perfectly. Hmmm, is this the first bug in our tools? With a little more research using Live HTTP Headers, I discover this page is the result of some funky redirecting and status codes. Here’s what’s going on:

404-issue-diagram

The “http:” version of the page is 302 redirecting to the “https:” version of the page, which is returning a 404 File Not Found error code while still displaying a valid page. Because the page renders correctly in a browser, it can be difficult to manually detect this type of issue. But now that I’ve figured out the problem, I can use the filter functionality to generate a list of all 160 URLs that appear to have this problem, download them as a CSV file, and email them to the site manager who owns that part of microsoft.com.

Hopefully folks will find this tool useful in diagnosing issues within their own sites as well. Please let us know if you have any questions or comments.

–Nathan Buggia, Lead Program Manager, Webmaster Center