Wondering why your site isn’t as “indexed” as you’d like? Check your crawler control signals before getting too upset. We see a lot of websites making mistakes which can block indexing. While we’d like to ignore those mistakes and simply ingest the content anyway, that’s not always in your best interest, so far safer that we follow your suggestions.
Be sure to read How To Verify Bingbot is Bingbot and To Crawl or Not to Crawl, That is Bingbot’s Question when you’re done with this article as those articles can help fill in some details as well.
On a recent trip, I met with 22 startups and half of them complained of indexing issues. That same half had a real problem in their robots.txt files. The rest were mostly safe, not having the file in the first place.
The problem the first half had? Well, I’ll let you see if you can spot the problem. This was their robots.txt file:
(The “User-agent: *” means the command applies to all robots and the “Disallow: /” tells the robot that it should not visit any pages on the site.)
Obviously, this would pose a problem for them as their pages and content simply wouldn’t be indexed by Bing. Bingbot honors the robots.txt directives. Other search engines may not honor this command (as was evidenced by the content indexed from the sites). While not getting indexed is one form of problem, having content indexed you didn’t want indexed can be even thornier.
While blocking via the robots.txt can be obvious, other things can be implemented which can be equally damaging to our ability to index your content.
Within Bing Webmaster Tools is the Crawl Control feature, which enables you to control the pace of crawling and what time of day Bingbot crawls your site. Pretty handy tool, but you have to understand how to use it. Bingbot checks this feature for commands on interacting with your website, so setting the crawl rate to its lowest setting tells Bingbot to crawl you very lightly and slowly. For large websites, this will make it hard to crawl all your content.
Changing the settings to “full volume” will increase the crawl rate, but be ready for the load this places on your servers or indexing might not be the only issue you face. For the vast majority of sites online today, this won’t be an issue. Maxing the controls and control rate won’t shut you down, but be sure you know what the load on your servers is doing to those visitors trying to access your site at the same time. The controls allow you to set the crawl rate at various times across your day, so you can easily manage things. Maximize your crawl rate when your visitors are not on the site, and slow down crawling when they are visiting your site.
You need to also ensure you’re not selectively blocking Bingbot. If you are, your expectations for indexing should be low, as no crawling means no indexing.
Query Per Second limits
Another common issue we encounter when trying to crawl sites are limits imposed on the number of queries per second (QPS) we can make of the server. These limits are often put in place to protect the systems, which is completely reasonable. Unfortunately, many System Administrators set very low limits unaware of the issues this can cause for your SEO work. This is especially problematic for large websites and sites that produce fresh content in large volumes. Similar to the point made above, if we can’t call the content from the server, we can’t index it.
Hints and help
Inside Bing WMT, you’ll find the Fetch as Bingbot tool. Drop a URL in there and see what comes back. Because that feature adheres to the robots.txt protocol, if you’re blocking us in the robots file, you’ll see a message from the tool noting this issue to crawling.
For that matter, dig deeper and make sure there is nothing else at the server level that’s blocking crawlers. Talk to your IT folks. If you’re working from a hosted solution, check with your host to make sure they aren’t blocking crawlers. We see this happen frequently as hosting companies seek to protect their servers and limit bandwidth consumption. In most cases, the actual websites are completely unaware this blocking is even happening.
Check your crawl error reports. Many times we’ll ping websites and get 500 errors returned. When this happens, we back off and try again later; no sense continually hounding the server, it’s obviously having problems. Watch your 404 pages as well. While you can’t really have “too many” 404 returns, it could indicate a bigger issue. If someone moved content to a new folder and didn’t implement 301 redirects, it’ll take us time to recrawl the site and index the new pages. And you’ve lost the value the 301s could have transferred to the new pages. Possibly more problematic is that the 404 responses start the “release from index” process for pages which don’t return the content as expected when called. If you intend to bring the content back to the original location, you could be setting yourself up for losing traffic while the URLs that 404′d are removed, then must be recrawled and indexed again.
So be sure to investigate all the options if you think you’re suffering indexing issues. And watch those robots.txt files. They’re powerful and we listen to them.