Better than canonical; URL Normalization

At Bing, we have to crawl and index The Internet and The Internet is pretty big. One common question that Fabrice Canel, a veteran of search here, is commonly asking people is to figure out how big the Internet is?  How many URLs are out-there?

The answers that come back fluctuate from “a few thousand” (ok you clearly don’t use The Internet), a few million (ok good start, plenty of people on Facebook, but please continue thinking), a few billion (more than 1 billion people are using the internet and having some kind of profile page set).  This is all good thinking, but keep going.  A few trillion, maybe? (Well, now the numbers are becoming so huge that they have different meanings per country.)   After a while, some talented people get it right: the real number of URLs on the Internet is “infinity”.

The Internet is so huge that we can crawl forever discovering new links.  Not only regular, good content but also plenty of unexpected content; not necessarily relevant for search engines, such as a “Next Day” link in a calendar hosted in web pages, keyword tag cloud links, etc.  The list is almost endless.

Webmasters can help the search engines and themselves by removing useless and duplicate content from the crawler’s path – make sure we don’t see it. Guiding search engines to your most relevant pages and helping them to discard less relevant pages is an SEO recipe for success. We have a couple blog posts related to this idea that “less is more”, such as our blog post on optimization guidelines for large sites and our blog post Building Websites Optimized for All Platforms.

Today, we want to remind folks that among the solutions you have to fix duplicate problems on your site, relying on canonical tag is not necessarily the perfect solution to fix all your duplicate content problems.  Let me share an example of a site outputting hundreds of millions of URLs including plenty of dupes and I’ll explain what happened when the search engines started implementing canonical tags.  Names will not be named, but this is a real example.

Search engines will follow links to canonical destination URLs discovered in the canonical source URLs, but they will continue to visit canonical source URLs as the canonical destination may change and they check similarity between the content of the source and the content of the destination. By visiting canonical source pages, search engines waste the limited crawl bandwidth they generally have assigned to a website, and worse, may continue discovering inside the canonical source page plenty of URLs with extra useless URLs parameters. This extra-crawling and processing impact the overall quality of the site in our index and can create unneeded crawl loads on websites. Testing of differences between sources and destinations may impact the transfer of signals from source to destination, as well.

So we want to tell you that at Bing, we have a better solution to fix your duplicate problems than simply relying only on the canonical tag.

When you have duplicate problems due to extra URLs parameters, using the URL Normalization feature in the Bing Webmaster Tools is the preferred method as you are telling us which parameters can go away; we call this URL normalization. By normalizing, we mean that our crawler will not visit the URLs with extra parameters except for an occasional test of the quality of the normalization rules. You benefit through less pointless crawling which reduces resource loads on your server, fresher copy in our index of the canonical destination URLs and less out-links being discovered with extra-parameters. You can still use the canonical tag on these pages as complimentary information to the URL Normalization rules provided.  Another advantage of using URL Normalization in the Bing Webmaster Tools is that you don’t need to wait for your engineering team to implement the canonical tag to fix your problem. You can implement these normalization rules right away in the tool without any code change.  You can also do http 301 redirection instead of using the canonical tag when possible.

The solution to some major content duplicate problems you may have can be as easy as a 5 minute task: connect to Bing Webmaster Tools and suggest URL Normalization rules.  To help you we provide in the tool the most common duplicate URL parameters detected on your site. You can also review what we index via the Index Explorer feature to see any other parameters you’d like us to skip when crawling.

Thanks for your help removing duplicate content from your sites and in helping to keeping a lid on infinity.