Optimizing your very large site for search — Part 1

At Live Search, one of the most common questions we receive from our peers at microsoft.com and msn.com is how to optimize their sites for search. But microsoft.com is unlike most other sites on the Internet. It is huge, containing millions of URLs, and is growing all the time. However, large content sites like microsoft.com and msn.com are not the only sites that can have an infinite number of URLs. There are also large ecommerce sites and government agency sites that produce very large numbers of URLs. As with any site, our original recommendations on how to rank in Live Search are still important. But we've given it some thought and wanted to provide some recommendations oriented toward very large sites. Over the next several posts, we will be discussing topics that may help your site, especially if it is very large.

Less is more

By producing lots of content, a site exposes a huge surface area for the search engines to crawl. This can lead to sub-optimal results or underperforming pages. As a site grows, the number of URLs produced will also grow, requiring the search engine bot to dig deeper and work harder. One way to control the growth of URLs is to ensure that you are only exposing one URL per piece of content. This is the process of canonicalization, although it is sometimes referred to as normalization. Canonicalization encompasses a number of issues related to the selection of the best URL for a page. For most site owners, the issue of canonicalization is a familiar topic because they have selected (or should have selected) the top-level domain they want to point to. For example, the following are all valid ways to get to microsoft.com:
  • microsoft.com
  • www.microsoft.com
  • www.microsoft.com/en/us/default.aspx
  • www.microsoft.com/en/us/

For the large site that uses complex URLs for tracking, co-branding, or ecommerce activities, managing canonicalization is a much larger and more important issue. To reduce the risk of duplicate URLs and diluting your link equity, the following are some additional conversions that should be made:

  • Remove the index or default
    http://www.mysite.com/default.aspx to http://www.mysite.com/
    http://www.mysite.com/en/us/default.aspx to http://www.mysite.com/en/us
  • Avoid CamelCase, convert your text to lower case
    http://www.mysite.com/FooBar/ to http://www.mysite.com/foobar
  • Remove query string variables or rewrite to readable URLs
    http://www.mysite.com/downloads/details.aspx?FamilyID=ab99&displaylang=en to http://www.mysite.com/downloads/en/family/ab99
  • Remove Port Numbers
    http://www.mysite.com:8080/ to http://www.mysite.com
  • Avoid exposing secure HTTPS version
    https://www.mysite.com/en/us/ to http://www.mysite.com/en/us

My URLs are a mess now what?

The best option for a webmaster is to consider canonicalization when they are planning their site. Unfortunately, many large sites were developed before optimization became an issue. By developing an architecture and site plan that considers canonical forms from the start, such planning will prevent a number of headaches later. But because many site owners already have challenges with canonicalization, here are a few solutions for correcting the situation with existing sites:

 

301 redirect to the canonical form

For a small site, this can be done fairly easily. However, the more canonical issues your site has, the more complicated the solution will be. The first step is to determine what canonical issues your site suffers from and then build a solution that takes into account both the need to scale and performance requirements. For some examples of how to do this, I recommend reading Tony Spencer's article on .htaccess, 301 Redirects & SEO.

 

Use a consistent linking convention

Whatever you choose as your canonical form, never deviate from that form in your sitemap or your internal link structure. For example, always link to mysite.com, rather than sometimes linking to mysite.com/default.htm. You can't enforce how people link to your site externally, but you should ensure your convention is enforced internally. This can be a long process to enact on a large site, so it may help to start with your most trafficked pages where there is the most impact and work your way through the site to your least important pages. 

Don't link to multiple versions of the page

I recently received a question from a publisher on the issue of co-branding. This is where a site may have multiple versions of a single page that it serves with different brands applied, each with a different URL. For example, the Microsoft download page can be reached directly or from any number of branded pages. The URL for such a download could be read:

  • http://downloads.microsoft.com/downloads/EP00688359?id=BA Brand A skin to the page
  • http://downloads.microsoft.com /downloads/EP00688359?id=BB Brand B skin
  • http://downloads.microsoft.com /downloads/EP00688359?id=BC Brand C skin

The content is the same across all the pages other than the brand skin on the page. For a large site like microsoft.com, this means that the same page could be generated over and over, creating additional URLs for the search engine to crawl. The ideal situation in co-branded pages is to have a single page that the search engine bot sees and to block access to any other, branded versions. For a good solution to this problem, Nathan Buggia wrote a very detailed post on how to deal with URL referrer tracking at Jane and Robot. In it, he offers several patterns that will help in this situation and others.

Use absolute links

Once you have done the items above or if you are starting from scratch you may want to consider moving from relative to absolute links. Relative links point a link relative to another page on your server. While there are performance and architectural reasons that webmasters use absolute links, the goal of this exercise is to bring consistency. Absolute links tend to make doing so easier. A relative link will look like:

"../downloads/EP00688359"
The absolute form will look like:
"http://www.mysite.com/downloads/ep00688359"

While there are other issues that can increase the number of URLs, solving the issue of canonicalization will help both your customers and us at Live Search get more of the URLs that matter.

Coming up next

In our next post on large sites, we will discuss the benefits and implementation of HTTP compression and conditional GET statements. As always, if you have additional questions, feel free to ask in our forums.

Fabrice Canel and Jeremiah Andrick – Program Managers, Live Search Webmaster