Optimizing your very large site for search — Part 1

At Live Search, one of the most common questions we receive from our peers at microsoft.com and msn.com is how to optimize their sites for search. But microsoft.com is unlike most other sites on the Internet. It is huge, containing millions of URLs, and is growing all the time. However, large content sites like microsoft.com and msn.com are not the only sites that can have an infinite number of URLs. There are also large ecommerce sites and government agency sites that produce very large numbers of URLs.

As with any site, our original recommendations on how to rank in Live Search are still important. But we’ve given it some thought and wanted to provide some recommendations oriented toward very large sites. Over the next several posts, we will be discussing topics that may help your site, especially if it is very large.

Less is more

By producing lots of content, a site exposes a huge surface area for the search engines to crawl. This can lead to sub-optimal results or underperforming pages. As a site grows, the number of URLs produced will also grow, requiring the search engine bot to dig deeper and work harder. One way to control the growth of URLs is to ensure that you are only exposing one URL per piece of content. This is the process of canonicalization, although it is sometimes referred to as normalization.

Canonicalization encompasses a number of issues related to the selection of the best URL for a page. For most site owners, the issue of canonicalization is a familiar topic because they have selected (or should have selected) the top-level domain they want to point to. For example, the following are all valid ways to get to microsoft.com:

  • microsoft.com
  • www.microsoft.com
  • www.microsoft.com/en/us/default.aspx
  • www.microsoft.com/en/us/

For the large site that uses complex URLs for tracking, co-branding, or ecommerce activities, managing canonicalization is a much larger and more important issue. To reduce the risk of duplicate URLs and diluting your link equity, the following are some additional conversions that should be made:

  • Add or remove the trailing /

http://www.mysite.com/ to http://www.mysite.com

  • Remove the index or default

http://www.mysite.com/default.aspx to http://www.mysite.com
http://www.mysite.com/en/us/default.aspx to http://www.mysite.com/en/us

  • Avoid CamelCase — convert your text to lower case

http://www.mysite.com/FooBar/ to http://www.mysite.com/foobar

  • Remove query string variables or rewrite to readable URLs

http://www.mysite.com/downloads/details.aspx?FamilyID=ab99&displaylang=en to http://www.mysite.com/downloads/en/family/ab99

  • Remove Port Numbers

http://www.mysite.com:8080/ to http://www.mysite.com

  • Avoid exposing secure HTTPS version

https://www.mysite.com/en/us/ to http://www.mysite.com/en/us

My URLs are a mess — now what?

The best option for a webmaster is to consider canonicalization when they are planning their site. Unfortunately, many large sites were developed before optimization became an issue. By developing an architecture and site plan that considers canonical forms from the start, such planning will prevent a number of headaches later. But because many site owners already have challenges with canonicalization, here are a few solutions for correcting the situation with existing sites:

301 redirect to the canonical form

For a small site, this can be done fairly easily. However, the more canonical issues your site has, the more complicated the solution will be. The first step is to determine what canonical issues your site suffers from and then build a solution that takes into account both the need to scale and performance requirements.

For some examples of how to do this, I recommend reading Tony Spencer’s article on .htaccess, 301 Redirects & SEO.

Use a consistent linking convention

Whatever you choose as your canonical form, never deviate from that form in your sitemap or your internal link structure. For example, always link to mysite.com, rather than sometimes linking to mysite.com/default.htm. You can’t enforce how people link to your site externally, but you should ensure your convention is enforced internally. This can be a long process to enact on a large site, so it may help to start with your most trafficked pages where there is the most impact and work your way through the site to your least important pages.

Don’t link to multiple versions of the page

I recently received a question from a publisher on the issue of co-branding. This is where a site may have multiple versions of a single page that it serves with different brands applied, each with a different URL. For example, the Microsoft download page can be reached directly or from any number of branded pages. The URL for such a download could be read:

  • http://downloads.microsoft.com/downloads/EP00688359?id=BA – Brand A skin to the page
  • http:// downloads.microsoft.com /downloads/EP00688359?id=BB – Brand B skin
  • http:// downloads.microsoft.com /downloads/EP00688359?id=BC – Brand C skin

The content is the same across all the pages other than the brand skin on the page. For a large site like microsoft.com, this means that the same page could be generated over and over, creating additional URLs for the search engine to crawl. The ideal situation in co-branded pages is to have a single page that the search engine bot sees and to block access to any other, branded versions. For a good solution to this problem, Nathan Buggia wrote a very detailed post on how to deal with URL referrer tracking at Jane and Robot. In it, he offers several patterns that will help in this situation and others.

Use absolute links

Once you have done the items above or if you are starting from scratch you may want to consider moving from relative to absolute links. Relative links point a link relative to another page on your server. While there are performance and architectural reasons that webmasters use absolute links, the goal of this exercise is to bring consistency. Absolute links tend to make doing so easier.

A relative link will look like:

"../downloads/EP00688359"

The absolute form will look like:

"http://www.mysite.com/downloads/ep00688359"

While there are other issues that can increase the number of URLs, solving the issue of canonicalization will help both your customers and us at Live Search get more of the URLs that matter.

Coming up next

In our next post on large sites, we will discuss the benefits and implementation of HTTP compression and conditional GET statements. As always, if you have additional questions, feel free to ask in our forums.

Jeremiah Andrick — Program Manager, Live Search Webmaster

Join the conversation

27 comments
  1. Anonymous

    I think that such optimization usefull not only for a large sites.

    Most described here is common seo practics for any website.

  2. Anonymous

    Thank you for posting this.  I can’t tell you how many .net developers use bad url practices such as CamelCase for files and extensions (.aspx).  The extension was largely due to lack of good rewrite support easily in IIS although plenty of options were available the isapi route.  Still though having this on a microsoft site is good to point .net developers is a good thing. If only to rid of CamelCase files and paths to lowercase ‘-’ separated names.

  3. Anonymous

    Excellent advice!  But curious indeed from Live Search, which has a notoriously bad time handling 301 redirects.

    In fact, after reading this post, I yet again looked at my site in Live’s index.  Again, I looked at the headers…

    mysite.com

    -> 301 ->

    http://www.mysite.com

    No problemo.  Except that Live stubbornly keeps "mysite.com" as the URL displayed in and linked by its index.  And this 301 has been in place like … forever.

    I’ll keep monitoring and update this comment in what I’m regarding as the increasingly unlikely event that Live gets this right.

  4. Anonymous

    Error in your example for "Add or remove the trailing /"

    The URLs http://www.mysite.com/”>http://www.mysite.com/ and http://www.mysite.com are exactly the same. This is because the very first / character is not actually part of the URL path, but more of a separator between the URL hostname and path.

    This is only true for the first "/" – all other ones are part of the path. So a better example would be:

    http://www.mysite.com/”>http://www.mysite.com/widgets/ and http://www.mysite.com/”>http://www.mysite.com/widgets

  5. Anonymous

    It is very refreshing to see that I have been recommending 100% of these suggestions to my clients. I usually have the hardest problems with clients who use Microsoft Servers. because default settings allow for upper and lower case urls to resolve to the same content. Even though the urls are not the same.

    The CamelBack default settings really need to addressed for IIS. I keep hearing how wonderful MS is lately and how far the dev tools have come with .Net and so forth.

    There are also some horrid shopping carts out there for the apache/php servers. OS Commerce is the worst! They have a fix now but its takes some work to get it working.

    Also don’t forget to talk about never ending URLs, these usually happen in shopping carts that use a unique id in the URL for every time you add or remove a product. You could actually have a limitless number of URLs. This happens with calendars too. Typical of small hotel and event sites.

    I usually just block them from being indexed all together because the value on those page for searchers is very small if anything.

  6. Anonymous

    Good to know all the info you guys shared. Thanks a lot!

  7. Anonymous

    Thanks for very doable SEO tips we should all be taking note of – not just for larger sites. When you are new, it can be tough to get reliable info. This sounds like common sense stuff worth implementing.

  8. Anonymous

    " As with any site, our original recommendations on how to rank in Live Search are still important. But we’ve given it some thought and wanted to provide some recommendations  "

    Link doesn’t work :-(

  9. rickdej

    Thanks for all the feedback.  I fixed the broken link in this post as well.  

    Jeremiah Andrick

  10. Anonymous

    Some good points, maybe talk about site hierarchy issues.

  11. Anonymous

    You could also segment your site or break it up into smaller quadrants.

  12. Anonymous

    Thank you for this post :-)

    I have a big gallery where I didn’t fixed URLs on lowercase !

    So I know it’s very easy to change, but I wonder if the website will get bad ranking on other search engines.

    Is it necessary to lowercase all URLS ?

    Thank you

  13. Anonymous

    I didn’t know anything about the issue of canonicalization until I read Matt Cutts post at his blog. Your post here has brought more light into ways to solve the issue.

  14. Quality Directory

    All canonicalization issues have been fixed. And I use MOD-rewrite to rewrite the dynamically generated URLs to shorter search engine-friendly URLs.

  15. Anonymous

    Hope to be better. Better means more features.

  16. Anonymous

    This is great news. Best of luck for the future and keep up the good work.

  17. Webmaster.Mas

    Very useful and helpful information for webmasters managing very large sites.

  18. Anonymous

    Thanks for this, nice blog. Take a look at ours for more SEO tips.

    SEO Junkies

  19. Anonymous

    Its really nice to know about the canonical and how its works. Thanks

  20. Anonymous

    Thanks for the pro tips. I just used some of these tips on my site to optimize for bing.

  21. miles2go

    The vital information which i found is use of absolute value. New WordPress don't use the default version. which they do not change. So my advice after installation go setting and change to absolute value for better result in search engine.

  22. alex-band

    great advice!

    thanks!

  23. Desi Games

    This is also true for any medium to small site is what I feel. It like owning a house, even though you own a small, medium or big house, you would like to keep it as clean as possible. These are basis hygiene matters which all webmasters should follow.

  24. abercrombieandfitch

    looking forward to part 2

  25. JohnyC

    Can using "nofollow" rel is helpful instead of redirection?

  26. novintabligh

    Woow, thanks a lot. I need this information for my large site.

  27. boris_webmaster

    I am glad I found this information. I have a large site that is not ranking, maybe now I can work out why

Comments are closed.