Going international: considerations for your global website

One of the best parts of publishing online is that, on the Web, anyone can have a world-wide reach. But while being global is made easy on the Internet, ensuring that the content you produce will be found by the right audience can be a real challenge. Search engines can have trouble understanding geotargeting because of a few technical limitations. These include:

  • Search engines may not be crawling your site from the location of your customers.
  • Search engines may not execute JavaScript, which can break some targeting methods that rely on self-selection or JavaScript-based redirection.
  • There’s no standardized way to tell a search engine which region or language your content’s targeted for.
  • Content language usage can be misleading. For example, code comments and URLs could be written in English and hosted in a data center in Washington State, USA, while the main content could be written in Spanish, and the user generated content (UGC) comments at the bottom of the page could in French, German, Portuguese, and Pig Latin.
  • Top level domains may not indicate the intended audience. For example, http://ma.tt/, an English-based personal site or Orange.com, a French Telecom site hosted in France.
  • Some sites use redirection techniques that are unfriendly to search.

At Live Search, we attempt to overcome these and other challenges by examining the contents of a site, looking for indications that help us determine its intended audience. Sometimes it’s easy, for instance, when a site has a country code top level domain that matches the body text language of its intended audience. Other times, it’s more difficult. Live Search looks at a number of indicators to ascertain a site’s intended audience. The following are a few of the indicators we look at:

  1. Country code top-level domain (ccTLD). For example, aw.ca specifically targets users in Canada.
  2. Host server location, particularly for .com, .net, and .org.
  3. Language of the body text on the page.
  4. Locale of pages that link to your site.

While we do our best to read the indicators you give us, none of them alone are crystal-clear gauges of a site’s intended geographic interest, so we take them all into consideration. So, how do you ensure that we’re able to determine the intended audience for your content?

Some best practices

There is no single, fully-effective approach to architecting and localizing websites. Books could and have been written on this subject, but there are ways to do it that are friendlier to both your customers and to us. Consider the following recommendations:

Target the location

When writing the content for your page, are you using keywords that tell the end user what location your content is relevant to? If you’re a local business, be sure to include text with the telephone number with the area code and country codes, the physical address (if applicable), the city, state, and country name where you’re located. This’ll help both search engines and customers find you.

If you’re already doing this, one thing you may not have considered is to target additional location keywords that may also represent the location. For example, if your business is located in the Capital Hill area of Seattle, Washington, you’ll want to list both “Seattle” and “Capital Hill” because your customers will likely be searching for both. However, in this case, if you list “Capitol Hill” (note the spelling difference, because your customers might spell it incorrectly) and “Washington,” you might be mistakenly considered to be back east in the District of Columbia!

Be consistent in language

One problem we see from time to time is inconsistent language usage, especially where user generated content is concerned. These pages can trip up our detection, especially when other best practices aren’t being followed.

For example, from time to time, content in MSDN is required for external markets like France, but the page content hasn’t been converted to French or perhaps the content was user-generated, such as in the following image. image

Ensure, wherever possible, that you’re speaking the same language in the title tags, description tags, and rest of your page. Consistency is key.

Create a hierarchy of language

When you’re architecting your site, we recommend grouping your localized content by the TLD, subdomain, or subfolder. Keep all of the content from a region or language grouped together in a single structure. The following are all examples of the good organizational structures for languages using the sample URL www.domain.com:

  • www.domain.ca
  • ca.domain.com
  • www.domain.com/ca-fr/

Don’t mix content intended for one market with the content of another. This is bad for search and can be a bad experience for the customer.

Note: One thing to consider in planning the hierarchy of a global site is how many URLs you produce. Having too many URLs for each market may dilute the whole relevance of your pages and site. Furthermore, you may not get customers to link to the most important pages on your site. See our part 1 post on large site optimization for more thoughts on this topic.

Common mistakes

Sometimes search engines struggle to deliver your content to the right audience due to problems within your (the publisher’s) control. Some of these common mistakes include:

Using a cookie to store the language setting

Some sites store the language setting as a preference in a cookie, but provide no navigational method for seeing the content for other markets. The problem with this approach is that the search engines don’t support cookies when crawling, so we never see anything but the default language. This can also create a less-than-optimal user experience. For instance, my friend Mishka may be reading a site here in the US but then switches to the German version of the same site. The site drops a cookie on her computer to note the change. If Mishka then emails a link to the site to her mother, who doesn’t know English, the site won’t find the German language cookie on her mother’s machine, so she’ll see an English page that she can’t read.

JavaScript

Some website owners spread localized content around a site without a clear path to get to the content. For example, let’s say your site enables customers to use a JavaScript control, like the one shown in the image below, to load localized content within the page, but it doesn’t change the URL for the different language content. In this case, because the crawler can’t execute JavaScript, it’ll never be able to reach anything except the default language content. This scenario also prevents a user from linking to the localized version on the content.

image

Scripting against HTTP_ACCEPT_LANGUAGE

The browser’s HTTP_ACCEPT_LANGUAGE header is passed to a website when the page requests it and informs the site about which language you prefer to receive the content in. If you’re using a script to detect this setting and change the content based on that setting, it’s easy to load all the languages in the website under the same URL. However, if that’s the only way the localized content is accessible, the crawler, which doesn’t pass values to this header, will only see the default language. Having a contextual URL per language is necessary for ensuring the content is crawled.

Market as a parameter

Search engines may sometimes be able to recognize a market setting in a query parameter if it’s using a common nomenclature (such as EN-US), but this is still neither optimal nor friendly for search. It’s better to follow the pattern for URL hierarchy we described above. The following are examples of both standard and non-standard market nomenclature that are used in URLs:

Standard nomenclature:

  • www.domain.com/default.aspx?mkt=ca-fr

Non-standard nomenclature:

  • www.domain.com/default.aspx?market=fran
  • www.domain.com/default.aspx?market=89372

Wrapping up

Search engines are always looking to improve how we detect the location and language of the intended audience for your site. At Live Search, we’re considering several possibilities, from meta tag standards to tools in Webmaster Center. But until those tools become available, by following the recommended practices and avoiding the common mistakes listed above, you can help us make those determinations today. As we make improvements in this space, we’ll continue to make announcements as well as seek your feedback and questions in our forums. Some additional great reads on the subject include:

Jeremiah Andrick, Program Manager, Live Search Webmaster Center