Optimizing your very large site for Search — Part 3

Working with large sites often means being a part of a large organization, which brings its own set of challenges. Many stakeholders with different agendas or needs influence how sites are structured. Within larger organizations, there are long to-do lists and a lack of understanding of the impact certain designs or architecture choices can have on the ability of the search engine to index the site.

In our past two articles on large site optimization, we have talked about how to help us crawl your site more efficiently and the need to reduce the number of URLs you expose. However, due to the diverse needs of some organizations, the issue isn’t that you share too many URLs. Instead, some websites hide valuable content from the search crawlers. To ensure that your site’s getting plenty of coverage, consider the following patterns and be sure to avoid them.

Just a little too flashy

Using Silverlight or Flash on pages within your site can help to create rich and interactive user experiences that can complement or build on the content in the rest of your site. On the other hand, Rich Internet Applications (RIAs) can also hinder search engine discoverability of your content. Such content is typically not fully exposed to a search engine crawler or a browser that doesn’t execute JavaScript.

While in recent months there have been some advances in crawling RIAs, these advances are limited to text extraction. The truth is that these crawling improvements only take place in certain contexts. You should still consider content built in these tools inaccessible to the crawlers.

There are some techniques you can use that help to make sure that content in Silverlight or Flash isn’t lost to crawlers or any visitors to your site.

Fundamentals, fundamentals, fundamentals

As with sports skills, no matter how good you get, you need to ensure that you have your fundamentals down. There's no difference in the world of RIAs. Your site needs to provide:

  • Unique page titles
  • Detailed meta page descriptions
  • Quality body copy (e.g. timely, relevant, well written)
  • Descriptive H1 tags (one per page)
  • Discoverable navigation
  • Informative ALT tags

Practice progressive enhancement

Progressive enhancement is a strategy for ensuring that your content is accessible to all viewers by designing for the least capable devices first, and then enhance those documents with separate logic for presentation. The basic concept of progressive enhancement can be summed up in the following principles:

  • Basic content should be accessible to all users
  • Basic functionality should be accessible to all browsers
  • All content is in hierarchical semantic markup (<title>, <h1>, <h2>, <h3>, etc.)
  • Enhanced layout (Silverlight or Flash files) is provided separately
  • Enhanced behavior is provided by externally-linked JavaScript
  • End user browser preferences are respected

One way to implement progressive enhancement is to ensure your rich content is embedded in <div> or <span> tags. These tags should contain alternate content including the links, text, and images found in your RIA.

The alternate content may contain links, heading, styled text, and images — anything you can add to an ordinary HTML page. You can use JavaScript to detect if the browser supports Flash or Silverlight. If so, the JavaScript manipulates the page's document object model (DOM) to replace the alternate content with the Flash or Silverlight application. The key is to ensure that the alternate content accurately reflects the contents of the RIA file or you may be penalized.

For a complete picture on progressive enhancement, you can read A List Apart’s article Understanding progressive enhancement.

Writing your AJAX application for search engines

In the past few years, AJAX (shorthand for Asynchronous JavaScript and XML) has become popular with web developers looking to create a more dynamic and responsive experiences by exchanging small amounts of data with the server behind the scenes. Often AJAX is used as a technique for creating interactive web applications, but more than ever some people are using AJAX to spice up their pages.

These exchanges of data require the execution of scripts when particular page events occur. Search engines don’t interpret or execute this kind of code, Therefore, any content contained within an AJAX-enabled page or control isn’t accessible and won’t be indexed.

If some pages or your entire site will leverage AJAX, you should consider the following strategies for generating as much static content as possible:

  • Use a descriptive title tag, meta keywords, meta description, and H1 tag on all pages to identify its intended content, even if the content on the page itself isn’t immediately accessible.
  • Don’t put your site navigation within an AJAX/JavaScript container.
  • If your RIA requires extensive AJAX use, plan for development of automated, static page builds that are accessible via static links from dynamic pages. Some people refer to these as “lo-fi” content links.
  • Where possible, use server-side technologies instead of client-side browser requests for populating individual, dynamic content containers.

Possible pattern

If your site or a section of your site requires Silverlight, Flash, or AJAX, you may want to review Nikhil Kothar’s pattern for making RIAs more indexable at SEO for Ajax and Silverlight Applications. No matter the technology, you can build a simple container that will keep the page search engine friendly without trying to detect a search engine crawler on the server. The container Nikhil proposes is a simple bit of coding that uses JavaScript to load the enhanced content but has a separate container with content to display if the enhance content doesn’t load. The markup will look something like:

<div style="width:0px;height:0px;overflow:hidden;"> 
   <!-- Object -->
   <object >
    ...
   </object>
   <!-- or JavaScript generated content -->
</div>
   <script type="text/javascript">
   document.write('<div style="display: none">');
   </script>
<div>
   Text for browser having JavaScript turned off and search engines.
</div>
   <script type="text/javascript">
   document.write('</div>');
</script>
</div>

Playing hide and seek

For large sites, there are probably other reasons why valuable content may not be accessible to crawlers. As the owner of a large site, it’s important from time to time to audit your robots.txt file to ensure that you aren’t blocking important sections of your site, and ensure that your sitemaps are both up-to-date and enabling access to anything that may be of value to the searcher.

Coming up

In our final installment in this series, we will discuss some of the finer points of content for big sites. If you have additional questions, feel free to ask in our forums.

Jeremiah Andrick - Program Manager, Live Search Webmaster