Optimizing your very large site for search — Part 2

For the large website, there are many critically important issues in optimizing for search. In Part 1 of this series of posts, we discussed the importance of reducing the number of URLs you expose through canonicalization. But there are other ways to reduce the surface area of your site to search engines and focus on pages that matter.

While you may have reduced the number of URLs you exposed to Live Search, a large site can still have a large surface area to crawl. In crawling your site, search engines may not get all the best content or can eat unnecessary bandwidth that you pay for. This is where HTTP compression and conditional GET can help.

Enabling HTTP compression

Whether or not you are concerned with bandwidth control, setting up HTTP compression is a best practice for every site owner. What is HTTP compression? HTTP compression is a protocol that is a part of the HTTP 1.1 specification standard known as "content-encoding." This protocol defines how a web server can check, when it receives a request for a file, if the client browser (or crawler) is "compression enabled" before serving the file to the client.

Most people are familiar with the ZIP file format of data compression where files are added to a ZIP archive and then extracted as needed. This is not how HTTP compression works. HTTP compression is used by web servers to passively compress document files in real time as they are being transferred to the browser. The browser is able to decompress and display the file as intended.

Not all files are created equal

Certain file types are not suitable for HTTP compression. For example, files that have already been compressed, such as JPEGs, GIFs, movies, or standard compressed files (e.g. ZIP, gzip, and .RAR) are not going to compress further with HTTP compression turned on. However, sites that have a lot of plain text content, including the main HTML files, XML, CSS, and RSS, will benefit from HTTP compression. For example, most standard HTML text files will compress by about a half, sometimes more.

Setting up Apache

If your site is running Apache, you can leverage the mod_deflate tool, which will add a filter to compress the content as a gzip file. You can apply these filters site-wide or by selectively compressing only specific MIME types, determined by examining the header generated, either automatically by Apache or a CGI script or some other dynamic programming you create.

To enable compression for all MIME types, set the SetOutputFilter directive to a website or directory:

<Directory "/web/mysite/php/"> 
    SetOutputFilter Deflate 
</Directory>

To enable compression on a specific MIME type (as in this example, “text/html”), use the AddOutputFilterByType directive:

AddOutputFilterByType DEFLATE text/html

Every site is different. If you need to support older browsers, you may need to have more advanced configurations. You can read more in the mod_deflate documentation.

Setting up IIS 7

Fortunately for most site owners, IIS 7 has HTTP compression for static files enabled by default. However, if you want to compress all files, you have to manually turn on Dynamic Compression.

You can do this by going to the IIS Services panel and double-clicking Compression.

You'll notice Enable static content compression is selected by default. To enable Dynamic Compression, simply select the Enable dynamic content compression option.

Setting up IIS 6

IIS 6 also includes a native compression system and can be configured to compress both static and dynamic content. To enable HTTP compression in IIS 6, all you have to do is open the website's Properties page and edit the global properties for the site. Under the Service tab, you can configure the options within the HTTP compression section.

Both versions of IIS also cache the compressed information in a directory, which helps improve the performance by eliminating the need to re-compress files on the fly. As with Apache, IIS does let you select MIME types to compress. TechNet has more information about selective compression in IIS.

Did your content change since last time we visited it?

As with HTTP compression, the official HTTP 1.1 specification allows you to define when a document was last updated. When Live Search crawls your site, we ask if each document has changed since we last looked at it. If so, then give us the latest version. Otherwise, if it is unchanged, just let us know and give us nothing. This mechanism is referred to as conditional GET, and by implementing it, you can save yourself bandwidth and us the cost of comparing files we already have in the index.

Additionally, it allows us to spend our crawl time looking at files that we may not have previously indexed, which could improve your coverage over time. 

Implementing the conditional statements

There are a lot of factors to consider when implementing conditional GET, depending on the web server, programming language, or content management system used. Fortunately, both IIS and Apache have native support for Last-Modified / If-Modified-Since / Not-Modified functionality for static files. For dynamic files, you may need to implement a code-based solution. A good pattern for this and an equally good description of how the crawler responds to conditional GET can be found in the article, “Save bandwidth costs: Dynamic pages can support If-Modified-Since too.“

Testing your setup

Once you have determined the best course for implementing your HTTP compression and conditional GET strategies, you can ensure that your implementation is working with our HTTP Compression and HTTP Conditional Get test tool. Using your robots.txt file, you can test to ensure your configuration is correct and will work with Live Search.

Coming up next

Now that your pages are compressed and you are telling us when your content is new, we will move on to discussing how to avoid hiding the content you want us to find. As always, if you have additional questions, feel free to ask in our forums.

Fabrice Canel and Jeremiah Andrick – Program Managers, Live Search Webmaster