Sitemaps Best Practices Including Large Web Sites

One of the key Search Engine Optimization (SEO) strategies for web sites is to have high quality sitemaps helping search engines to discover and access all relevant content posted on that web site. Sitemaps offer this really simple way for site owners to share information with every search engine about the content they have on their site instead of having to rely solely on crawling algorithms (ie: crawlers, robots) to find it.

The Sitemaps protocol defined at www.sitemaps.org, is a now widely supported. Often web sites and some Content Management Systems (CMSs) offers sitemaps by default or as an option. Bing even offers an open source server-side technology, Bing XML Sitemap Plugin, for websites running on Internet Information Services (IIS) for Windows® Server, as well as Apache HTTP Server.

Best practices if you want to enable a sitemaps

If you don’t have a sitemap yet, we recommend first that you explore if your web site or your CMS can manage this, or install a sitemap plugin.

If you have to, or want to, develop your own sitemaps, we suggest the following best practices:

  1. First, follow the sitemaps reference at www.sitemaps.org. Common mistakes we see are people thinking that HTML Sitemaps are sitemaps, malformed XML Sitemaps, XML Sitemaps too large (max 50,000 links and up to 10 megabytes uncompressed) and links in sitemaps not correctly encoded.
  2. Have relevant sitemaps linking to the most relevant content on your sites. Avoid duplicate links and dead links: a best practice is to generate sitemaps at least once a day, to minimize the number of broken links in sitemaps.
  3. Select the right format:
    1. Use RSS feed, to list real-time all new and updated content posted on your site, during the last 24 hours. Avoid listing only the past 10 newest links on your site, search engines may not visit RSS as often as you want and may miss new URLs. (This can also be submitted inside Bing Webmaster Tools as a Sitemap option.)
    2. Use XML Sitemap files and sitemaps index file to generate a complete snapshot of all relevant URLs on your site daily.
  4. Consolidate sitemaps: Avoid too many XML Sitemaps per site and avoid too many RSS feeds: Ideally, have only one sitemap index file listing all relevant sitemap files and sitemap index files, and only one RSS listing the latest content on your site.
  5. Use sitemap properties and RSS properties as appropriate.
  6. Tell search engines where our sitemaps XML URLs and RSS URLs are located by referencing them in your robots.txt files or by publishing the location of your sitemaps in search engines’ Webmaster Tools.

Scaling up sitemaps to very large sites

Interestingly some sites these days, are large… really large… with millions to billions of URLs. Sitemap index files or sitemap files can link up to 50,000 links, so with one sitemap index file, you can list 50,000 x 50,000 links = 2,500,000,000 links.  If you have more than 2.5 Billion links… think first if you really need so many links on your site. In general search engines will not crawl and index all of that. It’s highly preferable that you link only to the most relevant web pages to make sure that at least these relevant web pages are discovered, crawled and indexed. Just in case, if you have more than 2.5 billion links, you can use 2 sitemap index files, or you can use a sitemap index file linking to sitemap index files offering now up to 125 trillion links: so far that’s still definitely more than the number of fake profiles on some social sites, so you’ll be covered. ;)

The main problem with extra-large sitemaps is that search engines are often not able to discover all links in them as it takes time to download all these sitemaps each day. Search engines cannot download thousands of sitemaps in a few seconds or minutes to avoid over crawling web sites; the total size of sitemap XML files can reach more than 100 Giga-Bytes.  Between the time we download the sitemaps index file to discover sitemaps files URLs, and the time we downloaded these sitemap files, these sitemaps may have expired or be over-written. Additionally search engines don’t download sitemaps at specific time of the day; they are so often not in sync with web sites sitemaps generation process. Having fixed names for sitemaps files does not often solve the issue as files, and so URLs listed, can be overwritten during the download process.

To mitigate these issues, a best practice to help ensure that search engines discover all the links of your very large web site is that you manage two sets of sitemaps files: update sitemap set A on day one, update sitemap set B on day two, and continue iterating between A and B. Use a sitemap index file to link to Sitemaps A and Sitemaps B or have 2 sitemap index files one for A and one for B. This method will give enough time (24 hours) for search engines to download a set of sitemaps not modified and so will help ensure that search engines have discovered all your sites URLs in the past 24 to 48 hours.

Regards,

Fabrice Canel
Principal Program Manager
Bing Index Generation