Uncovering web-based treasure with Sitemaps (SEM 101)

Have you ever noticed how pirate treasure maps are like Sitemaps? While your website may not contain a treasure of gold and silver (unless it’s a metals commodities trading site!), if you have good content, that is certainly treasure to someone who is looking for it. Unfortunately, it’s buried on your website and no one knows what’s there except you! But since you want to share your site’s treasure with others, you need to let them know what you have buried and where to find it. You can wait for search engine crawlers (aka bots) and random traffic to come by to browse, but that will take time and even then, they might not discover everything that you have to offer. Instead, you can help the search bots to dig up your treasured content with a Sitemap.

Now I should pause for a moment to mention that you shouldn’t confuse sitemaps with Sitemaps. You’ve got that, right? Well, just in case that’s as clear as mud, keep this in mind: when referring to sitemap files in text (such as this!), use the lower case word “sitemap” to mean HTML-based files intended for users to browse. They typically contain a list of all the pages on your site.

On the other hand, use the capitalized word “Sitemap” to mean XML-based files designed for use by search engine bots to collect data from webmasters identifying the most important pages and directories within their sites for crawling and indexing. Both types of sitemap files can (and probably should) use all lower case letters in their file name (such as sitemap.xml and sitemap.htm), but capitalize the references to the XML-based one in text to help readers distinguish which type of sitemap you are discussing. This article, coming from the perspective of a search engine, is focusing on Sitemaps, not sitemaps. You’re with me now, right? :-)

A good Sitemap will tell search engine bots about the content stored on a site. That helps the content be seen by the bot and, with any luck (assuming the content is well formed and has value), get into the index. Users who are on a content treasure quest will query search engines with keywords to locate the content they are seeking. If the search engine indexed the content found by the bot, which can be more likely when a good Sitemap is present, then that site’s content has a better chance for appearing in the search engine results pages (SERP). After all, you can’t get onto the SERP if your pages aren’t indexed!

Structure

A Sitemap file, saved to the root directory of your site, contains references to specific URL locations for pages (or to other Sitemap files on very large sites), often describing the last modified date, the typical change frequency for a page, and the priority the specified page has compared to the other content on your site.

A brief example of the contents of a Sitemap file looks like this:

<?xml version=”1.0″ encoding=”UTF-8″?>
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
   <url>
     <loc>http://www.mysite.com/default.htm</loc>
     <lastmod>2009-03-01</lastmod>
     <changefreq>monthly</changefreq>
     <priority>0.8</priority>
   </url>
   <url>
     <loc>http://www.mysite.com/contacts.htm</loc>
     <changefreq>yearly</changefreq>
     <priority>0.4</priority>
   </url>
</urlset>

The <urlset> tag is standard and points to the current protocol to reference. The <url> and <loc> tags are the minimum required data needed for each page entry. The other tags, <lastmod>, <changefreq>, and <priority>, are optional, additional data. To see the data entry formatting and attributes used for these optional tags, sail on over to Sitemaps XML format for reference information. Note that not every page on your site need be listed in the Sitemap—only the ones containing valuable content for the user.

File formats

Bing supports Sitemap files submitted as XML and gzip files, but not as HTM or HTML files (those would be sitemaps as opposed to Sitemaps, right? Besides, the XML content of a well-formed Sitemap file wouldn’t render correctly in browsers as an HTM file, anyway). If you’ve created a browsable HTML-based sitemap for your end users, they will thank you for the effort, but you can’t recycle it as a Sitemap. You’ll still need to create a separate, XML-based Sitemap file using the tag structure as noted above for submission to Bing.

Size matters not so much anymore

A typical treasure map only has one X to mark the spot. However, your Sitemap can list multiple locations identifying the treasures on your site. It used to be considered common wisdom in the search engine optimization (SEO) community that there can be too much of a good thing. It used to be accepted that from a search engine perspective, the most effective size for a Sitemap is approximately 150 or fewer URLs. Anything more bountiful and the crawler may not take it all in. Well not so fast, matey!

Per the post Bing enhances support for large Sitemaps made in this blog just a few weeks ago, Bing now supports Sitemap files that contain up to 50,000 references (to either URLs or links to other, child Sitemap files). This development is a boon for webmasters of very large sites. They can now create multiple child Sitemap files, each dedicated to mapping specific areas of their content organization, and store those child Sitemap files in the base directories of those content areas. Then they can link to the child Sitemaps via their primary (aka index) Sitemap file (the one stored in the root directory of a site). One index Sitemap linking to 50,000 child Sitemaps, each of those referencing up to 50,000 URLs, means they can reference up to 2.5 billion URLs through the Sitemap technology, and the Bing crawler, MSNBot, will read it all. Now that’s a lot of treasure!

Sitemap submission

There are multiple ways for webmasters to submit their Sitemaps to Bing:

  • Bing Webmaster tools. You can sign in to Bing’s Webmaster tools and use the Sitemaps tool (if you are not already registered to use these free tools, this is a good reason to sign up and see all the other tools available to help you analyze and optimize your site). Simply copy the URL of your Sitemap into the Direct sitemap submission text box, and then click Submit.
     
  • Robot.txt file reference. If you are using a robots.txt file to instruct search engine bots which files and directories not to crawl and thus block from adding to their indexes, you can add a line to that file, most typically done at the end, that reads as follows:

    Sitemap: http://www.YourURL.com/sitemap.xml
     

    substituting the full URL to your Sitemap file in place of the YourURL.com example.

Validation

Before submitting your Sitemap to Bing, we recommend that you run the XML code you’ve written through a Sitemap validation tool. After all, what good is a treasure map if it has errors in it? Do a search for your Sitemap validator of choice and follow the instructions on the page. If errors are found, correct them before you submit the Sitemap file to Bing.

Once you submit your Sitemap to Bing, we will read its contents, which will help us with uncovering more of the content treasures buried on your site and evaluating it as potential new additions to our index. And with more of your site’s content in the index, instead of users sailing on past your site to other ports of call in their quest for content treasure, they may stop at yours and exclaim, “Shiver me timbers, matey, look at what we have here!”

If you have any questions, comments, or suggestions, feel free to post them in our SEM forum. Until next time…

Rick DeJarnette

Fabrice Canel

Edited May 22, 2023: Removed reference to ping.