Prevent a bot from getting “lost in space” (SEM 101)

We recently published a non-SEM 101 blog post on controlling the crawl rate of MSNBot, the Bing web crawler (aka robot, or simply just bot). That got me thinking about robots. Naturally, that led to The Robot on Lost in Space. Will Robinson, the show’s precocious youngster who was a whiz at 1960s-style, clunky electronics (even though the show was supposedly set in 1997!), was best friends with The Robot. They looked out for each other and helped each other in times of need.

In a way, search engine bots and webmasters have a similar relationship. They need one another. Webmasters need search engine bots to crawl the pages of their sites so that they can be added to their indexes. Bots need webmasters to provide them with compelling content, well-formed code, and authoritative backlinks to serve to their search customers in search engine results pages (SERPs). This form of a “mutualistic, symbiotic” relationship is beneficial to both parties. Webmasters benefit from getting high quality content into the search engine index. Search engines benefit by being able to provide searchers with useful, relevant results to their queries, no matter how arcane.

The Robot character was actually quite powerful in its analytical capabilities, and would alert the Robinson family when danger lurked (although I never understood why it did not sense the cloak-and-dagger danger in stowaway Dr. Smith). But most important was young Will Robinson’s ability to communicate with The Robot. He could direct The Robot’s powerful intelligence toward things that needed its attention (such as your average, run-of-the-mill, hostile space aliens, who coincidently always looked like expressionless, rubber masked-humans!). The Robot’s assistance helped Will survive his rough and tumble alien environment.

As a webmaster, you, too, can communicate with the robot (the search engine variety). You can block it from indexing specified directories and files, override the generic index block for a subset of those files, block the following of links on a page, and much more. Let’s take a look at how you can converse with your friendly search engine robot and help it navigate your website with an eye toward directing its behavior for indexing your content and following your links. This effort might help your site better survive in the rough and tumble environment of the Web.

While Will Robinson conversed with The Robot in spoken English, you’ll need to communicate with the search engine bots using the Robots Exclusion Protocol (REP). You can do this in three ways (depending on what you want to do):

  • In a separate file named robots.txt (which you save to the root directory of your website).
  • Within the HTML code of each page in a <meta> tag or as attributes in the <a> tag
  • Within the HTTP Header from the web server

While not a perfect analogy, consider the scope of the message to the robot for defining how you tell it what to do. For site-wide directives applying to all content, you can use HTTP Header directives. For the flexibility of managing directives applicable to the entire site, or those just for directories or even individual files, you’ll typically use the robots.txt file. Page- and even link-specific directives are more often handled in a page’s HTML code.

1. The robots.txt file

The robots.txt file is commonly used to block bots (identified as user-agents within the context of the file) from indexing directories and files that contain data the webmaster doesn’t want added to the search index, such as scripts, databases, and other information that is not intended for public consumption or has no value for searchers. For example, a basic robots.txt file might include directives such as the following:

User-agent: *
Disallow: /private/

The above sample robots.txt file code applies to all user-agents (bots to you and me) and blocks bot from indexing all files in the directory named /private on the web server.

But you can do more than that with the robots.txt file. What if you actually had a ton of existing files in /private but actually wanted some of them made available for the crawler to index? Instead of re-architecting your site to move certain content to a new directory (and potentially breaking internal links along the way), use the Allow directive. Allow is a non-standard REP directive, but it’s supported by Bing and other major search engines. Note that to be compatible with the largest number of search engines, you should list all Allow directives before the generic Disallow directives for the same directory. Such a pair of directives might look like this:

Allow: /private/public.doc
Disallow: /private/

Note If there is some logical confusion and both Allow and Disallow directives apply to a URL within robots.txt, the Allow directive takes precedent.

Wildcards

The use of wildcards is supported in robots.txt. The “*” character can be used to represent characters appended to the ends of URLs, such as session IDs and extraneous parameters. Examples of each would look like this:

Disallow: */tags/
Disallow: *private.aspx
Disallow: /*?sessionid

The 1st line in the above example blocks bot from indexing any URL that contains a directory named “tags,” such as “/best_Sellers/tags/computer/”, “/newYearSpecial/tags/gift/shoes/”, and “/archive/2008/sales/tags/knife/spoon/”. The 2nd line above blocks indexing of all URLs that end with the string “private.aspx”, regardless of the directory name (note that the preceding forward slash is redundant and thus not included). The last line above blocks indexing of any URL with “?sessionid” anywhere in their URL string, such as “/cart.aspx?sessionid=342bca31?”.

Notes

  • The last directive in the sample is not intended to block the indexing of file and directory names that use that string, only URL parameters, so we added the “?” to the sample string to ensure that work as expected. However, if the parameter “sessionid” will not always be the first in a URL, you can change the string to “*?*sessionid” so you’re sure you block the URLs you intend. If you only want to block parameter names and not values, use the string “*?*sessionid=”. If you delete the “?” from the example string, this directive will block URL containing file and directory names that match the string. As you can see, this can be tricky, but also quite powerful.
  • A trailing “*” is always redundant since that replicates the existing behavior for MSNBot. Disallowing “/private*” is the same as disallowing “/private”, so don’t bother adding wildcard directives for those cases.

You can use the “$”wildcard character to filter by file extension.

Disallow: /*.docx$

Note The directive above will disallow any URL containing the file name extension string “.docx” from being crawled, such as a URL containing the sample string, “/sample/hello.docx”. In comparison, the directive Disallow: /*.docx blocks more URLs, as it applies to more than just file name extension strings.

Sitemaps

If you have created an XML-based Sitemap file for your site (as discussed in the recent blog post Uncovering web-based treasure with Sitemaps), you can add a reference to the location of your Sitemap file at the end of your robots.txt file. The syntax for a Sitemap reference is as follows:

Sitemap: http://www.your-url.com/sitemap.xml

Other issues

You can also add a crawl-delay directive to your robots.txt file to change the default pace at which Bing crawls your site. I’ll do my part in conserving electrons and avoiding redundancy by instead referring you to that post: Crawl delay and the Bing crawler, MSNBot.

Whatever you choose to do with robots.txt, don’t play games with constant changes to the file. Note this story from our blog about a webmaster who was having problems with getting a site properly crawled, and we discovered they were swapping out differently configured robots.txt files in an automated fashion in a misguided effort to control crawling. That was not a helpful strategy!

File format

The robots.txt file must be saved in a standard text file format, such as ASCII or UTF-8, so it can be read by the bots. One easy way to verify that the proper file format is used is to edit the file in Microsoft Notepad. Save the file using the Notepad default file format type, Text Documents (*.txt) with ANSI encoding.

Validation

Once your robots.txt file is built, I suggest that you validate it before you consider it done. If you are a member of Bing Webmaster Center, log in to get access to our online Robot.txt validation tool. Otherwise, there are a number of other online robots.txt validators available for you to use. We all want to avoid having The Robot say, “Warning! Warning! That does not compute!”

Maintenance

Once you’ve got a good robots.txt file built and validated, don’t just set it and forget it. Periodically audit the settings in the file, especially after you’ve gone through a site redesign. You need to be sure your blocking directives are still valid so that nothing is unintentionally blocking the bot from valid content or leaving sensitive material exposed.

2. Directives in HTML code

You can also put REP directives directly in your HTML code. Recall a few weeks back we discussed the way to optimize the <head> tag and got into how to use <meta> tags? Well, that is where most of these REP directives go as well. Let’s take a look at how these work. Here’s a sample <meta> tag that addresses bots:

<meta name=”robots” content=”noindex, nofollow”>

There are more options than the two listed. REP values for the content attribute are designed to be aggregated to combine functionality so that it performs just as you want.

Meta tags

The name=”robots” attribute of the <meta> tag is read by the bot when it accesses an HTML page. The directives listed in the content attribute tell it what (or more specifically, what not) to do. Let’s define the function of each of the content attribute values.

Value

Function

noindex

Prevents the bot from indexing the contents of the page, but links on the page can be followed. This is useful when the page’s content is not intended for searchers to see.

nofollow

Prevents the bot from following the links on the page, but the page can be indexed. This is useful if the links created on a page are not in control of the webmaster, such as with a blog or user forum. Links to spam or malware are not what careful webmasters want to serve to their customers!

nosnippet

Instructs the bot to not display a snippet for that page in the SERPs. Snippets are the text description of the page shown between the page title and the blue link to the site.

noarchive

Instructs the bot to not display a cache link for that page in the SERP.

nocache

Same as noarchive.

noodp

Instructs the bot to not use a title and snippet from the Open Directory Project (ODP) for that page in the SERP.

Note that the HTML-based, bot-blocking directives within a page will override Allow directives found applying to the same file in robots.txt.

Links

What if you want almost all of the links on a page followed, but for some reason, you want to block the bot from following one or a few? Well, there is a solution for that as well: the rel=”nofollow” attribute. To see it in action, look at the following sample anchor tag:

<a rel=”nofollow” href=”http://www.untrustedsite.com/forum/stuff.aspx?var=1”>Read these forum comments</a>

Caveats

Note that with the rel=”nofollow” attribute, a REP-compliant bot will not follow that specific link on that page. However, if any other page on the site (or a link on an external site) refers to the blocked page without any REP directives blocking it, the page may still be crawled and could make it into the index. This caveat goes for all REP link blocking directives that are not consistently applied to a specified page. And in the case of external sites linking to the page on your site that is supposed to be blocked, which local webmasters cannot control, these pages may still be crawled and indexed. In this case, the <meta> tag’s noindex solution for that page is the best option.

I discuss this attribute in some detail at the end of a previous blog article. I’ll again conserve electrons by referring you to Making links work for you. Robots appreciate that kind of thing.

3. Directives in HTTP Headers

While you can use REP directives in <meta> tags on each HTML page, some content types, such as PDFs or Microsoft Office documents, do not allow you to add these tags.

Instead, you can modify the HTTP header configured in your web server with the X-Robots-Tag to implement REP directives that are applicable across your entire website. Below is an example of the X-Robots-Tag code used to prevent both the indexing of any content and the following of any links:

x-robots-tag: noindex, nofollow

The directives available with the X-Robots-Tag are the same as defined earlier for use in the <meta> tag. You’ll need to consult your web server’s documentation on how to customize its HTTP header.

REP directive precedence

Search engines read the robots.txt file repeatedly to learn about the latest disallow rules. For URLs disallowed in robots.txt, search engines do not attempt to download them. Without the ability to fetch URLs and their content associated, no content is captured and no links are followed. However, search engines are still aware of the links and they may display link-only information with search results caption text generated from anchor text or ODP information.

However, web pages not blocked by robots.txt but blocked instead by REP directives in <meta> tags or by the HTTP Header X-Robots-Tag is fetched so that the bot can see the blocking REP directive, at which point nothing is indexed and no links are followed.

This means that Disallow directives in robots.txt take precedence over <meta> tag and X-Robots-Tag directives due to the fact that pages blocked by robots.txt will not accessed by the bot and thus the in-page or HTTP Header directives will never be read. This is the way all major search engines work.

For more information on REP, see our past big blog article on the subject, Robots Exclusion Protocol: joining together to provide better documentation.

Judicious use of REP directives will help shape your website’s presence on the SERPs. It’ll help direct the search engine bots to content you want them to see and block them from content you do not want indexed, be it because the content is not useful to searchers (such as shopping cart or logon pages) or because it contains potentially business confidential data (such as references to internal IT infrastructure in scripts). Your efforts to direct the traffic here will help prevent bots from getting “lost in space.” It’s only too bad that the search engine bots don’t wave their virtual arms and trumpet “Danger, Will Robinson!” whenever they can access content not intended for the index! But that’s your call to make, not theirs.

If you have any questions, comments, or suggestions, feel free to post them in either our SEM forum or our Crawling/Indexing Discussion forum. Later…

— Rick DeJarnette, Bing Webmaster Center