Robots speaking many languages

We’ve already covered in past blog articles some of the basics about how webmasters can use a file called robots.txt to control how search engine crawlers (aka bots) crawl their websites. But there is so much more to talk about with bots. So let’s take a bit of a deeper dive into the subject.

Topic 1: Using the proper text file encoding

The robots.txt file is used by webmasters to either specifically define which files and directories that compliant search engine bots may or may not crawl. Robots.txt files are basically text files. However, even something as seemingly straightforward as a text file is not as simple as it might seem. Which type of file encoding scheme is used to save the file makes a big difference. For example, when you use the quintessential text file editor, the Notepad utility in Windows, you can save your text files in your choice of the following encoding types:

ANSI (aka Windows-1252)
Unicode
Unicode big endian
UTF-8 (usually defined as “Unicode Transformation Format,” but I’ve also seen alternatives starting with either “Universal ” or even “UCS,” which itself stands for “universal character set“)

If you choose to save your robots.txt file as either Unicode or Unicode big endian, the resulting file will not be compatible with most search engine bots.

Robots.txt file requirements

To ensure that the search engine bots can read the directives for blocking or allowing content to be indexed in robots.txt file (not just with Bing, but all of them), save the file using one of the following compatible encoding formats:

American Standard Code for Information Interchange (ASCII) (a 7-bit, 128 character set)
ISO-8859-1 (an 8-bit, 256 character set backward compatible with US ASCII)
UTF-8 (a variable-length character encoding version of Unicode that is backwards compatible with US ASCII)
Windows-1252 (aka ANSI, as used in Microsoft Windows, it is an 8-bit, 256 character set backward compatible with US ASCII)

Sticking with one of these compatible encoding formats will ensure that the bots you wish to control can read, and thus act upon, your robots.txt file. For more information, check out this article covering the history of character sets from the Microsoft Typography team.

Topic 2: Writing non-ASCII alphabetic characters in robots.txt

The limited number of compatible file encoding formats for robots.txt exposes a potential problem for some users.

The Internet Engineering Task Force (IETF) proclaims that Uniform Resource Identifiers (URIs), comprising both Uniform Resource Locators (URLs) and Uniform Resource Names (URNs), must be written using the US-ASCII character set. However, ASCII’s 128 characters only covers the English alphabet, numbers, and punctuation marks. Some of the alphabetic characters from other Latin-based languages, such as ñ in Spanish and ç in French, are left out of ASCII. More significantly, most characters in non-Latin-based alphabets, such as pi (π) in Greek, ya (Ñ) in Cyrillic, and entire alphabets from many other world languages, can’t be accurately written in the limited, English-oriented ASCII.

This limitation with regard to robots.txt can come into play for webmasters when bots visit web servers using languages whose characters fall outside of the ASCII character set. If a robots.txt file is present on that server and it includes directives to block bots from indexing content in files and directories whose names include non-ASCII characters, the bot may not interpret the directive as the webmaster intended.

Percent encoding to the rescue

There is a way to make sure the bots can properly read the file and directory path names, regardless of whether it adheres to ASCII standards. When writing directives that include characters unavailable in ASCII, you can “escape” (aka percent-encode) them, which enables the bot to read them.

Percent-encoded characters, discussed in the IETF’s RFC 3986, are used as character substitutes. A percent-encoded character is a sequence of one or more three-character codes (aka octets), starting with the “%” sign and followed by two hexadecimal numbers. Percent encoding converts the character’s hexadecimal UTF-8 value into a sequence of one or more ASCII-based octets that a URI-compliant bot can read.

To demonstrate what percent-encoded text looks like, type www.%62%69%6e%67.com in your browser’s address bar. It will be automatically decoded into www.bing.com. The octet codes %62, %69, %6e, and %67 are decoded by the browser into letters b, i, n, and g, respectively. Note though, that that the recommended use for percent encoding is really for those non-ASCII characters in a URL path to minimize the potential for decoding translation errors.

Real world example

Let’s look at a real-world example. Suppose you were the webmaster for a website that contained the URL http://www.domain.com/Ð¿Ð°Ð¿ÐºÐ°/ (the folder name in the sample URL is written in Cyrillic and literally means “folder”). To block a bot from indexing that folder on your website using percent encoding in your robots.txt file, you would need to write the directive as follows:

Disallow: /%D0%BF%D0%B0%D0%BF%D0%BA%D0%B0/

If instead you simply wrote

Disallow: /Ð¿Ð°Ð¿ÐºÐ°/

the bot may not be able to read the directive and thus fail to perform as desired.

Performing percent encoding

So how do you translate your non-ASCII characters into escape-encoded octets? Well, it’s a bit of a chore, frankly. If you search for them, there are a few websites and/or tools that offer to perform percent encoding for you, but rather than endorse a site I know nothing about, I’ll instead tell you how to manually calculate the conversion. If you want to use an automated tool, go for it. But knowing how the process works will allow you to verify that a tool encoded your characters correctly.

Warning! I’m going to get pretty tech geeky here. If working with hexadecimal and binary numbers is not your thing, I apologize up front!

OK, thus warned, let’s get to it. You first need to know the UTF-8 hexadecimal value for each character you want to encode. They are usually presented as U+HHHH. The four “H” hex digits are what you need.

As defined in IETF RFC 3987, the escape-encoded characters can be between one and four octets in length. The first octet of the sequence defines how many octets you need to represent the specific UTF-8 character. The higher the hex number, the more octets you need to express it. Remember these rules:

Characters with hex values between 0000 and 007F only require only one octet. The high-order (left most) bit of the binary octet will always be 0 and the remaining seven bits are used to define the character.
Characters with hex values between 0080 and 07FF require two octets. The right most octet (last of the sequence) will always have the first two highest order bits set to 10. The remaining six bit positions of that octet are the first six low-order bits of the hex number’s converted binary value (I set the Calculator utility in Windows to Scientific view to do that conversion). The next octet (the first in the sequence, positioned to the left of the last octet) always starts with the first three highest order bits set to 110 (the number of leading 1 bits indicates the number of octets needed to represent the character – in this case, two). The remaining higher bits of the binary-converted hex number will fill in the last five lower order bit positions (add one or more 0 at the high end if there aren’t enough remaining bits to complete the 8-bit octet).
Characters with hex values between 0800 and FFFF require three octets. Use the same right-to-left octet encoding process as the two-octet character, but start the first (highest) octet with 1110.
Characters with hex values higher than FFFF require four octets. Use the same right-to-left octet encoding process as the two-octet character, but start the first (highest) octet with 11110.

Below is a table to help illustrate these concepts. The letter n in the table represents the open bit positions in each octet for encoding the character’s binary number.

Hexadecimal value	Octet sequence (in binary)
0000 0000-0000 007F	0nnnnnnn
0000 0080-0000 07FF	110nnnnn 10nnnnnn
0000 0800-0000 FFFF	1110nnnn 10nnnnnn 10nnnnnn
0001 0000-0010 FFFF	11110nnn 10nnnnnn 10nnnnnn 10nnnnnn

Let’s demo this using the first letter of the Cyrillic example given above, Ð¿. To manually percent encode this UTF-8 character, do the following:

Look up the character’s hex value. The hex value for the lower case version of this character is 043F.
Use the table above to determine the number of octets needed. 043F requires two.
Convert the hex value to binary. Windows Calculator converted it to 10000111111.
Build the lowest order octet based on the rules stated earlier. We get 10111111.
Build the next, higher order octet. We get 11010000.
This results in a binary octet sequence of 11010000 10111111.
Reconvert each octet in the sequence into hex. We get a converted sequence of D0 BF.
Write each octet with a preceding percent symbol (and no spaces in-between, please!) to finish the encoding: %D0%BF

You can confirm your percent encoded path works as expected by typing it into your browser as part of a URL. If it resolves correctly, you’re golden.

There always more to talk about with robots (and so many other webmaster-related topics). If you have any questions, comments, or suggestions, feel free to post them in our SEM forum. Until next time…

— Rick DeJarnette, Bing Webmaster Center

Related Stories

SEO best practice for subscription-based and paywall content

Site explorer: SEO-explore your site

Accessing Bing Webmaster Tools API using cURL