As a member of the Live Search Webmaster Team, I'm often asked by web publishers how they can control the way search engines access and display their content. The de-facto standard for managing this is the Robots Exclusion Protocol (REP) introduced back in the early 1990's. Over the years, the REP has evolved to support more than "exclusion" directives; it now supports directives controlling what content gets included, how the content is displayed, and how frequently the content is crawled. The REP offers an easy and efficient way to communicate with search engines, and is currently used by millions of publishers worldwide. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and the way it works for any publisher, no matter how large or small.
In the spirit of making the lives of webmasters simpler, Microsoft, Yahoo and Google are coming forward with detailed documentation about how we implement the Robots Exclusion Protocol (REP). This will provide a common implementation for webmasters and make it easier for any publishers to know how their REP directives will be handled by three major search providers, making REP more intuitive and friendly to even more publishers on the web.
Common REP Directives and USE Cases
The following list includes all the major REP features currently implemented by Google, Microsoft, and Yahoo. We are documenting the features and the use cases they enable for site owners. With each feature, you'll see what it does and how you should communicate it.
Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.
1.Robots.txt Directives
Directive | Impact | Use Cases |
---|---|---|
Disallow | Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled | 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling |
Allow |
Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule – the longest rule – applies. |
This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it. |
$ Wildcard Support |
Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June) |
'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc. |
* Wildcard Support | Tells a crawler to match a sequence of characters (available by end of June) | 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc. |
Sitemaps Location |
Tells a crawler where it can find your sitemaps. |
Point to other locations where feeds exist to point the crawlers to the site's content |
2. HTML META Directives
The tags below can be present as Meta Tags in the page HTML or X-Robots-Tag Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.
Directive | Impact | Use Case(s) |
---|---|---|
NOINDEX META Tag | Tells a crawler not to index a given page | Don't index the page. This allows pages that are crawled to be kept out of the index. |
NOFOLLOW META Tag | Tells a crawler not to follow a link to other content on a given page | Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page. |
NOSNIPPET META Tag | Tells a crawler not to display snippets in the search results for a given page | Present no abstract for the page on Search Results. |
NOARCHIVE / NOCACHE META Tag | Tells a search engine not to show a "cached" link for a given page | Do not make a copy of the page available to users from the Search Engine cache. |
NOODP META Tag | Tells a crawler not to use a title and snippet from the Open Directory Project for a given page | Do not use the ODP (Open Directory Project) title and abstract for this page in Search. |
Other REP Directives
The directives listed above are used by Microsoft, Google and Yahoo, but may not be implemented by all other search engines. Additionally, Live Search and Yahoo support the Crawl-Delay directive, which is not supported by Google at this time.
- Crawl-Delay - Allows a site to delay the frequency with which a crawler checks for new content (Supported by Live Search and Yahoo).
Learn more
Going forward, we plan to continue this coordination and ensure that as new uses of REP arise, we're able to make it as easy as possible for webmasters to use them. Until then, you can find more information about robots.txt at http://www.robotstxt.org and within Live Search's Webmaster Center, which contains lots of helpful information, including:
- Authenticating MSNBot Via Reverse DNS Lookup
- Crawling Performance and Crawl-Delay
- Requesting inclusion of content into our index
- Technical Support for Webmasters
There is also a useful list of the bots used by the major search engines here: http://www.robotstxt.org/wc/active/html/index.html
-- Fabrice Canel & Nathan Buggia, Live Search Webmaster Team