Robots Exclusion Protocol: joining together to provide better documentation

As a member of the Live Search Webmaster Team, I’m often asked by web publishers how they can control the way search engines access and display their content. The de-facto standard for managing this is the Robots Exclusion Protocol (REP) introduced back in the early 1990′s. Over the years, the REP has evolved to support more than “exclusion” directives; it now supports directives controlling what content gets included, how the content is displayed, and how frequently the content is crawled. The REP offers an easy and efficient way to communicate with search engines, and is currently used by millions of publishers worldwide. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and the way it works for any publisher, no matter how large or small.

In the spirit of making the lives of webmasters simpler, Microsoft, Yahoo and Google are coming forward with detailed documentation about how we implement the Robots Exclusion Protocol (REP). This will provide a common implementation for webmasters and make it easier for any publishers to know how their REP directives will be handled by three major search providers, making REP more intuitive and friendly to even more publishers on the web.

Common REP Directives and USE Cases

The following list includes all the major REP features currently implemented by Google, Microsoft, and Yahoo. We are documenting the features and the use cases they enable for site owners. With each feature, you’ll see what it does and how you should communicate it.

Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.

1.Robots.txt Directives

Directive Impact Use Cases
Disallow Tells a crawler not to crawl your site or parts of your site — your site’s robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled ‘No crawl’ pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling

Allow

Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule – the longest rule – applies.

This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.

$ Wildcard Support

Tells a crawler to match everything from the end of a URL — large number of directories without specifying specific pages (available by end of June)

‘No Crawl’ files with specific patterns, for e.g., files with certain file types that always have a certain extension, say ‘.pdf’, etc.

* Wildcard Support Tells a crawler to match a sequence of characters (available by end of June) ‘No Crawl’ URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.

Sitemaps Location

Tells a crawler where it can find your sitemaps.

Point to other locations where feeds exist to point the crawlers to the site’s content

2. HTML META Directives

The tags below can be present as Meta Tags in the page HTML or X-Robots-Tag Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.

Directive Impact Use Case(s)
NOINDEX META Tag Tells a crawler not to index a given page Don’t index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no abstract for the page on Search Results.
NOARCHIVE / NOCACHE META Tag Tells a search engine not to show a “cached” link for a given page Do not make a copy of the page available to users from the Search Engine cache.
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and abstract for this page in Search.

Other REP Directives

The directives listed above are used by Microsoft, Google and Yahoo, but may not be implemented by all other search engines.  Additionally, Live Search and Yahoo support the Crawl-Delay directive, which is not supported by Google at this time.

  • Crawl-Delay – Allows a site to delay the frequency with which a crawler checks for new content (Supported by Live Search and Yahoo).

Learn more

Going forward, we plan to continue this coordination and ensure that as new uses of REP arise, we’re able to make it as easy as possible for webmasters to use them. Until then, you can find more information about robots.txt at http://www.robotstxt.org and within Live Search’s Webmaster Center, which contains lots of helpful information, including:

There is also a useful list of the bots used by the major search engines here: http://www.robotstxt.org/wc/active/html/index.html

– Fabrice Canel & Nathan Buggia, Live Search Webmaster Team

Join the conversation

24 comments
  1. Anonymous

    Good work that the three major search engines are teaming up to create standards that ease our work. I hope this is just a first glance of other cooperations in the future.

  2. Anonymous

    Hi Live search!

    Google Webmaster Tools allow me to set crawl rate: Slower, Normal (default), Faster. I do not know how long is Slower, Normal, Faster???

  3. Anonymous

    Can I see example robots.txt  for msnbot?

  4. Anonymous

    Robot for FAQ pages are recommended?

  5. Anonymous

    Live Search Team, can you please tell us 1) when your spiders will accept wildcard directives and 2) will the Live Search Webmaster "Validate robots.txt" tool reflect the acceptance of wildcard values concurrently with the spider?

  6. Anonymous

    always easier for people when different search engines work together.

  7. Anonymous

    About wildcards:

    Does path ‘*.doc$’ mean the same as ”.doc$” ?

  8. Anonymous

    Great Blog but I was wondering how can I exclude parts of my page to be indexed but allow the search to follow the links inside the excluded parts?

    Do the microsoft products Live, MSN, Sharepoint, Search Server follow the same standards for this exclusion?

  9. Anonymous

    If I wanted to include /videos but disallow /videos/[here] I would include $ so /videos/$ as the $ part is a search query.

  10. Quality Directory

    Yes, REP provides the easiest way to communicate with major search engines. But some lesser search engines don't seem to respect its set out rules.

  11. Anonymous

    Wath a name of bing spider? I am wont write him in robots.txt.

  12. himanshu_swaraj

    thanks a lot for sharing this.. It will help a lot to the webmasters..

  13. Anonymous

    Thanks, i needed the noindex for my website, clearly i dnt want the search engines to index my privacy policy pages.

    Matt

  14. Anonymous

    Do the microsoft products Live, MSN, Sharepoint, Search Server follow the same standards for this exclusion? good problem.

  15. zhangmin_iichiba

    bing respect nofollow?  really?

  16. zhangmin_iichiba

    NOFOLLOW META Tag

    Tells a crawler not to follow a link to other content on a given page

    really  it is the same with google?

  17. Anonymous

    thanks a lot for sharing this

  18. Anonymous

    I know that your article makes sense, but because of the livelihood of release I had to come here to spam, I hope you can forgive me. thank you. Expect you to write better articles..maybe you will like this if you are fashion.

  19. nirohi

    Thanks a lot for sharing this. bing respect nofollow?

  20. Anonymous

    It is a shame that Bing too doing it! We want Bing to be successful and give Google a run for its money and make the web a bi-polar world rather than its uni-polar character ATM . For this Bing has to offer webmasters and searchers something unique so that they switch from Google.

  21. Anonymous

    whoo, good for us, I am looking forward to reading more wonderful posts here.

  22. nechaev_ra

    Certainly.. "you let the robot know that you are discounting all outgoing links from this page" These are actions against spamers

  23. m0rad

    thanks a lot for sharing this

Comments are closed.