Reducing Junk

Everyone has experienced trying to follow a link that didn’t work. The link may have led to a page that is no longer available, or to a parked domain page. It can also be frustrating to find a search result page that has a missing or garbled description of where the link may take you. Considering the enormity of the web and the complexity of linking web documents , a key challenge for web search quality is the search engine’s ability to process, select and serve only good links with helpful descriptions. In this blog my colleague Richard Qian, Development Manager for Bing, provides an overview of our efforts to reduce junk links and junky or empty snippets.

– Dr. Harry Shum, Corporate Vice President, Bing R&D

To help clarify a few terms which we use in this blog, the following screenshot shows a typical search result with three common components: 1) title with a hyperlink. When we say links, we refer to this; 2) snippet is the description of the search result and 3) URL shows the address of the result page.

aclip_image002

Removing Junk Links

In search there’s perhaps nothing worse than clicking a search result only to get an error message in return. Or being taken to a page that tells you the page you were trying to open could no longer be found. Equally annoying is when you end up on a page for a domain that was just registered and is plastered with ads without any useful content. These are different types of junk links that we refer to as dead links, soft 404s, and parked domains. Next we explain how Bing detects and removes them from the search results.

Dead Links

A dead link is when a 4xx or 5xx error code is returned from an HTTP request for a page. E.g., 404 for a page that could not be found on the site it is hosted on; 500 indicating an internal server error, etc. A user following a link that returns such a code will see an error message instead of useful content. Until we crawl these pages again and discover they are missing, we may still serve them in our search results. In general we want to remove all dead links from our search results as quickly as possible. However we observed many such issues were only transient and some pages came back alive after a short while. The classifiers that solve this problem help us decide whether a page is just temporarily experiencing an issue or if it has truly been deleted. If we think there is some suspicion about the page in question we may boost its re-crawl priority and frequency to help us make a determination as quickly as possible. There is an important tradeoff we make here between aggressively removing dead links and the relevance of our search results. We aim to minimize the number of dead links in our search results without removing content that may experience temporary issues but is useful to our users.

Soft 404

A soft 404 is like a hard 404 where the page was deleted from the site it is hosted on. But in this case the server still returns a normal HTTP 200 code with a webpage reporting that the original page you were trying to reach no longer exists. Here we must rely on our ability to understand the page content to know whether it is a soft 404 or not. Our high precision classifiers in this area use page content such as key phrases in the page’s title, body and URL to determine if the page is a soft 404 and whether to remove it from the search results. E.g., for the query {Five Guys Burgers and Fries history}, here are Bing’s top results before and after we applied our soft 404 classifiers:

Before

clip_image004

After

clip_image006

Parked Domains

Parked domains refer to web sites that have placeholder content after a new domain registration. Usually these pages show ads to try to monetize traffic to the domain before it has been properly setup by the new site owner. Here is an example:

clip_image008

Parked domains usually share similar page content, page layout, or other identifiable patterns. Like the techniques used to identify soft 404s, we look at the patterns in page content to determine if a page is a parked domain. By collaboratively mining many different types of pages against our large index of web data we are able to create signatures that allow us to identify parked domains when we see them and to remove them from our search results.

Reducing Junky and Empty Snippets

The user experience would be poor when the snippet of a search result contains a garbled description or when the description is missing altogether. There are various reasons that may cause junky snippets including incorrectly encoded meta description, incorrect encoding handling, HTML parsing errors, and failed document conversion. There are multiple causes for empty snippets as well, e.g., the web page contains only dynamic content and the search engine’s crawler failed to get it; the web page is just a picture; the target document is in a non-HTML format and the document convertor failed to extract text from it, and the existence of a ‘noindex’ meta tag, just to name a few.

Junky Snippets

Bing uses the standard UTF-8 encoding in our internal document processing stack and recommends that site owners use UTF-8 to minimize any potential conversion issues that result in junky snippets. We do support other encoding formats and do so by employing a learnt classifier to detect the encoding of an input page and then convert it to UTF-8. We also trained a classifier to catch unreadable or garbage text. In addition Bing identifies HTML, XML, JavaScript and other markups using a comprehensive parser and we continue to improve its robustness in handling various types of pages and corner cases.

The improved coverage and precision of our encoding classifier, document convertor, garbage detector, and HTML parser have reduced the occurrence of junky snippets in Bing’s search results. Here is an example showing the improvement on a junky snippet for the query {Yimin Xiao}:

Before

clip_image010

After

clip_image012

Empty Snippets

Many websites today make heavy use of client-side technologies like AJAX and Flash to provide rich and dynamic user experiences. Using traditional crawling and indexing techniques in these situations could leave us without useful content for ranking and snippet generation. Bing embraces the richness of the web with our dynamic crawlers and document processors that render and index dynamically generated pages. We also utilize a number of classifiers to determine whether a page is a plain static page or needs to be dynamically rendered. The majority of the web is made up of HTML but there are also valuable documents existing in other formats such as PDF, Word, PowerPoint, Excel, etc. Bing developed an in-house document convertor to translate these documents into HTML which can then be used for ranking and snippet generation. We also have trained classifiers to identify titles and other text fragment types from the converted documents.

Here are a couple of examples showing before and after we applied our various techniques to reduce empty snippets:

Before

clip_image014

After

clip_image016

Before

clip_image018

After

clip_image020

Thank you for reading this blog. We’d love to hear your feedback and your experience of using Bing especially in the areas which we discussed here. At Bing we are committed to continue improving our search quality and earn your satisfaction. We will cover other feature areas in future blogs.

Dr. Richard Qian, Partner Development Manager, Core Search, Bing R&D