Bing Search Quality Insights: Minimizing Answer Defects

Back in March, Jan Pedersen explored the topic of Whole Page Relevance. In the post, he outlined how Bing ranks media objects to deliver a rich set of results that go beyond just web pages using machine learning techniques. These objects include videos, images, maps, news items and assorted media objects or what we call answers. Today, Kieran McDonald will detail how we ensure that the answers are not defective and the results match the intent of your query. If you are interested in the wider topic of minimizing defects, also read William Ramsey’s recent blog: To Err is Human

– Dr. Harry Shum, Corporate Vice President, Bing R&D.

The core philosophy guiding our algorithm centers on presenting results that are most useful at the top of the page. In the simplest terms, we do this by ranking them in descending order based on the amount of clicks that a given result receives. Answers in turn must be competitive with the web results that they displace in order to justify their real estate and position within the page. While this may appear straightforward, we occasionally encounter instances where answers do not match user intent even though they receive a sufficient percentage of clicks to maintain their position on the page. We deem these results defective. To address these instances, we have built dedicated models aimed at delivering the most relevant results possible. In this post, we’ll illustrate how defects might slip through the cracks and how the models we employ minimize answers that are not helpful from appearing in the results.

Here are some examples of how this works:

For the query {elephant}, we know that most people will primarily interact with three items on the Search Engine Results Page (SERP):

1) The images answer showcasing evocative elephant images

2) The Wikipedia web result

3) The video answer highlighting popular elephant videos

Below is the current Bing SERP for this query:

But deep within the Bing internal indexes, there is actually another answer trying to vie for your attention. The product index has identified an answer for elephant posters. While you can’t buy elephants on Bing, we can identify elephant posters that you could buy with this answer:

Our defect classifiers mark this answer as a defect as it does not match the user intent and it will be blocked from the page even though a percentage of people may click on the answer to keep it competitive to the lower ranked web documents. This example highlights the problem with relying solely on click rate. In this case, it is likely that people may be clicking on the answer due to the attractive graphic even though they have no interest in purchasing an elephant poster. Our defect classifier uses multiple other signals besides click rate, including how people have historically engaged with a specific query category, and  determines that this answer is not relevant for this query relative to other results and blocks it from the page. By marking the answer as a defect, we are able to ensure that we’re matching the results more closely to the original query intent.

Another class of queries that are particularly susceptible to ambiguous intent and defective answers is navigational queries. Navigational queries are queries where users typically navigate to a single site or web page. Consider the {target} query which also matches content in our video index.


In some of the simplest cases, intent mismatches occur when the federated indexes, such as our video index, have content that matches the query. In isolation the results might seem relevant based on simple ranking features but the user really has no intent for that specific content. For example {target} and {amazon} are navigational queries but also matched exactly many video titles which caused this answer to return results for the query on our internal systems.


In terms of click order, this video answer can actually compete effectively with some of the web results (typically at second to fifth ranking on the page,) but these answers are clear defects. With this change, people are not distracted from their original focus which is to visit the retail store.

In a federated search engine, multiple indexes and answers run in parallel to deliver results within milliseconds. The finance answer and news answer strongly matches the intent of the query for {stock market}.

If you had an internal view into the system, you would also see a third local answer for “Stock Market Foods” vying for attention.

The defect classifier is able to use information about how web, news and finance answer interpret this query in order to decide if the local answer is defective. In this case it blocks it from the page. The answer ranking models are also in agreement here that this answer is not competitive with other content on the page (web results and answers). In this case, both the ranking and the defect model agree to block the answer from the page.

The other major source of defects is caused by poor quality results as opposed to misinterpreting the intent of the users.

A typical case in which this can happen internally is when an index is too lax at allowing a partial match between the query and its content. Below is a blocked news answer result for the query {centinela hospital medical center}.

Our defect models utilize signals for how well the query matches the web results and the news results to infer that this is a defective answer. For this example, signals capturing the partial match between the news articles and the original query also increase the probability that this is classified as a defect.

Ambiguous queries can be more difficult to correctly handle. The query {john} triggers an image answer that shows a mixture of John Cena and John Deere tractors.


While it correctly does not filter the image answer for the more specific query {john cena}, the defect classifier identifies this answer as defective. Predicting defects from multimedia answers requires signals that characterize the quality of the multimedia results and quantify how well they match to the query. Signals that characterize the diversity of the web results can also help the defect classifier. This is a challenging area where we continue to invest more deeply.

In summary, if one index can expertly answer a query it reduces the likelihood that another is relevant. In our final example {squirrels barking}, a video is highly relevant but we fail to recognize that image result is defective. As we all know, relevance and speed are critical for search engines. The cost of the defect is not just the potential of distracting people from relevant results but also the extra time delay to deliver the answer.

We know that sometimes a defect will slip through the cracks, but we are constantly working to improve the defect classifiers to enhance the relevancy of our search results. While we don’t have the perfect solution yet, we’ve learned a few things over time:

  • Building a specialized classifier to minimize defects is more effective than relying on the answer ranking algorithm alone to filter out defects on the page. The defect classifier tests whether the answer deserves to be on the page, while the answer ranker tries to find the optimal position to place the answer. While these are related approaches, they have different focus areas. If minimizing defects is the goal, then a specialized model for that task helps.
  • In publishing content is king, but when it comes to machine learning signals are king. Once a machine learning approach is working, iterating on signals generally provides a continual source of gains.
  • The need for effective measurement cannot be understated. Answer defect measurement can require judgment on the intent and quality of an answer’s content relative to a query. The measurement process itself often needs iteration and monitoring to ensure alignment with what people expect.

Bing is committed to delivering high quality results as quickly as possible. To that end, minimizing answer defects speeds up the page and removes irrelevant content. In the end our goal is to help you spend less time searching and more time doing.

– Dr. Kieran McDonald, Principal Development Manager, Bing