Bing Search Quality Insights: Whole Page Relevance

Dr. Jan Pedersen, Chief Scientist for Core Search at Bing

If you compared the results you see with Bing today with those of our search product five years ago, at first glance they might seem similar – ten blue links presented with text summaries. However, if you look closer you would see some important differences. The first thing you would notice is that many of the queries you perform today produce excellent results that would have simply not been possible five years ago. This is a tribute to the continued growth and evolution of the Web coupled with our unceasing focus on core search quality. Second, you would notice that the overall experience is much richer. Bing’s search results today are no longer just web pages; we include videos, images, maps, news items and other media objects, which we call answers.

Lin

Moreover, some results take up more screen real estate than others, making it easier for you to see the content that will help you get things done quickly. For example, the query [bing] returns a first result of Bing.com that is larger than usual with vertically arrayed deep links and other quick access links beneath it. This is what we call an Answer and makes sense because for a navigational query like this the vast majority of people are trying to find Bing the search engine and are only interested in the first result. Since we recently made an announcement there is also a news answer inserted after the first web page (bing.com) but before the second web page (wikipedia.org).

Bing

The richness of the results page leads to a number of interesting, and novel, search quality questions. Under what conditions are you better served by an answer versus a traditional link to a web page? How do we rationalize making a result more prominent if it makes other results harder to find? At Bing we refer to these optimization problems as Whole Page Relevance and we have developed a number of interesting methodologies to address some of the larger issues. To illustrate the point let me describe a bit about the Bing technology for blending together blocks of content, Web pages and answers, into a single result set, which we call Answer Ranking.

As with any relevance problem we start with the question of how to measure if Bing has done a good job. We could do this by simply asking human judges to compare the output of competing blending algorithms and assess which is better. This turns out to be a difficult judgment task that produces quite noisy and unreliable results. Instead we look at how people behave on Bing in the real world. Based on how they respond to changes we make an assumption that a better blending algorithm will move people’s clicks towards the top of the page. This turns out to be the same as saying that a block of content, or answer, is well placed if it receives at least as many clicks as the equivalently sized block of content below it — or, as we say internally, if its win rate is greater than 0.5. So a good blending algorithm will promote an answer on the page upward as long as its win rate is greater than 0.5. Armed with this metric, we can run online experiments and compare the results of competing blending algorithms giving us a realistic data set.

Next we investigate the available inputs into an online blending function that improves this metric. We can, and do, use historical anonymous click data, but this is not sufficient because it does not generalize to rare queries, or to new content with no history. So, we add in three kinds of additional inputs: confidence scores from the answer provider, query characterizations, and features extracted from other answers and web pages that will be shown on the page. For example, to learn where to place an image answer in the search results for a given query, we consider the confidence score returned from the image search provider, the ranking scores of nearby Web pages, and whether the query is marked as referring to the sort of entities that are well described by images (people, places, etc.). Currently we use over one thousand signals in our production blending functions.

Finally, we consider the offline and online infrastructure that will be used to create and run a blending function. We use a very robust, but high-performance learning method, called boosted regression trees, to automatically produce a ranking function given training data. This allows us to use many signals with the confidence that each additional signal will incrementally improve our blending function. Our training sets are fairly large, since they are mined from our billions of anonymous query session logs, so we use our large-scale data mining infrastructure, called Cosmos, to prepare the data and run offline experiments. Once a new blending function has been generated by our offline learning method, it is deployed to a serving component internally called APlus that puts all that data into action and runs after all candidate content blocks that have been generated, where it can be tested via online experimentation and finally placed into production.

The net effect of all this is that from training data we can generate a blending function that can look at all the available content in response to a user query making a final placement decision that attempts to move clicks as high up on the page as possible. In other words, we can deliver a rich set of results that are statistically more likely to get you what you’re looking for- whether it’s a link, a video, a news item, a map or snippet of information.

This methodology has been very successful at improving overall whole page relevance while still being adaptable to new sorts of answers. Over the last two years we have significantly improved the aggregate win rate and currently we deploy and maintain dozens of machine learned blending functions specialized to different answer types.

Most recently we have focused on new inputs that can improve our ability to place temporally relevant answers. For example, the news answer is designed for speed and timeliness. Our aim is for the answer to update as quickly as possible when a news story breaks then decline in prominence as the news cycle for the story winds down. We achieve this by taking in more inputs from news providers and focusing our training data to capture temporal relevance events. For example, a few months ago, our production blending function at the time (on the left below) placed news at the bottom of the page for the query [angelina jolie], while our temporally tuned blending function (on the right below) placed such news at the top of the page, which was appropriate at that time.

      

Angelina 1Angelina 2

 

As the Web continues to evolve with new forms of rich media and data our approach to whole page relevance will evolve along with it. It’s interesting to imagine what Bing will look like in five years from now.

With that, we’ll just ask you to stay tuned and join us in this dialog about search quality at Bing.

– Jan Pedersen