Search Quality Insights: Behind the Bing It On Challenge

As I mentioned in my original post there has never been a more exciting or challenging time to be in the search space. The core to a great search engine has been and will always remain the same: delivering relevant, comprehensive and unbiased results that people can trust. We use thousands of signals from queries to documents and user feedback to determine the best search results and in turn make hundreds of improvements to our features every year, from small tweaks to core algorithm updates. In this series, you have heard from our Chief Scientist Dr. Jan Pedersen who summarized our efforts on whole page relevance, and Dr. Richard Qian who covered the techniques we employ to reduce junk links. More recently, Dr. William Ramsey described how we reduce defects in links and related searches and Dr. Kieran McDonald’s post examined answers quality utilizing specialized classifiers.

These are just a few examples of the myriad algorithm changes that we’ve developed to enhance Bing over the years. With all of these changes we’ve made some great progress in Bing search quality for our customers. You will notice we have released a fun, non-scientific tool for customers to see for themselves how far we’ve come. And while we know we still have lots of work to do, we think it’s long past time in our industry for a conversation on search quality. This conversation is what we hope to start with the Bing It On Challenge, and continue in the future to make sure we are delivering the quality experiences our customers deserve.

In this post I hope to give you a sense of what Bing’s journey has been like over the past couple of years in terms of our efforts to rapidly improve our search quality against a tough competitor.

How did we get here?

There isn’t a single answer to that question as there wasn’t just one silver bullet. We realized early on that experimentation is the key to improving relevance. We don’t know which feature ideas and techniques work until we actually try them out, and many ideas end up failing. So in order to have enough good ideas that succeed, we needed to build an experimentation funnel that enabled us to try out lots and lots of ideas to begin with. This was made more challenging by the fact that as a relatively new entrant to the search space, we had less traffic which made learning and experimentation harder.

So we invested in building a sophisticated experimentation system in the cloud which made it very easy for our engineers to try out new relevance techniques and train new machine learning models. This increased experimentation capacity allowed our engineers to submit hundreds of experiments in parallel and try out their ideas. This was a key breakthrough in scaling out relevance experimentation and enabling a large number of people on the team to effectively contribute in improving relevance and quality.

As experimentation volumes increased, we also realized that anything we could do to speed up this inner loop of experimentation would have a huge multiplier effect on how many more experiments we could run, and how quickly we could learn and implement new features to improve Bing’s quality.

Relevance experimentation at Bing involves training machine-learned models on large amount of training data using thousands of features. In the early years, our models were based on neural networks. But as the amount of training data, number of features and the complexity of our models increased, the inner loop of experimentation slowed down significantly. At one point, it took us several days to finish just one experiment end-to-end. We knew we needed to do something.

To overcome this challenge, we turned to our deep partnership with MSR to develop a technology we call Fastrank. FastRank is based on boosted decision trees which are much faster to train and thus attractive for relevance experimentation.. But there was skepticism on whether the quality of ranking produced by decision trees could match that of neural networks. Our colleagues at MSR took on this hard problem and developed new optimization algorithms that allowed us to not only match the quality of neural nets, but also train more than an order of magnitude faster.

This allowed us to once again make our inner experimentation loop fast, and really accelerate our pace of learning and improving ranking. In the end, our relevance improvements have come from trying out tens of thousands of ideas over the years, and shipping hundreds of those that have been successful.

Conventional Wisdom

When we previewed our side-by-side test results with people outside the company, I was often asked how we were able to make these gains with presumably less data than the other guys. While there are too many variables to give a fully scientific explanation, I would say our long-term commitment and investment in machine learning for relevance has enabled us to steadily scale out relevance experimentation and make rapid progress.

Of course, as we all know, relevance is subjective and queries are dynamic and always changing. But we feel confident that it’s time for customers to come give us a look, and for a conversation on searching quality to occur in our industry. This is why we’re inviting customers to try the Bing It On Challenge so that you have a fun way to compare side-by-side and come to your own conclusion. I encourage you to try it now at www.bing.com .

As always, we want to hear from you so please join the conversation.

On behalf of the Bing team

– Dr. Harry Shum, Corporate Vice President, Bing R&D