Leveraging Search Algorithms for Bing Predicts

In the past, we’ve provided an in-depth view of how Bing.com improves relevance for users in web and image search, along with ways we’ve improved experimentation. Leveraging technology similar to search, the team has embarked on the relatively new area we call “Bing Predicts.” This feature analyzes web activity, social sentiment, and other signals to predict the outcomes of events. My colleague, Walter Sun, the technical lead for the project, will share some insights on how Bing Predicts works. In particular, he will shed some light on how we currently make predictions in the categories of entertainment, politics, and sports – including the recently announced Bing predictions for March Madness.

Dr. Jan Pedersen, Chief Scientist, Bing and Information Platform R&D

In a popular game show, contestants are shown three doors and told that behind only one of them is a proper prize, often in the form of a car. Picking this door yields the car and picking anything else results in a consolation prize or, worse yet, a goat. If you selected Door #1, the host would reveal that behind Door #2 is a goat and give you the option to switch your selection to Door #3 or stay with your original selection of Door #1. Statistics tell us that the best strategy is to switch to Door #3. However, many people’s intuition is to stay with Door #1. The correct strategy for this game, also known as the Monty Hall problem, has been a subject of debate amongst scholars.

The difficulty behind the problem presented is we often forget to take in every piece of information and judge the situation objectively. We could misjudge the intentions of the show host and make a decision based on emotion, rather than logic. Consider a more extreme version of this problem where there are one million doors instead of three still with only one prize. If you again picked Door #1 and the host showed you all doors except Door #867,531 had goats behind them, you’ve effectively been handed the location of the car. You started with a one in a million chance of picking the right door, and now the show has effectively offered a much more viable alternative.

This illustration, while extreme, is meant to show you that informed contestants can benefit from a little objective analysis. At Bing, we’ve built our prediction engine to do exactly that — combine algorithmic and information assets to objectively provide informed users additional knowledge and insights tailored to help them make more confident decisions across a wide array of topics and categories.

In building our predictions, what we’ve learned is that intuition doesn’t always lead to the best decision and being able to extract the correct prediction or decision comes down to finding what information is most trustworthy. We also must understand that decision-makers don’t always make perfect decisions. For example, at this year’s NFL championship game, statisticians widely criticized a team for making the wrong late game decision, the team being New England and the wrong decision being not calling a timeout with one minute to go (not the second-guessing you were expecting from the end of that game, eh?).

Today, we’re excited to give you a peek behind the curtain into the technology that we use to make Bing predictions. We’ll discuss how we leverage our search algorithm technology for predictions. In particular, we explain how we can learn which components in the sea of web activity and social sentiment can be used to predict the outcome of any given event.

Search engines have become an entry point for almost every user seeking information on the Web. Furthermore, social media platforms have enabled users to create, share, and exchange information from the web, as well as any user-generated content. Mining anonymous users’ web data along with the contents of social media gives us an opportunity to discover users’ sentiment around certain events or entities, estimate popularity trends, and predict future events. What’s interesting is that we can apply some of our web search machine-learned algorithms to the web and social data to continuously fine tune our predictions.

As our work has progressed, we have learned that web activity is less biased than polling, so more insights can be extracted from using web trends than the traditional polling method. Past work has shown that people, when polled or surveyed publicly, might respond with biases, intentional or otherwise, whereas aggregate web activity does not contain such biases. In addition, we have learned that some events which the populace cannot affect (e.g. who wins an NFL game) can still be inferred by web activity because of the ‘wisdom of the crowd’ phenomenon. Much like the Monty Hall problem, our team observed that as you dig into the details, you gain additional insights into the events.

As Yogi Berra famously quipped, “It’s tough to make predictions, especially about the future.” Events come in various forms, and we can categorize them for our discussion into three primary algorithmic categories: (1) user voting contests or events, (2) judge-based contests, and (3) live competitions, the last of which is generally in the realm of sports.

User Voting Contests or Events

As previously discussed in a Microsoft blog, user activity in aggregate on a search engine and on the web can be used to infer outcomes in areas seemingly unrelated. In this blog, I illustrated how searching for school districts in the winter correlates well with potential snow or school-closure events. User activity in retail segments can demonstrate early fashion trends or interest in products, music, or movies. These facts are all believable; however, when we go further and have our machine-learned algorithms learn the correct set of features and their proper weights to accurately predict the outcome of a yet to be determined event, it then becomes an interesting topic for discussion and debate.

Our first predictions started with reality TV shows less than a year ago. The premise in these shows is that each week’s performance should be the sole factor in determining who wins or loses – one of the hosts frequently admonishes viewers to “watch each week and vote for the person who you think does the best, or risk your favorite losing.” The first part of that statement is stated to encourage viewers to watch, but the latter part provides an implicit truth that viewers over the course a long season develop favorites, impacting their aggregate web activity and allowing us to predict how they will vote in the future.

As much as we like to be spontaneous, we all form habits. Over time, we’ll develop favorites and become firm supporters of certain contestants as we invest in watching a show regularly and even participate in voting. For predictions around a given voting show, we examine activity on the web and behavior on social sites to learn which aspects are most related to voting patterns. We train our ranker (technical term for how we weight variables in our forecasting model) on past seasons to determine if there is a trend. These trends generally propagate from year to year since the fan biases from one season to the next have considerable overlap and new viewers often exhibit similar behavior. One thing we learned is that early in the elimination rounds of a show, the weekly performances have a meaningful impact on who is safe and who is eliminated (for those unfamiliar, a show starts with around 12 finalists and each week the person who gets the fewest votes is eliminated). Later on in the show, performances become less important as viewers have locked onto their favorites. Over time, we aggregate users who watch the show, check their behavior through search and social sentiment, then infer who they vote for. However, it’s important to note that a simple search for “Caleb Johnson” does not directly translate to a user voting for him. Our models need to account for biases and outliers in the data, such as an artist or contestant being in the news for an event not related to the show. This is just one of many examples that can impact predictions that our team considers.

At Bing, we build different models for each show as each one has its own unique characteristics, the basic methodology is similar. Since last year, we’ve made predictions for each weekly show for American Idol, The Voice, and Dancing with the Stars to an accuracy in excess of 85%. You may ask “how good is 85%? Why not 100%?” The accuracy we’re speaking of is over the course of entire seasons where, as introduced earlier, shows can start with something like 12 contestants (or teams). A random algorithm would have an accuracy of 1/12 (you have 1 chance in 12 of guessing who will be eliminated), 1/11, … , 1/3, 1/2, through a full season. So, taking the average of this partial harmonic series yields an accuracy of 19.1%, which serves as the baseline (any worse and you can throw darts!). Of course, there is some public information as some contestants are clearly stronger than others. If we assume an informed user can accurately group the contestants into ‘best’, ‘middle’, and ‘worst’, so that in week 1 through week 4, the accuracy could be 1/4, 1/3, 1/2, and 100% (repeated 3 times for the ‘middle’ and ‘best’ sections), that would still only be 52.1%. Even though the odds are low, to the second question, we still aspire to try and reach 100% accuracy because we believe most of the time we have enough information in the aggregate data to make a confident prediction. As an example, in last year’s American Idol, our models had Caleb Johnson as the winner in late March when there were still 9 contestants, and to demonstrate the ‘magical’ aspect of this, the general public, determined by proxy with Vegas odds, had Caleb as an underdog most of the season, including at the finale with two contestants remaining.

Another ‘fan vote’ category is elections, and last year, we had the chance to expand Bing Predicts internationally as we tackled the historic Scottish Referendum vote. The key to this event was determining which online sentiment matched with eligible voters, as there was a known tendency for residents of the United Kingdom outside of Scotland to prefer that Scotland remain part of the union. We noticed a measurable bias that had to be accounted for and adjusted when we looked at the web/social signal. One interesting learning from this event was that people’s opinion in post-debate polls were more dramatic than how they voted, and the web/social activity accurately captured this more stable state. In particular, our analyses showed a range of a “No” vote from 51% to 59% throughout the 3 months before the vote (final vote was at the midpoint of 55%). Polls showed ranges as high as 70% “No” and some even showed a “Yes” outcome. In short, our data helped create a more accurate prediction.

Judge-based Contests

If all events were popularity-based, we’d survive with the above model, but we’d also live in a world where we have ice cream and cake at all school lunches, flip flops at work, and remote working from beaches (hmm, none of that sounds all that bad, but I’m not in charge J). The reality, though, is that many contests or events rely on a select group of experts to make a decision. In these scenarios, the best practice is to find patterns of behavior to help infer how this small group of people will vote or decide. For Bing Predicts, we again use web activity and social sentiment to try and extract this pattern. For award shows like the Oscars, we found that the pattern is a combination of pure nominee or movie interest by the general public through web/social activity with the direction of voters for recent similar award shows (e.g. BAFTA, Golden Globes, etc.). In other words, this effectively says that the general public has a good intuition, but differences between us and the experts can be captured and learned from how the experts for other award shows have voted in the past.

For the NBA Draft, we had an even more unique challenge, as beyond the fact that the general managers, ownership, and coaches were a small set of experts making the decision for whom to select, they also had the incentive of throwing other competitors ‘off the scent’, because if you possessed the 3^rd pick in the draft and desperately wanted Joel Embiid, you would not want to tip your hand and have the two teams ahead leverage that information. We first built player models for all of the eligible amateur candidates for the draft. This model would be akin for the sports analytics types as a projected “Wins Above Replacement Player” (WARP), where we use amateur statistics to project professional statistics, which in turn can be transformed to a single value of how many additional wins a given player can provide. At the zeroth order, teams would want to maximize their number of wins so this statistic, coupled with team needs, would provide the ideal selection order. The definition of team needs can be inferred by looking at existing rosters, but it was also interesting to find that examining the ‘wisdom of the crowd’, namely what people were saying about team needs and best fits, from both local beat writers and fans in general on social sources, accurately captured this second aspect. From this, we were able to correctly predict to within one position 9 of the top 11 selections, on par with the professional expert sites despite using a pure algorithm without any individual expert input.

Live (Sports) Competitions

Live sports predictions is the category that draws the most discussion and skepticism. Sports, in general, garner our interest simply because they are the ultimate reality show, where the outcomes are solely determined by a handful of highly-skilled athletes while millions watching enact various superstitions in an attempt to alter the results. Certainly, anyone can believe that we cannot rely on popularity alone to determine winners of contests (no matter how vocal Knicks fans are, they’re not winning this year’s NBA championship). Rather, for these contests we start with a statistical model of team strengths. We look at historical statistics and see which factors contribute to strong teams and which match-ups are favorable and unfavorable for a team. From this, we build a prior model which would be akin to a subject-matter expert publishing a ‘Power index’, ‘Team ranking’, or RPI rating for teams in a given sport. We then add web activity and social sentiment to tune the strengths, capturing real-time information like injuries, suspensions, controversies, and line-up changes. Our analyses found that adding this ‘wisdom of the crowd’ yielded an additional 5% of accuracy in NFL games, a non-trivial bump in correctness in a very competitive league. In sports, unlike User Activity events where we seek to get as close to perfection as possible, there is enough uncertainty with individual player performances (e.g. consider the likelihood that a Florida State freshman whose season high was 35 points could score 30 points in the final four and a half minutes of their 2/25 game) that our metric for success is beating experts and prediction markets through our leveraging of information available on the web.

Our newest area of investment is March Madness, so-called because of the high level of uncertainty in outcomes. The 64 team bracket alone accounts for 9.2 quintillion permutations (that’s a 9 with 18 zeros following it). For March Madness, we based our model on one that is similar to the one we built for the NFL and other sports where we have a prior model trained on the variables that make up a live 2-team matchup. But where this gets really exciting is how much new data we’ve been able to apply to the model. Through a partnership with the NCAA, we gained access to over 10 seasons worth of historical outcomes, including entire data sets that were previously unavailable to the public. This includes offensive and defensive statistics, conference success in previous tournaments, the proximity of tournaments to each team’s home campus, the style of each team, their individual strengths and weaknesses, and many other factors which might result in them favoring certain match-ups over others. After ingesting these initial data sets, we applied our analysis of web and social sentiment to tune our predictions, resulting in projected outcomes for each of the 67 games of the tournament, including both predicted winner and probability of the team winning. We then present to you the 1 bracket which we think is the most likely to transpire.

As noted earlier, sports are fun because of the uncertainty, especially in a 68-team single elimination tournament. If every match-up was a best-of-7, statistically speaking, we’d have much less upsets. The tournament would also then have to be called the NBA Playoffs J. We might not catch an upset from a hot three-point shooting 14-seed, but our goal is to compare favorably against other experts in the field of March Madness predictions. We’ve put in the hours and analysis over the past 10 seasons to fine tune our engine. We’re excited to see how we do, and hope you are too. May your bracket not get busted and have a blast during March Madness!

Dr. Walter Sun – Principal Applied Science Manager, Bing Predicts Team Lead