Building a State-of-the-Art Speller

Is it Swarzinegar, Swarneger, Scwarznagger or Schwartiznegar? These are just a few of more than 2,000 different ways users on Bing have typed their queries in hope of searching for “Schwarzenegger.” The aim of the Bing Speller is to correct these queries so users receive relevant web results that match their intent even when their query is misspelt. A great speller makes a search engine feel like magic to the users. In this blog my colleague Jim Kleban provides an overview of Bing Speller technology with some examples of recent improvements we just shipped in December.

– Dr. Harry Shum, Corporate Vice President, Bing R&D

Bing’s Speller processes tens of millions of data points mined from searches, web pages, clicks and user actions to help you find the right “Schwarzenegger.” In close collaboration with Microsoft Research we have developed advanced machine learning models to build a great speller. The team working on speller relevance runs thousands of experiments every week – ranging from improving data freshness to improving ranking fundamentals – to deliver a better search experience. The core to building a better speller is marrying speed with relevance. In order to achieve this, Bing’s Speller handles tens of thousands of queries per second and computes corrections within tens of milliseconds. These spell corrections can be so good that they sometimes feel like magic! Would you have guessed the query “terramazu” is so delicious?

Using Query Context to Improve Spell Corrections

Building a great speller is one of the most significant problems in web search. One of the most powerful clues we have to find the right spelling is the context of the query. This means we model how likely a particular spell correction is in the presence of the other terms in someone’s search by employing large scale language models and statistics to match how often phrases appear together. The average query contains around three words, so using this context is essential. For instance, someone typing “nickteen” may have intended “nick teen” or “nineteen”. However, the same term in the query “nickteen withdraw” from our query logs provides the necessary context for Bing to understand that the person is looking for information about nicotine withdrawal:

Sometimes, even when an individual word is spelled correctly, the Bing Speller corrects it to better suit the searchers intent. For example, someone recently searched the phrase: “how can you sea if money is reel,” when searching for information on how to detect the authenticity of money. In this case, the Bing Speller corrects the two words “sea” to “see” and “reel” to “real.” Though the two words were correctly spelled, they don’t make sense in the context of the search. The correction has a big impact on the quality of the web results:

With Speller
Without Speller

Freshness: A Vast and Changing Vocabulary

Another challenge for web spellers is dealing with the constantly evolving vocabulary on the internet. New words need to be recognized as they arise on the web from a variety of sources. These can be news articles, technical terms, people names, internet slang and foreign language terms. Since a dynamic dictionary doesn’t exist, to solve this problem, the Speller needs to index all of the words and phrases available, even if they are rare or obscure. For instance, a comprehensive view of web content enables us to correct a misspelling of the company Pala-Tech and a search for thyroid from “pallitech thyoid” to “pala tech thyroid.” Without access to the constantly changing ebbs of the Web, word processing spellers are not likely to be able to make this correction.

With Speller
Without Speller

The edit distance or difference of individual letters of two queries can be thought of as a measure of how similar they are from a lexicographic perspective. High edit distance spelling errors often occur when people try to “sound out” a world phonetically. These errors typically have several characters that are different between what was typed and what was actually intended. “Swarzinegar” to “Schwarzenegger” is a good example. You can see that many of the characters are missing between the misspelled and the corrected version. The Bing Speller employs statistical mapping schemes and phonetic similarity measures to identify and correct the misspellings.

It takes state-of-the-art machine learning, statistical modeling, information retrieval, and significant engineering muscle to deliver high quality web scale spell correction at high speeds. We hope this post gives you a glimpse into the various aspects that are involved in seamlessly spell correcting queries to maximize result relevance.

We thrive on these challenges and are always experimenting with the next set of algorithms to deliver breakthrough relevance improvements to you, our users.

– Dr. Jim Kleban, Program Manager, Bing R&D