Challenges of Testing Data - Is it Really 73°F in Seattle right now? – by Cyril Bouanna

Cyril Bounanna is a Senior Development Lead for the Bing Data Quality and Measurement Team. Today he presents to us a real-life scenario that software engineers and testers encounter when they are building Bing.

***

Software engineers have developed a lot of expertise and solutions to assess code quality. From static analysis to E2E automation, performance benchmarks, or search relevance, we know how to verify if our code does the right thing. But in Bing, we also realize that without good data, no code can produce good search results. As the saying goes “garbage in, garbage out”. So how can we verify the quality of our data?

Consider, for example, weather information. Every 10 minutes, we receive the latest temperature reading from thousands of locations around the United States. Should we trust it and show those temperature readings on our website? If those temperatures are wrong, our users will be disappointed in our site, not in our data provider.

We could at least guard against malformed data, and avoid showing 999° or ^$°. But how about 0°, or even a decent looking 73°? How could we know if it is really 73° in Seattle right now? (Isn’t it always raining there?)

Well, there’s really a lot that can be done with the right heuristics:

We can automatically compare data against historical averages. It
might be 0° in Anchorage, but 0° in Miami in July is definitely bad data;
We can automatically compare data across providers, to
cross-check multiple sources like any good investigator would do;
We can also compare across time. 73° in Seattle might be
reasonable, but not if 10 minutes ago it was 41° there, and 42° 10 minutes
before that;
Or across space. 73° in Seattle when it’s 41° just a few miles
south in Tacoma doesn’t look good either;
You can also apply “tolerance thresholds” to your heuristics:
it might be acceptable for a few neighboring cities to have significantly
different weather (no rule is always true), but if it’s happening
everywhere throughout the US, it’s much more likely you have a corrupted feed.

And here’s another interesting aspect: if we really want to guard against bad data, data validation needs to be fully integrated into the data production process, making this ‘testing’ an integral part of product development.

Everyday we are building Machine Learning models that can figure out those rules on their own. Does this sound like interesting development challenges to you? If so, join our Bing Data Quality and Measurements Team and help protect your everyday search engine from bad data! Apply today http://bit.ly/bingexpjobs