Deep Learning for Image Understanding in Bing

A few months ago we provided a behind the scenes look into how Bing is improving image search quality. In this post we wanted to take the opportunity to further that discussion highlighting some recent outreach we did with researchers to explore new approaches to improving image search quality. My colleague Eason Wang will give you a closer look at how we are taking advantage of Deep Learning and entity understanding to deliver more relevant, useful and beautiful image results in Bing.

– Dr. Harry Shum, Corporate Vice President, Bing R&D

Our colleague Meenaz Merchant gave an overview of Bing image search a few months ago. Today we will talk about the technology behind our approach.

Rooted in the fundamental techniques of web search, traditional image search has relied heavily on text information such as surrounding text, title and URLs to deliver relevant results. However, in the age of social media and self-expression this approach has become less effective. A large percentage of images are shared with little or no text information (as shown in the example below). This presents new challenges propelling us to examine the properties of the image itself to provide more relevant and useful results. The following post highlights some of the approaches we’re using at Bing to advance the state of the art of image search, from general image understanding to deep learning.

For a traditional web image, the keyword “sunrise”
appears 6 times around the image


An example of sunrise from social network: there is little or no text associated to the image. The only text available is location information: “At National Olympic Park, Port Angeles, WA.”


Traditional Image Understanding

Let’s look at the way we apply image understanding to search. For a query like “Katy Perry” or “Eiffel Tower”, we leverage image content features to make our results more relevant and more beautiful. For example, we know the dominant colors of the results; we know which images contain line drawings; we know which images are duplicates to each other even with edits and transformations. These as well as other image content signals are used in our image search ranker to provide better results. Further, Bing provides users with the ability to filter images based on these features as you can see in the examples below.

{Eiffel tower}


{Eiffel tower} with line drawing filter on


{Eiffel tower} with blue color filter on


{Eiffel tower} duplicate images


{Katy Perry}


{Katy perry} with blue color filter on


{katy perry} with line drawing filter on


Duplicate detection


Deep Learning: The Next Frontier

In the area of machine learning for images, features and representations of the images usually determine the quality of learning. Traditional approaches rely on hand crafted features such as colors or lines as illustrated above. Deep learning is different in that it learns features and representations from image pixels. In Bing, GPU powered machines and data centers make it possible to train deep neural networks to recognize a large number of concepts in images.

Deep learning has been used at Microsoft in image processing, speech recognition and machine translation for over 15 years. It regained popularity in recent years due to algorithm advancement and growth of computation power of GPUs. To demonstrate the potential of this approach, Microsoft Chief Research Officer Rick Rashid showcased a speech recognition breakthrough via machine translation that converted his spoken English words into computer-generated Chinese language. These improvements come, in part, from contributions delivered via Microsoft Research’s work on deep neural networks (DNNs).

Below are a few examples about similar images based on deep learning features compared with traditional features.

Similar images of a cat based on deep learning features are all cats except two dogs with similar appearances), as show in the image below.


Similar images of a cat based on traditional features are much less consistent. There are dogs, cats, a baby and a face as shown in the image below.


Similar images of dishes based on deep learning features are food of various kinds, as shown in the image below.


Similar images of dishes based on traditional features are not food but landscapes and even a python.


Similar images of waterscape based on deep learning features are all about water, as shown in the image below.


Similar images of waterscape based on traditional features include a human portrait, indoor images and landscapes.


Two images can be connected if the distance between the respective features learned through deep learning is small enough. Extending this concept to all the images on the web, trillions of connected images form a gigantic graph where each image is connected via sematic links to other images. As illustrated in the graph below, by using deep learning features, the image of a motorcycle is connected with other images with motorcycles of different colors and shapes. By using traditional features such as colors and edges, the same image of a motorcycle is connected to images of different entities such as bicycles, or even waterfalls and landscapes. In contrast, deep learning keeps the semantics in the image neighborhood even though the visual patterns are not very similar.

Similarity Graph Based on Deep Learning


Similarity Graph Based on Traditional Features


Looking Ahead

In only a few years, Bing image search has come a long way. We are still far away from truly having semantic understanding of any image taken at any time in any place. These are the problems that keep us awake at night. We’ve made the first steps of a long journey, and we will surely continue our endeavor to deeply understand images across the web in order to bring the best and most relevant images to users when they need them most.

Dr. Eason Wang, Program Manager, Bing Multimedia Search