Safer Web Exploration with Bing

The internet’s massive reach and ever growing accessibility makes it an attractive place for cyber-criminals, who use it to distribute malware to unsuspecting users. To deliver malicious software to the user at large scale, savvy hackers increasingly try to game search engines to amplify the effect of their exploits by targeting frequently visited sites. At Bing our job is to not only deliver relevant results, but also provide a safe searching experience. In this blog my colleague Igor Rondel in our Index team provides an overview of the malware detection and protection technology that we’ve been developing and are constantly improving in order to maximize the safety of your searches.

Dr. Harry Shum, Corporate Vice President, Bing R&D

Overview

Over the past decade both malware and malware detection technologies have evolved by leaps and bounds, pushing each other to innovate and find new ways to circumvent the other. The more advanced malware distributors attempt to evade detection by providing search engines with different content than the people visiting the site. But in order to do this, they need to know the request is coming from a search engine, so the latest technologies employed by search engines attempt to mask themselves from the bad guys in a variety of ways. In parallel, advances in large scale search engine infrastructure as well as traffic pattern understanding have allowed search engines to develop smarter ways of identifying which sites to scan, when and how frequently.

Scanning technology has also evolved over the years. The current state-of-the-art malware scanning relies on a vast array of techniques utilized in parallel to maximize both precision and recall of detection. It involves – but isn’t limited to – scanning the content of the page (both static and dynamic); script (e.g. JS) de-obfuscation; tracing outgoing calls made from the page to known exploit servers; providing a honeypot (unprotected) environment where the document is rendered and allowing the exploit to manifest itself. Another powerful technique in search engines’ toolkit is drawing an indirect malware association for a site based on other sites under the same owner.

At Bing, we work hard to not only stay on top of these cutting edge techniques, but move this technology forward. We work closely with CERTs (community emergency response team), Webmasters and the entire malware technology community as well as maintain close coordination with other Microsoft products, like Forefront and IE SmartScreen, exchanging malware signals and findings to better protect customers.

Background

It goes without saying, the internet is an amazing resource that provides limitless value to people around the world. Unfortunately it is also a place that attracts criminal behavior that can harm unsuspecting people navigating the web. One such category is malware. Malware, short for ‘malicious software’, comes in many forms: computer viruses, worms, Trojan horses, etc… While there are a number of ways malicious software can be distributed, perhaps the most prevalent (in the context of search engines) is the “drive-by-download attack”, a method by which an attacker can install malware on a victim’s machine simply by that user browsing to an infected web page.

Hackers who develop and distribute malware do so for a number of reasons, but the biggest incentive is, of course, money. By distributing certain types of malware onto a user’s machine, such as key logging software, a hacker could get access to everything a user types including personal and financial information such as bank accounts, passwords and social security numbers. Alternatively it could turn a user’s machine into a zombie member of a “bot-net” to facilitate illegal activities such as distributed denial-of-service attacks or to expand malware circulation.

Cyber-criminals are smart, clever and are constantly exploring new malware techniques and distribution methods. In particular, they aim to evade detection and increase the scale of distribution. Search engines, whose job it is to connect people with web pages, are a powerful mechanism for bad guys to distribute their malware on a large scale. To maximize impact, malware perpetrators try to increase the likelihood of users visiting infected sites by employing spamming techniques or simply infecting innocent sites which already enjoy high traffic. As more established businesses typically have more advanced security practices, these criminals often go after the softer, more vulnerable targets such as small businesses, personal blogs and adult pages.

To combat this and to maintain user trust, Bing employs various methods of detecting malware. Since malware development is on the cutting edge, so too is malware detection, as it needs to constantly evolve to remain effective. People have come to expect search engines to not only deliver the best results, but also protect them from harm like malware (as well as other types like inappropriate adult content, but that’s a subject of a future post!) This is a perfectly reasonable expectation, one that Bing takes incredibly seriously and has made a significant investment in over the years.

Bing’s Malware Protection – Detection

The key element of protecting users from malicious results is the ability to accurately detect malware. Aside from the standard challenges of achieving high precision/ high recall detection, search engines also need to contend with the massive volume of data. Since the web contains trillions of documents and each one can be infected (or cleaned up!) at any time, we have to be smart about prioritizing which pages get scanned and at what frequency.. For example, detecting malware on a high profile site like www.ebay.com, which gets millions of visits monthly, would prevent many more infections than detecting malware on a site no one visits.

The next step after identifying which pages to scan is to crawl them to get their latest content. Over the years, hackers caught on to the fact that search engines will try to detect malware to prevent its distribution and have developed ways to evade detection. One typical technique, known as “cloaking”, involves showing different content to the crawler than they would to a user, thus evading detection yet still triggering the attack when a real user visits the site. So naturally search engines work on ways to bypass these evasion techniques. One such approach is to mask the crawler’s identity to trick the malicious software into thinking the request is coming from a real user, thus reducing the effectiveness of hacker’s evasion techniques. And so the cycle continues: as one develops new techniques of evasion, the other finds ways around it.

Once we have the latest page content, both static and dynamic, that’s when the core detection technology kicks in. We won’t go into too much detail on specifics of malware detection (sorry bad guys!), but let’s cover some of the types of detection Bing performs:

Content based detection: Microsoft has a long history of fighting malicious use of its software and has developed a strong culture of security across all its products. This considerable expertise, combining product team security engineers with Microsoft Research (MSR), has led to the development of Forefront, Microsoft AV solution. Bing leverages this technology, benefiting from years of research, while at the same time helping improve it by providing data corpus and infrastructure to run it at huge scale. In addition, Bing implements several supporting signature detection engines, as well as state-of-the-art JavaScript classifiers and heuristics, co-developed with MSR, to detect signals indicating an attack.
High-interaction client honeypot: This technique involves browsing to and rendering the document in a secured online environment that’s intentionally made vulnerable, to deliberately solicit drive-by-download attacks
- Step 1: Use older, more vulnerable browser versions; remove security patches and any other hindrance for the exploit to begin its work
- Step 2: Hook into the critical machine resources, registry, core system files, and monitor and log all interactions
- Step 3: Sit back and see what bites: if a page spawns an unsolicited process, drops files in restricted locations or tries to access these resources, you just caught yourself a malware.
Malware network awareness: A page infected with malware almost always relies on an external resource for the actual exploit delivery. In other words, a page (whose author is often an innocent victim themself) contains a call out to an exploit server (managed by the bad guys) that delivers the exploit to the user’s machine. Bing monitors all outgoing calls the page makes during its scanning and matches these chains against the known exploit servers. This is another area where sharing of information between various products across Microsoft, as well as various CERT groups across the world, leads to a more comprehensive and up-to-date inventory, as well as an overall safer web.
Extrapolation: Another technique Bing employs is to make an educated guess on whether a page is infected by analyzing other pages under the same host/ domain. The theory behind this is that if a given host had some pages infected, it likely means there is a vulnerability a hacker found to exploit, which means that other pages in the host/ domain are infected as well or are at high risk of getting infected in the future.

These, along with other detection mechanisms, result in malware being detected on about 0.04% of search results returned to users (number fluctuates week over week.)

Bing’s Malware Protection – Post-Detection

Once we determine that a page is infected, it goes into quarantine mode. That means the page will remain marked as infected until several consecutive scans come back clean. Infected pages are rescanned at certain intervals as we try to balance scanning bandwidth and the need to ‘clear the page’s name’ as soon as malware has been removed. Note that re-infection rate of websites is very high – after discovering that their site is compromised with malware many webmasters will clean it up, but neglect to fix the security hole that allowed them to be compromised in the first place. In fact, approximately 40% of infected sites will be re-infected within a day of being cleaned up! However, the likelihood of re-infection goes down significantly the longer the site remains clean – i.e. after two weeks the likelihood of re-infection drops to 5%, and after 30 days it’s less than 1%, largely due to the webmaster taking the necessary corrective steps to remove the underlying vulnerability, likely forcing malware distributors to move on to easier targets.

So what does Bing do with urls where malware was found? The answer to this is guided by our primary goal and further informed by a number of considerations:

Goal: Minimize user rate of infection

Considerations:

Infected pages are often hacked and their authors are innocent victims

Infected pages are often great pages that enjoy significant traffic, which is why they were targeted by the hacker in the first place

Infected pages, especially popular sites, are typically cleaned up quickly

We considered a number of options, including kicking the urls out of the index altogether, de-ranking them out of the 1^stresults page, and decorating the result with a clear warning. Each of these achieves the goal of reducing infection rate, but only the latter effectively addresses the considerations described above.

While de-indexing the url would definitely reduce its infection rate contribution to zero, it’s entirely unnecessary because the amount of malware a search engine has in their index does notdirectly correlate to the impact those infected documents have on user infection rate. In fact, there is a non-trivial chance of an unintended consequence of this approach resulting in precisely what we are trying to avoid:

1. User queries Bing and does not find a site they know to exist

2. Since the site is suppressed due to malware (and no warning shows up), the user attributes this to Bing not having the document in the index

3. User is dissatisfied and issues their query in a different search engine

4. That search engine does return the expected site, but fails to detect malware

5. User clicks on the result and gets infected

Alternatively, keeping the url in the index, but showing a malware warning is very effective as it forces the user to make a conscious choice (since they actually have to click on the link inside the warning, not the results page, to continue navigation to the infected page.) The result is that the warning persuades 94%+ of users to abandon navigation to the url. The remaining users make an informed decision to continue navigation, possibly due to having confidence in their system’s overall security. Note that most of the indexed documents either never see the light of day or have very limited exposure to the user and about 75% of infected documents shown to the user never get clicked on.

On the flipside, the benefits of showing the warning over de-indexing are: 1) ability to revert the url’s malware designation immediately once we observe the site is clean, 2) retention of information accumulated for the indexed url and 3) empowering the user to make the final decision instead of making one for them.

These considerations are the reason Bing opted for the warning approach and likely why most other major search engines follow a similar methodology.

Final Thoughts

Scope: We share the findings from malware detection across verticals and partners. For example, Yahoo!, Facebook and many products across Microsoft consume our malware signal and adjust their result presentation accordingly. Internally, we share this information across Bing properties like image search and Ads, to ensure that malware protection is provided across the board.

External Community: At Bing, we strongly believe in listening to and learning from the community of security experts and that the knowledge of the entire malware ecosystem is far greater than any single person or company may have. Attending conferences, studying published papers, tracking relevant forums are standard elements in our line of work, as is paying careful attention to 3rd party firms monitoring malware landscape.

Webmasters: We aim to be transparent with and helpful to the web site owners. As mentioned above, many of these are innocent of the malware their sites distribute and are often unaware and may not have the expertise to deal with it. To that end, we recently rolled out a set of tools to 1) alert webmasters of the malware on their sites, 2) help them pinpoint the cause by providing various debugging information and 3) provide a way to request rapid re-evaluation. This is detailed in greater depth in an earlier post that can be found here.

We will continue to relentlessly improve Bing’s malware detection algorithms and methods with the latest state of the art and our own research innovations to provide the safest searching experience for our users.

Here’s to safer searching!

– Igor Rondel, Principal Development Manager, Core Search, Bing R&D

Related Stories

Turing Bletchley v3 - A Vision-Language Foundation Model

Driving Performance at Microsoft Bing

Building the New Bing: Image Creator