Extrapolating Malware Detection with Rollup

Protecting Bing users from malware is a top priority for the Index Quality team. To that end, we analyze every signal available to us and determine not only whether the page is infected, but also whether it runs at a high risk of infection at a future date. One of the key elements of this analysis is discovering clues about potential vulnerabilities on the ‘container’ hosting the page that could be exploited by malware distributors to spread their malware to other URLs under the container. In this edition of the Bing Index Quality blog my colleague, David Felstead, in our Anti-Malware team provides an overview of the technique we use to address this and improvements we recently rolled out that aim to improve its coverage and precision.

Igor Rondel, Principal Development Manager, Bing Index Quality

 

Unfortunately for webmasters and searchers alike, hacked websites are a very real danger of the web.  For a web searcher, visiting a website that has been compromised presents a very real risk of their computer being infected with malware. For a webmaster, the discovery of the root cause of the hack, the cleanup of the compromised code, and finally the damage to reputation and brand can be nightmarish. Bing is scouring the web twenty four hours a day, seven days a week to discover hacked websites and malware distributors, to better protect our searchers and to keep webmasters who have had the good sense to sign up with Bing Webmaster Tools informed.

One challenge the Bing anti-malware team faces is striking the balance between detection completeness and accuracy and one major facet of this challenge is understanding when to “rollup” our malware detection, that is, consider an entire segment of a site or the site itself as malicious. At Bing, the nomenclature we use to describe a collection of URLs at the path, host or domain level is a “container”, and this is the basic unit we use for rollup – essentially if a container is rolled up, then every URL under that container will be considered malware; e.g. a rollup on the host “foo.example.com” will cause every URL on that host to be marked as malicious, whereas a rollup under “example.com/malware” will cause all URLs under the path “/malware” and all its sub-paths to be marked as malicious, but not the homepage or other paths.  The concept of rollup is fairly well established when thinking of a site’s reputation, be it for malware detection, adult classification or spam discovery; it is the concept that “if >N% of the URLs in a particular container are of one specific category, then the likelihood of the remaining URLs in that container being the same category is increased.” In the case of malware, we use it as a proxy to determine how deep the level of compromise on the site actually is – is it a few isolated pages, or is the entire website under the control of a malware distributor?

Recently our team spent some time re-evaluating and improving rollup for malware detection, specifically the conditions on when and where a rollup judgment will be applied.  The balance we need to strike here over-triggering the warning when it appears the compromise may be localized or already cleaned up. To determine where (e.g. at the path level, host level or domain level) to roll up, and whether or not a rollup is warranted, we look at many features of a site:

  • The number of malicious URLs found in each container vs. how many were scanned;
  • The overall scan coverage of the container;
  • The frequency at which the malicious URLs were discovered;
  • The types of infections found;
  • The size of the container in URLs vs. the size of the site;
  • The amount of traffic being sent to the container vs. the amount of traffic being sent to the site;
  • The “depth” from the root of the site of the malicious URLs (e.g. malware on the homepage is much more problematic than malware on a single page deep within the site);
  • The popularity of the site;
  • …and many more

Using this set of features, each container on the site is evaluated to determine if rollup should be applied.  By intuition, one might think “well, if you found malware anywhere on this site, shouldn’t the entire site be marked as risky?”, and that is indeed a valid argument.  However, we need to take into account that compromises occur in a variety of ways, and by their nature are often extremely transient.  Even the most secure, trusted sites may occasionally have malware detected on them not as the result of webmaster carelessness or misconfiguration (what we traditionally consider being “hacked”), but from malicious ads being distributed through third-party ad networks; not an uncommon experience:

In the cases of ad network compromise, infections tend to be transient and short lived, often occurring only once, and perhaps never showing up to a real person – in this case, a rollup of a site or container would be unwarranted.  However, if the infection is persistent, (i.e. is observed several times), widespread (across many URLs on a site) or recurrent (is cleaned up, then reoccurs) then rollup is likely the best way to protect the users of the site.

Since we made the improvements to our rollup algorithm, we have observed the following changes, which we feel indicate a much higher level of protection for our customers:

  • Rollup coverage on URLs in the Bing crawled index increased by 2x
  • 60% more high-risk malware URLs flagged with rollup on Bing SERPs
  • Approximately 0.015% of Bing query traffic affected, that is ~1 in every 7000 queries

From a webmaster perspective, Bing reports rollup and infection information via Bing Webmaster Tools, so if you’re a webmaster and have not signed up, what are you waiting for?

As always, we are constantly observing and re-evaluating our data, telemetry, techniques and technologies, not to mention the state of the malware ecosystem on the web, to provide the best and most secure search experience to Bing users.  The web is a dynamic and ever changing place; even more so when it comes to illegal activity such as malware distribution.  As such, we never have the luxury of “resting on our laurels”, so check back regularly for more updates and information of what we do here at Bing, we have plenty to share.

David Felstead, Principal Development Lead, Bing Index Quality Team