Web Spam Filtering

As I mentioned in the July 15 blog introducing Bing Index Quality, one of the key dimensions of our work is web spam detection and filtering. The overview of our approach to this complex problem will be the focus of today’s update.

What is web spam? On the surface, our definition is fairly straightforward and intuitive. We think of a webpage as spam if its owner uses black hat SEO techniques in an effort to game our search algorithms with the goal of gaining undeserved ranking for this page or other pages in the spammer’s network. Simple, right? In reality, however, it’s often anything but simple: a) there is typically a fine line between a legitimate use of an SEO technique and its abuse; b) even if SEO techniques are severely abused, it’s often not clear whether it’s intentional or accidental and c) even the most egregious spam pages may have user value and it is important to recognize that to decide on the proper course of action.

Why is it important? Let me answer that by stating our goals, which bear a direct tie to this question.

  • Delight users with superior search results quality – spam results are typically inferior quality that push down good results by gaining undeservedly high placement
  • Contribute to reduction of malicious content – There is much supporting data (both from our analysis as well as external studies) that shows a high overlap between spam and malware
  • Contribute to the betterment of the internet ecosystem – Simply put, if Bing, Google and other search engines are able to eliminate spam from results, it would cut off the traffic, i.e. spammer’s bloodline, to the spam sites, thus putting them out of business and making the entire internet ecosystem cleaner and safer for all
  • Improve utilization of our resources – Index space is neither free nor unlimited. Detecting and removing spam websites from the index helps prevent wasting index space and provides more space for good pages

How do we detect spam? Clearly spammers do not advertise themselves as such; obfuscation and cloak/ dagger techniques are spammer’s friends. Therefore search engines must continuously develop innovative approaches to detect spam. Communication around spam detection is a sensitive matter, however, because unlike most other facets of search engine algorithms, we are dealing with an adversary who stands to benefit from a) detailed understanding of search algorithms and b) detailed understanding of anti-spam efforts. Therefore I hope the reader will forgive me for steering clear of specifics and instead focusing on the main themes of our detection and filtering workflow.

First step: understand spammer’s motivation. Before we delve any deeper into spam detection, it’s important to take a step back and understand a spammer’s motivation. Getting an innate understanding of this makes it easier to conclusively identify a page is spam. So what are spammer’s motivations? Spam is a business. As such, to no surprise of anyone, it inevitably comes down to money. Whether directly or indirectly, a spammer’s primary goal is to make money. There are exceptions who are in it for other causes, e.g. politics & general mayhem, but the vast majority of the spammers are driven by their ability to monetize their efforts. The most prevalent way to translate spam activity into financial gain is via Ads (inc. affiliate programs.) The more Ads they are able to show to users, the bigger the financial gain. The math here is very simple – certain % of the users will inevitably click on page ads and certain % of those will follow through (i.e. make purchase) on the Ads (in most cases follow through is not even required as most types of ads reward site owner via the click on the ad itself.)

You may be thinking: “That’s all fine and good, but how is this useful?” This helps us in that a spammer’s motivation is often reflected by the web page itself and while there is no absolute sure way to pinpoint this, there are often clues that one can learn to read. We use such insights to develop algorithms that aim to make this type of determination automatically by looking at things like:

  • Quality of content – This itself is a huge and important concept that we’ll deep dive into in a future blog. At a high level, provided a spammer’s overarching goal is to drive ad and affiliate clicks, the content of the page is important only to the extent that it helps facilitate said goal. To put it another way, spammers generate content targeted at search engines and their algorithms, whereas legitimate SEOs generate content for their customers. The result is that, in most cases, spam pages have inadequate content with limited value to the user. We use this fact to facilitate detection. There are literally hundreds, if not thousands, of signals used to make this assessment, ranging from simple things like number of words on the page to more complex concepts of content uniqueness and utility.
  • Presence of Ads – Just about every page on the web contain ads. Presence of ads doesn’t make the page bad, let alone spam. What we care about are things like a) how many ads appear on the page, b) what type of ads (e.g. banner, grey-overs, pop-ups), and c) how intrusive/ disruptive they are.
  • Positional & layout information – Where is the main content located? Where are the ads located? Do the ads take up the prime real estate or are they neatly separated away from the main content (e.g. in the header/ footer or side pane)? Is it easy for users to mentally separate content from ads?

Spammer’s next goal: scaling up their ability to monetize (typically via maximizing impressions.)Once they have the page(s) that perform monetization, a spammer’s next goal is to maximize the payoff (after all, why go through all this trouble for a single page that few will see.) This is where black hat techniques are applied and SEO abuse takes place. Specifically, the spammer’s goal is to maximize traffic to the page(s) that perform monetization. How do they go about doing that? It may help to think of it in two ways: 1) maximize web presence by mass-producing the pages that perform monetization and 2) maximizing the rank these pages achieve in search engines.

Maximize web presence – There are a number of approaches spammers utilize to quickly and cheaply generate a large number of webpages, including a) copying other’s content (either entirely or with minor tweaks), b) using programs to automatically generate page content, c) using external APIs to populate their pages with non-unique content. Our technology attempts to detect these and similar mechanisms directly. To amplify this, we also develop creative clustering algorithms (using things like page layout, ads, domain names and WhoIS-type information) that in a way act as force-multiplies to help identify large clusters of these mass produced pages/ sites.

Maximize webpage rankings – There are literally dozens of ways SEO can be abused in an effort to trick search engines and gain unfair ranking including things like a) stuffing page body/ url/ anchors with keywords, b) performing link manipulation via link farms, link networks, forum post abuse and c) including hidden content on the page not meant for human consumption. To combat this, one of the strategies we employ is to develop technologies that look for these specific techniques. For example, understanding the standard distribution of text on the web can help us identify suspicious outliers (i.e. pages with unusually large presence of certain keywords) that are possibly the result of keyword stuffing spam technique. Similar technology can also be applied to analyze URLs and anchors. Other technologies focus on analyzing the web graph (page/ site inlinks and outlinks) to identify possible link manipulation, etc…

Avoiding detection: Like malware distributors, spammers often put a lot of effort in masking themselves from detection by search engines since detection = loss of profit. Techniques are often similar as well and include things like a) redirects, b) content cloaking, c) making content appear legitimate and d) use of dynamic content.

Where can spam be present? Everywhere. While some segments are inherently spammier than others (e.g. software downloads, free ringtones/ mp3s) & spammers often target higher frequency and more monetizable topics, in general spammers don’t ‘discriminate.’ Spam can be found in your typical everyday webpages, forums, social networking sites (e.g. LinkedIn) as well as shared sites (e.g. blogspot, wordpress), etc…

What happens once a page is determined to be spam? While spam filtering serves multiple goals as discussed above, the overarching goal is to ensure search results quality. To that effect, our aim is to minimize the presence of spammy pages on the SERP. The specific mechanism that achieves this is less important and could take the form of demoting the page, neutralizing the effect of specific spam techniques or removing the page/ site out of the index all-together. The decision is made based on considerations like a) the extent/ egregiousness of the spam techniques involved and b) the potential value the page presents to the users.

I hope you found today’s edition interesting and informative. It ended up being a little longer than originally planned, but I wanted to make sure the depth of content is just right, whether you are brand new to the concept or an avid search engine news follower. Many Webmaster readers of this blog may also be able to extract useful tidbits, particularly how to avoid ending up on the wrong side of spam filtering. :)  Ultimately, search engines rank pages based on whether or not we think they will provide value to the searcher and the best way to ensure that your pages rank well is to provide content that users actually want to see, rather than focusing on the specifics of the page structure or its link graph. Aside from the fundamentals of ensuring that your pages are well formed and that the content is easily discoverable by the search engine, the best SEO you can do as a webmaster is providing quality content. Our next blog will likely pick up where this one leaves of and introduce you to one specific update we recently rolled out that focused on URL keyword stuffing and how it impacted our users and the SEO community.

Igor Rondel, Principal Development Manager, Bing Index Quality