MGC Spam Filtering

In today’s edition of the Bing Index Quality blog we will delve into one particular spamming technique – MGC (short for ‘machine generated content’.) We will discuss what it is, why & how spammers employ it and introduce a specific update we shipped a few months ago aimed at detecting and filtering out pages utilizing this technique.

What is MGC and why/ how spammers employ it?

As we mentioned in the Web Spam Filtering overview blog from August 27, an important element of a spammer’s arsenal is the ability to mass-produce pages at little cost to the spammer. This is an essential step in enabling the spammer to maximize their web presence and exposure to search users. Whatever black hat SEO technique they are planning to leverage, the logic is simple – why not apply it to thousands of pages instead of just one and have all thousands vie for a good SERP position. This also enables them to maximize their target area, perhaps through targeting different keywords on different pages.

Here is a relevant paragraph from our earlier blog that describes some of the techniques spammers use to achieve this: “There are a number of approaches spammers utilize to quickly and cheaply generate a large number of webpages, including a) copying other’s content (either entirely or with minor tweaks), b) using programs to automatically generate page content, c) using external APIs to populate their pages with non-unique content. Our technology attempts to detect these and similar mechanisms directly. To amplify this, we also develop creative clustering algorithms (using things like page layout, ads, domain names and WhoIS-type information) that in a way act as force-multipliers to help identify large clusters of these mass produced pages/ sites.”

As you probably figured out, the concept described in b) above is in fact what we refer to in this blogpost as MGC. The concept as such is fairly intuitive and easy to grasp. Let’s review some of the key distinguishing characteristics of this technique to reinforce the concept:

  • Generates content automatically
  • Can generated any number of pages
  • Specific algorithms may differ (e.g. auto-text generation vs. Frankenstein-style copying content from multiple other sources and joining them together, i.e. a sentence here, a sentence there…)
  • Complexity can range from basic to very sophisticated (e.g. random character generation vs. using the latest language modeling programs.)
  • Often paired with keyword insertion black-hat SEO
  • Upon close examination, content is gibberish providing zero user value

Now let’s take a look at a few examples of pages that we’d consider MGC:

  • Here is a ‘beautiful’ example of MGC that helps illustrate just about every one of the points mentioned above. It includes tons of keyword stuffing (e.g. ‘michael kors bags’, ‘michael kors outlets’), content appears to be copied from multiple sources and joined together, zero thought given to content presentation, content doesn’t make much sense and is incoherent (if you need convincing, just read through the circled paragraph.)

blog1

  • Here is another poster child example. Clearly the spammer is hoping to optimize for ‘nude celebrity’ type queries (which are quite plentiful as you can imagine.) Content is not only gibberish, but is also not pertinent to the topic. Sentence punctuation and word capitalization is busted throughout the page.

blog2

  • In this examples page author doesn’t even try to make the content appear legitimate/ intended for human consumption. Content is completely incoherent (first line tells you all you need to know), just about every sentence is grammatically (and logically) incorrect, images/ text intertwined throwing readability right out the window.

blog3

Why care?

While the impact of this technique is not particularly huge (we’ll talk more about this below), we care about it because a) it provides absolutely no value to the user and b) it masks itself such that it’s not immediately obvious that the content is garbage. Having come to an MGC page, the user typically needs to spend (read: waste) some amount of time reading the content before realizing that it’s nonsense (1st example above illustrates this point particularly well.)

How do we combat it?

As in previous posts, I will not go into too much detail since I have no desire to make spammer’s life any easier, but instead talk about the gist of the algorithm and share some of the signals we look at that suggest possible use of MGC technique. At a high level, we look at various aspects of the content that give away its automated nature. If enough evidence is found, then the page becomes an MGC candidate. MGC pages typically have poor grammar, misuse of punctuation, invalid use/ format of proper name, improper capitalization, etc. Content incoherence (i.e. one sentences not making sense next to its neighbor) is another strong giveaway. For a human, spotting MGC is often easy and fairly obvious because language isn’t used correctly, word sequences seem unnatural, and grammar is all off/ over the place. In short, our technology aims to duplicate what comes so easily and naturally to a human reader.

Naturally we need to be very careful labeling pages/ sites as MGC, just like with any other detection technology, out of concern of generating false positives. Just because a page has grammatical errors or uses language that wouldn’t necessarily earn it an A+ in Ms. Johnson’s literature class doesn’t make it MGC. Certain types of pages are particularly susceptible to being misclassified as MGC (e.g. content written by non-native speakers or children, non-standard content that falls outside the normal language models like technical manuals or academic papers.) To mitigate this, we always look for not just evidence of the spamming technique, MGC in this case, but also other supporting information that often accompanies MGC pages (e.g. keyword stuffing, poor quality of content, page popularity – or rather lack thereof, uniqueness of content, etc…) and with the aid of this make the final determination.

What has been the impact on the end user & the SEO community?

 

Igor Rondel, Principal Development Manager, Bing Index Quality