URL Keyword Stuffing Spam Filtering

As we alluded to in last week’s Index Quality blog, today’s update will focus on one specific spam filtering mechanism we rolled out a few months ago that targets a common spam technique known as URL keyword stuffing (KWS.)

What is URL KWS?

Like any other black hat technique, the goal of URL KWS, at a high level, is to manipulate search engines to give the page a higher rank than it truly deserves. The underlying idea unique to URL KWS relies on two assumptions about ranking algorithms: a) keyword matching is used and b) matching against the URL is especially valuable. While this is somewhat simplistic considering search engines employ thousands of signals to determine page ranking, these signals do indeed play a role (albeit significantly less than even a few years ago.) Having identified these perceived ‘vulnerabilities’, the spammer attempts to take advantage by creating keyword rich domains names. And since spammers’ strategy includes maximizing impressions, they tend to go after high value/ frequency/ monetizable keywords (e.g. viagra, loan, payday, outlet, free, etc…)

Those are the basic mechanics that comprise the overall URL KWS concept. Looking at it a little closer, spammers employ a variety of approaches to implement this technique, resulting in a number of distinct flavors. These are some of the more common variants (note: some of the URLs mentioned below are fictitious, used to demonstrate the point) –

Multiple hosts, with keyword-rich hostnames: http://account.free.online.savings.samedaypaydayloansusa.com
Host/ domain names with repeating keywords: http://loan.payday.paydayloanspaydayloansusa.com
URL cluster across same domain, but varied hostnames comprised of keyword permutations
- http://contososhoeswomen.shoesonsale.com/
- http://bestwomensrunningsneakers.shoesonsale.com/
- http://discountrunningapparelforwomen.shoesonsale.com/
URL squatting
- This is a little different as the spammer is playing on a human tendency to misspell keywords & in effect syphoning traffic off of existing (typically high profile/ traffic) sites
- E.g. http://nytime.com(misspelling ofhttp://nytimes.com), http://ebey.com (misspelling of http://ebay.com)

It’s important to note, however, that certainly not all URLs containing multiple keywords are URL KWS spams. In fact, majority are perfectly legitimate non-spam URLs (e.g. http://www.nytimes.com/2011/08/25/opinion/how-to-fix-our-math-education.html.) To ensure high detection precision, this detection technique is typically used in combination with other signals (more on this below.)

Addressing this type of spam is important because a) it is a widely used technique (i.e. significant SERP presence) and b) URLs appear to be good matches to the query, enticing users to click on them.

How do we detect it?

As I mentioned in the previous blog, we will not be giving out specific details on detection algorithms because spammers are likely to use that knowledge to evolve their techniques. I can, however, tell you that we look at a number of signals that suggest possible use of URL keyword stuffing, such as:

Site size
Number of hosts
Number of words in host/ domain names and path
Host/ domain/ path keyword co-occurrence (inc. unigrams and bigrams)
% of the site cluster comprised of top frequency host/ domain name keywords
Host/ domain names containing certain lexicons/ pattern combinations (e.g. [“year”, “event | product name”], http://www.turbotaxonline2014.com)
Site/page content quality & popularity signals

To amplify this, we try to cluster sites (by various pivots such as domain, owner, etc…) and then look for patterns of the signals listed above in the same cluster. This helps improve detection precision because spammers often create dozens/ hundreds of similar looking sites.

What has been the impact on the end user & the SEO community?

Users: This update impacted ~3% of Bing queries (on average ~1 in 10 URLs was filtered out per impacted query.)
SEO community: ~5M sites, comprising > 130M urls, have been impacted, resulting in upwards of 75% reduction in traffic to these sites from Bing.
Example queries: {hotmail login}, {bestbuy on sale}, {cheap hdtv}
Examples of spam sites impacted:

Igor Rondel, Principal Development Manager, Bing Index Quality