The pernicious perfidy of page-level web spam (SEM 101)

In the exciting world of today’s Internet, where the world’s information is literally at your fingertips, where you can endlessly communicate, shop, research, and be entertained, spam is a big downer. The unwanted email spam that fills our inboxes also consumes huge portions of the available bandwidth of our routers and trunk lines. But email is not the only spam game in town.

Web spam is the bane (well, one of the banes) of the search engine and web searcher communities. Search engines want to provide search users with a great experience, helping them find what they want as quickly and as easily as possible. Search users want to use search engines to get the right information they seek as quickly as possible. And webmasters want search users to find their websites, but also to get those search user visitors to become conversions instead of bounces.

Web spam, those unwanted garbage pages that use overtly deceptive search engine optimization (SEO) techniques and contain no valuable content, is a frustration to search engines and search users alike, and ultimately work against the best interests of conversion-seeking webmasters (severely annoying a potential customer is rarely a great sales technique!).

In the previous article that defined web spam and discussed how it is different from junk content, we mentioned that there are two types of web spam. In this article, we’re going to delve into the details of the first type: page-level web spam.

Definition of page-level web spam

Page-level web spam uses on-page SEO trickery (not to be confused with link-level web spam, which we’ll discuss in an upcoming article). Webmasters and optimizers for these sites do this because they believe they can fool the search engines into giving their webpages a higher-than-deserved ranking based on their content relevancy, often times for subject areas that are completely unrelated to the site’s actual content. This is done in an effort to deceive searchers into visiting their spammy sites for a multitude of reasons, none of which usually benefit the end user.

The use of the following questionable SEO techniques will cause Bing to examine your site more deeply for page-level web spam. If your site is determined to be using web spam techniques, your site could be penalized as a result.

Note that Bing recognizes that the core concepts behind many of these techniques can have valid uses. No one is saying that their use always and automatically denotes web spam. The issue of intent behind their use is the distinguishing factor for determining whether or not web spam is present and any site penalties are needed. Please understand that, from a search engine perspective, the web spam effort consistently provides very little to no value whatsoever to end users. The entire effort is directed to fraudulently affect search engine rankings. As Martha Stewart might say, that’s not a good thing.

Keyword URL and link stuffing

Definition: This is the use of heavily repeated keywords and phrases with the goal of attaining a more favorable ranking for those words in a search engine index.

Problem: Keywords can be repeated to excess, so much so that they render any text in which they appear unintelligible from a natural language point of view. Those excessive repetitions can also be added in places that are not seen by the end user (meaning outside of displayed page text). Some web spam pages even use repeated keywords that are unrelated to the theme of the page. If any of these conditions are detected, these techniques will draw the attention of Bing as likely web spam.

What we look for: The purveyors of web spam use a variety of methods for keyword stuffing, including:

  • Excessive repetitions of keywords. The number of repetitions relative to the amount of content on the page is a key indicator of web spam. The practice of repetitive keyword stuffing is often relative to the amount of content in a page. For example, a very long page of text dedicated to a single topic may naturally repeat its primary theme keyword several times, but a page with less content using the same number of repetitions of the same word may be indicative of keyword stuffing.
  • Stuffing words unrelated to the page or site theme. Stuffing the page with words that are known to be heavily searched on the Web when they are irrelevant to the theme of a site can be an indicator of web spam. Relevance is an important factor for evaluating whether keywords are indicators of web spam.
  • Stuffing on-page text. Littering the text of a page with repeated keywords that render the text meaningless and unreadable to humans is a clear problem. When such content on the page is not useful to people, the content is often suspect as web spam.
  • Stuffing in less visible areas of the page. Placing repeated keywords in less visible areas of a page, such as at the bottom of the page, in links, in Alt text, and in the title tag, can be indicative of web spam.
  • Hiding stuffed keywords in the code of a page. By putting keywords in the code of a page that the search engine crawler (aka a bot) will see but configuring it so that a web browser will not show it to a human reader can be highly suspicious. Such methods as formatting text fonts the same color as the background, using extremely small fonts, and hiding stuffed keywords using tag attributes such as style=”display: none” and class=”hide” (both of which prevent the tagged contents from being shown to the user) will draw the attention of a search engine for closer scrutiny.

Note that stuffing the keywords <meta> tag alone is not a reason to be judged as web spam. But <meta> tag stuffing could be an indicator that other web spam techniques may be employed and could draw a search engine to take a closer look at such a site.

It is important that webmasters not overreact to this information. A small amount of relevant keyword repetition is considered common and is not considered web spam as long as it is used naturally within the page content language and the page provides useful, relevant content. They key message is always the same: develop your pages for human readers, not for search engine bots, for the best results. For more information on creating and using keywords wisely, see the blog articles The key to picking the right keywords and Put your keywords where the emphasis is.

Misspelling and computer generated words

Definition: Pages populated with many various spellings of targeted keywords, especially those unrelated to the theme of the page or the site, can indicate that the keyword lists are computer generated.

Problem: Aggressive inclusion of large numbers of misspelled or rare word lists and phrases can be considered web spam when used to excess. The relevance of those words to the theme of the page or the site is the key distinguishing factor here.

What we look for: The Bing team commonly sees the following techniques on web spam sites:

  • Excessive use of misspelled keywords. Huge lists containing all possible iterations of a misspelled word can be so excessive that the page will be worthy of closer inspection for web spam.
  • Large numbers of misspelled words unrelated to the theme of the site. Long lists of word spelling variations whose core definitions are unrelated to the theme of the page or the site can indicate the site is web spam.
  • Common misspellings of popular site URLs in domain names. Common misspellings of URLs and other computer-generated content are usually considered web spam sites.

Redirecting and cloaking

Definition: When a web client visits a website, certain traits can be used to identify the user and redirect them to a different page. These include, but are not limited to, redirects based on the referral code, the user agent (bot or human), and IP address.

Problem: Redirecting can be a legitimate technique in some cases such as if a web client is limited in what it can display on a mobile device web browser, or when a web server uses the client’s IP address to determine the language in which to present the content (aka geo-targeting). However, problems arise when sites filter their content based on whether the user agent belongs to an end user web browser versus a search engine bot. This type of filtering can run the gamut between showing the bot a keyword-stuffed page to an entirely different set of content, all of which is an attempt to deceive. When used with this intent, this is web spam.

What the webmasters who implement these techniques don’t understand is that search engines can detect this attempted deception. We do see when the content presented is user-agent based, and when the differences between the content variations is not done in the same light as that done between mobile and desktop browsers.

What we look for: Some webmasters design their websites to use the following deceptive techniques when the detected user agent is a search engine bot:

  • Script-based redirects. The use of JavaScript or <meta> tag refreshes to automatically change which page is displayed are often suspicious in nature and will get more scrutiny from Bing. This is because some sites use JavaScript to redirect all visiting user agents to a new page, and that page may contain web spam. However, since search engine bots don’t execute JavaScript natively, they won’t execute the redirect and thus are supposed to index the contents of the original page (although the search engines bots can still detect this behavior).
  • Referral redirects. Some websites consider the referrer when they show a page. When the referrer is a SERP and the target website shows a different page than the one shown when the user directly navigates to the URL, this behavior is considered web spam.
  • Redirect search engine bot to a target page. Some sites detect the user agent specified and send search engine bots to alternate, text-based pages modified with other web spam techniques such as keyword stuffing (but the site provides its normal web content pages to end user web browser user agents). When redirects are filtered on search engine user agents for the purpose of deceiving them, this is a web spam version of cloaking. Bots can detect when they are redirected to special pages. So when this is encountered, it is usually indicative of web spam and will be investigated further.
  • Redirect end users to a target page. Sometimes webmasters use cloaking to work the opposite way than described immediately above. They may serve highly optimized content pages on Topic A to search engine bot user agents, but when a web browser visits the site, the page shown shows content for a completely different subject (typically an illicit one, such as a page promoting porn, casino or online gambling, illicit pharmaceuticals, and the like.). The effort here is to rank well for a commonly searched topic of interest in a search engine results page (SERP). Then supposedly when searchers find that link in their SERPs, they click the blue link in the SERP and are unwittingly redirected to the web spam page.

The problem for webmasters practicing these techniques is that their technical deceptions are not very effective. Search engines use a number of techniques to uncover such fraudulent practices as redirect and cloaking web spam. When they are revealed, the websites of the perpetrators are penalized, sometimes severely. Well-meaning webmasters or online business owners who hire unscrupulous consultants or carelessly take black hat SEO advice from indiscriminate sources on the Web are setting themselves up for trouble. Reviewing the issues identified in this article as well as the official webmaster guidelines for Bing, Yahoo, and Google, will go a long way to keeping a website on the right track for search.

In the next article on web spam, we’ll discuss link-level web spam in detail. We’ll also include some information on what to do if your site was pegged as web spam and after the problems have been resolved, how to request reinstatement into the Bing index as a normal website. Stay tuned!

If you have any questions, comments, or suggestions, feel free to post them in our Ranking Feedback and Discussion forum. Until next time…

— Rick DeJarnette, Bing Webmaster Center