The internet can often feel like a giant cesspool of low quality, illegal or malicious documents. The mere mention of malware, adware and viruses is enough to send even the most experienced internet surfers running for cover. And yet, these represent just the tip of the iceberg of poor quality documents. Searchers also have to deal with phishing sites, fraudulent sites or sites propagating scams, spammy sites…. Well, you get the picture. On top of this, users are often inundated with non-malicious, but low quality/ low content sites like parked domains, 404s and cheap content farms. To deal with this, search engines must ensure not only results relevance (pertinence of content to user’s query), but also results quality and appropriateness. In Bing, this effort is driven primarily by the Index Quality team.
Q: Why post on Bing Webmaster Blog?
A: Index Quality team has long had a close and rewarding relationship with the Bing Webmaster team. In the large Bing family, the Webmaster and Index Quality teams are close siblings: 1) we share a common principle and desire to stay close to our customers; 2) Webmaster Portal hosts our malware analytics page as well as SEO/ content guidelines and 3) we even share some backend technology to power various data mining and analysis capabilities. In fact, until as recent as 3-4 years ago these two projects were actually developed by the same group of engineers. We continue to build on that relationship and collaborate on many projects such as the Bing Site Safety Page and SEO/ Content guidelines. As such, the Webmaster blog presents a perfect location to share Index Quality updates with Bing customers and content owners.
Q: Why now?
A: Well, this isn’t actually our first post. We’ve previously posted updates on the Bing blog and partnered closely with the Webmaster team to publish posts such as this in the past. Having said that, we believe we need to do a better job of proactively communicating updates in this important space and maintain a regular rhythm instead of infrequent one-offs. As to ‘why now’, there is no specific reason (well, aside from the constant… encouragements we get from Duane Forrester!)
Q: Aha, I knew Duane Forrester is somehow related to this! Doesn’t Duane speak to SEO/ SPAM topics for Bing?
A: Duane is one of Bing’s most prolific external communicator. Although he’s formally the Product Manager for Bing Webmaster, he does indeed often utilize his diverse Bing knowledge to speak to other topics including SEO, content quality, etc… And while Index Quality team has been ‘shy’ in blogging about our work in the past, Duane often helped fill the void posting his insights on related topics and speaking to them at conferences. Moving forward we hope to maintain a frequent rhythm of external communication to support a broader communication strategy that expands on what Duane has done for us in the past.
Q: What can I expect moving forward?
A: Couple of things. We will:
- Deliver new updates on regular basis (perhaps bi-weekly or monthly)
- Aim to actively monitor and respond to your questions and comments
- Cover topics ranging from blogging about the latest and important updates in this space to spotlighting some of the cool innovation we’re working on as well as providing a heads up about potential traffic impacting changes being rolled out.
To kick off the series, I thought it may be worthwhile to start by introducing our team in a bit more detail to set the context and narrative for future posts. To put it very plainly, our charter is to assess web documents against quality, ‘appropriateness’ and other standards, take action if needed and share this information with downstream components. Let’s look a little deeper into each of these steps.
Q: Are there any specific categories of documents Bing Index Quality team is concerned with?
A: Yes, very much so:
- Malicious: Sites visiting which may result in user’s computer getting infected or personal data stolen, e.g. malware, adware and scareware
- Spam: Sites using black-hat SEO techniques like keyword stuffing, URL squatting, link manipulation, etc…
- Scam: Sites whose purpose is to defraud users, e.g. mug shots scam, phishing sites, etc…
- Junk: Mostly useless pages such as 404s, parked domains, “soft redirection”, etc…
- Illegal: Child pornography (CP), DMCA, etc…
- Adult: Websites hosting pornographic content
- Objectionable (non-porn): Websites hosting content “not suitable for all audiences”, e.g. extreme violence, hate speech, etc…
Q: Does this mean you detect and remove all of these?
A: Short answer is ‘No.’ The list above represents our ambitious scope and vision in terms of what we strive to achieve, but it is a vast and difficult problem that will likely remain open as long as search engines exist. Some of these categories we’ve been investing in for a while and do a fairly decent job of already (malware, spam, junk, adult), while others we have a fairly basic (and in some cases manual/ ad hoc) solutions for (e.g. scams, illegal), while yet others we are only recently beginning to consider (e.g. extreme violence.) Also each of these categories demands potentially different approaches and follow up actions (i.e. not as simple as blindly removing all of them from the index.)
Q: Does Index Quality team assess documents outside of the specific categories discussed above?
A: Yes, we process every single document that goes through Bing and aim to assess its quality, from head to toe, even if it doesn’t fit under categories mentioned above. In fact, this bucket represents the vast majority of the documents in Bing. Knowing which documents are high quality vs low allows us to feed a very powerful signal to our ranker to help provide the best results to the user.
Q: What aspects of content do you care about wrt low/ high quality documents?
A: Many, here are some examples:
- Generation cost: How much time/ effort/ experience went into generating the content on the page (e.g. content farms)?
- Freshness: How up-to-date is the content on the page?
- Authority: How authoritative is the content on the page? Would you trust it? Make important decisions based on it?
- Depth: How deep or shallow is the content on the page? Does it fully answer user’s need or require them to seek additional information elsewhere?
Q: What action does Bing take for the various types of content described above?
A: We take different actions depending on the type of content we discover:
- De-indexing – Certain web documents add so little value (or worse yet – negative value) that we do not feel they deserve space in the index. This is the harshest penalty we can give. It applies to some types of spam (using the worst kinds of abusive SEO like link farming and stolen content), scam sites, pages hosting CP content and most junk pages.
- Results filtering
- Unconditional – From user’s perspective, this is nearly identical to de-indexing, except we show a message to the user that certain results have been removed. This applies to pages taken down due to DMCA notices and transient junk pages (e.g. high profile, temporary soft 404 pages that are likely to repair soon.)
- Conditional – Similar to above, except results are removed based on certain conditions. For example: 1) adult results on strict safesearch, 2) adult results on moderate safesearch mode for non-adult intent queries and 3) (coming soon!) pages reported under the new ‘right to be forgotten’ regulation for associated queries and coming from EU markets.
- User warning – This action does not remove the document from the SERP, but instead triggers a user experience warning to render upon user clicking on the result. It applies to documents which were found to be infected with malware. The reader is welcome to check out a blog post dedicated to this topic. Long story short, most infected webpages are ‘innocent’ victims that don’t deserve punishment and a vast majority (> 95%) of users heed the warning and refrain from navigating to the infected site.
- Demotion – Certain documents exhibit discouraged behavior or provide limited value. These do not necessarily deserve to be filtered out completely, but should typically be ranked lower than other, higher quality pages. To achieve this, we attempt to identify such documents and feed the signal to the ranker. This strategy applies to certain spammy sites (using dubious, but not overly egregious SEO practices) and overall low quality documents (e.g. content farms, low content, etc…)
We hope you found this post useful and look forward to going into more detail across these super interesting topics in the coming months. Future posts will be shorter and quickly get straight to the point on specific topics. For example, I can share that our next post will come out sometime in the next 1-2 weeks and will cover the new and improved Bing Site Safety Page we’ve recently rolled out to Webmaster Portal as well as hint at some additional updates coming to it soon. Also we’d love to hear from you and will try to respond as best we can. Please feel free to recommend ideas for future topics or share anything else on your mind about this space.
Igor Rondel, Principal Development Manager, Bing Index Quality