Crawling the Internet…

Now that we have a beta, people are starting to pay attention to whether their sites are in our index.  The two most common questions we get are (1) why did you not crawl my site, and (2) you crawled page X, but its not  in your index why?  Let’s take these one at a time.

Why did MSNBot not crawl my site?  The answer to this is not straightforward so I will mention a couple of things that are worth  considering. The first is to determine whether your pages are crawler friendly.  An example of a page that might look “unfriendly” to a crawler is one that looks like this: http://www.somesite.com/info/default.aspx?view=22&tab=9&pcid=81-A4-76&section=848&origin=msnsearch&cookie=false.  When MSNBot looks at this URL it gets scared (well, not really it’s a machine not a human so it doesn’t have feelings).  The algorithm starts to wonder whether it is going to get stuck in a loop endlessly crawling every single permutation of the query parameters.  Thus, URL’s with many (definitely more than 5) query parameters have a very low chance of ever being crawled. Another thing to consider is whether we can find your page.  If we need to traverse through eight pages on your site before finding leaf pages that nobody but yourself points to, MSNBot might choose not to go that far.  This is why many people recommend creating a site map and we would as well.  Lastly, you can also use this tool to Submit your URL to MSN Search.


You crawled my site, so why can’t I find it in your search index? This is one is a little bit easier.  The reason that this is most likely happening is that we are detecting the page as spam when we analyze the page to build our index.   How can you make sure that this does not happen?  The best thing to do is to not spam us.  On our site owners help we talk about some of the things that we consider spam.  In case you have not read it here is a quick refresher: dirty javascript redirects, stuffing alt text, white on white links, off topic links etc.  We take this stuff very seriously and we are continuously working to improve our spam detection — we still have room for improvement.  The reason that we take this seriously is that web spam threatens our entire industry.  To the extent that spam is successful people will not be able to turn to search engines to find what they are looking for.

Lastly, a brief moment on peanut butter — why is it that we stop liking peanut butter after like 8th grade? Or is it just me?  I have not had a peanut butter and jelly sandwich for the longest time.  This morning I had one.  Yummy.  Here’s to peanut butter.

Eytan Seidman, Program Manager