Making links work for you (SEM 101)

Links can be the lifeblood of a good website, as we discussed in Part 1 and Part 2 of Links: the good, the bad, and the ugly. But how well you manage them on your site from a site architecture perspective can be the difference between your website being starved for oxygen (aka search engine referral traffic) versus healthy and thriving. That’s why we do search engine optimization (SEO).

This article is part 2 of the recent Site Architecture and SEO series. In the first article, I discussed how site architecture is related to making your site easier to crawl. By knowing what the search engine web crawler (also known as a robot or, more simply, a bot) needs to efficiently and effectively do its job, you can grease the skids to getting more of your content in the index.

Let’s take a look at what you can do to optimize your website’s URL and the links contained within.

Canonicalize your home page URL

We’ve talked about canonicalization and its relevance to large site issues in this blog a few times in the past. This series of articles isn’t the place for another deep dive on the subject, but the importance of the concept does bear repeating.

Search engines apply page rank to URLs. That makes sense, right? But check this out:

  • mysite.com
  • www.mysite.com
  • mysite.com/
  • www. mysite.com/
  • mysite.com/default.htm
  • www.mysite.com/default.htm
  • www.yourhostprovider.com/~mysite
  • www.mysite.com/en/us/

All of these various URLs almost certainly point to the same default.htm page at the domain mysite.com. However, because each URL is structured differently, each one is considered to be unique to search engines. And as a result, the “link juice” (the ranking value) that search engines attribute to each URL, even when they are going to the same page, ends up diluting what is attributed to the site overall. The concept of canonicalization is the way to express which single URL form you want to be used for your website’s home page. And once done, you should use that URL form religiously in your internal linking.

But what about inbound links? You can’t reliably control how other folks structure their links to you, right? Well, there’s a trick for that. Recall in the first site architecture post where I talked about setting up 301 redirects for moved pages? You can also set up 301 redirects for all possible permutations of your site’s home page URL and redirect them to the canonicalized URL. Once search engines begin hitting the 301s, they will attribute the link juice formerly attributed to the variant URL to your now canonicalized URL. Use the 301s to funnel all of the link juice you have earned to elevate the rank of your canonicalized URL instead of having it spread out over several URL variations.

Choose between relative vs. absolute links

To further emphasize your canonicalized URL, always use absolute URLs for internal links. What does this mean? It’s simple. Use the entire URL to point to the linked page rather than a file address that is relative to the home page of your site. For example, if your home page links to a page named contactus.htm stored in a directory named media, format the value for the href attribute to use the page’s full URL, as in 

<a href=”http://www.mysite.com/media/contactus.htm”>Contact us</a>

instead of the shorter, relative directory reference used for the page, as in

The use of absolute links reinforces the use of your full URL and, like canonicalization, focuses the link juice to that URL. Because of this, be sure to use absolute links in your intra-site navigation scheme, which is the most often used mechanism for accessing your internal pages. External, inbound links have to do this to reach anything other than your home page. Why not contribute a little bit of your own effort in adding to the link juice for your internal webpages?

Another reason to use absolute URLs is in anticipation against theft. No, not theft of your nice, new 24” monitor, but your more valuable site content. It is a sad fact of life on the Web that there are lazy folks out there who will simply copy and paste someone else’s content into their website. If you use absolute links for your inline links, your stolen content will most often take the reader of the plagiarized content back to the source—your site!

In the end, using relative links is not a bad thing. Not at all. It’s just that absolute links are a better choice for SEO.

Use proper URL syntax in the anchor tag

When referencing URLs in the href attribute of your anchor tags, following these internal standard URL formats will optimize the link for SEO. Consistency with this is especially important with your site’s canonical internal links.

  • For URLs that point to the default or index file of a directory, omit the default file name and instead end the URL with a folder name, always followed by the trailing “/” (as in href=”http://www.mysite.com/”>http://www.mysite.com/). This is even true for default file name URLs that use dynamic parameters (as in http://www.mysite.com/?var=1)
  • For URLs that are not the default file for a folder, it is fine to include the file name in the full URL, even if there is no dynamic parameter (as in http://www.mysite.com/article.htm)
  • For URLs that include ampersands (typically used between sets of dynamic attributes), substitute the equivalent escape code &amp; for the single ampersand character (&) to enable the page to pass HTML validation checks (as in http://www.mysite.com/?var=1&amp;var=2)

Use title attribute in anchor tags for internal links

Of course, if you’ve been keeping up with this column, you already know to use relevant keywords and phrases in the anchor tag text you write for your internal links. This helps search engines develop relevance between those keywords and the page to which they are referencing. To further develop keyword relevance for those pages, also include the title attribute to your anchor tags. An example might go something like this:

<a href=”http://www.mysite.com/newpage.htm” title=”keyword or key phrase describing the linked page”>Keyword or phrase about the content of the linked page</a>

Think of the anchor text as your primary description of the linked page. But if you do inline linking within the paragraphs of your body text, you need to maintain the natural, logical flow of the language in the paragraph, which can limit your link text description. As such, you can use the title attribute to add additional keyword information about the linked page without adversely affecting the readability of the text for the end user.

Identify the canonical URL for each page

We’ve already talked a little here about canonicalization and how that works for your home page. But what about other pages that can have multiple URLs? This is commonly a problem with sites that employ dynamic URLs. A large number of varying dynamic parameters can be applied to a URL that all ultimately go to the same page. But if you want to help the search engines determine which should be the canonical URL for a given page, even when that differs from the URL you are using in the links to that page, you can use the <link> tag within the <head> section of the page to identify the canonical URL for that page to the bot. An example looks like this:

<link rel=”canonical” href=”http://www.mysite.com/products.aspx?item=doodad />

This will apply even when there are no links on your site that actually use that URL (assuming all of the internal links to that page are employing additional dynamic parameters). Just be sure the URL listed as canonical actually resolves!

The canonical attribute is a relatively new feature for search engines, and while we are ramping up support for this new feature, think of this data as more of a hint rather than a directive to us.

Minimize the number of parameters in dynamic URLs

While we are talking about URLs with dynamic parameters, be aware that these can become problems for bots that want to crawl your site. Dynamic URLs are often used by affiliate sites to brand certain product pages that are otherwise identical content-wise, and the bots will pick up on the duplication. Indexing that data will often result in removing duplicates, and the version you wish to keep may not be the one that is reserved by the search engine index. Minimize the use of dynamic URLs as much as possible to reduce the incidence of this potential issue.

Now truth be told, MSNBot, the crawler used by Bing, can read and follow URLs using more than 30 variables. Limitations on the ability of the bot is not the problem. The problem comes down to that the random order of the variables and the number of variables used in a URL can create what is nearly an infinite number of permutations for the same ultimate content, and that duplication is the problem. As such, if you minimize the number of variables used, the fewer duplications there will be for your pages, which is good for getting the right page from the index onto the search engine results page (SERP).

Avoid using session IDs or cookies

Bots will fail to crawl your site if you attach session IDs to or require cookies in your links. Bots usually can’t accept such tracking mechanisms. As a result, they will never get access to the parts of your website that require these elements, which means those pages won’t get indexed by the search engines.

Have at least one internal link to every page

No man is an island. No HTML page should be, either! Every page should be linked to at least once by your other internal pages within your site. Otherwise the bot will never find it nor index it.

Avoid pages with nothing but a long list of non-contextual links

While the point of the Web is to have sites link to others in a web of interconnected pages, too much of a good thing is not always better. In fact, it can be bad! Pages that offer nothing but an endless list of context-less links are of very little value to users, and thus are not given much value by search engines. Again, the old adage of pursuing quality over quantity usually holds true.

Now if you actually have a bushel or three of very high quality and relevant outbound links, which are presented in logical context for the user, bully for you! And of course, an HTML sitemap page can contain numerous internal links, which is not a bad thing, either (but at least try to organize the sitemap list of links so they are easier for the user to consume). Just note that very long lists of links with no context is not helpful for anyone, including you.

Prevent the bot from following a link

Use the rel=”nofollow” parameter in your anchor tags to identify pages you don’t want followed, such as those to untrusted sources, such as forum comments or to a site that you are not certain you want to endorse (remember, your outbound links are otherwise considered to be de facto endorsements of the linked content). An example of the code looks like this:

<a rel=”nofollow” href=”http://www.untrustedsite.com/forum/stuff.aspx?var=1”>Read these forum comments</a>

To block the bot from indexing content on your site (such as authentication or shopping cart pages, to improve crawling efficiency), don’t do it through this attribute in the anchor tag. Instead, use either your robots.txt file or add a <meta> tag using the robots attribute to block access of the bot to that content.

We’re still ramping up on how to make your site architecture more robust for SEO. Next up in our site architecture series: content issues. If you have any questions, comments, or suggestions, feel free to post them in our SEM forum. Catch you later…

— Rick DeJarnette, Bing Webmaster Center