This is the last of five posts on the topic of conducting your own site reviews. In the previous posts, we discussed why you’d want to perform a site review (Part 1), then took an initial look at page-level issues (Part 2), followed by a discussion of site-wide issues (Part 3 and Part 4), that can affect site performance for users and search engine ranking. In this last post, we’ll look at additional, architectural issues that should also be examined in a site review.
Duplicate content conundrums
Aside from duplicate indexed content arising from the lack of canonicalization as discussed in Part 4, sites that use the secure protocol HTTPS can also experience duplicate content in the index (both HTTP and HTTPS versions of the same page) if the HTTPS content is not specifically organized on the site to be separated from the main HTTP content.
Because search engines view duplicate content as a waste of the valuable space in their indexes, they make a concerted effort to purge duplicate content when it’s identified. The challenge is that the index may not retain the specific version of your content you wish they would (that’s why canonicalization can be so helpful). Designing your site to put HTTPS content in a specific directory location (rather than starting at the root with the non-secure content) and then employing techniques that prevent the search bot from indexing the content from that directory (we’ll discuss using a robots.txt file in a moment) can help with this strategy.
Site review task: Review what content is in the search indexes and implement strategies to eliminate any redundant content found. If your site uses HTTPS content, isolate it into its own location on your web server so that it can be blocked from crawler access.
Block indexing with robots.txt
Not every webmaster uses a robots.txt file to instruct search bots as to which files and directories aren’t allowed to be crawled. But if you do, it’s critical that it’s properly formatted and configured. Stories abound about disgruntled former consultants or inept webmasters configuring a site’s robots.txt file to block all search bot crawling activity on a site, effectively killing all references to it in the search engine index.
The robots.txt file uses text-based directives written in Robots Exclusion Protocol (REP) to block bot access to specified files, directories, or the entire site. You can specify crawl blocks on one, some, all, or none of the various search engine bots on the Web. You can even configure the crawl rate for bots (if the default bot crawl rate is too aggressive for your servers). Blocking bot access to directories or pages that contain private data, such as scripts, user-specific info, test data, or as-yet-unreleased pages is easily managed with a robots.txt file. Note, however, that the use of robots.txt is not a security measure. The robots.txt file itself is always publicly accessible in the root directory of a website, so anything listed in there to be blocked from indexing is still easily seen by interested parties.
Also note that REP directives can be made in the <meta> tags at the top of the code in any webpage and in the web server’s HTTP Header. Check for unexpected exclusion directives in those locations as well if your webpage content is not being crawled as expected.
When creating or editing your robots.txt file in a text editor, be sure to note the following common behavior for bots that have customized directives created just for them. In addition to the generic section of directives applicable to all bots, if you add a custom section for a specific search bot, that bot will likely ignore the directives found in the generic section. As such, be sure to duplicate all of the generic directives you want that bot to follow in addition to the custom directives.
Once you’ve created your site’s custom robots.txt file, validate it. You can use the Bing Robots.txt Validation Tool for this. Once the file has passed validation, upload it to the root directory of your website. Search bots always look for a file named robots.txt in the root directory before they crawl a site. If it’s there, REP-compliant bots will automatically read it and obey any applicable directives it contains.
To learn more on the proper format and syntax of the content allowed in a robots.txt file, see the blog post Prevent a bot from getting “lost in space” (SEM 101). To see if any of your site’s pages are unintentionally blocked by your robots.txt file, check out the crawler details information in the Bing Webmaster Tools.
Site review task: Check to see if your site employs a robots.txt file. If so, confirm that the directives block only what you want blocked. If one or more sets of custom directives exist for specific bots, be sure they include all necessary directives from the generic set of directives. Then be sure the file has been validated. Also, check for and review the validity of any REP commands in webpage <meta> tags and in HTTP Headers.
Reveal the heart of your site with a Sitemap
Whereas a robots.txt tells search bots what not to crawl, a Sitemap conversely identifies the most important content on your site for indexing. A well-done Sitemap can help your site be crawled more effectively, and that can only help in optimizing your search ranking for indexed content.
However, don’t necessarily make your Sitemap a mere URL dump for every page on your site. Include just the important content pages you want targeted for indexing. And then over time, as you update your existing and add new content pages, update your Sitemap as well so that the bots see the references to your latest, freshest content.
Unlike with robots.txt files, search bots do not automatically look for and use Sitemap files. To get the most value from your Sitemap development efforts, you need to submit the Sitemap directly to the various search engines. Once received, bots will know to look for the file on your site whenever they visit.
Sitemaps do have a standardized structure based in XML tagging. They can accommodate up to 50,000 URL listings. For very large sites that need to identify more content pages, Bing supports the use of Sitemap index files, which enable webmasters to point the index to up to 50,000 standard Sitemap files. For information on Sitemap structure, code syntax, and submission methods, see the blog post Uncovering web-based treasure with Sitemaps (SEM 101).
Please check the validity of the URLs used in your Sitemap file before the Sitemap is submitted to Bing. Search engines typically have a very low tolerance for broken link errors in Sitemaps. Bots will discard Sitemap file data if they encounter too many broken URLs.
Site review task: Review your Sitemap strategy. If you constantly refresh or update your content, plan to regularly refresh your Sitemaps as well. Validate the URLs used in the Sitemaps. Be sure to submit your first Sitemap to the search engines so that they know to look for that file within your website directory structure whenever they visit.
Page load time
An important SEO factor for measuring the value of websites by search engines is page load time. Servers that host sites with huge pages (such as those with too much high-resolution image content), too little bandwidth for their regular traffic, always running near or at maximum capacity, are intermittently unavailable, or are physically located so far away from their target market that their performance is always hampered by too many time-consuming router hops, can suffer ranking consequences due to these problems.
Bots that cannot quickly and easily navigate through your site may crawl fewer pages or, if the problem is severe enough, abandon the crawl effort altogether. Of course, this affects not only how many pages are indexed and how often crawls are performed, if the bots determine there is a consistent pattern of excessively slow page load times, your ranking for the pages in the index can be adversely affected. And, of course, the key reason this is a problem is that slow page load times affect customers attempting to visit your site, so Bing takes the magnitude of that experience into consideration when assessing rank.
One solution large sites employ to improve page load time is to enable HTTP compression on their site. To learn more about how to use this technology on your website, see the blog post Optimizing your very large site for search – Part 2. To test whether your web server is capable of supporting this service, use the Bing HTTP Compression Test Tool.
Site review task: Work with your IT department or your host provider service to test page load response times at various times of the day over many days. If problems are uncovered, look for patterns of high response times based on time or location that will help indicate the source of the bottlenecks. Employ HTTP compression if possible.
Geo-location is everything
Speaking of server location, are your servers physically located in your target market’s region? For example, hosting servers on another continent for a targeted US audience market may present more of a ranking problem than just page load time due to distance latency. The IP addresses of the host servers can be used by search engines as relevance factors to the served target audience. If your website’s IP address is located outside of its targeted market region, that can affect the perceived relevance of your website to that audience, and as a result, all other things being equal, your ranking could be adversely affected compared to competitors located within the market location. Top-level domains (TLDs) can also be a relevance factor for the same reason. While .com is a generic TLD and is most commonly used in the US market (although exceptions abound), a country-code-based TLD is associated with its national audience. Hosting a website on a TLD that is external to the target audience can negatively influence its ranking in that target audience market due to relevance factors.
One way to improve the relevant association of the site to the intended market is adding a text-based, physical street address in the content of the home page (assuming there is a local address in the location of the target market).
Site review task: Review your web server’s site host location and its TLD as it pertains to the target market. Add references to a physical location in the content to increase regional market relevance if necessary.
Post site review process
Once you’ve completed your site review, finish the job. Implement the changes needed to resolve the issues you identified on your site. If yours is like most organizations, it’ll come down to a matter of available resources as to when (or even how much of) your optimization task list can be done. Bottom line: make an implementation plan, prioritize the work items, and stick to it. The best plans are worthless if they are not followed.
As part of the implementation plan, be sure to include efforts to regularly monitor the status and performance of your site. You’ll want to use the various webmaster toolsets from the search engine companies as well as web analytics to measure how the optimization changes you implement affect your page-indexing rate, your site ranking, your search engine results page (SERP) click-through rate, your conversion rate, and review all of the analytics measurements you took before the optimization work began to confirm the success of your efforts.
Do it again
Then, after you have appropriately waited for the search engine bots to find your optimized site architecture and content, plan to rerun the checks you ran at the beginning (identified in Part 1) all over again, both at the search engine command line and in web analytics. Check your progress and continue to implement the changes based on your priorities.
Then plan to do another site review within a prescribed, reasonable amount of time. After all, SEO is not a one-time event. It’s an ongoing process that takes time and effort. Just as the Web is not static place (and hopefully the same goes for your website!), the world of the Web is constantly changing around you. You need to work to keep up your website with your competition, your customers, and your conversion goals.
It’s a wrap
In this five-part, site review series, we’ve covered a lot of items to consider in a site review. By looking at your site through the eyes of a search engine bot (as well as through the eyes of your potential customers), identifying the improvements that could be made, and then implementing them, you’ll be poised to see both users and search engines rally around your cause.
Ultimately, when it comes to earning good ranking, developing a site that’s beneficial to end users has to be the highest goal. Creating and publishing great content for those users will always be very important. But a site that’s technically valid, free of configuration errors and malware, logically structured, and is well connected with relevant, valid links, will go a very long way to making the content you provide to your users stand out from the crowd.
If you have any questions, comments, or suggestions, feel free to post them in our SEM forum. Thanks for tuning in! Keep on rolling…
— Rick DeJarnette, Bing Webmaster Center