Stephan Spencer's Scatterings

The Scattered Wisdom of a scientist turned web marketing virtuoso

December 2008
S M T W T F S
 << <   > >>
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Ask Jeeves wants your Robots.txt!

David Naylor from Bronco, who was one of the speakers at the Organic Listings Forum session at the Search Engine Strategies conference, advised site owners to have a robots.txt file, even if it's just an empty file, because Ask Jeeves' spider seems to favor web sites that have one.

Anyone noticed an improvement with your presence in Ask Jeeves after creating a robots.txt file?

Of course there's also the side benefit that you'll eliminate all those "File Not Found" error messages for robots.txt in your server error log, which tend to overwhelm the error log, making it harder to spot more concerning error messages. That assumes of course that you actually examine your error log on occasion. ;-)

Posted by Stephan Spencer on 12/10/2005 | Permalink

Comments (2)| Comments RSS | Filed under: Search Engines , , , ,            

What's wrong with Google Sitemaps

Last Friday it seemed like the whole blogosphere was abuzz with the news that Google unveiled its new Google Sitemaps service, a free inclusion service where you publish an XML file of your site pages to Google so its spider can get a better sense of what to crawl of your site. This is good news, especially for dynamic sites that aren't getting fully indexed. I appreciate Google once again showing its thought leadership. Not only is Google giving webmasters a new way to relay information about their site structure information to its spiders, but it's sharing this new technology with the other search engines by releasing the protocol and code as open source.

This all sounds wonderful, but there are 2 quite major problems with Google's approach.

  • First, it doesn't solve the duplicate pages problem that a great many dynamic sites have. Even the Google Store suffers from this (which I blogged about previously but here's a more recent example of a Google Store product page being duplicated times in Google's index). The Google Sitemaps protocol does not provide a way for webmasters to convey which pages are duplicates of other pages. A site that gets crawled incorrectly by Googlebot, due to superfluous or non-essential parameters/flags being included in the URLs of links on the pages, will continue to get crawled incorrectly. An "Official Google Sitemaps Team Member" states that the sitemap XML file will merely augment their crawl, it won't replace existing pages in the index:

    This program is a complement to, not a replacement of, the regular crawl. The benefit of Sitemaps is two fold:
    -- For links we already know about thro our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site.
    -- For the links we dont know about, we plan to use the additional links you supply, to increase our crawl coverage.

    The high-level Google engineer who goes by GoogleGuy in the online forums explains Google Sitemaps in this way:

    Imagine if you have pages A, B, and C on your site. We find pages A and B through our normal web crawl of your links. Then you build a sitemap and list the pages B and C. Now there's a chance (but not a promise) that we'll crawl page C. We won't drop page A just because you didn't list it in your sitemap. And just because you listed a page that we didn't know about doesn't guarantee that we'll crawl it. But if for some reason we didn't see any links to C, or maybe we knew about page C but the url was rejected for having too many parameters or some other reason, now there's a chance that we'll crawl that page C.

    So, the way I read GoogleGuy's explanation, if pages A and C are essentially duplicates of each other, with A containing an additional superfluous parameter in its URL (like sortby=default or lang=english), then BOTH could end up in Google's index. Thus, Google Sitemaps won't reduce the amount of duplication in Google's index; in fact, I believe it will increase it.

    Duplicate pages, on its own, may not sound like a problem for webmasters as much as it is for Google itself, which has to dedicate additional resources to maintain all this redundant content in its index. However, it does have serious implications for webmasters, because it results in PageRank dilution — where multiple versions of a page split up the "votes" (links) and PageRank score that a single version of the page would aggregate.

  • This brings me to the second, related problem with Google Sitemaps: it doesn't do anything to alleviate the phenomenon of PageRank dilution. PageRank dilution results in lower PageRank, which in turn results in lower rankings. For example, consider that the above-mentioned Google Store's product page (the "Black is Back T-Shirt") is in Google's index 5 times instead of just once. So each of those 5 variations earns only a fraction of the total potential PageRank score that it could have earned if all the links pointed to a single "Black is Back T-Shirt" page.Google Sitemaps needs to provide a way to convey, or to sync up with, the site's hierarchical internal linking structure, so that it's clear which pages should get how much of a share of the PageRank flowing into the site's home page. Since the primary holder of PageRank score is the home page (that is, after all, the page that most everyone links to), it's up to the site's internal hierarchical linking structure to pass the PageRank of the home page to the rest of the site. As such, a page that is 2 clicks away from the home page will get a much larger share of PageRank score passed on to it from the home page, versus a page that is 5 clicks away from the home page.

Here's how I suggest both of the above issues be rectified: by extending robots.txt with some additional directives that specify:

  • which parameter in a dynamic URL is the "key field"
  • which parameter is the product ID and which is the category ID (specifically for online catalogs)
  • which parameters are superfluous or that don't signficantly vary the content displayed

Armed with this information, Googlebot will be able to not only eliminate duplicate pages but also intelligently choose the most appropriate version to save in its index and then associate with that page the PageRank of ALL versions of the page. The days of session IDs killing a site's Google visibility would be over! Google admits in its Sitemaps FAQ that session IDs are still a problem even with the advent of Google Sitemaps:

Q: URLs on my site have session IDs in them. Do I need to remove them?

Yes. Including session IDs in URLs may result in incomplete and redundant crawling of your site.

Remember, getting indexed only gets you to the party, it doesn't mean you're going to be popular at the party. Google Sitemaps may help you get more pages indexed, but if those pages all have a PageRank score of 0, then what was the point? It'll be like sitting along the wall the whole time with no one asking you to dance!

GravityStream, our SEO proxy technology (the concept of SEO proxies is explained in my article in Catalog Age last October) deals with PageRank dilution by distilling URLs in links into their lowest common denominator and replacing them on the proxy. We've found that, even as Googlebot gets more aggressive at spidering dynamic sites with complex URLs and starts indexing one of our clients' sites more fully, our proxy still has a major leg-up on the native site that it's proxying. For example, our GravityStream proxy of PETsMART.com is #1 in Google for "best pet toys", and yet the corresponding page on the PETsMART.com native site is nowhere in the first 10 pages of results even though it is indexed. Until Google extends Google Sitemaps to deal with PageRank dilution, I'd expect that a GravityStream proxy will still trump a native site, even if it's using Google Sitemaps. That means that currently, despite Google Sitemaps, GravityStream still plays an important role for online retailers. Nonetheless, it's my sincere hope that Google takes my feedback on board and reworks their protocol!

Posted by Stephan Spencer on 06/06/2005 | Permalink

Comments (0)| Comments RSS | Filed under: Search Engines , , , , , , ,            

Google's index hits 8 billion pages. Yes folks, size does matter.

On Wednesday, the day before Microsoft unveiled the beta of Microsoft Search, Google announced that their index was now over eight billion pages strong. Impeccable timing from the Googleplex. Just a couple days later, and Microsoft could have proudly touted its bigger web page index over Google's. Still, Microsoft's 5 billion documents is an impressive feat, particularly for a new search engine just out of the blocks. Google continues to show their market dominance, however, with a database of a whopping 8,058,044,651 web pages. Poor Microsoft, trumped by Google at the last minute!

Why the big deal about index size? From the user's perspective, a search engine that is comprehensive of the Web in its entirety is going to be more useful than one whose indexation is patchy. Which is why I think the Overture Site Match paid inclusion program from Yahoo! is a really bad idea. Sites shouldn't pay the search engine to be indexed. Rather, the search engine should strive to index as much of the Web as possible because that makes for a better search engine.

Indeed, I see Google's announcement as a landmark in the evolution of search engines. Search engine spiders have historically had major problems with "spider traps" — dynamic database-driven websites that serve up identical or nearly identical content at varying URLs (e.g. when there is a session ID in the URL). Alas, search engines couldn't find their way through this quagmire without severe duplication clogging up their indices. The solution for the search engines was to avoid dynamic sites, to a large degree — or at least to approach them with caution. Over time, however, the sophistication of the spidering and indexing algorithms has improved to the point that search engines (most notably, Google) have been able to successfully index a plethora of previously un-indexed content and minimize the amount of duplication. And thus, the "Invisible Web" begins to shrink. Keep it up, Google and Microsoft!

Posted by Stephan Spencer on 11/14/2004 | Permalink

Comments (2)| Comments RSS | Filed under: Search Engines , , ,            

Free pass into password-protected content

Many sites that require registration or payment in order to access their premium content have realized that they can't keep the search engine spiders (such as Googlebot and Yahoo Slurp) out of their password protected areas or they take a serious hit on their search engine traffic and visibility. Therefore, they let their search engine spiders in, but keep humans out (at least those who don't have an account, of course). Smart humans can take advantage of the back doors the spiders get shown by simply going into Google or Yahoo and doing a search that is site-specific (using the site: query operator). Then, in the search results, click on the Cached link in the search listing of the page that you wish to read. No Cached link present? Then try clicking on the title of the search listing. You may get redirected to a password entry page, but in many cases you will get through to the content! This is because subscription sites often times let search engine users go just one page deep without requiring log-in. So, after reading that page, simply go back to the search results and click through again to read another page. This works on LATimes.com, ChicagoTribune.com, Webmasterworld.com, and many others. Try it out. Enjoy!

Posted by Stephan Spencer on 11/02/2004 | Permalink

Comments (2)| Comments RSS | Filed under: Search Engines ,            

Spiders like Googlebot choke on Session IDs

Many ecommerce sites have session IDs or user IDs in the URL of their pages. This tends to cause either the pages to not get indexed by search engines like Google, or to cause the pages to get included many times over and over, clogging up the index with duplicates (this phenonemon is called a "spider trap"). Furthermore, having all these duplicates in the index causes the site's importance score, known as PageRank, to be spread out across all these duplicates (this phenonemon is called "PageRank dilution").

Ironically, Googlebot regularly gets caught in a spider trap while spidering one of its own sites - the Google Store (where they sell branded caps, shirts, umbrellas, etc.). The URLs of the store are not very search engine friendly: they and are overly complex, and include session IDs. This has resulted in 3,440 duplicate copies of the Accessories page and 3,420 copies of the Office page, for example.

If you have a dynamic, database-driven website and you want to avoid your own site becoming a spider trap, you'll need to keep your URLs simple. Try to avoid having any ?, &, or = characters in the URLs. And try to keep the number of "parameters" to a minimum. With URLs and search engine friendliness, less is more.

Posted by Stephan Spencer on 06/25/2004 | Permalink

Comments (1)| Comments RSS | Filed under: General , , , , , , , ,