Stephan Spencer's Scatterings

The Scattered Wisdom of a scientist turned web marketing virtuoso

October 2008
S M T W T F S
 << <   > >>
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Getting indexed by Google Mini doesn't mean you'll get indexed by Googlebot

Google bills the Google Mini as a search appliance that's able to deliver "the same reliable results you expect from Google web search" to your intranet or public website. I think Google is being misleading here, confusing their customer base into mistakenly believing they will receive the "same results" from the Google Mini as the Google.com web search. Google.com has major points of difference from the Google Mini. For example, Google.com takes numerous spam signals into account when determining indexation and rankings; Google Mini does not scan for the same spam signals. Google.com takes into account the PageRank scores of all inbound links; Google Mini does not.

As discovered by Joel on Software, the Mini never phones home, even to get public PageRank information; the search results it produces are entirely based on whatever documents you told it to crawl.

So, if your site's PageRank is too low once you get into deep pages, then Googlebot may lose interest and may not even index the page, whereas Google Mini would be much more inclusive.

Or if a site contains something that looks dodgy -- like noscript, noframes, hidden divs, tiny text, etc. -- it will be a turn-off for Googlebot but not for Google Mini.

Thus, if you're a Google Mini customer and your site is well-indexed, don't take that to mean that your site will be well-indexed by Googlebot.

Posted by Stephan Spencer on 09/26/2006 | Permalink

Comments (0)| Comments RSS | Filed under: Search Engines , , ,            

Googlebot, parameters and dynamic sites

I previously mentioned that Matt Cutts from Google gave some advice to webmasters of dynamic (database driven) web sites.

For one thing, Matt advised that if you have a dynamic web site, you should minimize the number of parameters in the URL. You’re very safe if you have fewer than 2 parameters. Keep the values of those parameters to fewer than 5 digits. And don’t name a parameter id. That's because Google is suspicious of that parameter being a session ID or something other than a key field. Even if it's the only parameter in your URLs, try not to use it. Particularly if that variable's value is long (like 5 digits or more). sid would be a bad choice too because it could stand for session ID as much as it could stand for a key field like story ID. It doesn't mean that your pages won't be indexed if you use this parameter name; it just means those pages would be at a greater risk of not being included. You should be fine though if your pages are all already in Google.

Matt also mentioned something that should be a bit alarming to anyone with a dynamic site. Googlebot sometimes tries variations of URLs by dropping parameters. Meaning that Googlebot may experiment with removing name-value pairs from the query string portion of your URLs (i.e. the part of the URL that follows after the question mark) and seeing if the pages still load. I understand the reason for this to be that if these variant pages still show the same content as the page at the original URL, it gives Googlebot an indication that the omitted parameters are superfluous in the query string. So for example, a URL such as this:

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
xsrc=&PID=36575&S=NY&T=&MTG=PR

might be shortened by Googlebot to:

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
xsrc=&PID=36575&S=NY&T=

and

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
xsrc=&PID=36575

and

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
S=NY&T=&MTG=PR

etc.

Then these URL variations would get spidered and compared with each other. I've heard of big websites getting hit by this and it causing big problems for the website in question. Don't get all worried about Googlebot doing this to your site if you're a not a big and important site. Matt stated that Google only does this deep level analysis on big, quality sites. Anyone been subjected to this? And if so, what damage or inconvenience did it inflict on you?

Posted by Stephan Spencer on 08/27/2005 | Permalink

Comments (0)| Comments RSS | Filed under: Search Engines , , , , , , ,            

What's wrong with Google Sitemaps

Last Friday it seemed like the whole blogosphere was abuzz with the news that Google unveiled its new Google Sitemaps service, a free inclusion service where you publish an XML file of your site pages to Google so its spider can get a better sense of what to crawl of your site. This is good news, especially for dynamic sites that aren't getting fully indexed. I appreciate Google once again showing its thought leadership. Not only is Google giving webmasters a new way to relay information about their site structure information to its spiders, but it's sharing this new technology with the other search engines by releasing the protocol and code as open source.

This all sounds wonderful, but there are 2 quite major problems with Google's approach.

  • First, it doesn't solve the duplicate pages problem that a great many dynamic sites have. Even the Google Store suffers from this (which I blogged about previously but here's a more recent example of a Google Store product page being duplicated times in Google's index). The Google Sitemaps protocol does not provide a way for webmasters to convey which pages are duplicates of other pages. A site that gets crawled incorrectly by Googlebot, due to superfluous or non-essential parameters/flags being included in the URLs of links on the pages, will continue to get crawled incorrectly. An "Official Google Sitemaps Team Member" states that the sitemap XML file will merely augment their crawl, it won't replace existing pages in the index:

    This program is a complement to, not a replacement of, the regular crawl. The benefit of Sitemaps is two fold:
    -- For links we already know about thro our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site.
    -- For the links we dont know about, we plan to use the additional links you supply, to increase our crawl coverage.

    The high-level Google engineer who goes by GoogleGuy in the online forums explains Google Sitemaps in this way:

    Imagine if you have pages A, B, and C on your site. We find pages A and B through our normal web crawl of your links. Then you build a sitemap and list the pages B and C. Now there's a chance (but not a promise) that we'll crawl page C. We won't drop page A just because you didn't list it in your sitemap. And just because you listed a page that we didn't know about doesn't guarantee that we'll crawl it. But if for some reason we didn't see any links to C, or maybe we knew about page C but the url was rejected for having too many parameters or some other reason, now there's a chance that we'll crawl that page C.

    So, the way I read GoogleGuy's explanation, if pages A and C are essentially duplicates of each other, with A containing an additional superfluous parameter in its URL (like sortby=default or lang=english), then BOTH could end up in Google's index. Thus, Google Sitemaps won't reduce the amount of duplication in Google's index; in fact, I believe it will increase it.

    Duplicate pages, on its own, may not sound like a problem for webmasters as much as it is for Google itself, which has to dedicate additional resources to maintain all this redundant content in its index. However, it does have serious implications for webmasters, because it results in PageRank dilution — where multiple versions of a page split up the "votes" (links) and PageRank score that a single version of the page would aggregate.

  • This brings me to the second, related problem with Google Sitemaps: it doesn't do anything to alleviate the phenomenon of PageRank dilution. PageRank dilution results in lower PageRank, which in turn results in lower rankings. For example, consider that the above-mentioned Google Store's product page (the "Black is Back T-Shirt") is in Google's index 5 times instead of just once. So each of those 5 variations earns only a fraction of the total potential PageRank score that it could have earned if all the links pointed to a single "Black is Back T-Shirt" page.Google Sitemaps needs to provide a way to convey, or to sync up with, the site's hierarchical internal linking structure, so that it's clear which pages should get how much of a share of the PageRank flowing into the site's home page. Since the primary holder of PageRank score is the home page (that is, after all, the page that most everyone links to), it's up to the site's internal hierarchical linking structure to pass the PageRank of the home page to the rest of the site. As such, a page that is 2 clicks away from the home page will get a much larger share of PageRank score passed on to it from the home page, versus a page that is 5 clicks away from the home page.

Here's how I suggest both of the above issues be rectified: by extending robots.txt with some additional directives that specify:

  • which parameter in a dynamic URL is the "key field"
  • which parameter is the product ID and which is the category ID (specifically for online catalogs)
  • which parameters are superfluous or that don't signficantly vary the content displayed

Armed with this information, Googlebot will be able to not only eliminate duplicate pages but also intelligently choose the most appropriate version to save in its index and then associate with that page the PageRank of ALL versions of the page. The days of session IDs killing a site's Google visibility would be over! Google admits in its Sitemaps FAQ that session IDs are still a problem even with the advent of Google Sitemaps:

Q: URLs on my site have session IDs in them. Do I need to remove them?

Yes. Including session IDs in URLs may result in incomplete and redundant crawling of your site.

Remember, getting indexed only gets you to the party, it doesn't mean you're going to be popular at the party. Google Sitemaps may help you get more pages indexed, but if those pages all have a PageRank score of 0, then what was the point? It'll be like sitting along the wall the whole time with no one asking you to dance!

GravityStream, our SEO proxy technology (the concept of SEO proxies is explained in my article in Catalog Age last October) deals with PageRank dilution by distilling URLs in links into their lowest common denominator and replacing them on the proxy. We've found that, even as Googlebot gets more aggressive at spidering dynamic sites with complex URLs and starts indexing one of our clients' sites more fully, our proxy still has a major leg-up on the native site that it's proxying. For example, our GravityStream proxy of PETsMART.com is #1 in Google for "best pet toys", and yet the corresponding page on the PETsMART.com native site is nowhere in the first 10 pages of results even though it is indexed. Until Google extends Google Sitemaps to deal with PageRank dilution, I'd expect that a GravityStream proxy will still trump a native site, even if it's using Google Sitemaps. That means that currently, despite Google Sitemaps, GravityStream still plays an important role for online retailers. Nonetheless, it's my sincere hope that Google takes my feedback on board and reworks their protocol!

Posted by Stephan Spencer on 06/06/2005 | Permalink

Comments (0)| Comments RSS | Filed under: Search Engines , , , , , , ,            

Spiders like Googlebot choke on Session IDs

Many ecommerce sites have session IDs or user IDs in the URL of their pages. This tends to cause either the pages to not get indexed by search engines like Google, or to cause the pages to get included many times over and over, clogging up the index with duplicates (this phenonemon is called a "spider trap"). Furthermore, having all these duplicates in the index causes the site's importance score, known as PageRank, to be spread out across all these duplicates (this phenonemon is called "PageRank dilution").

Ironically, Googlebot regularly gets caught in a spider trap while spidering one of its own sites - the Google Store (where they sell branded caps, shirts, umbrellas, etc.). The URLs of the store are not very search engine friendly: they and are overly complex, and include session IDs. This has resulted in 3,440 duplicate copies of the Accessories page and 3,420 copies of the Office page, for example.

If you have a dynamic, database-driven website and you want to avoid your own site becoming a spider trap, you'll need to keep your URLs simple. Try to avoid having any ?, &, or = characters in the URLs. And try to keep the number of "parameters" to a minimum. With URLs and search engine friendliness, less is more.

Posted by Stephan Spencer on 06/25/2004 | Permalink

Comments (1)| Comments RSS | Filed under: General , , , , , , , ,