Stephan Spencer's Scatterings

The Scattered Wisdom of a scientist turned web marketing virtuoso

December 2008
S M T W T F S
 << <   > >>
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Google's index hits 8 billion pages. Yes folks, size does matter.

On Wednesday, the day before Microsoft unveiled the beta of Microsoft Search, Google announced that their index was now over eight billion pages strong. Impeccable timing from the Googleplex. Just a couple days later, and Microsoft could have proudly touted its bigger web page index over Google's. Still, Microsoft's 5 billion documents is an impressive feat, particularly for a new search engine just out of the blocks. Google continues to show their market dominance, however, with a database of a whopping 8,058,044,651 web pages. Poor Microsoft, trumped by Google at the last minute!

Why the big deal about index size? From the user's perspective, a search engine that is comprehensive of the Web in its entirety is going to be more useful than one whose indexation is patchy. Which is why I think the Overture Site Match paid inclusion program from Yahoo! is a really bad idea. Sites shouldn't pay the search engine to be indexed. Rather, the search engine should strive to index as much of the Web as possible because that makes for a better search engine.

Indeed, I see Google's announcement as a landmark in the evolution of search engines. Search engine spiders have historically had major problems with "spider traps" — dynamic database-driven websites that serve up identical or nearly identical content at varying URLs (e.g. when there is a session ID in the URL). Alas, search engines couldn't find their way through this quagmire without severe duplication clogging up their indices. The solution for the search engines was to avoid dynamic sites, to a large degree — or at least to approach them with caution. Over time, however, the sophistication of the spidering and indexing algorithms has improved to the point that search engines (most notably, Google) have been able to successfully index a plethora of previously un-indexed content and minimize the amount of duplication. And thus, the "Invisible Web" begins to shrink. Keep it up, Google and Microsoft!

Posted by Stephan Spencer on 11/14/2004 | Permalink

Comments (2)| Comments RSS | Filed under: Search Engines , , ,            

Spiders like Googlebot choke on Session IDs

Many ecommerce sites have session IDs or user IDs in the URL of their pages. This tends to cause either the pages to not get indexed by search engines like Google, or to cause the pages to get included many times over and over, clogging up the index with duplicates (this phenonemon is called a "spider trap"). Furthermore, having all these duplicates in the index causes the site's importance score, known as PageRank, to be spread out across all these duplicates (this phenonemon is called "PageRank dilution").

Ironically, Googlebot regularly gets caught in a spider trap while spidering one of its own sites - the Google Store (where they sell branded caps, shirts, umbrellas, etc.). The URLs of the store are not very search engine friendly: they and are overly complex, and include session IDs. This has resulted in 3,440 duplicate copies of the Accessories page and 3,420 copies of the Office page, for example.

If you have a dynamic, database-driven website and you want to avoid your own site becoming a spider trap, you'll need to keep your URLs simple. Try to avoid having any ?, &, or = characters in the URLs. And try to keep the number of "parameters" to a minimum. With URLs and search engine friendliness, less is more.

Posted by Stephan Spencer on 06/25/2004 | Permalink

Comments (1)| Comments RSS | Filed under: General , , , , , , , ,