I don’t have much faith in Google’s (or any other engine’s, for that matter) estimated number of results, and I’ve held this view for a long time. I believe it to be a wildly inaccurate number. If you think about it, why would a search engine put a lot of effort or processing power into really nailing that number, since searchers (with the notable exception of SEOs) could care less if there are 100 thousand results or 100 million results returned; they only really care what’s in the top 10. So it’s fraught with problems to use the estimated number of results as a basis for any SEO metrics. Yet SEOs use the number all the time, for things such as: indexation (a site: query), link popularity (a link: query), and keyword competition (e.g. KEI score, or Keyword Effectiveness Indicator). If we can’t trust the estimated numbers for such metrics, then we should probably move on to find other SEO metrics that we can trust. Yet indexation remains a metric we should care about. So how does one go about checking a site’s Google indexation levels without relying on the demonstrable inaccuracy of Google’s estimated results? Don’t expect it from Google Webmaster Central, although that would be nice. For Webmaster Central’s “Index stats” under the “Statistics” tab, you’ll only find links to Google SERPs for site:, link:, cache:, info:, and related: queries. I’d love it if the Webmaster Central team added reliable stats here that weren’t simply based on estimated results in the SERPs. If they did, it still wouldn’t provide me with trustworthy indexation numbers for sites of which I’m not a verified owner/webmaster. I’d need another solution. At this point the only solution I can think of for getting an exact count of your indexation in Google is to query for each individual URL, then sum all results together. You’d have to write a script to hammer away at Google — via the SOAP API if you’re lucky enough to have some old keys (Google discontinued offering websearch API keys), or by scraping the Google SERPs. Remember: to check if a page is indexed in Google, don’t use the bare URL as the query, prepend it with cache: or info:. So, to see if http://www.ifloor.com/gs/cat-8-hardwood-floors-1.html is indexed, you’d query for “cache:http://www.ifloor.com/gs/cat-8-hardwood-floors-1.html“.
I see your point. Google only lets you view the first 1000 results. So then why bother doing a headcount if it wastes CPU cycles?
Unfortunately, I’ve noticed some weird things about Google, so I’m not sure if you can trust the cache:xyz. When I first got a website indexed in Google, Google indexed it sporadically. However, I noticed one peculiarity. The cache had a 5 day old version of the site. However, in the description, I saw that it had content that was only 2 days old.
I’m not sure what happened, but Google is weird. I’m not so sure that using Cache:xyz is very accurate either.
As a further note:
I had a site which I prohibited Google from visiting via robots.txt. However, it still had the site in their index – only with the site URL though. I could search for the site via the site URL (without the .com), and I could see info:mysite.com. However, cache:mysite.com did not work, as expected. As I said, Google is weird.