Infothought: Comment on "A Comparison of the Size of the Yahoo! and Google Indices"

Comments: "A Comparison of the Size of the Yahoo! and Google Indices"

I love statistics. It's often said of quoting from the Bible, "Even the devil can quote scripture for his purposes". I'd wager that El Diablo is pretty good at taking statistics and distorting their significance, too; people far less cunning seem to do it all the time with ease.

Posted by Dave at August 16, 2005 02:26 PM

Also pointed out by Jean Veronis: http://aixtal.blogspot.com/2005/08/yahoo-pages-manquantes-2.html

Posted by Will Fitzgerald at August 16, 2005 03:14 PM

Good--and fast. I didn't read the whole study yet, but right off the bat it had a funny feel to it. Thanks for confirming that.

Posted by Walt Crawford at August 16, 2005 04:06 PM

The word list problem is interesting, but there is a comment on slashdot saying that it doesn't have a large impact on the results.

http://slashdot.org/comments.pl?sid=159082&cid=13323888

What do you think?

Posted by anon at August 16, 2005 05:07 PM

I don't think testing a search engine with some strange obscure query gives a metric of its index size.
Personally, the relevance of the results is more important.

Posted by Anthroponym at August 17, 2005 01:33 AM

The NCSA study included the SAME word-list pages 10,012 times. That's right, it included them in each and every search.

How come yahoo has never showed any of these pages? I have it from reliable sources: those same pages were suppressed in the yahoo results, because by their virtue of matching almost any english query, they are effectively spam.

So what we can learn from this is not who indexes more pages, but surely who suppresses spam pages better, and it is not google.

What is amazing more than anything else in this story is that google personel such as chrisd on slashdot jumped on this study without realizing this fundamental flaw.

Posted by Sean DeBurgh at August 17, 2005 02:57 AM

Thanks for the feedback, all.

anon - The general problem is that the results are skewed from esoteric factors, such as spam, wordlists, differences in large files in general, and so. One would need to examine the results of a search in detail to see what most affects that particular search. I looked at some of the examples given in the Slashdot comment, and it would require some extensive inspection. But I think it's clear that the differences reflect repeated samplings of relatively minor quirks, rather than overall index size differences.

Posted by Seth Finkelstein at August 17, 2005 08:36 AM