August 16, 2005

"A Comparison of the Size of the Yahoo! and Google Indices"

The study "A Comparison of the Size of the Yahoo! and Google Indices" is being widely reported. On initial examination, I've found a bad problem with it.

The methodology is severely flawed, with a sampling-error bias.

In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist (a total of 135,069 words) [4] and wrote a PERL script to randomly select two words at a time from that list. The script then used those keywords to search both Yahoo! and Google and logged the number of results returned. For the purposes of this study we used a sample of 10,012 different searches of Yahoo! and Google using our randomly selected keywords.

By sampling random words, they biased the samples to files of LARGE WORDS LISTS!

And this effect applies, to a great or lesser extent, to EVERY SAMPLE.

One can see this in their log of search results.

First entry:

Terms: carbolization clambers
Google totals:
Duplicates Omitted Estimate: 7
Duplicates Omitted Total: 4
Duplicates Included Estimate: 7
Duplicates Included Total: 7

Yahoo totals:
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0

Do the Google search

Every entry is a large word-list file. Some are presumably (near?) duplicates of the same file

And every search will have this problem, since every search will pick up files like those.

It's a severe systematic error.

Update [12:30 pm EST] - add search-engine spam to the sampling bias. Consider:

Terms: alkaloid's observance
Google totals:
Duplicates Omitted Estimate: 29
Duplicates Omitted Total: 15
Duplicates Included Estimate: 29
Duplicates Included Total: 29

Yahoo totals:
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0

Look at the results. Every page is either a gibberish spam page or a wordlist.

By Seth Finkelstein | posted in google | on August 16, 2005 11:44 AM (Infothought permalink)
Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Subscribe with Bloglines      Subscribe in NewsGator Online  Google Reader or Homepage


I love statistics. It's often said of quoting from the Bible, "Even the devil can quote scripture for his purposes". I'd wager that El Diablo is pretty good at taking statistics and distorting their significance, too; people far less cunning seem to do it all the time with ease.

Posted by: Dave at August 16, 2005 02:26 PM

Also pointed out by Jean Veronis:

Posted by: Will Fitzgerald at August 16, 2005 03:14 PM

Good--and fast. I didn't read the whole study yet, but right off the bat it had a funny feel to it. Thanks for confirming that.

Posted by: Walt Crawford at August 16, 2005 04:06 PM

The word list problem is interesting, but there is a comment on slashdot saying that it doesn't have a large impact on the results.

What do you think?

Posted by: anon at August 16, 2005 05:07 PM

I don't think testing a search engine with some strange obscure query gives a metric of its index size.
Personally, the relevance of the results is more important.

Posted by: Anthroponym at August 17, 2005 01:33 AM

The NCSA study included the SAME word-list pages 10,012 times. That's right, it included them in each and every search.

How come yahoo has never showed any of these pages? I have it from reliable sources: those same pages were suppressed in the yahoo results, because by their virtue of matching almost any english query, they are effectively spam.

So what we can learn from this is not who indexes more pages, but surely who suppresses spam pages better, and it is not google.

What is amazing more than anything else in this story is that google personel such as chrisd on slashdot jumped on this study without realizing this fundamental flaw.

Posted by: Sean DeBurgh at August 17, 2005 02:57 AM

Thanks for the feedback, all.

anon - The general problem is that the results are skewed from esoteric factors, such as spam, wordlists, differences in large files in general, and so. One would need to examine the results of a search in detail to see what most affects that particular search. I looked at some of the examples given in the Slashdot comment, and it would require some extensive inspection. But I think it's clear that the differences reflect repeated samplings of relatively minor quirks, rather than overall index size differences.

Posted by: Seth Finkelstein at August 17, 2005 08:36 AM

More on this:

Posted by: Jean Veronis at August 19, 2005 06:44 AM