August 23, 2005

Yahoo! Google Size Study Still Flawed

The study comparing sizes of Yahoo! and Google has attempted to address some issues, but is still flawed. Per Jean Véronis:

In the new study, the authors still draw two words at random in the ispell dictionary, but exclude a third, random word from the search (using the exclusion operator - ), in the hope of removing word lists and spam from results. For example, they will search for switchers trophoblast -agnus. They find that Google still returns more results (although less often than before).

Unfortunately, this new strategy doesn't remove the bias. Word lists and spam are still returned, as can be easily checked on any of the queries used, such as switchers trophoblast -agnus. Here are the results from a Google search this morning : all results but one are word lists and junk.

Let me further elaborate. The study's authors assume:

To deal with this problem we modified our original search parameters of searching for two random words from the commonly available English Ispell Wordlist (a total of 135,069 words) [4]. Instead, we searched for two random words and not a third random word. This method, we feel, helps to exclude the vast number of "dictionaries" and "wordlists" because those results should be filtered out by the "not a third random word" part of our search query.

The intent is clear. But the above statement is just not very true. In fact, it may not even exclude format variations of the original wordlist. For example, hypothetically, if there's a wordlist split into two files, one covering words starting with letters "a-n", and another for letters starting "o-z", then searching [alpha beta] will find the first file, yet searching [alpha beta -zebra] will still find the exact same file.

More importantly, all wordlists are not identical. A specific example in the "verification" study is searching [guck wheeze -prothrombin].

Terms: guck wheeze -prothrombin
Google totals:
Duplicates Omitted Estimate: 88
Duplicates Omitted Total: 56
Duplicates Included Estimate: 88
Duplicates Included Total: 83

Yahoo totals:
Duplicates Omitted Estimate: 30
Duplicates Omitted Total: 25
Duplicates Included Estimate: 29
Duplicates Included Total: 28

Ratio of Google/Yahoo for this query:
Duplicates Omitted Estimate: 2.933333
Duplicates Omitted Total: 2.240000
Duplicates Included Estimate: 3.034483
Duplicates Included Total: 2.964286

But "guck" and "wheeze" are common words, while "prothrombin" is much more obscure. So, per the search, there are still many wordlists which contain "guck" and "wheeze", but not "prothrombin" (as well as spam pages).

In general, sampling bias must be carefully examined, because extensive repetitions of a flawed procedure will still yield a fundamentally flawed outcome.

By Seth Finkelstein | posted in google | on August 23, 2005 02:01 PM (Infothought permalink)
