October 24, 2002

Improving searching for "Localized Google search result exclusions"

In a fascinating report:
"Localized Google search result exclusions Statement of issues and call for data"
authors Jonathan Zittrain and Benjamin Edelman examine sites excluded by Google from localized country-specific searching. For methodology, they use:

A note on search criteria: The authors' searches use standard Google search syntax to request 1) pages on the specified web site (using the site:stormfront.org restriction), and 2) pages that lack a phrase of gibberish (using the exclusion syntax -asdfasdf), since some search term must be specified. Similar searches for other sites confirm that these search criteria provide a reliable estimate of the number of pages indexed by Google on a given web site.

This methodology has a notable flaw - it cannot find any blacklisted item which is less than domain-level. For example, one item blacklisted from Germany is the home page of the Holocaust denier Arthur R. Butz, at URL:

This can be seen by comparing the German search using "allinurl" syntax http://www.google.de/search?q=allinurl%3Apubweb.acns.nwu.edu%2F%7Eabutz%2F

Versus a similar US search using "allinurl" syntax

The German search will return nothing, while the US search finds the relevant pages.

However, this item cannot be found with the "site:" syntax. A "site:" search argument is treated by Google as a domain name, and "pubweb.acns.nwu.edu/~abutz/" is not a domain. Thus, "site:pubweb.acns.nwu.edu/~abutz/" will never match anything.

Moreover, comparing site:pubweb.acns.nwu.edu search results between Germany and the US will NOT display any numerical difference in results. This is because as noted previously, the Google database seems to be identical for all countries. It is only the search display results which are affected.

Around 6,000 pages are indexed for pubweb.acns.nwu.edu. Since the maximum number of search results which can be displayed at a time is 100, there will be far more than 100 results which can be displayed even when the Holocaust-denier pages are removed.

Of course, if someone tried to retrieve all 6,000 pages, at some point, a difference due to banned pages would be visible. But that's an impractical, or at least very involved, task.

Thus, "allinurl" searches, when used with care as to what they mean, are a much better methodology for searching for banned items.

Again, it's important to note the separate components can appear anywhere in the URL, so "allinurl:stormfront.org" is "stormfront" and "org" in the URL, not just the string "stormfront.org" as might be naively thought).
See http://www.google.com/help/operators.html#allinurl

Update Oct 26:
The "info:" Google search operator is a good way to ask yes/no questions, which works for domains, directories, and pages.
See http://www.google.com/help/operators.html#info

Compare the German search using "info:" operator http://www.google.de/search?hl=en&q=info%3Apubweb.acns.nwu.edu%2F%7Eabutz%2F

Versus a similar US search using "info:" operator

Again, the German search will return nothing, while the US search finds the specific page. However, keep in mind it's possible for both "info:" searches to return nothing, depending on the vagaries of the database. That is, a not-found result in another country search, combined with a found result in the US search, is definitive evidence. But a not-found result in another country search and the US search may simply indicate the particular URL is not indexed.

Note searches for the German page caches ("cache:") DO work, even with banned sites (yet another proof that the database is identical, the results ban is a post-processing step)


By Seth Finkelstein | posted in censorware | on October 24, 2002 04:07 PM (Infothought permalink) | Followups

Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Subscribe with Bloglines      Subscribe in NewsGator Online  Google Reader or Homepage