October 24, 2002

How-it-works note for "Localized Google search result exclusions" report

In a fascinating report:
"Localized Google search result exclusions Statement of issues and call for data"
http://cyber.law.harvard.edu/filtering/google/
authors Jonathan Zittrain and Benjamin Edelman examine sites excluded by Google from localized country-specific searching. In discussing results, they conjecture:

The implication of these results -- confirmed in our subsequent searches on google.com versus google.fr and .de for the terms at issue -- is that the French and German versions of Google simply omit search results from the sites excluded from their respective versions of Google.

This implication can be refined and clearly demonstrated by observation of more sophisticated searching. The following example uses the "allinurl" syntax of Google, which searches for URLs which have the given components (note the separate components can appear anywhere in the URL, so "allinurl:stormfront.org" is "stormfront" and "org" in the URL, not just the string "stormfront.org" as might be naively thought).
See http://www.google.com/help/operators.html#allinurl

Consider the following US search:
http://www.google.com/search?q=allinurl:stormfront.org&num=100&hl=en
This returned: Results 1 - 25 of about 1,670.

Now compare with the German counterpart:
http://www.google.de/search?q=allinurl:stormfront.org&num=100&hl=en
This returned: Results 1 - 9 of about 1,670.

Immediate observation: The rightmost (total) number is identical. So identical results are in the Google database. It's simply not displaying them. How is it determining which domain results to display?

Note which "stormfront.org" site URLs are visible on the German page:

www4.stormfront.org:81/guest/RemoteListSummary/NNA
irc.stormfront.org:8000/
lists.stormfront.org:81/guest/remoteavailablelists

What do these all have in common?
They all have a port number after the host name.
The exclusion pattern obviously isn't matching the :number part of the URL.
It's matching a pattern of "*.stormfront.org/", as in the following which are displayed the US search, but not the German search.

kids.stormfront.org/
nna.stormfront.org/
www4.stormfront.org/
www2.stormfront.org/
women.stormfront.org/
www.hessmemorial.stormfront.org/
www3.stormfront.org/
ldf.stormfront.org/

Thus, the restrictions appear to be implemented as a post-processing step using very simple patterns of prohibited results.

Update: See also my explanation "Google Censorship - How It Works"

By Seth Finkelstein | posted in censorware , google | on October 24, 2002 10:28 AM (Infothought permalink) | Followups
Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Subscribe with Bloglines      Subscribe in NewsGator Online  Google Reader or Homepage