In a fascinating report:
"Localized Google search result exclusions Statement of issues and call
for data"
http://cyber.law.harvard.edu/filtering/google/
authors Jonathan Zittrain and Benjamin Edelman examine sites excluded
by Google from localized country-specific searching.
In discussing results, they conjecture:
The implication of these results -- confirmed in our subsequent searches on google.com versus google.fr and .de for the terms at issue -- is that the French and German versions of Google simply omit search results from the sites excluded from their respective versions of Google.
This implication can be refined and clearly demonstrated
by observation of more sophisticated searching. The following example
uses the "allinurl" syntax of Google, which searches for URLs which
have the given components (note the separate components can appear
anywhere in the URL, so "allinurl:stormfront.org" is
"stormfront" and "org" in the URL, not just the string "stormfront.org"
as might be naively thought).
See http://www.google.com/help/operators.html#allinurl
Consider the following US search:
http://www.google.com/search?q=allinurl:stormfront.org&num=100&hl=en
This returned: Results 1 - 25 of about 1,670.
Now compare with the German counterpart:
http://www.google.de/search?q=allinurl:stormfront.org&num=100&hl=en
This returned: Results 1 - 9 of about 1,670.
Immediate observation: The rightmost (total) number is identical. So identical results are in the Google database. It's simply not displaying them. How is it determining which domain results to display?
Note which "stormfront.org" site URLs are visible on the German page:
www4.stormfront.org:81/guest/RemoteListSummary/NNA
irc.stormfront.org:8000/
lists.stormfront.org:81/guest/remoteavailablelists
What do these all have in common?
They all have a port number after the host name.
The exclusion pattern obviously isn't matching the :number part of the URL.
It's matching a pattern of "*.stormfront.org/", as in the following which
are displayed the US search, but not the German search.
kids.stormfront.org/
nna.stormfront.org/
www4.stormfront.org/
www2.stormfront.org/
women.stormfront.org/
www.hessmemorial.stormfront.org/
www3.stormfront.org/
ldf.stormfront.org/
Thus, the restrictions appear to be implemented as a post-processing step using very simple patterns of prohibited results.
Update: See also my explanation "Google Censorship - How It Works"
By Seth Finkelstein | posted in censorware , google | on October 24, 2002 10:28 AM (Infothought permalink) | Followups