May 24, 2007

Google and "She"/"He" Spelling "Corrections"

A Google algorithmic quirk which spelling "corrected" searches like e.g. [he invents] to [she invents] recently got some attention, and Google has apparently now rolled out a fix for this problem.

I didn't chase after it at the time, since it seemed obviously an issue of statistics difference, and plenty of informed people were explaining that result to those who saw it as deliberate sexism. So I didn't see the need for me to say it too. There can be a long discussion of structural sexism, and the effects of the default English pronoun being "he", etc, but I had no special expertise to weigh in on the matter.

But the fix that Google has made is interesting for what it reveals about how their algorithm actually functions. As Philipp Lenssen said in the above:

(Note: no matter what Google tells you, algorithms are always influenced by those who design, write & test them)

So Google seems to have changed the way "she" is handled in their spelling suggestions.

But it turns out, from seeing what behavior remains, that Google does not do the obvious sort of simple correction algorithm one might initially think. That is, a search for ["she inventt"] still gets a suggestion of
Did you mean: "he invent".

Why is this significant?

Because "she" is a common English word, "inveent" is not a common English word, and the naive correction of "inveent" to "invent" should yield a suggestion of "she invent". But it seems to be doing some sort of statistical best-match for the phrase as a whole.

I supposed this is not surprising, even expected, in retrospect. But it shows it's harder than it might appear to remove all aspects of structural bias (which is not to trivialize addressing an obvious case).

Semi-digression: Google seems to special-case swear-words. A search of ["fcck you"] does NOT return the obvious correction! One rule seems to be that if the swear-word doesn't appear in the original search, it won't be suggested.

By Seth Finkelstein | posted in google | on May 24, 2007 11:52 AM (Infothought permalink)
Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Subscribe with Bloglines      Subscribe in NewsGator Online  Google Reader or Homepage


Interesting post. Didn't know about the swear-words exception. (Clearly, Google engineers are "biased" against swear words, showing how this affects their algorithm!)

Posted by: Philipp Lenssen at May 24, 2007 02:44 PM

Peter Norvig all but admitted that Google uses n-grams (phrases) and the frequency that they appear in the Web in their spelling corrector. So the results reflect the biases of the Web.

Posted by: Wes Felter at May 25, 2007 07:00 PM