Google Spam Filtering Gone Bad

An anticensorware investigation by Seth Finkelstein

Abstract: This report describes a problem which caused Google to return very few, or no, results for particular combinations of search terms. It is almost certain this is a consequence of search results being post-processed by spam-defense which has gone awry.

Google, Spam, and the Whack-NACK

A "GoogleWhack" is a search-game of finding an "elusive query (two words - no quote marks) with a single, solitary result". That is, finding two words which appear together in just one document in the entire Google search index. The words don't have to appear next to each other, just within the document as a whole.

At the start of October 2003, it was noticed by Whack-players that certain search term combinations were appearing with improbably low numbers. For example, a search with "motorcycle" and "candle" would yield no returned results. This was not realistic. Something must be wrong with Google. The absence of expected search results was dubbed the "Google NACK" , from "NACK" meaning "negative acknowledgement" (i.e. no results).

The answer to this anomaly seems to lie in the much less savory game of Google-spamming. For "Google-spam" , sites typically put up many phony pages, with a large number of key terms and links, in order to appear in many search results. To defend against this practice, Google has a mechanism to suppress the display of certain sites. This mechanism can used for censorship, by governments mandating certain sites be placed on the banned list. See:

But the suppression mechanism is ordinarily and much more frequently used for eliminating spam results in the search index.

How it works

As explained in the "Google Censorship" report:

"A Google search is not simply a raw dump of a database query to the user's screen. The retrieval of the data is just one step. There is much post-processing afterwards, in terms of presentation and customization.

When Google "removes" material, often it is still in the Google index itself. But the post-processing has removed it from any results shown to the user."

The current problem with search results shows every indication of being an attempt to remove spam, but removing all or nearly all results in the process.

Observed Problem

As one example, a search for the terms "keyboard" and "bracelet" yielded:

Results 1 - 1 of about 52,100

Note this is not "keyboard bracelet" as a phrase, the words right next to each other. Rather, it's the word "keyboard" and the word "bracelet" anywhere in the document.

Now, there cannot be only one document in the entire index containing the words "keyboard" and "bracelet". What's wrong? The clue is in understanding the details of the search result. The part "about 52,100" indicates that about 52,100 items in the database fit this search, which is reasonable. But "Results 1 - 1" indicates only one item is being returned. So it follows that the rest must be falling afoul of the suppression mechanism.

A Theory And Evidence

The suppressed sites should be quietly removed from the items returned. However, the following glitch would explain the behavior observed here:

The result display is stopping on the first spam (to be suppressed) result

That is, when a spam site would be removed, and then the next (non-spam) site would be returned, instead the result display is simply crashing.

The evidence for this theory is a "NACK-hack" found by this author. Assume, for a result site, there is a "poison spam-site", one which causes the display to crash. If a search was constructed which eliminated the "poison spam-site", then more results should be seen.

Of course, one would have to know the "poison spam-site", and Google is conjectured to be crashing before it's displayed. But in a crude way, this can be done by eliminating various classes of sites. Google has a "site:" keyword , which can be used in advanced searching to either require or exclude everything from sub-domains to, critically, top-level domains - all .com, .net. .org .

So, try a search "keyboard" and "bracelet" and NO .org domains . It yields:

Results 1 - 1 of about 49,300

Note this number is different from the previous 52,100. The search is functioning, the databases is retrieving different results (note it could be a slightly different database too). This also establishes that using the "site:" keyword itself does not change the problem.

Now search "keyboard" and "bracelet" and NO .net domains . Since the one visible result was a .net site, this yields nothing, but it's a consistent nothing.

The proof comes when searching excluding all .com sites, i.e. "keyboard" and "bracelet" and NO .com domains . Then the screen is a normal display, yielding:

Results 1 - 10 of about 7,200

he "poison spam-site" is obviously a .com site. By excluding it, by excluding all .com domains, the search results no longer crash.

Again, this is a theory, but the evidence supports it.

More Theory And Evidence

When Google searches for combinations of terms, pages with the terms close to each other are ranked highly. Such pages are also unfortunately often search spam pages, using a mismash of keywords. Thus, an unusual combination of words (and a dedicated spammer) will bring spam pages near the top of the results for certain keyword searches. So this is why the problem was noticed by people playing "GoogleWhack" .

Some fortuitous searching for the exact phrase "keyboard bracelet" gives even more supporting evidence. Combining a phrase with domains, and searching "keyboard bracelet" and .com domains , turns out to yield a suspicious spam-site at the bottom of the results:

"text.zupavilla.com/animal_fetish.html"

This cannot be the poison spam-site, since then it wouldn't be displayed. But it may be a variant of that poisonous site, which is not yet caught by the suppression mechanism.

It turns out this page has, among other randomness, the keyword spam of:

antenna mouse or keyboard bracelet

Now, doing a new search just excluding "text.zupavilla.com" or "zupavilla.com" will not be good enough, since again the spammer is likely using different domain names with the same spam (in fact, examining the raw HTML on that site reveals many spam domains). However, doing a search prohibiting keywords from this spam will likely remove the site from returned results. Note searching "keyboard bracelet" without thisdoesnotexist is still problematic, showing the act itself of excluding a word isn't the trick.

But a search for "keyboard bracelet" without antenna (remember the phrase "antenna mouse" in the above spam) gives an extensive search display again! QED.

Of course, the results could run into some other poison spam-site later on. For example, searching "keyboard bracelet" without dummy seems, by ranking vagaries, to push the bad site down just far enough so that the display shows six results before crashing. Manipulating the "crash-point" in this way further supports the theory outlined above. For example, a search for the words (not phrase) "antenna" "mouse" "keyboard" gives only four results.

Conclusion

Spam is a plague on search engines as well as email. But technical solutions may have unintended consequences.


Update October 11: Google is working on removing some of the spam pages which cause a crash, so more search results are now being returned. However, the basic problem appears unfixed. That is, the search results screen will still crash when a result would be displayed from a poison spam-site. But currently, some of those search-spam results have been removed, so the crash may occur later on the result list.

Today, a search for the terms "keyboard" and "bracelet" may crash on the SECOND displayed results screen (between result number 11 and result number 20). That is, the first screen of a typical search will return 10 results, as normal. But going to the next screen of 10 results (results 11 to 20) may be where the crash happens at the moment.

Using the Google Advanced Search form to display 100 results per page is an informative way to test the extent of the problem.

Amusing note: This report is now the top-ranked Google result for searching the terms "keyboard" and "bracelet".


Version 1.2 October 11 2003

Major coverage:

Update November 27 2003: See new report from Seth Finkelstein:

Google Bayesian Spam Filtering Problem?
http://sethf.com/anticensorware/google/bayesian-spam.php

Support

This work was not funded by anyone, and has no connection to any organization. In fact, if anyone is providing financial support for such projects, the author would like to know.

[I run the Google ads below with some irony ...]:


Mail comments to: Seth Finkelstein <sethf@sethf.com>

For future information:   subscribe    to   Seth Finkelstein's Infothought list    or read the    Infothought blog

(if you subscribed a few months ago, please resubscribe due to a crash)

See more of Seth Finkelstein 's Censorware Investigations