March 26, 2004

Google-bombing cannot be defused trivially

There's a proposed Google-bombing solution in the article "Five-domain Googlebomb explodes in boardroom":

"An easy fix for many bombs," explains Brandt "Google should not use terms in external links to boost the rank of a page on those terms, unless those terms are on the page itself. This is a no-brainer. But it means another CPU cycle per link, which is why Google won't do it."

Unfortunately, I have to disagree here. It's not so simple. In fact, the way it works now is ultimately the Right Thing from a technical point of view, in terms making relevancy inferences from a simple algorithm.

One nontrivial reason is misspellings. If many people make the same spelling error in linking (such as turning "Dan Gillmor" into "Dan Gilmore"), it's useful to return that linked page for the search, rather than ignoring it since the wrong spelling likely won't be on the target page.

There's also issues with robots.txt. The robots.txt file isn't for privacy, it's just an advisory to have search-spiders work more efficiently (think of how ill-considered it would be, to have a public file listing material which should not be viewed - "Do Not Look Here"). If the site doesn't want spidering, but many people link to it with certain words, it seems a reasonable thing to return that site for those words. The option of not returning the site isn't necessarily right, because sites often just use robots.txt to avoid the load of being spidered, rather than to hide in any way.

Many issues with Google, or any complex search system, are more subtle than they might appear at first glance.

By Seth Finkelstein | posted in google | on March 26, 2004 11:59 PM (Infothought permalink) | Followups

Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)