November 13, 2008

Uncommon Google items - "Real Sex", Counts, Distinctive Sentences

A few things that have not been echoed widely, and deserve more notice (not that I can change it much, but here's something ...)

Tony Comstock - Taking the Real Sex out of [Real Sex] Searches. (Is the googlebot erotophobic?)

"But last July I noticed that although the word [real] is #5 in Google’s listing of keywords in inbound links, [real] doesn’t appear anywhere in the Googlebot’s listing for our site content keywords. That’s right, the Googlebot doesn’t see the word [real] at the home of real life, real people, real sex."

This is actually really interesting, though I don't really have the tools to investigate it. I lean towards thinking it's some sort of spam or "trusted site" algorithmic issue rather than an anti-sex bias of Googlebot.

David Weinberger - Obama v. Bush: Google counts

Estimated Google hits for [“Barack Obama”] are more than [“George W. Bush”] and [“George Bush”] combined. This strikes me as a clear demonstration that the meaning of those hit numbers is not what one intuitively expects them to be. It's known that the numbers are not full database counts - people read them as full database counts, but they are merely a statistical estimate. I suspect, just off the top of my head, that the results are heavily skewed by a recency bias in what's used for the estimate. I'd believe Barack Obama has been mentioned overall more than George Bush in the very recent past.

Walt Crawford - How Common is Common Language?

An extensive examination where Google is used as a testbed for analyzing the utility of checking phrases in cases of suspected plagiarism. "Even relatively short sentences seem to be unusual most of the time. On the order of 85% in this sample, and I suspect that percentage would be higher in a truly random sample. ... What I believe may be true: If you’re suspicious that a clumsy plagiarist has cut-and-pasted without paraphrasing, almost any medium-length sentence may suggest you should check further. It may be entirely innocent. But it seems surprisingly uncommon for the same, say, 11-word string to show up more than once."

By Seth Finkelstein | posted in google | on November 13, 2008 09:04 PM (Infothought permalink)
Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Subscribe with Bloglines      Subscribe in NewsGator Online  Google Reader or Homepage


Hello Seth, and thanks for calling out my post.

Using Google's search suggestions feature and their keyword tool on adwords, we've got a little better picture of just what Google thinks of

The long and the short is that, although I try to stay away from assumptions that "google's out to get the porn people" we've seen some fairly distressing data points about our site and others.

For example, using only the text from our index page, Adwords suggests [anabollic], [cum fiesta] and [wife craves black cock] as possible keywords. I'd like to know by what process those words were arrived at.

Posted by: Tony Comstock at November 13, 2008 10:33 PM

Thanks here too. That "study" was fun, maybe because it was so surprising: For all the simple math that suggests a nearly infinite number of legal eight-word English-language sentences, it's still surprising to see how rarely seemingly ordinary sentences actually occur.

I love "studies" (sorry for the scare quotes, but with a sample size in the low hundreds, it was mostly anecdotal) that are fun to do and fun to report--particularly since that fun is usually the primary payoff.

Posted by: walt crawford at November 14, 2008 12:05 PM

[SF - elevated to guest post]

Posted by: Daniel Brandt at November 15, 2008 02:03 PM