Why you should be concerned about Google Flu Trends
http://www.guardian.co.uk/technology/2008/nov/27/privacy-searchengines
The search engine has unwittingly hung a big sign on itself advertising services for government surveillance
The title's fine, though when I submitted it I proposed "Google Flu And Monitoring Health". I was aiming for a deliberate ambiguity in the phrase "monitoring health" between the literal sense of seeing where is sickness and more metaphorical sense of good safeguards against misuse of private data. Maybe I was being too clever.
I know many people have written on this topic, but I really tried to capture the double-edged nature here. That is, the conflict between "That's so cool" for technical achievement, and "That's so scary" in terms of potential for abuse. As I think of it: Technology-positive social criticism.
I'm hoping to popularize a phrase I've used here: "surveillance engines"
[For all columns, see the page Seth Finkelstein | guardian.co.uk.]
Echo: The big business of net censorship - "Clamping down on free speech on the internet has been a lucrative enterprise for software manufacturers" - Jo Glanville
We know as much as we do because of the great research of organisations such as the OpenNet Initiative and because of the brave detective work done by researchers such as Seth Finkelstein and Ben Edelman. Under the Digital Millennium Copyright Act in the US, no one can legitimately examine the lists of blocked sites or ask for a review.
[Below is a guest post from Daniel Brandt, who gives his experiences and speculations below. His views are of course his own and not necessarily my own, but I do believe them worth hearing]
There is definitely some sort of filtering going on in Google's rankings for certain keywords. It took 18 months for any of the pages on my wikipedia-watch.org site to rank better than 200 deep or so for any combination of keywords from those pages. During this time, Yahoo and Live.com were ranking the same pages well for the same terms.
When I test terms on Google, I test with a multi-threaded private tool that checks more than 30 Google data centers on different Class Cs, and shows the rank up to 100 on each one. I can see changes kicking in and out as they propagate across these data centers. The transitions can take several days in normal cases, as when a new or modified page is appropriated into the results.
Wikipedia-watch.org has been a website now for 36 months. During the first half of that period, no pages ranked higher than 200 deep or so, even if you used two fairly uncommon words from that page to search for it (this is documented at wikipedia-watch.org/goohate.html). During the second half of that period, after it took about four months to settle into the transition, the deeper pages ranked okay, and were on a par with Yahoo and Live. But there was still one glaring exception to this rule: the search for the single word "wikipedia" failed to turn up the home page in the first 100 results almost all of the time during this second period.
When it did show up, it always ranked within the top 15. When it didn't show up, it was always greater than 100. There was never anything in between, and I've been watching this curiosity for the last six months now. For the first five of these months, it might kick in for a few hours on all data centers, and then disappear. This happened several times. Twice it kicked in for a few days, and then disappeared from the top 100 again. During the last 30 days, it has been in about half of the total time, for several days each time, and then disappeared again for days. It's always one or the other -- in the top 15 or not even in the top 100. Meanwhile, the deep pages have ranked okay the last 18 months, and have been stable this entire time.
This behavior is something I'm seeing only for the home page, and only on Google but not on Yahoo or Live. It happens almost exclusively when the word "wikipedia" is the solitary search term, or maybe this one word and another term that's also on that page. If you add a third term you begin ranking reasonably well for my home page, presumably because the search is now specific enough to override the filtering. By the way, this home page has a PageRank of 5 and Yahoo counts 3,500 external backlinks to that home page (there's a counting tool at microsoft-watch.org/cgi-bin/ranking.htm). You cannot use Google to count backlinks, because for years now, Google has been deliberately suppressing this information.
I should also add here that for three years running, another site of mine, Scroogle.org, had a tool that compared the top 100 Google results for a search with the top 100 Yahoo results for that same search. This may come as a surprise to some, but the divergence was consistently 80 percent for all searches. In other words, only 20 out of 100 links showed up on both Yahoo and Google for any search, and the other 80 on each engine were unique in their top 100. The overall quality of the results was about even for each engine. To put this another way, there's a lot of wiggle room for a particular engine to vary the top results, and still look like they're providing the most relevant links.
To make this long story shorter, I believe that there is some sort of backend filter that affects which top results are shown by Google. This actually makes some sense, since most searchers never go beyond the first page of results (at 10 links per page). This means Google's reputation and ad revenue depend heavily on the utility of that first page. A filter that favors recency is one component of this, because Google jacks up recent forum and blog posts (and increasingly even news posts). Everyone expects this by now. Static sites such as wikipedia-watch.org must compete in this sort of environment.
In addition to the recency factor, I think there is filter weighting based on what I call "newbie searches." A newbie search is grandpa or grandma searching for single words such as "wikipedia" or "email" that normally return millions of results, which of course is useless to the searcher. Such searches are stupid to begin with, but Google must cater to stupidity in order to push ads, since ad revenue is 99 percent of total revenue. There might even be some sort of rotational weighting for newbie searches.
And call me a tin-foil hatter if you must, but I also believe that "hand jobs" are involved in tweaking this filter. In other words, there is a political dimension to it as well. Regrettably, I cannot prove this. We need more transparency from Google, and we need it now, before the situation becomes even more suspicious.
A few things that have not been echoed widely, and deserve more notice (not that I can change it much, but here's something ...)
Tony Comstock - Taking the Real Sex out of [Real Sex] Searches. (Is the googlebot erotophobic?)
"But last July I noticed that although the word [real] is #5 in Google’s listing of keywords in inbound links, [real] doesn’t appear anywhere in the Googlebot’s listing for our site content keywords. That’s right, the Googlebot doesn’t see the word [real] at the home of real life, real people, real sex."
This is actually really interesting, though I don't really have the tools to investigate it. I lean towards thinking it's some sort of spam or "trusted site" algorithmic issue rather than an anti-sex bias of Googlebot.
David Weinberger - Obama v. Bush: Google counts
Estimated Google hits for [“Barack Obama”] are more than [“George W. Bush”] and [“George Bush”] combined. This strikes me as a clear demonstration that the meaning of those hit numbers is not what one intuitively expects them to be. It's known that the numbers are not full database counts - people read them as full database counts, but they are merely a statistical estimate. I suspect, just off the top of my head, that the results are heavily skewed by a recency bias in what's used for the estimate. I'd believe Barack Obama has been mentioned overall more than George Bush in the very recent past.
Walt Crawford - How Common is Common Language?
An extensive examination where Google is used as a testbed for analyzing the utility of checking phrases in cases of suspected plagiarism. "Even relatively short sentences seem to be unusual most of the time. On the order of 85% in this sample, and I suspect that percentage would be higher in a truly random sample. ... What I believe may be true: If you’re suspicious that a clumsy plagiarist has cut-and-pasted without paraphrasing, almost any medium-length sentence may suggest you should check further. It may be entirely innocent. But it seems surprisingly uncommon for the same, say, 11-word string to show up more than once."
Google's copyright war will have open access advocates up in arms
http://www.guardian.co.uk/technology/2008/nov/06/google-open-access-copyright
.. on the copyright issues surrounding Google's digitising of books
There's some value in enemy-of-my-enemy opposition, where the interests of an advertising near-monopoly are a counterweight to a content cartel. But battles between behemoth businesses should not be mistaken for friendship to libraries, authors or public interest.
[Update: I didn't pick the title, but I don't find it a problem]
As I've said before, normally I don't write about pure politics, since if my influence on Internet freedom is marginal, my influence on the electoral process isn't even a speck on the page. And while this election doesn't look close, you never know. I suspect my readership skews liberal and intellectual, but I probably have a few conservative, older, readers in the mix.
Vote for Barack Obama
I've never been one of the worshippers of "The One", but I actually have come to think more favorably of Barack Obama over the past few months. Politically, I am impressed by how he fought off the inevitable Swiftboating attempts, and the competence of the campaign organization overall.
In contrast, McCain's stunts like "suspending" his campaign during the financial crisis - and then doing nothing but grandstanding - refute any argument for his experience or leadership. Obama clearly demonstrated both intelligence and steadiness there.
By all governing measures, Obama has proved to be a better candidate than McCain - the people he surrounds himself with, the Vice President choice, the strategic decisions he's made - and I believe the policies he's advocated (Obama's quip "It's like these guys take pride in being ignorant" sums it up well).
So I endorse Barack Obama.
["Life Trumps Blogging", but some collected notes]
1) The obligatory pontification about the Google Book Search settlement, a topic on which all Google interested pundits must write about, will appear in my next Guardian column, in a few days.
2) Briefly noted: The Economist Innovation Awards and Summit
Business Process: Jimmy Wales, Founder, Wikipedia for public collaboration as a form of product and content development.
I have yet to see a more blatant business jargon way of saying "for electronic plantations full of digital sharecroppers".
3) Amazing story from an unreliable source of "Why Jimmy Wales got booted from Wikia's top job". I wouldn't have believed it, and it's been denied, but a reliable source confirmed to me that it's true. There looks to be some very strange backroom politics going on within Wikia (the company aiming to "commercialize the hell out" of Wikipedia concepts and success, though having no significant financial connection to the Wikimedia Foundation).