Infothought: google Archives

April 30, 2012

Google Power - from "Penguin" update to CISPA security bill

Another Google search algorithm update, another set of implications:

Petition: Google: Please kill your Penguin update

With the recent Google Penguin update, it has become nearly impossible for small content based websites to stay competitive with large publishers like eHow, WikiHow, Yahoo Answers and Amazon.
Countless webmasters have seen their livelihoods vanish overnight. ...

Have you noticed any great outcry in law/policy circles at this re-intermediation, as a potential threat to innovation? Any worry over the immense power vested in the whims of a single company, about what it might all mean for Freedom And The Internet? There's not much to be heard over the sounds of backscratching.

While it's definitely possible to go too far into seeing Google everywhere, one also shouldn't go too far in the other direction either, and pretend there's no economy of influence. Though I've basically given up on writing about the search algorithm implications myself. The SEO world knows all about it, they live it, they don't need to have it rehashed. The "connected" law/policy quasi-lobbyists don't want to know about it, ranging from unconcerned to actively hostile. For the remaining groups, well, we see how much notice is garnered by petitions like the above.

And the attention-driving aspect is further shown by articles over the battle about a "cyber-security" bill called CISPA: Why is Silicon Valley silent on CISPA?

In January, America's major tech companies joined everyday internet users to break the back of a reviled law called SOPA. Months later, Washington is brewing a new law that alarms many SOPA opponents — but this time the same companies have been quiet as church mice.
We put in calls about the vote to some of our Silicon Valley sources and the response has been nothing but crickets. Silence from Google. Ditto from Facebook. ....

Just a few months ago, the net was marinated in tales of how the evil SOPA-ians quaked before The Power Of Google, *cough*, I mean, The People. About how laws which threaten The Business Model Of Google, *cough*, I mean Civil Liberties, could no longer stand in the New Era. Fate gives us these little parallels to show how much that was all manipulation and feeding the masses delusions of significance. I have to grant that the end result of the SOPA battle did pass my test of being a positive outcome on civil-liberties (end-vs-means wouldn't be a difficult question if the "ends" view had nothing on its side). But it seems that's almost more accident than design.

Unfortunately, the only powerful faction making any of these points is the big media companies, who are Google's opponents, but not my friends or, perhaps more relevantly, patrons. While I don't want to be an unpaid Google lobbyist, it's even less appealing to be an unpaid media company flack.

Posted by Seth Finkelstein at 11:58 PM

May 13, 2011

The Google-IS-Evil! campaign, funded by Facebook

Kudos to Christopher Soghoian for the events where "Facebook Busted in Clumsy Smear on Google"

The social network secretly hired a PR firm to plant negative stories about the search giant, The Daily Beast's Dan Lyons reveals - a caper that is blowing up in their face, and escalating their war.

But the key lesson here, as the saying goes: It was worse than a crime, it was a blunder

It's informative to read the original pitch, and see the often implicit dealing made explicit:

I wanted to gauge your interest in authoring an op-ed this week for a top-tier media outlet on an important issue that I know you’re following closely. ....
I'm happy to help place the op-ed and assist in the drafting, if needed. For media targets, I was thinking about the Washington Post, Politico, The Hill, Roll Call or the Huffington Post.

That is, a PR firm is contacting someone they hope will act as a "front" for a nominal "opinion" piece, which is really part of a mudslinging campaign by a business rival (let us pause to remember how, e.g. Net Neutrality is a grassroots effort for freedom and liberty, right ...). The "front" will get attention, the rival will get credibility for their attack.

And we get only got this peek into the inner workings because it was a rookie error:

The two Burson operatives who ran the campaign - Jim Goldman, a former CNBC tech reporter, and John Mercurio, a former political reporter - are both former journalists new to Burson and new to PR. Their biggest mistake: reaching out to a blogger they didn't know with an email pitch that contained embarrassing information, including the offer to help write an op-ed bashing Google.

People don't believe me when I talk about how much agenda-setting is done by, well, those with agendas :-(.

Posted by Seth Finkelstein at 09:39 PM | Comments (1)

March 31, 2011

Waiting for the Google-IS-Evil! campaign, funded by Microsoft

Not an April Fools joke: "Microsoft is filing a formal complaint with the European Commission as part of the Commission's ongoing investigation into whether Google has violated European competition law."

I was thinking of writing an April Fools post inspired by this, describing a new fictitious campaign for "Google Neutrality", supposedly backed by Microsoft. Stuff like demanding immediate hearings on the ominous threat of a corporate behemoth which is strategically positioned at a choke-point of Internet operations. It might crush small fragile start-ups (oh, the start-ups, the precious start-ups) thus strangling innovation. Or engage in anti-competitive self-favoring dealing. Pay no attention to the apologism of its shills. Liberty itself demands that the monster in the making be tamed, via the dangerous but necessary action of government oversight.

But it was too difficult to get the satire right. And I suspect many people wouldn't get a multi-layered joke, of the irony of Microsoft targeting Google for what used to be said of Microsoft, while at the same time Google is targeting ISP's (that's Net Neutrality, if it's unclear). And I suppose nobody remembers that the Microsoft antitrust campaign was said to be a creature of Netscape (once upon a time, Netscape was rich).

Anyway, I'm waiting for Microsoft to start spreading some PR money around about "Google - Threat or Menace?". I think there's been a little bit of it, and I imagine I see a few of the money-followers making proposals of a sort. But the spigots apparently haven't opened up. If/when they do, look for conferences and articles and endless blog-posts claiming Google lies and arguing how Google must be stopped to preserve net freedom.

I've gotten very cynical about net politics.

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

February 28, 2011

Google "Farmer" Update Blesses Wikipedia, Curses Mahalo, Centralization

The Google "Farmer" Update results, that is, the winnner and losers from Google's latest algorithm change regarding "content farms", have now been analyzed. So we have outcomes such as:

Let's see in detail what Google did to the affected domains. The first conclusion is quite straightforward: the number of keywords these domains are ranking for dropped dramatically. Looking at mahalo.com as an example, it went from 33,875 keywords before the update to just 9,740 keywords after the update went public – a decrease of more than 70%.

Versus gainers

1. Amazon.com
2. eHow.com
3. NexTag.com
4. Wikipedia.com [sic - should be wikipedia.org]
5. Walmart.com
6. Target.com
7. Etsy.com
8. Answers.Yahoo.com
9. Sears.com
10. bestonlinecoupons.com

Note a pattern above? Another step to centralization, with some aggregator sites anointed as winners, and some as losers. And Wikipedia ends up even more dominant on Google.

I have to remind myself I'm basically completely unable to get the law/policy types to realize the enormous extent to which Wikipedia is de facto subsidized by Google. Here, not only is Wikipedia getting yet another boost, but some of its arguable commercial competitors are being killed! It's not because Wikipedia has some magic itself, in "community" or "civility", or whatever huckerism is being hyped. Rather, it has the algorithm support of Google.

Another gem noted - "Google also said that if its YouTube site gained, that was "happenstance."". If a big ISP did a network management change that just by "happenstance" might have benefited an enormous media property it owned, accusations of bias and favoritism would be rife.

Bonus link - Search Neutrality as Disclosure and Auditing (Frank Pasquale)

Given these parallels, I've compared principles of broadband non-discrimination and search non-discrimination. But virtually every time the term "search neutrality" comes up in conversation, people tend to want to end the argument by saying "there is no one best way to order search results - editorial discretion is built into the process of ranking sites." ... To critics, a neutral search engine would have to perform the (impossible) task of ranking every site according to some Platonic ideal of merit. ... Neutrality is a very broad term, and the obvious differences between the technical operation of physical infrastructure and search engines should not stop us from applying certain broad principles to each entity.

But there's no money behind that.

Posted by Seth Finkelstein at 11:59 PM | Comments (6)

January 29, 2011

Google Algorithm, Spam, "Content Farms", Life and Death for Sites

The most recent Google algorithm change, and reactions, is showing yet again how much structural influence these deliberate choices have over sites. One of the concepts I've tried to get into certain discourse (and, for various reasons, pretty much failed) concerns the effect Google can have by making algorithmic changes which either favor or disfavor certain types of sites. When I attempt to explain this, usually to various people whose education was in law or philosophy or other humanities, the first problem I often find is that they have no idea what I'm talking about. They've heard Google doesn't make specific sites ranking choices. This then seems to displace anything else in terms of concepts. Especially an idea of making parameter choices which then affect specific sites (extra-credit: possible choices which are nominally global but which primarily affect one very prominent site, cough, Wikipedia, cough ...).

In a discussion thread above, one webmaster says their site was very negatively affected, and gave this interesting report:

What has replaced us you may ask? Well that's the fun part.
Result 1 Wikipedia with a general about for the game.
Result 2 A Ehow article from 4 years ago with absolutely no relevant content to the query.
Result 3 A hubpages article again that is totally out of date and useless to the querytype.
Result 4-24 I dont want to even bother typing as it is just about borderline spam.

I haven't verified the claims. But just look at the list. If it's correct, note the implications - independent site replaced by large centralizers again (and spam). One can understand the reasoning. However, it's not exactly something that just happens, or falls from the sky.

Every time one of these events happens, it's an instructive lesson to see all the seething by the small websites at the "bottom", and how little that matters at the "top". It's something to keep in mind the next time there's a punditry outrage-fest that supposedly by coincidence maps onto certain big-business fights.

Posted by Seth Finkelstein at 11:33 AM

December 01, 2010

Repeat - Google Makes CHOICES In Its Algorithm ("DecorMyEyes")

I'm going to try to get in on today's "DecorMyEyes" pile-on, which is basically a story about yet another company discovering that attention, even bad attention, can be Google-leveraged into high search rankings overall. This is very old news in general (see, e.g. my old piece on "Jew Watch"). But since a gatekeeper wrote about it recently, the issue became noticed again by other gatekeepers.

However, this time, Google took action:

Instead, in the last few days we developed an algorithmic solution which detects the merchant from the Times article along with hundreds of other merchants that, in our opinion, provide an extremely poor user experience. The algorithm we incorporated into our search rankings represents an initial solution to this issue, and Google users are now getting a better experience as a result.

That is, instead of just saying "It's an algorithm, the output is a result of an algorithm, no humans here, algorithm, algorithm, algorithm ...", humans changed the algorithm.

When I've tried make people realize Google can do the same things in a positive manner for, e.g. Wikipedia (regarded as a good "user experience"?) - it simply doesn't penetrate the pundit-world.

Now, of course, not all changes are going to be equally easy. One can't simply wish complicated calculations into behaving in ideal ways (no, you can't just downrank stupidity). But the other side of that problem is that Google has gotten an extensive free ride in terms of values by simply saying "algorithm".

Posted by Seth Finkelstein at 11:59 PM

October 21, 2009

Google, Bing - Twitter as "a vehicle for directing ... to large audiences"

Compare:

A recent _Wired_ Twitter article, quoting Twitter's CEO:

Do you understand how money flows to the Internet? When you know that Twitter is a vehicle for directing information and traffic to large audiences, you realize there’s obviously a huge business.

Microsoft Bing search:

Because today at Web 2.0 we announced that working with those clever birds over at Twitter, we now have access to the entire public Twitter feed and have a beta of Bing Twitter search for you to play with (in the US, for now).

Google blog:

In the past few years, an entirely new type of data has emerged — real-time updates like those on Twitter have appeared not only as a way for people to communicate their thoughts and feelings, but also as an interesting source of data about what is happening right now in regard to a particular topic.

Me: (a while back, for which I was much flamed)

People aren't being connected by the 'real-time messaging service', they're being bundled up and sold.

Once more - I refuse to be a sucker again. I will not play the latest rigged game where the house makes a fortune, the touts get their commission, while the players are fodder for it all.

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

August 21, 2009

Google Book Search Settlement - The Enemy Of The Enemy Is Opposition's Friend

[I've been on blog-vacation, but I think I can say something interesting on this item]

One of several mysteries to me has been why the Internet Archive's opposition to the Google Books Settlement has recently been getting so much attention from professional law/policy types. Don't misunderstand my point regarding that attention, I'm all for it. But, in my experience, those kinds of public spirited projects are the type of thing that typically get obscure mentions in journal articles that nobody reads, usually trotted-out to justify the writer's visionary hobbyhorse. Proof - has anyone heard of Project Gutenberg recently?

So, when I read BBC - Tech giants unite against Google:

Three technology heavyweights are joining a coalition to fight Google's attempt to create what could be the world's largest virtual library.
Amazon, Microsoft and Yahoo will sign up to the Open Book Alliance being spearheaded by the Internet Archive.
They oppose a legal settlement that could make Google the main source for many online works.
"Google is trying to monopolise the library system," the Internet Archive's founder Brewster Kahle told BBC News.

A-ha! Mystery solved. Moneybags. Indeed, very large moneybags. If it were just librarians and civil-libertarians and free-culture people, nobody else would care. They'd just be kicked, at best. But big corporations are different matter. Their concerns are taken seriously.

Now, I'm not saying anything about the causality, or pawns, or making any statement about the morality of such a coalition. An enemy-of-my-enemy strategy can be good politics, even necessary against a behemoth like Google.

However, I suspect if before this was announced, I had speculated that all the attention to the project was a sign that something like this was potentially in the works (remember, these sorts of arrangements don't happen overnight, they can require months of negotiation) - I would have been roundly denounced as cynical in the extreme.

Again, I wish the endeavor well. I just find it darkly amusing to note the various forces at work here.

Bonus: Group-groom to/from Doc Searls - Unsettling books

Posted by Seth Finkelstein at 11:59 PM | Comments (2)

July 31, 2009

Uncommon Google items - Barbie, Books, Horizontal Hold

Linkblogging from the Distant Dorsal

1) Tom Slee - "Googling Barbie Again". He said it, not me:

"[Law Intellectual BigHead] made a big deal of the Google search results for Barbie in his book ... where he claimed that, whereas other search engines gave you only sales-related Barbie sites in the top ten, Google's "radically decentralized" algorithm revealed an entirely different picture of Barbie. ...
The one big change in the last 18 months is that the remaining countercultural site from 2008 has now been pushed over the edge to page 2 of the search results, displaced by two Google-owned collections of links (News and Videos). ...
... It should be no surprise that as the web has become mainstream, and as corporations realise the necessity of investing in their web presence, the web begins to look more like other mainstream media. Perhaps more evidence that the Web's counter-cultural moment is over.

2) I should have noted a while back Walt Crawford's long Cites & Insights discussing Perspective: The Google Books Search Settlement.

The agreement could be a lot worse. The outcome could also be a lot better. I'm sure Google would agree with both statements, as it finds itself in businesses where it has neither expertise nor much chance of advertising-level profits. At the same time, the copyright maximalists didn't quite win this round. We'll almost certainly get somewhat better access to several million OP books—and will have to hope (and work to see) that the price (monetary and otherwise) isn't too high.

I was reminded of it today given that the Harvard Berkman Center is running a workshop on "Alternative Approaches to Open Digital Libraries in the Shadow of the Google Book Search Settlement"

3) David Weinberger ~~inadvertently~~[Updated] provides a small lesson in how PageRank isn't everything in terms of Google ranking, in noting Britannica: #1 at Google

Today, for the very first time in my experience, The Encyclopedia Britannica was the #1 result at Google for a query ... It's good to see the EB making progress with its online offering, but I'm actually puzzled in this case. The query was "horizontal hold" (without quotes), and the EB page that's #1 is pretty much worthless. ... So, how did Google’s special sauce float this especially unhelpful page to the surface? ...

(I see it as #2 now, under a wiki.answers.com). I keep trying to tell various people that Google's ranking has multiple variables, but the simplistic model seems very difficult to displace.

[Update: David Weinberger commented: Seth, it was[n't] an "inadvertent" lesson. It was totally advertent. My reference to "secret sauce" intended to imply that Google's algorithm is complex and proprietary. And in the case I mentioned, those algorithms seem to have failed, for the top listing is unlikely to help anyone interested in the search terms ("horizontal hold").]

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

March 26, 2009

My _Guardian_ column on Google's "interest-based advertising"

http://www.guardian.co.uk/technology/2009/mar/26/seth-finkelstein-google-advertising

Google's surveillance is taking us further down the road to hell

Google recently took another step along the path of surveillance as a service, launching what it called "interest-based advertising", and which everyone else calls "behavioural targeting".

I had suggested a title of "Google's interest-based advertising and surveillance as a service", as I was aimed for the keywords "interest-based advertising", and I wanted to emphasize the phrase "surveillance as a service" (that plays off "software as a service"). But the title they used is fine by me. It's definitely more attention-grabbing.

Althought there's certainly a lot of punditry on the topic, I hope I managed to say something that wasn't a rehash of the same points, by concentrating on some of the politics and public-relations issues. I particularly like my line about Google's tech gimmicks meaning that "Too many supposed watchdogs end up distracted by the equivalent of a chew toy."

And I've already seen that "chew toy" argument being made. I look forward to many, many, iterations over this, as Google sends out the flacks and apologists to preach how its massive monitoring network is no trouble at all, compared to the horrible ISP deep-packet-inspection (i.e. "Look over there - a monster!").

[Pre-emptive note: From checking comments elsewhere, please don't "explain" to me how according to your elaborate ideological theory of moral responsibility, Google is a saint while ISPs are devils. I've heard it. In fact, I will hear it from experts who spend their whole professional lives in the service of trying to make people believe corporate agendas are the essence of being human, and they're good at what they do. I'm a geek. I know all about the differences between cookie-based tracking and packet analysis. The whole point of my column is arguing that sort of thinking is the wrong way to approach these issues, because it's very flawed in practice.]

Posted by Seth Finkelstein at 08:09 AM | Comments (8)

January 21, 2009

My _Guardian_ column on Real Sex And The Google Search

"Google should learn the difference between real sex and spam"
http://www.guardian.co.uk/technology/2009/jan/22/google-censorship

"If humans argue so much about distinguishing between erotica and pornography, imagine the difficulty search algorithms have"

I can live with the title, but I suggested per above "Real Sex And The Google Search" - the idea was to make a pun on the search [real sex] discussed in the article, and real sex in reference to the sex-bloggers rather than commercial material aimed at purely prurient interest.

Amusingly (and self-referentially), right now the article is ranking at position #4 for a Google search for [real sex].

Posted by Seth Finkelstein at 09:36 PM | Comments (3)

November 26, 2008

My _Guardian_ column on Google Flu Trends

Why you should be concerned about Google Flu Trends

http://www.guardian.co.uk/technology/2008/nov/27/privacy-searchengines

The search engine has unwittingly hung a big sign on itself advertising services for government surveillance

The title's fine, though when I submitted it I proposed "Google Flu And Monitoring Health". I was aiming for a deliberate ambiguity in the phrase "monitoring health" between the literal sense of seeing where is sickness and more metaphorical sense of good safeguards against misuse of private data. Maybe I was being too clever.

I know many people have written on this topic, but I really tried to capture the double-edged nature here. That is, the conflict between "That's so cool" for technical achievement, and "That's so scary" in terms of potential for abuse. As I think of it: Technology-positive social criticism.

I'm hoping to popularize a phrase I've used here: "surveillance engines"

[For all columns, see the page Seth Finkelstein | guardian.co.uk.]

Posted by Seth Finkelstein at 11:59 PM | Comments (2)

November 17, 2008

Daniel Brandt (Scroogle, Google Watch) on Google ranking anomalies

[Below is a guest post from Daniel Brandt, who gives his experiences and speculations below. His views are of course his own and not necessarily my own, but I do believe them worth hearing]

There is definitely some sort of filtering going on in Google's rankings for certain keywords. It took 18 months for any of the pages on my wikipedia-watch.org site to rank better than 200 deep or so for any combination of keywords from those pages. During this time, Yahoo and Live.com were ranking the same pages well for the same terms.

When I test terms on Google, I test with a multi-threaded private tool that checks more than 30 Google data centers on different Class Cs, and shows the rank up to 100 on each one. I can see changes kicking in and out as they propagate across these data centers. The transitions can take several days in normal cases, as when a new or modified page is appropriated into the results.

Wikipedia-watch.org has been a website now for 36 months. During the first half of that period, no pages ranked higher than 200 deep or so, even if you used two fairly uncommon words from that page to search for it (this is documented at wikipedia-watch.org/goohate.html). During the second half of that period, after it took about four months to settle into the transition, the deeper pages ranked okay, and were on a par with Yahoo and Live. But there was still one glaring exception to this rule: the search for the single word "wikipedia" failed to turn up the home page in the first 100 results almost all of the time during this second period.

When it did show up, it always ranked within the top 15. When it didn't show up, it was always greater than 100. There was never anything in between, and I've been watching this curiosity for the last six months now. For the first five of these months, it might kick in for a few hours on all data centers, and then disappear. This happened several times. Twice it kicked in for a few days, and then disappeared from the top 100 again. During the last 30 days, it has been in about half of the total time, for several days each time, and then disappeared again for days. It's always one or the other -- in the top 15 or not even in the top 100. Meanwhile, the deep pages have ranked okay the last 18 months, and have been stable this entire time.

This behavior is something I'm seeing only for the home page, and only on Google but not on Yahoo or Live. It happens almost exclusively when the word "wikipedia" is the solitary search term, or maybe this one word and another term that's also on that page. If you add a third term you begin ranking reasonably well for my home page, presumably because the search is now specific enough to override the filtering. By the way, this home page has a PageRank of 5 and Yahoo counts 3,500 external backlinks to that home page (there's a counting tool at microsoft-watch.org/cgi-bin/ranking.htm). You cannot use Google to count backlinks, because for years now, Google has been deliberately suppressing this information.

I should also add here that for three years running, another site of mine, Scroogle.org, had a tool that compared the top 100 Google results for a search with the top 100 Yahoo results for that same search. This may come as a surprise to some, but the divergence was consistently 80 percent for all searches. In other words, only 20 out of 100 links showed up on both Yahoo and Google for any search, and the other 80 on each engine were unique in their top 100. The overall quality of the results was about even for each engine. To put this another way, there's a lot of wiggle room for a particular engine to vary the top results, and still look like they're providing the most relevant links.

To make this long story shorter, I believe that there is some sort of backend filter that affects which top results are shown by Google. This actually makes some sense, since most searchers never go beyond the first page of results (at 10 links per page). This means Google's reputation and ad revenue depend heavily on the utility of that first page. A filter that favors recency is one component of this, because Google jacks up recent forum and blog posts (and increasingly even news posts). Everyone expects this by now. Static sites such as wikipedia-watch.org must compete in this sort of environment.

In addition to the recency factor, I think there is filter weighting based on what I call "newbie searches." A newbie search is grandpa or grandma searching for single words such as "wikipedia" or "email" that normally return millions of results, which of course is useless to the searcher. Such searches are stupid to begin with, but Google must cater to stupidity in order to push ads, since ad revenue is 99 percent of total revenue. There might even be some sort of rotational weighting for newbie searches.

And call me a tin-foil hatter if you must, but I also believe that "hand jobs" are involved in tweaking this filter. In other words, there is a political dimension to it as well. Regrettably, I cannot prove this. We need more transparency from Google, and we need it now, before the situation becomes even more suspicious.

Posted by Seth Finkelstein at 04:23 PM | Comments (8)

November 13, 2008

Uncommon Google items - "Real Sex", Counts, Distinctive Sentences

A few things that have not been echoed widely, and deserve more notice (not that I can change it much, but here's something ...)

Tony Comstock - Taking the Real Sex out of [Real Sex] Searches. (Is the googlebot erotophobic?)

"But last July I noticed that although the word [real] is #5 in Google’s listing of keywords in inbound links, [real] doesn’t appear anywhere in the Googlebot’s listing for our site content keywords. That’s right, the Googlebot doesn’t see the word [real] at the home of real life, real people, real sex."

This is actually really interesting, though I don't really have the tools to investigate it. I lean towards thinking it's some sort of spam or "trusted site" algorithmic issue rather than an anti-sex bias of Googlebot.

David Weinberger - Obama v. Bush: Google counts

Estimated Google hits for [“Barack Obama”] are more than [“George W. Bush”] and [“George Bush”] combined. This strikes me as a clear demonstration that the meaning of those hit numbers is not what one intuitively expects them to be. It's known that the numbers are not full database counts - people read them as full database counts, but they are merely a statistical estimate. I suspect, just off the top of my head, that the results are heavily skewed by a recency bias in what's used for the estimate. I'd believe Barack Obama has been mentioned overall more than George Bush in the very recent past.

Walt Crawford - How Common is Common Language?

An extensive examination where Google is used as a testbed for analyzing the utility of checking phrases in cases of suspected plagiarism. "Even relatively short sentences seem to be unusual most of the time. On the order of 85% in this sample, and I suspect that percentage would be higher in a truly random sample. ... What I believe may be true: If you’re suspicious that a clumsy plagiarist has cut-and-pasted without paraphrasing, almost any medium-length sentence may suggest you should check further. It may be entirely innocent. But it seems surprisingly uncommon for the same, say, 11-word string to show up more than once."

Posted by Seth Finkelstein at 09:04 PM | Comments (3)

November 06, 2008

My _Guardian_ column on Google Book Search Settlement

Google's copyright war will have open access advocates up in arms

http://www.guardian.co.uk/technology/2008/nov/06/google-open-access-copyright

.. on the copyright issues surrounding Google's digitising of books
There's some value in enemy-of-my-enemy opposition, where the interests of an advertising near-monopoly are a counterweight to a content cartel. But battles between behemoth businesses should not be mistaken for friendship to libraries, authors or public interest.

[Update: I didn't pick the title, but I don't find it a problem]

Posted by Seth Finkelstein at 07:23 AM | Comments (3)

October 20, 2008

Google, Wikipedia - seeing RE-INTERMEDIATION in action!

One of the concepts I've tried to advocate (pretty much futilely) against the web evangelists who blather on about the buzzword "disintermediation", is that they are talking nonsense. My counter-buzzphrasing is, "There is re-intermediation". That is new centralization (new gatekeepers), new centers of power.

Nick Carr is now making this point better heard in The centripetal web

Technorati just couldn't compete with Google's resources. But it wasn't just a matter of responsiveness and reliability. As a web-services conglomerate, Google made it easy to enter one keyword and then do a series of different searches from its site. ... Google offered the path of least resistance, and I happily took it. ... I thought of this today as I read, ... a report that people seem to be abandoning Bloglines, the popular online feed reader, and that many of them are coming to use Google Reader instead. ...

By coincidence, Philipp Lenssen just posted about Google Now Allows Sites to Serve Content to Them While Showing a Registration Box to Non-Google Users, noting one implication being:

the barrier for competing search engines, existing and future ones, being raised... because Google may now be offered a key by some sites, something which the same site may not bother implementing for the new engine on the block (if that other engine would also suggest a first click policy). If this policy would ever become wide-spread, the next Larry and Sergeys of today writing a web crawler would face a lot of new dead ends: "Google exclusive" crawl territory, a place where newcomers need to ask permission first.

One the biggest examples of re-intermediation (driven by Google) has of course been Wikipedia, and Nick Carr observes in his post:

One of the untold stories of Wikipedia is the way it has siphoned traffic from small, specialist sites, even though those sites often have better information about the topics they cover

I actually argued this point in an academic discussion thread over a legal case, where I pointed out the process in action. That a link/attention to a poorly-fitting Wikipedia article supplanted attention and ranking from specialists who were experts on the topic (n.b. I didn't mean me, but real lawyers who were on top of the legal issues). The dream of blogging was that such specialists would supplant the superficial "MSM" ("Mainstream Media"), but instead we're just getting the potentially worse Wikipedia. But I just ended up getting flamed, maybe Nick Carr will do better (centralization of critics? 1/2 :-)).

Posted by Seth Finkelstein at 08:09 AM | Comments (3)

October 15, 2008

My _Guardian_ column on truth, Tim Berners-Lee, World Wide Web Foundation

Please, Sir Tim Berners Lee: try investigating how corporations rule the net
http://www.guardian.co.uk/technology/2008/oct/16/censorship-timbernerslee

"Seth Finkelstein on Tim Berners-Lee who raises the issue of separating truth from fiction on the internet"

[Note: I didn't pick that title - my own suggested title was 'Tim Berners-Lee takes on "The Net of a Million Lies"']

Here I discuss the ever-popular topic of finding truth among the lies. But I hope I acknowledged some of the cliches about the subject, and got beyond them a little.

Long-time net civil-liberties people might enjoy the references to the old "PICS" ("Platform for Internet Content Selection") proposals, which I sardonically note were derided as "Platform for Internet Censorship System".

I also weave in the effect of Google, from the uncommon angle that it's algorithmic technology wasn't very successful until the company turned into a advertising-selling platform. There's a profound lesson there.

[For all columns, see the page Seth Finkelstein | guardian.co.uk.]

Posted by Seth Finkelstein at 08:20 PM | Comments (2)

October 05, 2008

Google / NSA Freedom-Of-Information Act Request Yields Little

I should have mentioned this earlier, but Philipp Lenssen at Google Blogoscoped has posted (with permission) an analysis I did when he asked about the meaning of a journalist's freedom-of-information act request case file regarding the National Security Agency (NSA) and its connections to Google. Basically, nothing came of the journalist's request. All that was released was that the NSA bought 4 Google search appliances, a 2-year warranty covering replacement on all of them, and 100 hours worth of support consulting.

Any deep connections that Google has to the NSA and the CIA are not going to be found so easily. Find that sort of stuff requires either a huge amount of legwork with the right sources in the intelligence community, or a whistleblower.

Posted by Seth Finkelstein at 11:59 PM

September 15, 2008

Google effects as Digital Sharecroppers leave Wikia's Electronic Plantation

The Transformers (shape-changing toy robots) fan wiki-community, which I wrote about in my Guardian article concerning Wikia digital sharecroppers leaving the electronic plantation, has now completed their site emigration away from the mandatory ad-farm that forms Wikia's business plan. I wish them well. Now one interesting question is what happens in terms of Google rankings for the two sites.

Notably, the process of moving the site involved stripping out automatically inserted backlinks to Wikia in the pages generated to move the site, as explained in the post "The last helicopter out of Wikia (filtering page text)"

Wikia has inserted an extra link back to itself! in the exported text! Don't believe me? Check it out! How obnoxious! That's at the bottom of every page! ...
But gosh, it sure makes it harder for us to leave, doesn't it? And when we do - why there's millions of links from us back to Wikia's near-identical content! Links that improve their Google ratings... and harm ours. (Google looks down on re-presented content.)

[That "millions" is definitely an over-estimate, since all the history versions of wiki pages are not indexed, but the main idea still stands]

Anyway, does community win over inertia and cross-promotion? This is a fascinating test. Good luck climbing Mount PageRank ...

Posted by Seth Finkelstein at 11:59 PM | Comments (3)

September 04, 2008

Google Results Domination by Wikipedia - another study

Another study about Google's high ranking of Wikipedia articles:

"Just How Powerful Is Wikipedia?"

Well, with 70% of the US using Google (and that number is similarly high is many countries around the world) to find information, it would definitely be important if Google was very often sending us to Wikipedia. ....
... we found that an amazing 50.2% of the top 1000 searches had a Wikipedia result on the first page. (That's 502 out of 1000 for the math challenged.) We theorized that many of the "no" results likely came from the large number of porn terms on the list, and a cleaner list of family friendly terms might favor Wikipedia even more.

This overall result is of course not new - but I think it shows one reason why the extensive law / policy marketing of Wikipedia is a cause for concern.

Posted by Seth Finkelstein at 11:59 PM | Comments (2)

September 02, 2008

Obligatory "Google Chrome" Browser Blog Useless Cloudy Post

Sniff:

Note: There is no working Chromium-based browser on Linux. Although many Chromium submodules build under Linux and a few unit tests pass, all that runs is a command-line "all tests pass" executable.

Windows-only. That says something. I'm not sure exactly what. But then again, everything else has been said by everyone else anyway, so I'm sure what it says has already been said by someone else.

I wish people would stop using the word "cloud" for "remote services". Do you get electricity from the "cloud"? Does food come from the "cloud"? It's the same fogginess (pun unintended) as "cyberspace".

Google's one of the biggest "remote-services" companies in existence, so it's developing a browser which is optimized for those applications. Simple. Got it. Call it "mist wisp aura-laden computing', and suddenly it sounds far more opaque.

Posted by Seth Finkelstein at 11:38 PM

August 09, 2008

Google Knol Ranking issues revisited

There's been an ongoing thread of extensive Knol rankings discussion between search expert Danny Sullivan and Google oracle Matt Cutts, as to the issue of whether Google favors Knol in ranking. Key questions asked by Danny Sullivan:

* Are some domains seen as "trusted" by Google, so that any page within them gains some of that trust in ranking mechanisms.
* If so, is this trust transmitted to subdomains of a domain?
I think the answer to the first question is yes and the answer to the second is no. But I'd like Google to give us an on-the-record answer. It would help with the debunking.

I too suspect the answers are "yes" and "no" respectively. The poster-child for the "yes" answer to the first is Wikipedia, though other superlinked sites like Amazon or IMDB are evidence also. However, I could be convinced this is just a practical effect of trust flowing across pages.

The second question is trickier. It's something where the answer to it is not necessarily to the question which should be asked. That is, I can well imagine Matt Cutts hypothetically saying something like "No, knol.google.com is treated in the code exactly the same if were plain old knol.com - it has no ranking advantage from google.com". And that might be the absolute literal truth. However, for example, knol.google.com received at least one great link at launch from the front page of scholar.google.com (cached - it's gone from the current version). So, then, any site which got a front page link from a site as trusted and highly ranked as scholar.google.com would be treated the same. Matt Cutts again: "We try to rank all our content on a level playing field." (I miss vocal inflection - please try to read the quotation as relayed with a very dry tone).

Danny Sullivan also said:

I'm saying that knol.google.com seemed to have, when I wrote this, quickly gained enough authority ON ITS OWN that pages within it did better -- that a page I never mentioned, which seemed to have practically no links pointing it -- shot to 28 out of 755,000 pages. Sorry, that's just not something I think you'd see happen on most brand new sites. And again, not because Google did anything to favor itself. Just because the Knol site rapidly gained authority.

The problem is the words "on its own" well, they remind me of bloggers who leaped to the A-list due to being media quasi-celebrities or wealthy, and pontificate how it's a level playing field - meaning, anyone who is rich or famous could do the same thing (again, that's "democracy", web 2.0 style!). In essence, knol had a very (link)rich and (media)famous "parentage", so pages on it ranked - and will rank - accordingly.

Note the high-trust-centralization effect has some very under-examined implications, but there's little support to explore that :-(.

[Update: Memesterbation link]

Posted by Seth Finkelstein at 11:59 PM

July 29, 2008

The Hyperlinked Society :: Google, Links, and Popularity versus Authority

"The Hyperlinked Society: Questioning Connections in the Digital Age", just published, contains a chapter by me on "Google, Links, and Popularity versus Authority"

The entire book is on-line, and linkable, so you can read and link to it.

It's a good chapter, if I do say so myself, exploring the way search algorithms can embody various social values. And it's written in a style that liberal arts type should be able to handle, yet geeks should find tolerable (that is, the amount of information is high, the humanities jargon non-existent, and I endeavour to be clear and logical rather than obscure and verbose). I'm particularly proud of the part where I managed to weave in a several decades old news judgement description as an algorithmic determination.

The chapter is probably the best example of what I was considering writing a few years ago, when I had thoughts about what I described as doing Lawrence Lessig's "Code And Other Laws Of Cyberspace" from a technical perspective. But this is probably going to be my last "academic" writing. It's been evident for a while now that I have no future in that area. And creating this sort of material is not a very fulfilling hobby.

Anyway, read the chapter, read the book.

Posted by Seth Finkelstein at 06:05 PM

July 28, 2008

Google Knol Ranking - Google NOT favoring Knol (the way you think ...)

I've not bothered to write an obligatory "Google Knol" post, but I'm going to try to weigh in on the Great Google Knol Ranking Controversy, which is whether Google is artificially boosting the ranking of Knol articles. My contention:

1. Short answer: No

2. Slightly longer answer: Yes, but not in the way you think.

To begin with, simply as a historical evidence sanity-check, we've got many examples to consider as to whether Google gives its own properties any special favorable treatment. Google owns Blogger, but doesn't seem to give blogger posts any favored ranking over similar posts (favoring blog posts in general is another issue). Google owns YouTube, and yes, there have been rumors there, but YouTube is also essentially winner-take-all category dominator (self-reinforcing, true). Social network Orkut is an also-ran now. So there's no history of strong ranking promotion.

And critically: FAVORING KNOL WOULD BE STUPID. It's an unproven, profitless project at this point. Moreover, if Google was going to be evil here, the smart thing to do would be to turn up the crank slowly over months, like the boiling-a-frog cliche. Not hang a big sign out with an invitation of roughly "Sue us for anti-trust violations and abuse of monopoly power".

Matt Cutts, Google's most well-known blogger, has said

Hi Dare, as Ben Yates mentions, several of these knols were featured on the front page of Knol and therefore a lot of people writing about Knol were linking to these knols and passing PageRank and anchortext. I saw multiple people talking about and linking to Aaron's knol as well. It can sometimes take some time for our crawl/indexing system to determine how much trust or weight to assign to new web pages. As part of the process of Knol launching, I'm guessing our crawl/indexing system will continue to adjust appropriately over the next few days.

Which brings me to the "Yes" answer. Sure, anything featured by a megacorporation with lots of buzz around it will rank equally well. Anyone with the same amount of enormous power (in terms of attention and popularity) can expect the same result. That's "web 2.0" democracy for you!

The real thing to worry about is not some crude code like
"if ($site eq 'knol.google.com') { $ranking = $SUPERHIGH; }"

The real thing to worry about is if someone inside the search ranking group has told the Knol group the deep dark secret formula which has driven Wikipedia's metastasis throughout all of Google's results.
[note that's "secret" as in "trade secret", not conspiracy]

Posted by Seth Finkelstein at 02:32 PM | Comments (2)

June 09, 2008

Nick Carr: "Is Google Making Us Stupid?", and Man vs. Machine

Nick Carr has an essay "Is Google Making Us Stupid?", where he applies a certain effects-of-tech-on-humans framework to Google. I know Nick is a very smart and learned man, so I read his thoughts carefully. I suspect there will be a certain amount of noise in reaction to what he wrote, as some with less regard for him will take away a superficial impression and go into standard techno-utopian rants against that ("Luddite!" is a tip-off you're reading one of these).

However, I'll try to outline what I found unsatisfying, by talking a bit about some of the meta-issues (I have to spell out I'm deliberately doing this, otherwise the widely-varying contexts tend to make it look like I'm *only* talking about myself).

When I read articles such as the above, I'm very aware that there is indeed a science/humanities "Two Cultures" divide. And I'm on one side of it (science) while many pundits are on the other (humanities). One basic way to tell the difference is essentially when science types can extend "themselves" through technology, they think "This is cool! Wonderful! Great! More!", while humanities types angst about "How has the basic nature of our essential souls been corrupted?". Note this angst-ing effect generally applies only to technology they haven't grown up with - for example, you don't see a lot of articles bemoaning how the telephone disembodies us into ghostly vocal presences. Of course, the more intelligent humanities types, like Nick, know this history, and it's clear especially towards the end of his piece. But they write the angst-filled articles all the same.

To demonstrate, here's a paragraph shot through with those themes (my interpolations are in the brackets):

Still, their easy assumption that we'd all "be better off" if our brains were supplemented, or even replaced, by an artificial intelligence is unsettling [tech: "Neat!", lib-arts: "Scary!"]. It suggests a belief that intelligence is the output of a mechanical process, a series of discrete steps that can be isolated, measured, and optimized. [tech: "Yeah!", lib-arts: "My soul!"]. In Google's world, the world we enter when we go online, there's little place for the fuzziness of contemplation. Ambiguity is not an opening for insight but a bug to be fixed. [tech: "Math rules!", lib-arts: "Poetry rules!"]. The human brain is just an outdated computer that needs a faster processor and a bigger hard drive. [tech: "Humans are machines!". lib-arts: "Humans are divine!"]

So I've often wished there was more support for what I call "technology-positive social criticism". By which I mean that criticism of techo-hype and marketing hucksters often seems to end up couched in a certain type of fogeyism (which alienates tech types) because there's no other power-center supporting that criticism. I sometimes don't want to alienate those who write in this fogeyist idiom. But it's a struggle.

Posted by Seth Finkelstein at 11:59 PM | Comments (7)

April 24, 2008

Wikipedia as Google-Weapon

A huge hornet's nest has been stirred up with the posting of some private emails about supposed plans by a group, "isra-pedia", to use various tactics on Wikipedia to favor pro-Israel viewpoints in various disputes. There's alleged leaked group mail (the host website is untrustworthy, but Wikipedia administrative discussion provides some evidence that the group mail is authentic). My favorite part:

Every time you see a Hamas person makes an outragous statements (like Jews came from apes or kill the jews) you write a small article about that peroson (google his name to find more ) and bring the quote from memri.
why doing all that ?
because google is wikipedia friend - 3 days after you created the article google the person's name again and voila your article will be the #1 in google for that name.

It's by no means news that Wikipedia's Google rank can be used to go after people. But it's nice to have it stated so bluntly and with such obvious intent.

Now, the plans outlined seems to have been more somebody's idea of a good manipulation scheme than anything which they were able to do. But maybe this is merely amateurs who couldn't pull it off, and got caught.

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

April 14, 2008

"Expelled" Google-lobbying

Expelled Exposed is a rebuttal to an anti-evolution films. There's what I'll call a "Google-lobbying" campaign around the search rankings:

We need to get the NCSE's [National Center for Science Education's] counter-site to the hideous little propaganda film, Expelled, to rank higher in the search engines. The way to do this is for lots and lots of you to link to the Expelled Exposed site with the word Expelled.

Note I'd say this isn't a "Google-bomb", since the target site wants the high ranking itself. And Google's algorithmic changes to defuse the bombs aren't applicable here, since the words appear extensively on the site.

On the other hand, I don't know if they'll be enough interest to have much impact, unless it becomes a cause-celebre. We'll see.

Posted by Seth Finkelstein at 11:59 PM

April 09, 2008

Google Free Hamsters Dance

If it mattered, I'd write about other stuff, but I saw this post by Jeneane Sessum on Google ranking for a post about hamsters:

... because you are a blogger of some renown, Google makes sure your free hamsters post comes up on the FIRST PAGE of google search results for the term Free Hamsters, and that the image of your free hamster babies (who are now long since gone, as Google's memory long outlives a hamster's puny 2-3 year lifespan) will remain forever in the number one spot for Google image results ...
No I am not kidding you. A near seven-year blogging legacy, and the most traction I've gotten on any one post [...] is my baby hamster post.

Though I suspect that over the whole English-speaking world, many more people are interested in hamsters than anything having to do with "Web 2.0"/blog-marketing/etc. :-). It sorts of puts it all in perspective ...

But that ranking is actually an interesting result. At #7 for [free hamsters], #2 for ["free hamsters"]. And the page itself has very few links. Somewhere in Google's mind, it thinks this is somehow very relevant to free hamsters. More so than many pet stores which naively might be thought to dominate such a search. Very strange.

Posted by Seth Finkelstein at 05:41 PM | Comments (1)

March 21, 2008

Google Roundup - Popularity, Books, Ads, Surveillance

Under-echoed Google items which have crossed my screen:

"Mr. Google's Guidebook" - A long post by Tom Slee explaining in literary-story style some of the problems with herd-mentality aspects of using link-popularity. Sites which are popular then become more popular, leading to a entrenched dominance of early winners (by the way, Google in particular and search experts in general do know about this issue, and try to add in some other factors, but that leads to other problems, etc.).

Competing books: What Would Google Do? (answer: index them and sell little ads on search) - Siva Vaidhyanathan notes that, including him, there's four books coming out on (my phrasing) the Google-and-society book bandwagon.

"Google's riches rely on ads, algorithms, and worldwide confusion - Cade Metz has an extensive irreverent piece on theses topics

The Externalities of Search 2.0: The Emerging Privacy Threats when the Drive for the Perfect Search Engine meets Web 2.0 - Michael Zimmer, "... this paper argues that the drive for Search 2.0 necessarily requires the widespread monitoring and aggregation of a users’ online personal and intellectual activities, bringing with it particular externalities, such as threats to informational privacy while online." (as I've put it: "The price of total personalization is total surveillance."). It's part of Special issue of First Monday: Critical Perspectives on Web 2.0, which is all probably of interest. And yes, I love the title of Søren Mørk Petersen's article there: Loser Generated Content: From Participation to Exploitation

Posted by Seth Finkelstein at 09:09 AM | Comments (2)

March 02, 2008

Google-Searching for "Barbie", or SEO as Socio-Economic Operation

Tom Slee on the [Barbie] Google results and how they've changed now:

... this search is basically owned by Mattel. Clicking the top link takes you to a pink page with "Think Pink" written in the middle of it, and the majority of the sites feature pink prominently.
No more defining the cultural symbols of our day for you, nine-year-old girl! Quit the self-aware political discourse and get back to dressing that doll in gender-appropriate colours (as selected for you by Mattel).

In other words, people who point to Google results as some sort of mass-mind or harbinger of popular will, often neglect (or wilfully ignore) that there's quite an industry around them. And that industry interacts with other senses of industry, bringing us back to where we were before in terms of corporate control of media.

Posted by Seth Finkelstein at 11:48 PM

February 13, 2008

The Cognitive Algorithm of When-Is-Google-Evil?

Shelley Powers on the spurious story claiming Google hijacks errors page

What really surprised me about this story, though, is that if people are so quick to accuse Google of 'evil' behavior in an innocuous situations like this, why was the idea of Google helping to bail out Yahoo to keep the latter out of the hands of Microsoft seen as a "good" thing? I would think a search engine monopoly in the hands of Google would be potentially more evil than Google providing useful features for default 404 error handling.
This environment is confusingly inconsistent at times.

It's a bit like how Libertarians will argue that the government is intrinsically incompetent and corrupt, but can be trusted with nuclear weapons which might literally destroy civilization as we know it. Or perhaps in general it's little things that people can see make for far better attention-getting articles than big abstract problems which are hard to conceptualize.

Also, connecting back to the "AutoLink" incident a while ago, I think there's a theme of "Don't Touch My Stuff!". You can take over the world, but don't touch the stuff. Which is actually a pretty common reaction.

Posted by Seth Finkelstein at 06:49 PM | Comments (2)

January 24, 2008

My _Guardian_ column on Wikia Search and Google-FUD

http://www.guardian.co.uk/technology/2008/jan/24/searchengines.wikipedia
Even search engines have an axe to grind

"Wikia Search tries to draw on the fear and doubt stemming from the dominance of Google"

I've tried to pack a lot into this column, everything from the $50K price for the "Grub" crawler" to pointing out how the politics of search can be used for free labor. I also bent over backwards not to even seem to be using the column to retaliate against Jimmy Wales's conduct, and he ends up only being mentioned in specific for identification (sadly, as far as I've ever seen, it's never done me any good to be morally better my attackers in terms of not abusing power, but I think I read too many comics books as a kid with Good triumphing over Evil - it doesn't work that way in real life).

Posted by Seth Finkelstein at 01:00 AM | Comments (1)

December 17, 2007

"Google Hijacked" debunking follow-up - guest post from ISP owner

[User Generated Content! Let's call this a guest-post, taken from the comments in the DEBUNKING "Google Hijacked" - The Sky, err, The Internet, Is NOT Falling! thread. Note the views and opinions expressed below are those of the writer, not me, though I am broadly in agreement on many points]

Brett Glass here; you may remember me as a long time columnist for magazines such as InfoWorld, BYTE, and PC World. I'm now (among other things) running an ISP, and think that people should think about what Rogers [ISP in Canada] is doing from an ISP's perspective. I've posted some of the text below to the comment sections of a few other blogs, but want to post it here too because it's relevant.

Network neutrality means not using one's control of the pipe to disadvantage competitive content or service providers. For example, if you're a cable company that offers VoIP, network neutrality means not blocking customers' use of other VoIP providers.

Network neutrality does NOT mean that a provider can't "frame" pages (as do many providers -- especially those like Juno which provide inexpensive or free service) or send them informative messages via their browser.

Let's step back and take a dispassionate look at what Rogers is really doing here. They need to get a message to a customer. Like any experienced ISP, they know that there's a good chance that e-mail won't be read in a timely way, if at all. (We, as an ISP, find that our customers constantly change their addresses -- often after revealing them online and exposing them to spammers -- without any notice, and often let the mailboxes that we give them fill up, unread, until they exceed their quotas and no more can be received.) The Windows Message Service once worked to send users messages, but only ran on Windows and is now routinely blocked because it's become an avenue for pop-up spam. Snail mail? Expensive and slow... and the whole point of the Internet is to do things faster and more efficiently than that. Give users an special program to display messages from the ISP? Users have too many things running in the background, cluttering their computers, already -- so no one could blame them if they didn't install it. (Also, many users won't install an application for fear of viruses, and alternative operating systems likely would not run the software.) Display a different page than the user requested? Perhaps, but that certainly comes much closer to "hijacking" than what Rogers is doing. Display a message in the user's browser window (where we know he or she is looking) along with the Web page, and let the user "dismiss" it as soon as it's noticed? Excellent idea. A wonderful, simple, unobtrusive, and (IMHO) elegant solution to the problem.

Now comes Lauren Weinstein -- known for drawing attention to himself by sensationalizing tempests in a teapot -- who has never run an ISP but seems to like to dictate what they do. Lauren claims that the sky will fall if ISPs use this nearly ideal way of communicating with their customers.

Contrary to the claims of Mr. Weinstein's "network neutrality squad" (who have expanded the definition of "network neutrality" to mean "ISPs not doing anything which we, as unappointed regulators, do not approve"), this means of communication does not violate copyrights. Why? First of all, the message from the ISP appears entirely above, and separate from, the content of the page in the browser window. It's not much different that displaying it in a different pane (which, by the way, the browser might also be able to do -- but this is better because it's less obtrusive and unlikely to fail for the lack of Javascript or distort the page below). The display can't be considered a derivative work, because no human is adding his own creative expression to someone else's creation. A machine -- which can't create copyrighted works or derivative ones -- is simply putting a message above the page in the same browser window.

It isn't defacement, because the original page appears exactly as it was intended -- just farther down in the window. And it isn't "hijacking," because the user is still getting the page he or she requested.

What's more, there's no way that it can be said to be "non-neutral." The proxy which inserts the message into the window doesn't know or care what content lies below. The screen capture in Weinstein's blog showed Google, but it just as easily could have been Yahoo!, or MySpace, or Slashdot. For the same reason, it can't be said to be an invasion of privacy, because the software isn't looking at the content of the page above which it is inserting the message.

In short, to complain that this practice is somehow injurious to the author of the original page is akin to an author complaining that his book has been injured by being displayed in a shop window along with another book by someone he didn't like. (Sorry, sir, but the merchant is allowed to do that.)

Nor is what Rogers is doing a violation of an ISP's "common carrier" obligations (even if they were considered to be common carriers, which under US law, at any rate, they are not). Common carriers have been injecting notices into communications streams since time immemorial ("Please deposit 50 cents for the next 3 minutes"). And television stations have been superimposing images on program content at least since the early 1960s, when (I'm dating myself here) Sandy Becker's "Max the burglar" dashed across the screen during kids' cartoon shows and the first caller to report his presence won a prize. (The game was called "Catch Max.") And in the US, Federal law -- in particular, Section 230 of the Communications Decency Act -- protects ISPs from liability for content they retransmit whether or not they are considered to be common carriers. They do not lose this protection if there happens to be other content from a different source in the same window on the user's PC.

There are sure to be some folks -- perhaps people who are frustrated with their ISPs for other reasons -- who will take this as an opportunity to lash out at ISPs. But most customers, I think, will recognize this as a good and sensible way for a company to contact its customers. Our small ISP is looking into it. In fact, because the issue is being raised, we're adding authorization to do it to our Terms of Service, so that users will be put on notice that they might receive a message through their browsers one day. I suppose it's possible that a customer might dislike this mode of communication and go elsewhere, but I suspect that most of them will appreciate it. In the meantime, let's just say "no" to regulation of the Internet.

Posted by Seth Finkelstein at 11:59 PM | Comments (19)

December 11, 2007

DEBUNKING "Google Hijacked" - The Sky, err, The Internet, Is NOT Falling!

[I wrote this for a mailing list, before the story started spreading all over the usual places. I didn't even get through there [sad face image] ]

Regarding Lauren Weinstein's post on "Google Hijacked -- Major ISP to Intercept and Modify Web Pages"

This is apparently not quite the danger it may appear at first glance.

The product at issue, PerfTech, seems to have been around AND USED for a while, for example:

http://www.codeamber.org/news/PR020205_2230_code_amber_perftech_press.html
Code Amber Utilizes PerfTech to Reach ISP Customers
February 2, 2005

"Code Amber (http://www.codeamber.org) and Wide Open West (WOW!) Internet and Cable last week delivered an Indiana Amber Alert to customers in the neighboring state of Ohio, enabled by a product deployed in WOW!'s network that allows the Internet provider to deliver bulletins directly to the screens of its browsing subscribers."

A look at http://www.perftech.com/press.html shows this is hardly a stealth application - they tout advertising-insertion as a *feature*, for subsidized ISP services.

Also, http://www.perftech.com/images/Press_Rls_5_26.pdf is one file with an example using *Google* ... dated March 26, *2004*.

Now, it strikes me as a very obnoxious product. But I'm so tired of the "The Sky, err, The Internet, Is Falling!" paranoia every time an ISP or teleco does something, anything, that can be twisted into service for the buzzwords of Net-you-know-what.
Again, can't we be better than that?

Posted by Seth Finkelstein at 01:48 AM | Comments (22)

December 02, 2007

"If you are inclined to trust Google as your source for news, Google yourself"

Echo: Not dead yet: the newspaper in the days of digital anarchy by Bill Keller, executive editor, New York Times. Key passage (my emphasis):

Google News and Wikipedia don't have bureaux in Baghdad, or anywhere else. With a few exceptions, they do not, in the cold terminology of the 21st-century media business, create content Wikipedia's policy actually forbids original material; it is a great mash-up of secondary sources. Wikipedia and Google aggregate information from, well, from us. From the Times, from the Guardian, and from a lot of less dependable sources. They can pool reporting from hundreds of news outlets but what if there aren't hundreds of news outlets? Or what if many of them are simply unreliable? And how would you know? Here's an experiment you can perform at home: If you are inclined to trust Google as your source for news, Google yourself.

He's been getting raked over the coals for not making nice with the web evangelists who want to sell data-mining the audience to his company. The point he's trying to make is that aggregation isn't magic, and garbage-in, garbage-out. But sadly, in the bogosphere, nobody (with a large following) wants to hear.

Posted by Seth Finkelstein at 11:59 PM | Comments (6)

November 17, 2007

Google penalties for link-selling, and A-listers vs. Z-listers

The latest Google slapping of paid links has generated an intriguing aspect of "class struggle", as the intermediaries from Z-listers complain it's unfair to penalize those blogs for selling links, while not penalizing A-lister blogs which having sponsors "thank you" posts with links, essentially also paid link selling.

While neither side of that battle cares what I have to say (and it's probably not the best idea for me to get between them), it's an interesting question - what's the difference between paying for posts, and posts with links to an A-lister blog's sponsors? Perhaps surprisingly, I actually do see a difference. While the thank-you link posts are by no means completely pure, there's a lesser level of search gaming there than the individual placements for paid links. As a minor detail, typically having several links on a page dilutes the PageRank being sold. It is indeed some selling of PageRank, but not as much as a post which is devoted to a specific advertiser.

But much more importantly, it's not just the PageRank being sold, but also the sale of keywords in the links. That is:

"We'd like to thank our sponsor, BigCo [link]" is one thing, but

"We'd like to thank our sponsor, BigCo [link], which sells uPods [link], Niagra [link], and mome hortages [link]" would be quite another.

Now, you can push this if the company is named "Buy Niagra", but in general, the difference works in practice. Companies ranking higher for their own name is not a big problem, and while the extra bit of PageRank to distribute over their site is indeed ill-gotten gains, it pales in comparison to the keyword link issues.

Besides, nothing stops Google from going after the A-listers selling PageRank at some future date, after they've worked out the bugs (which seems to be substantial) from handling the Z-listers' keyword-selling.

Posted by Seth Finkelstein at 11:50 PM | Comments (8)

November 08, 2007

Google New Pagerank In != Pagerank Out Changes And Google's Statements

Regarding Google's recent PageRank shake-up, where I conjectured that Pagerank In != Pagerank Out, I realized that an article a few weeks ago from Danny Sullivan (Official: Selling Paid Links Can Hurt Your PageRank Or Rankings On Google) had actually reported this effect from Google itself. I'd read the post at the time. But the implications weren't clear in the way they now make sense in retrospect (my habit of discounting oracular Googlese led me astray). Quoting the article, my emphasis:

More and more, I've been seeing people wondering if they've lost traffic on Google because they were detected to be selling paid links. However, Google's generally never penalized sites for link selling. If spotted, in most cases all Google would do is prevent links from a site or pages in a site from passing PageRank. Now that's changing. If you sell links, Google might indeed penalize your site plus drop the PageRank score that shows for it.

Note penalize is not the same as dropping the PageRank score that shows for it. So Google can drop the PageRank score that shows for it, WITHOUT penalizing the rankings of the site.

So I pinged Google, and they confirmed that PageRank scores are being lowered for some sites that sell links.
In addition, Google said that some sites that are selling links may indeed end up being dropped from its search engine or have penalties attached to prevent them from ranking well. [... snip]
By using PageRank decreases (something Google first experimented with in the SearchKing case in 2002), Google can hurt the perceived value of buying links from a particular site without harming core relevancy.

So "without harming core relevancy" apparently means what I've thought of as PageRank-In != PageRank-Out.

The market for paid links just got a whole lot more complicated :-).

Posted by Seth Finkelstein at 06:48 PM | Comments (12)

November 02, 2007

The TimesSelect Reader (Jon Garfunkel) - NYT paywall vs. Google and bloggers

The TimesSelect Reader is Jon Garfunkel's "8 parts and 21,000 words" examination of the New York Times having a premium, for-pay, service and

... whether the Times lost influence, or audience, or money over the last two years. Many entries in the blogs have been long on speculation and short on data. We have tried to fill in the data gaps here.

Readers here might particularly want to examine the section on TimesSelect, SEO, and Google:

Google may need the Times, but the Times is starting to rely on Google even more. Marshall Simmonds told me that 25% of the traffic to nytimes.com comes from all search engines.

Note also TimesSelect & Foreign Correspondence

Perhaps Friedman is more popular because he often tells us what we want to hear. Kristof tells us what we don't want to hear.

It's all an immense amount of work, deserving of extensive attention (which, not coming from an A-lister, it won't get - in fact, even if it did come from an A-lister, it probably wouldn't be read, though it'd be talked about).

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

October 31, 2007

Google now has Pagerank In != Pagerank Out ?

I sat out the Great Google PageRank Massacre Of October 2007 during last week, where several sites, including some high-ranking blogs, saw their PageRank displayed as dramatically lower than usually (the best example was the front page of YouTube supposedly going down to a score of 3/10, a level which can usually easily be achieved by a minor blog - that was an amusing proof that at least some changes were not due to Google hand-editing results). I thought I'd wait for the data to settle before examining it. What was so interesting during the initial part of the uproar was The Silence Of The Googlers (i.e. the people who work for Google). Not a peep, and that spoke loudly.

Also significant, nobody seemed to reliably report any ill-effects from the change. Given that blogs were affected, there was of course plenty of noise, but nothing major.

Then the oracle of 'plex spoke, saying:

The partial update to visible PageRank that went out a few days ago was primarily regarding PageRank selling and the forward links of sites. So paid links that pass PageRank would affect our opinion of a site.
Going forward, I expect that Google will be looking at additional sites that appear to be buying or selling PageRank.

I speculate that Google has now formalized what they've been doing crudely before, and separated the quantities of PageRank-for-ranking and PageRank-for-transmitting. Before, if a site had a high "in" PageRank, that meant the site had a temptation to sell it. Now, a site's "out" PageRank may be minimal, now matter what the incoming linkage. As a bonus, displaying the "out" PageRank will make the displayed data even more confusing.

Posted by Seth Finkelstein at 09:06 AM | Comments (2)

October 08, 2007

Google Hand-Editing Results In Spirit Now? (to penalize link-selling)

Danny Sullivan - Official: Selling Paid Links Can Hurt Your PageRank Or Rankings On Google

"If you sell links, Google might indeed penalize your site plus drop the PageRank score that shows for it."

I've long defended the basic accuracy of the statement "Google doesn't hand-edit results". Now, that statement obviously can't be true in the most extreme sense, otherwise they couldn't ever throw out spammers. And certainly they'll country-blacklist illegal sites. But I've been against making an reductio ad absurdum interpretation of such a statement, and then knocking down a strawman. That's not useful.

There were also lesser spam penalties. Arguably, that was merely caught up in an algorithmic sweep. But now (my emphasis):

Google stressed, by the way, that the current set of PageRank decreases is not assigned completely automatically; the majority of these decreases happened after a human review. That should help prevent false matches from happening so easily

I don't want to create false incentives, and human review is good of course. Yet I can't help thinking that we've now crossed a line here. Perhaps with the best of intentions, for the most worthy of reasons. But still, we're now on the other side of some divide.

Now, there really is someone sitting in a room thinking along the lines of : "Hmm, the algorithm says you have Pagerank 9, but looking at your site, you're using your pagerank-powers for link-profit, so let's turn it down a few notches, perhaps to Pagerank 7, so it's not quite as attractive. If in the future you prove to be a more moral vessel of our power, we may restore you to full strength."

That's a change. Good or bad, it's different from what's been the case before.

Posted by Seth Finkelstein at 09:36 PM | Comments (4)

September 25, 2007

GooglizationOfEverything.com

Echo: www.googlizationofeverything.com

"Where is this book going?"

... we should ask some hard questions about how Google is not only "creatively destroying" established players in various markets, but is also altering the very ways we see our world and ourselves.

[Siva Vaidhyanathan]

For those who make a Google a god, recall the quote "[A computer is] like an Old Testament god, with a lot of rules and no mercy. -- Joseph Campbell". Google has many algorithmic rules, and I've seen too many people begging for mercy.

Posted by Seth Finkelstein at 05:35 PM

September 15, 2007

Google : Privacy :: Fox : Henhouse

ObPunditry: Google calls for web privacy laws.

Search site Google has called on governments and business to agree [to] a basic set of global privacy rules.

In other news, foxes have called on farmers to agree to a basic set of henhouse privacy rules. They propose to standardize on "APEC principles" (Association of Poultry Eating Carnivores).

Anyway, there's no point in me rewriting what others have said better:

Google and new, international privacy rules

Posted by Seth Finkelstein at 08:11 PM | Comments (1)

September 10, 2007

Franco Frattini EU Censorship Proposal - QUESTIONS AND ANSWERS

Franco Frattini, European Union Justice Commissioner, has set off a minor blogstorm from the following censorship proposal:

BRUSSELS (Reuters) - Internet searches for bomb-making instructions should be blocked across the European Union, the bloc's top security official said on Monday.
Internet providers should also prevent access to any site giving instructions on how to make a bomb, EU Justice and Security Commissioner Franco Frattini said in an interview.
"I do intend to carry out a clear exploring exercise with the private sector ... on how it is possible to use technology to prevent people from using or searching dangerous words like bomb, kill, genocide or terrorism," Frattini told Reuters.

Putting aside the phrasing silliness (I know, it's like blogger catnip, ha-ha-he's-so-dumb), he has been making the same noises for a while:

Voice Of America News August 2006

"But I think it is very important, for example, to explore further possibilities of blocking websites that concretely incite to commit terrorist actions or for example providing of the diffusion of expertise or knowledge about bomb making," said Frattini.

However, it turns out that an organization http://SpyBlog.org.uk has compiled a VERY LONG Q-and-A about such censorship proposals

Below is the first part of our letter to Franco Frattini, and the preliminary, general answer, by Jonathan Faull the Director General for Justice, Freedom and Security of the European Commission.
See subsequent blog postings for Questions and Answers numbers 1 to 17

I doubt my audience needs me to say anything more about the battles of censoring the Internet, and he certainly doesn't care what I think reading ...

Posted by Seth Finkelstein at 11:25 PM | Comments (2)

August 28, 2007

Kraus Celebre

Allen Kraus, a focus of the NY Times' Google-power article, already has a web page, as pointed out by Jon Garfunkel in his piece:

Search Engine Obfuscation.

Jack Shafer of Slate (Page Rank 6/10) tells Mr. Kraus to get a web page. But the man has a web page (which I linked to as my random act of charity for the day). It's just that nobody else linked to it [Ed. note: the back links feature of Google and Yahoo is well-known to be highly inexact, er, wrong, with Yahoo being slightly better, so the link is there for dramatic effect]. And as such, his page, and his company (ImplexHealth), have a PageRank of 0/10. ...

["My readers know more than I do" :-)]

So here's another link for it.

I suppose the web propagandists, I mean, evangelists, could object that they said to start begging A-listers for links, I mean, blogging - not just have a web page. But I think the above point is powerful evidence about the scamminess of that idea, if any more was needed.

Posted by Seth Finkelstein at 11:59 PM | Comments (3)

August 27, 2007

"When Bad News Follows You" - NY Times and Google's power

"When Bad News Follows You" is today's must-pundit article for SEO (Search Engine Optimization), about the power of top Google results to affect people's lives, even if it's misinformation (h/t RoughType).

Rather than rehash what everyone else is saying, I'll try to provide some value-adds:

I like Oliver Widder's cartoon:

Welcome to our SEO seminar - "The Truth Is On The First Page"

I normally don't like to talk about Bennett Haselton's writings due to conflict of interest issues, but this seems far enough away from any potential contention so I'll note his amusing site detailing his dispute over a New York Times article about him: PublicEditorMyAss.com

The New York Times Web site has been hosting an article since May 2000 claiming that I was fired from Microsoft in January of that year. I complained several times that this was wrong -- I wasn't fired, I quit in good standing (and, for the record, voluntarily, not some "quit now or you'll be fired" deal) -- and I showed the NYT editors a copy of my personnel file from Microsoft which has "Term. type: Voluntary" and "Term reason: Resignation" printed on it, but the paper has still not corrected the article. ...
... I also told them that recently one of my employers found the article by Googling my name and thought I had lied about my employment history, and I only dodged that bullet because my employer looked up my Microsoft reference and determined that I was telling the truth.

And sadly, I've seen many marketers pushing the response "Start a blog!". I have the impulse to tell(off) these hucksters, that ordinary people do not want to get on the blog-evangelism gatekeeper-begging attention-mongering digital-sharecropping rat-race. They have lives instead, and want to live them without (free) laboring endlessly to be manipulated and sold for the benefit of pyramid-schemers. But my saying that wouldn't be heard, so it wouldn't do any good :-(.

Posted by Seth Finkelstein at 11:58 PM | Comments (5)

August 15, 2007

Regulating Search Engines paper - detailed notes

I made some notes as I went through the "Federal Search Commission?" paper, and since I've already given an overview of my thoughts, I decided to post these for whatever value they have in terms of the specifics of the argument, and where I believe it doesn't work. Again, basically, I sympathize with the examination of the concentration of media power. But the claims as to why it's not like other media power simply don't seem to me to be valid.

The first dimension involves an important preliminary question: what exactly is the relevant speech in relation to which search engines assert first amendment rights?

This: "If you're looking for pages about "widgets", the most relevant page is this, the second most relevant page is that, the third, etc".

When, however, the frame of reference is the supposed speech embodied in rankings the claim that regulation of search results violates the first amendment becomes highly precarious. It is highly questionable that search results constitute the kind of speech recognized to be within the ambit of the first amendment by either existing doctrine or any of the common normative theories in the field. While having an undeniable expressive element, the prevailing character of such speech is performative rather than propositional.

Regrets, I don't buy it. I don't see a way you can claim "Vote for X" is "propositional" while "The most relevant page for X is Y" is "performative". This part in the reasoning seems flawed: "To use the terminology of Robert Post, the speech of search engines as embodied in rankings is not a form of social interaction that realizes first amendment values."

That claim is problematic in a very deep sense, because if search engines rankings embody social values, then they're a form of social interaction in the relevant sense. The argument can't have it both ways, that they're expressions of the algorithm-writer's bias and prejudice for the sake of criticizing them, but not social interaction when it comes to regulation.

After all, one could say everything from tabloid newspapers to book publishing is not social interaction, in that they're monologue or pontification, not a town hall meeting.

In short, extending the compelled speech rule to cover the mere observations on relevance implied in search engine rankings seems to take the doctrine to domains where it was never meant to go.

But the problem here is taking that view in the opposite direction, to wit:

The evaluation of the value of bonds which was found to be an "opinion" in that case, while not the strongest case of an expression subject to a dialogical relationship, still has some potentially-dialogical features. Listeners can agree or disagree with the evaluation, criticize or support it, and make arguments for or against it. Search engine rankings, by contrast, are not perceived by users as an expression with which they can interact in ways characteristic of what we usually refer to as an "opinion."

Again, this just doesn't seems correct to me. Generally we have as little ability to dialog with a statement like "Standard and Poors rated this bond as junk" as "Google blacklisted this site as spam". In both cases, the mechanism used to determine the result is proprietary, and the institution offers it on a take-it-or-leave-it basis.

As in the case of the compelled-speech rule, recognizing the incidental and limited form of "opinions" implicit in search results -- i.e. opinions about relevance to users -- might cause the doctrine to spin out of control.

Right, right, got it. This idea is seen (in the reverse) in a lot in net-ranting. You can't convert every statement into protected speech by the magic of prepending "It's my opinion that ...", and so it's an opinion, which is protected speech, ha-ha-ha gotcha. Calling every statement an opinion isn't a get-out-of-regulation-free card. Understood. However, trying to turn it around in the other direction is just as bad, in that there's a problem playing off the many senses of the word "opinion". A search engine result is more like a judicial "opinion", which doesn't map exactly to the most common use of the word either.

The Google does not need me to save it, and I certainly know how its results can be gamed. But I also don't think it can be so readily categorized as somehow apart from standard journalism.

Posted by Seth Finkelstein at 01:34 PM

August 13, 2007

Regulation Of Search Engines discussion - "Federal Search Commission?"

There's an interesting legal discussion concerning the paper "Federal Search Commission? Access, Fairness and Accountability in the Law of Search" by Frank Pasquale and Oren Bracha.

I find myself torn, as I'm very politically sympathetic to the issues raised by the authors. As they recognize, this is really about mass media and information gatekeeping in a democratic society. There's a whole genre of these types of paper. But they usually boil down to saying roughly the same basic things in a very elaborate way:

-1. An informed populace is important for a democratic society
0. The First Amendment forbids government regulation of political speech
1. These mass media institutions concentrate enormous political power in a few corporations, giving these businesses huge megaphones, without any effective reply by the citizenry
2. But the courts have ruled that under the First Amendment, at least for newspapers, that's just fine (e.g. the "Tornillo" case).
3. This institution is not like newspapers, because [fill in the blank].

The magic is in item #3, and sadly, I've yet to see one of these papers where I found the reasoning convincing there. The writer's problem (generically, not this paper in specific) is that they can't make it a general media analysis, since then they would be both on the wrong side of existing law, and would immediately lay themselves open to intense attack as censors. So they're forced to try to find some hairsplit, some key feature that they can claim gets them out from under that trap (myself, I think the intellectually consistent liberal solution is saying that corporations aren't persons, but that's a whole different topic).

Now, the above task isn't entirely impossible. For TV and radio, it's "spectrum scarcity" and "pervasiveness". Which supported the Fairness Doctrine, to counteract practical monopolization. However, that regulation has been gone for a long time, and any proposal to restore it brings instant oppositional targeting by professional propagandists. The only relevant TV/radio material regulation still in force - and even increased in some ways in modern times! - is prohibitions on sex and cursing (which tells you something ...).

But the authors' specific attempts to find a hairsplit for search engines (my paraphrase here) - secret algorithms, or overblown marketing claims, or Google-is-God perceptions, or defining it as not discussion among citizens - just seem to me to be playing to the discomfort that some liberal-arts types have with anything involving technology. If computer programs are covered by copyright (something that was not so evident years ago), then search engine ranking are "opinions". Arguments otherwise are easy to shoot down.

I'd suggest putting the advocacy energy into some sort of "Right Of Reply" argument - that might even be possible, though it's still very much bucking the trend.

Posted by Seth Finkelstein at 08:04 AM | Comments (2)

August 12, 2007

Copiepresse v. Google documents on Groklaw, from Sean Daly

Sean Daly sent me a notice about his Groklaw posting Update on Copiepresse v. Google. This is the case where Google News in Belgium was sued by newspapers over copyright violations. Along with analysis

... here's the official English translation of the ruling in Copiepresse vs. Google. I have linked to cited jurisprudence and essays where possible (the Belgian documents are in French, the European documents are in English and other languages).

I don't agree with many of positions taken by the person who introduces the article, but it's definitely yeoman's service to acquire and post the original sources.

Posted by Seth Finkelstein at 11:58 PM

August 08, 2007

Obligatory Google News "Comments" Post - Google is creating original reporting

So Google News has comments (for small values of comments), and it is incumbents upon everyone to comment.

Of course, this gives Google a huge amount of power in picking and choosing who will be allowed to comment. They state:

We'll be trying out a mechanism for publishing comments from a special subset of readers: those people or organizations who were actual participants in the story in question

Essentially, they're taking their function as an automatic aggregator, and adding some human ORIGINAL REPORTING in follow-up. Very minimal original reporting, but they are in effect generating their own follow-up reaction articles from the original aggregated articles.

And wow, does this create some perverse incentives that can lead to unintended consequences. I can think of one obvious result off the top of my head:

1) Get mentioned in a popular article for doing something outrageous

Then either

2) Google gives you a platform to say whatever you want

Or then

2) Scream GOOGLE IS CENSORING!!! as loud as you can, and watch the fireworks.

I'm sure there's plenty of devious schemes hatching in the minds of flacks. This is going to draw a huge amount of attention. And that draws people to manipulate it.

Posted by Seth Finkelstein at 11:22 AM | Comments (4)

August 07, 2007

Lauren Weinstein: "Call For Search Engine Issues, Complaints, Concerns"

Echo: http://lauren.vortex.com/archive/000266.html

Greetings. As part of my continuing research and an upcoming white paper focusing on policy and related technical issues associated with search engines and their impacts, I'd very much appreciate any examples of relevant specific situations, concerns, and any other positive or negative experiences with search engine operations and support personnel, with a particular emphasis on (but not limited to) the following categories: [read the post above]

Posted by Seth Finkelstein at 11:59 PM

July 18, 2007

"The Googlization of Everything" - Siva Vaidhyanathan

"The Googlization of Everything" is a new book in the works by Siva Vaidhyanathan. I'm going to get a jump by echoing it before the crowd (any resemblance between this post and Google manipulation is purely ironic ...).
[n.b. note the picture in the first link - "Snared in the Web 2.0 ... "User-generated content" is just another name for massive corporate data collection, mining, and profiling"]

Per the The Institute for the Future of the Book's fellow announcement:

Siva is one of just a handful of writers to have leveled a consistent and coherent critique of Google's expansionist policies, arguing not from the usual kneejerk copyright conservatism that has dominated the debate but from a broader cultural and historical perspective: what does it mean for one company to control so much of the world's knowledge?

As I keep saying, there's a shift, but it's from one set of gatekeepers to another set of gatekeepers.

Or, as put in a talk note

His premise was that we've come to talk about Google in theological terms, and that the Google folks themselves encourage this through their familiar "don't be evil"-type approach to their public communications. He thinks their stated aim to eventually provide universal access to all information is basically cynical at worst, unrealizable at best.

More talk elaboration:

Siva concludes his talk with a plea against technofundamentalism - the Google logic that you can always fix the problem by tweaking and innovating. This is also a plea against the myth of technological neutrality. Google is not neutral, he says, and politics are built into the black boxes of their search engines. Finally, this is a plea for Critical Information Studies - a nice start to the conference, then.

Shorter: You can't fix a social problem with a technological solution?

Posted by Seth Finkelstein at 11:59 PM

July 16, 2007

Google Roundup: Cookies, Getting Rid of Wikipedia (Results), Song

Links for the underheard, in a futile gesture to whip the Long Tail.

Did you hear? Google will lower, to two years, the expiration time of its universal spying device, I mean, cookie. It'll just link to Michael Zimmer on Google cookie expiration:

My hunch is that the brilliant data-mining minds at Google recognize that if someone hasn't searched on Google in two years, their past history probably isn't a good indicator of their current needs. So, if linking to two-year-old data isn't all that valuable, they might as well just dump the cookie altogether. It doesn't harm their data-mining needs - and it's good PR.

[See also "More of Peter Fleischer Misleading on Google Data Retention" - he said it, I didn't.]

From the everybody talks about Wikipedia taking over Google results but finally someone did something about it department:

Will Critchlow: Search Google without wikipedia - a Firefox search plugin

Here at Distilled, it's something that came up in conversation a few times, so we decided to do something about it - we have created a Firefox search plugin that enables you to search Google without getting wikipedia results

[See also the CustomizeGoogle solution]

Humor: Lauren Weinstein - "I Am the Very Model of a Modern Major Googler"

And if you're really good it seems to us that you at least possess,
The skill to quote from memory full source of the Linux OS.

[Rumor has it that this line is only a slight exaggeration of what they expect]

Posted by Seth Finkelstein at 11:59 PM | Comments (2)

July 13, 2007

Google Video Cache Bypasses YouTube Age Verification

Echo: http://lists.grok.org.uk/pipermail/full-disclosure/2007-July/064625.html

Youtube.com requires account creation and login before allowing visitors to view videos flagged by users as inappropriate.
Sample flagged video: http://www.youtube.com/watch?v=[video_id]
"This video or group may contain content that is inappropriate for some users, as flagged by YouTube's user community.
To view this video or group, please verify you are 18 or older by logging in or signing up."
.....alternatively, download the video directly from Google video
http://cache.googlevideo.com/get_video?video_id=[video_id]

[h/t Google Blogoscoped forum]

I've said it before, cache is the bigggest threat to censorware.

Posted by Seth Finkelstein at 11:59 PM

July 05, 2007

Anti-"Sicko" Google Search Ads and Google Policy

I stayed out of the blogstorm of a few days ago regarding Google [Health Advertising Blog] Criticizes Moore's "Sicko" - given the number of ultrahigh-attention sites echoing the story, anything I'd say would either be futile or (personally) dangerous.

In the aftermath, I've seen some suggestions that Google is violating its own policy by permitting critical ads to be run against a search on "Sicko", e.g.:

Sicko short on truth
Moore's movie profers a deadly Rx.
In the smart new business magazine
www.American.com

Checking Google's ad content policy, the relevant passage seems to be:

Ad text advocating against any organization or person (public, private, or protected) is not permitted. Stating disagreement with or campaigning against a candidate for public office, a political party or public administration is generally permissible.

The letter of the policy doesn't say anything either way about a movie. But the spirit seems to the be that "campaigning" is allowed, so they could argue it encompasses general political speech.

Frankly, I think using Google ads in a controversial political issue is just a bad idea. The following is not an implicit encouragement, but since the idea is utterly obvious, I don't think there's any reason to refrain from mentioning it - buying a political Google ad is an invitation for some militants to click them, solely to cost the advertiser money. Maybe Google doesn't care, since they'd make money too off such "protests" (on the other hand, dealing with the claims of click fraud can't be fun).

Posted by Seth Finkelstein at 12:34 AM | Comments (2)

June 28, 2007

Google Privacy Fluff

Philipp Lenssen asked Google about data restrictions, and received a statement concerning "We restrict access internally in a number of ways. [details]".

I left a comment in part:

There's never going to be an official answer which says "Security? What security? We believe in open sourcing our business records. We don't take any precautions, anyone whatsoever can traipse through them at will".

It's important to understand that there's a difference between privacy, and business confidential data. Google's logs fall under both regimes. In many instances, the same incentives apply. But what happens when there's a difference? This is the argument I keep having with some of Google defender's - the Google Search Subpoena case was NOT a privacy case. Google's objections were mainly about business confidential data, which they then "spun" as privacy. Posturing about the extensive procedures Google takes to protect its business records is not wrong, but it's not about privacy either.

We don't know about what happens in serious privacy challenges. There's no way to independently check on Google's statements.

To understand the difference, consider the AT&T wiretapping case

"The Electronic Frontier Foundation (EFF) filed a class-action lawsuit against AT&T on January 31, 2006, accusing the telecom giant of violating the law and the privacy of its customers by collaborating with the National Security Agency (NSA) in its massive, illegal program to wiretap and data-mine Americans' communications."

AT&T surely could have a spokesflacker say all sorts of things about how seriously they protect customer privacy. Without some independent checks, taking such statements on faith is not warranted (pun unintended but still relevant)

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

June 21, 2007

My _Guardian_ column on Google and Privacy

http://technology.guardian.co.uk/opinion/story/0,,2107262,00.html

"The task is to prise out any abuses from behind the wall of corporate secrecy. Otherwise, we could end up with an unholy alliance between corporations and governments."

Posted by Seth Finkelstein at 01:05 AM | Comments (6)

June 20, 2007

Google: 1, Michael Gorman: 0

[I hate to do this to Michael Gorman, but I'm not above a little link-baiting myself. ]

In the Britannica Blog Link-Bait party, Gorman said:

"If you can't Google it, it doesn't exist" is a common saying of Jimmy Wales and his ilk - a remark that gives shallowness a bad name. It does, however, illustrate neatly a state of mind that has turned away from learning and scholarship and swallowed -- hook, line, and sinker -- every banal piece of digital hype. There are intellectual treasures of all kinds in libraries and archives throughout the world that are not available on Google, and, because of the defects of all search engines using free-text searching, would not be retrievable using Google even if every last word in them were digitized. Mr. Wales may place no importance on anything other than information in digital form, but we owe more than that to the young. There is a life beyond the search engine -- a life of richness and nuance undreamed of in Mr. Wales's philosophy -- and all teachers at all levels of education must insist that their students use primary sources and authoritative secondary sources in their papers and studies, regardless whether these sources are digitized. Further, they should emphasize the acquisition of research and critical thinking skills applied to the human record in all its variety.

Unfortunately, before we even get to the Googling, Michael Gorman fell down here on the critical thinking skills. While he certainly can't be expected to be a Jimmy Wales worshipper, hanging on the pronouncements of the guru of work-for-free, it's pretty easy to know that Wales doesn't believe something so strawmannish as the impression given above. If anything, his general line could be attacked as being much more slick, that this stuff is bad for you if you use it to the exclusion of everything else, but you shouldn't do that (and implicitly, if you do, it's your fault, don't go blaming the wonderful wisdom of crowds for steering you wrong, you should have checked anyway).

Anyway, Michael Gorman put a correction in the comments of the thread:

I have heard from Mr. (Jimmy) Wales himself, that he not only has not written "If you can't Google it, it doesn't exist" but also that this quotation is directly opposite to his actual views. I had read the quotation attributed to him in the New Yorker article by Stacy Schiff (July 31 2006) - "Wales, in his public speeches, cites the Google test: ``If it isn't on Google, it doesn't exist''" - and had not seen the attribution disputed. However, I was remiss in not checking further before I published this essay. I apologize to Mr. Wales unreservedly and wish, not for the first time, that the saying "A lie is half way around the world before the truth has its boots on" was not so spot on.

Which started the inevitable blog mockery

The best part of this whole stupid Gorman thing yet: in a blog post on shoddy research, he misquotes Jimmy Wales based on a printed source. And has to apologize. The irony! The laughs! The sheer idiocy of this whole exercise!

Michael Gorman rebuts

I did not "misquote" Mr. Wales. I read that he had said those words in public speeches in the New Yorker article. It's probably counter to the snide ethic of blogs, but I chose to accept his statement that, despite the unrefuted statement in the New Yorker, he had not said and did not believe those words.

Now comes the problem of who do you believe? One thread commenter:

Actually, Gorman cites the New Yorker article accurately, and the New Yorker does its homework and fact-checking and interviewed Wales extensively for the piece. Funny, Wales waits one year to complain about being misquoted? waits until he's on the hot seat and being criticized in this forum? ...but he had no problem with this quote when it merely was contained in the puff-ball New Yorker piece (that also contained the Essjay lies to boot)? Hmmm... .And this reflects badly on Gorman? How convenient for Wales to remember he never said this... .(Gorman is actually being gracious and letting Jimmy off the hook! I doubt I would if I were Gorman.)

New Yorker:

Part of the problem is provenance. The bulk of Wikipedia's content originates not in the stacks but on the Web, which offers up everything from breaking news, spin, and gossip to proof that the moon landings never took place. Glaring errors jostle quiet omissions. Wales, in his public speeches, cites the Google test: "If it isn't on Google, it doesn't exist." This position poses another difficulty: on Wikipedia, the present takes precedent over the past.

Sing: Which side are you on?

Well, it turns out this can be determined by ... THE GOOGLE. It's a little more difficult than is apparent, since it seems the reporter tightened the quote. There's no independent reference for "If it isn't on Google, it doesn't exist". What you have to search for is "it probably doesn't exist". And then one finds speech transcripts such as:

"But there are other cases where it's borderline. Where you might say, I'm not sure if this is a hoax, if this is real, is this not real, and the example here was a film called Twisted Issues, an obscure underground punk film from 1988. The funny thing is, I gave a talk just two days ago at the University of Florida, and the next day somebody wrote me and said, "Do you know I played on the soundtrack for Twisted Issues." I said, wow really, go ahead and edit the article, really, so anyway, so the first person says it's supposedly an underground punk film, but it miserably fails the Google test. So what's the Google test. You look something up in Google, and if you can't find it, then it probably doesn't exist. It's -- this is not a foolproof test, but it's pretty good. Right? There are still a few things on the planet that are not in Google. But it's pretty good. And so it fails the Google test, and it doesn't have any listing, so a couple people say, "delete, delete." And then somebody says "Hey wait wait wait wait, I found something. It's in the Film Threat Video Guide to 20 Underground Films You Must See. So maybe it has some notability. Next person down says, complete it. Next person says, it's a real movie, it's in IMDB, keep keep." So at the end of a discussion like this, this would have been kept. In fact it was kept, and the article's still there."

Verdict: From the full section above, I think Jimmy Wales is being taken out of context. He's clearly talking about a narrow circumstance of determining whether something is a hoax or not. And note in the debate Wales uses as an example, a print reference book is actually being cited as evidence.

It's all in how you use the Google, and think critically.

Posted by Seth Finkelstein at 12:18 AM | Comments (9)

June 18, 2007

Lauren Weinstein's Search Engine Dispute Notifications RFC

Echo: Search Engine Dispute Notifications: Request For Comments

Increasingly, cases are appearing of individuals and organizations being defamed or otherwise personally damaged -- lives sometimes utterly disrupted -- by purpose-built, falsified Web pages, frequently located in distant jurisdictions. ...
Question: Would it make sense for search engines, only in carefully limited, delineated, and serious situations, to provide on some search results a "Disputed Page" link to information explaining the dispute in detail, as an available middle ground between complete non-action and total page take downs?

In my view, it's a brave thought, but it won't happen. We've got to start thinking of search engines as media companies, because that's what they are (I don't claim this insight to be original - lots of people point it out in regard to their advertising business model). The search results are their content, and they do a very standard business model of selling targets ads around that content.

This then gets into the issue of speech and libel law for Internet service businesses, which is a very complicated topic. Can an algorithm output be libel, even if the human values which go into it don't contemplate the specific libel at issue? Good luck arguing that against Google's money and lawyer-buddies ...

Posted by Seth Finkelstein at 02:21 PM | Comments (1)

June 11, 2007

Privacy International vs Google, Privacy Report, and punditry explosion

The Privacy International "Race To The Bottom" Report touched off the expected punditry party:

Why Google?
We are aware that the decision to place Google at the bottom of the ranking is likely to be controversial, but throughout our research we have found numerous deficiencies and hostilities in Google's approach to privacy that go well beyond those of other organizations. While a number of companies share some of these negative elements, none comes close to achieving status as an endemic threat to privacy. This is in part due to the diversity and specificity of Google's product range and the ability of the company to share extracted data between these tools, and in part it is due to Google's market dominance and the sheer size of its user base.

I feel like someone should just set up some sort of system where one or two bloggers can be picked as the champion-of-battle of the inevitable reaction. As in, if you think Google is a poor misunderstood maligned gentle giant, go to Matt Cutts' Why I disagree with Privacy International. On the other hand, if you believe Google is an enormous corporation subject to all the negative aspects that come with being a huge business which has a deep interest in collecting personal data, read Shelley Powers On Privacy Redux. Danny Sullivan and Donna Bogatin can be the respective seconds.

Given that there's far more people saying things, than things to say, I'll leave it that.

Posted by Seth Finkelstein at 01:55 PM | Comments (3)

June 09, 2007

"We Googled You" - Harvard Business Review Interactive Case Study

Echo: "We Googled You"

Hathaway Jones's CEO has found a promising candidate to open the company's flagship store in Shanghai. Should a revelation on the Internet disqualify her now?

In brief: Managers are asked what they would do about hiring a job candidate where a Google search discloses some problematic college activism (h/t many-2-many). It's pretty interesting to read the responses ("I routinely Google people I'm going to interview or be interviewed by.").

I know what the typical Net evangelist would say, that we should all be forgiving, and get used to living in a goldfish-bowl. While that's one common sentiment, note it won't be the evangelist who suffers if they're wrong. It's far more interesting to see some of the negative thoughts of people who actually make such decisions.

Posted by Seth Finkelstein at 08:20 PM | Comments (2)

May 24, 2007

Google and "She"/"He" Spelling "Corrections"

A Google algorithmic quirk which spelling "corrected" searches like e.g. [he invents] to [she invents] recently got some attention, and Google has apparently now rolled out a fix for this problem.

I didn't chase after it at the time, since it seemed obviously an issue of statistics difference, and plenty of informed people were explaining that result to those who saw it as deliberate sexism. So I didn't see the need for me to say it too. There can be a long discussion of structural sexism, and the effects of the default English pronoun being "he", etc, but I had no special expertise to weigh in on the matter.

But the fix that Google has made is interesting for what it reveals about how their algorithm actually functions. As Philipp Lenssen said in the above:

(Note: no matter what Google tells you, algorithms are always influenced by those who design, write & test them)

So Google seems to have changed the way "she" is handled in their spelling suggestions.

But it turns out, from seeing what behavior remains, that Google does not do the obvious sort of simple correction algorithm one might initially think. That is, a search for ["she inventt"] still gets a suggestion of
Did you mean: "he invent".

Why is this significant?

Because "she" is a common English word, "inveent" is not a common English word, and the naive correction of "inveent" to "invent" should yield a suggestion of "she invent". But it seems to be doing some sort of statistical best-match for the phrase as a whole.

I supposed this is not surprising, even expected, in retrospect. But it shows it's harder than it might appear to remove all aspects of structural bias (which is not to trivialize addressing an obvious case).

Semi-digression: Google seems to special-case swear-words. A search of ["fcck you"] does NOT return the obvious correction! One rule seems to be that if the swear-word doesn't appear in the original search, it won't be suggested.

Posted by Seth Finkelstein at 11:52 AM | Comments (2)

May 21, 2007

"The Social, Political, Economic, and Cultural Dimensions of Search Engines"

Echo: "The Social, Political, Economic, and Cultural Dimensions of Search Engines"

The newest issue of the Journal of Computer-Mediated Communication, JCMC 12(3), is a double issue. It features a special theme section on the social and cultural implications of search engines, guest edited by Eszter Hargittai, and a special theme section on CMC and religion from cross-cultural perspectives, guest edited by Charles Ess and colleagues in Japan. The 18 articles brought together on these two diverse themes have in common that they inform and enlighten.

Introduction's Abstract:

Search engines are some of the most popular destinations on the Web - understandably so, given the vast amounts of information available to users and the need for help in sifting through online content. While the results of significant technical achievements, search engines are also embedded in social processes and institutions that influence how they function and how they are used. ...

[Disclosure: A few of the papers cite me in the references]

Posted by Seth Finkelstein at 05:12 PM

May 14, 2007

Google vs Privacy

Echoes:

Michael Zimmer: Google's Unsatisfying Explanation for Retaining User Search Data

In sum, I applaud Google for trying to be more transparent about why it collects user data and what it does with it, but they still keep much in the dark.

[compare Why does Google retain data? Because nonexistent laws tell it to]

Google's official statement about logs

Note: "In developing this policy, we spoke with various privacy advocates, regulators and others about how long they think the period should be."

Observe the rhetorical set-up, of taking a middle ground between zero and infinity. Somebody is sure to say "never keep logs". Somebody is sure to say "keep logs forever, some investigation might find them useful". By doing whatever they felt like doing in the first place, they are compromising between the two "extremes".

Posted by Seth Finkelstein at 11:57 PM | Comments (1)

April 08, 2007

Anti-Google-Bomb Algorithm Proved To Use Page Words, Not To Be Hand-Editing

George W. Bush: A Failure Once Again, According To Google, by Danny Sullivan at Searchengineland.com, points out that a Google search for "failure" (not "miserable failure") currently has a George Bush page at the top result, due to the page having the word "failure" in it. That happened because the http://www.whitehouse.gov/president/ page has "Latest Headlines", which then had this part of
http://www.whitehouse.gov/news/releases/2007/04/20070403.html

"President Bush Makes Remarks on the Emergency Supplemental President Bush on Tuesday said, "In a time of war, it's irresponsible for the... Democratic leadership in Congress to delay for months on end while our troops in combat are waiting for the funds. The bottom line is this: Congress's failure ..."

And so this shows the new Google defusing algorithm uses words on a page to determine in part what's a Google bomb.

Notably, in the comments, "RedCardinal" said: "Well I think we can safely dispel any theories about this being a handjob now."

While nobody who studies Google seriously thought that they hand-edited these problematic results, Google's secrecy breeds superstition, so it's worth placing extra emphasis on the evidence that the changes were not done by a simple blacklist, but were indeed an algorithmic change.

Note this should not be taken to assume that no search engine has ever hand-edited a problematic result! But the number of algorithmic quirks vastly outweighs the rare examples, due to sheer complexity.

Posted by Seth Finkelstein at 11:55 PM | Comments (1)

March 21, 2007

KinderStart v. Google - Google wins, KinderStart lawyer SANCTIONED

KinderStart v. Google, a lawswuit challenging Google's ranking algorithms, has been dismissed - hard and with sanctions against the KinderStart lawyer (h/t Eric Goldman). That last part, the with sanctions is a very significant part here. Essentially KinderStart's lawyer went so far out of bounds on some issue that the court imposed a punishment.

From a quick read of the judge's reasoning, it seems he really didn't like the charges of paid placement, and of political and religious discrimination in Google's search rankings. Google critics take note.

I know some people were rooting for KinderStart because they tried (unjustly, in my view) to position themselves as a focus of the fear of Google's power. But being the enemy of your enemy doesn't make them right.

[See also earlier post on previous dismissal here, Kinderstart vs Google lawsuit dismissed, and ranking on ranking]

Posted by Seth Finkelstein at 12:20 PM

March 13, 2007

The POWER of Google - Topix edition

Echo: Rick Skrenta of Topix, about worries regarding How Search-Engine Rules Cause Sites to Go Missing:

To say that a content site should not rely on search engine traffic -- most of which comes from Google -- is naive. The web is 10 billion pages now, with a single point of entry. That's the web the way works. If you want to have a web business, you have to acknowledge this reality. ...
Sometimes retailers get hosed because the city decides to re-pave the street their business is on. The street is infrastructure. Like it or not, Google is infrastructure on the net now. They're the source of all the foot traffic. The three words in retail are "location, location, location." The three words online are "search engine optimization." It means the same thing.

The point I want to make in echoing that, is both another proof (if any were needed) that the monopoly effect is quite real, and further that it has substantial implications way beyond web business, to what gets heard in society in general. This is repetitive, but it's worth emphasizing from the monetary angle to establish the reality.

Posted by Seth Finkelstein at 10:26 PM | Comments (1)

February 19, 2007

"Language Log" post examines old Google "Jew" Search Controversy

A writer at Language Log, a group linguistics site, just wrote a post motivated by the "Jew" search. This is the controversy well-known in search circles where the anti-Semitic site "Jew Watch" used to come up as the first result in a Google search for the word "Jew"

The post's an interesting window into what someone thinks when seeing the disclaimer Google displays for that search, yet not knowing at first the history of the controversy. He ask the obvious question about why Google displays a disclaimer for that term, but not for, say, hash slurs and racial epithets ("Meanwhile, other words that have uses as offensive epithets, or are used ONLY as offensive epithets, get no warning from Google.")

The answer to that is the disclaimer was prompted by bad publicity in the specific case, not linguistic offensiveness.

There's one small error in the post - the statement "And Google HAS meddled with the search results to some extent; the site's self-description" is noticing that the results display an Open Directory description rather than the site's own description. But it's not a change which was done to tone down the results for that site.

Posted by Seth Finkelstein at 11:59 PM

February 16, 2007

Copiepresse Google court decision online (though in French)

Via Eric Goldman, the "Google v. Copiepresse, No. 06/10.928/C" Google News case decision is available, though it's in French. Perhaps someone can translate it, for the joy and happiness and civic virtue thereof. Anyway, he has some interesting commentary:

1) As I've said before, I think Google treads a lot closer to copyright's boundaries than it publicly admits. Naturally, in public, it takes the advocacy position that its offerings are clearly within copyright law, but this is hard to distinguish from cheap rhetoric. Instead, I think it's fair to say that Google pushes the edge with a lot of its services. Therefore, it should not be surprising that, given enough data points, some judges will conclude that Google has gone too far.

In addition, there an English excerpt of the case's earlier, September, decision in a post back then, at SEO by the SEA.

I should also have mentioned earlier some actual reporting by Danny Sullivan at SearchEngineLand

Posted by Seth Finkelstein at 11:59 PM

February 14, 2007

My _Guardian_ Column about Blog PageRank-selling / Link-buying

I have a column in The Guardian about the issue of companies basically dealing in blog PageRank-selling and link-buying (the most well-known being PayPerPost, but I don't single it out, it's just one of many).

Key point: A-listers are being disintermediated in terms of being gatekeepers for advertisers, the agency has re-intermediation, and if a page gets to the top search result from purchasing attention, almost nobody who sees that top search ranking will even know about the blogger ethics debate.

And it's not about "conversation".

Posted by Seth Finkelstein at 08:24 PM | Comments (2)

February 09, 2007

Note: The Wikipedia-model Google-Killer Search Is Still Speculation

Since Google-killers are in the news today, for something original, let me note that despite the hype, the search project based on a Wikipedia model of user data-mining (whatever the thing is being called these days, I think the preceding phrase is clearer), has yet to even have a development machine installed. The project's mailing list has had lots of discussion about possible approaches, but no action.

The God-King of Wikipedia says:

No firm decisions have been made. We have the test servers scheduled for install on Friday, and then I want to turn people loose on them to start playing around and testing. We need to start talking about how that should happen and who wants to be directly involved.

I had the thought that finding good developers to work for free is different from finding wannabe literary types to work for free as copyeditors. But on reflection, I realized it won't be a problem here. There's plenty of programmers who would pay for a shot at being the guy who killed the fastest gun in the West, err, Google.

Posted by Seth Finkelstein at 05:01 PM | Comments (2)

February 07, 2007

RSS Advisory Board Needs Links

Like this: RSS

Rogers Cadenhead:

Searching for Ways to Move Up in Google
A year ago the RSS Advisory Board moved to its own domain, losing all Google juice associated with its old site. Because the search term RSS is enormously popular, we've found it difficult to attract search traffic and build a decent Google pagerank. It took nearly a year to crack the top 100 for that term on Google; we're currently up to the 80s.

I'm actually dubious they can get to the top ten in Google. Especially given that the old site has the Harvard name behind it (which works for search engines too, via "trust" algorithms ...). Just one interesting little example about how social power gets replicated in search power.

Interestingly, Yahoo gets this "right" in terms of a search on [RSS] giving rssboard.org's specification page the #3 spot. I suspect that's due to their similarity algorithm picking rssboard.org as the site to display rather than Harvard (which has the #3 spot on Google).

Note the implications here: It's a lot harder to establish an a newer project if Google keeps sending people to the old one.

Posted by Seth Finkelstein at 10:40 PM | Comments (3)

February 02, 2007

The Miserable Failure Of A Google Bombs Article Search Ranking

There's a "Back Off National Pork Board" controversy, where the National Pork Board is using a trademark claim to threaten a lawsuit against a breastfeeding activist for a T-shirt with the slogan "The Other White Milk."

But this post isn't about that.

Rather, in passing, in the SearchEngineLand article National Pork Board Goes After Breastfeeding Search Marketer, when discussing an earlier Google-Bomb article, it's noted that the post on SearchEngineLand.com about "miserable failure" DOESN'T SHOW UP (in the top 100 items) for a Google search on the terms [miserable failure].

Now, that's interesting (Danny, you've got to scream "I'M BEING CENSORED!", and get some A-list bloggers to theorize about how Google is suppressing you so as not to let out the secrets of Google bombing. Or maybe because comments in the article on how to re-ignite Google bombs are considered dangerous. Or Homeland Security had Google remove it because it was talking about bombs. Something like that ...). It's around #46 in Yahoo for [miserable failure], so some of the difference is legitimate outranking. But still, there's a divergence.

The article is in the Google index, since it comes up as #1 for the searches [Google Kills] and [Other Google Bombs]. Even #1 for [Bush Miserable] and #2 for [Failure Search].

But it's around #450 for [Google Bombs]. #390 for ["Google Bombs"].

I conclude [Miserable Failure] is in a general class of searches (like [Google Bombs]) where Google is doing something different from e.g. [Google Kills], and perhaps weighing age/trust more strongly. No reason all searches have to go through the exact same algorithm, we know that they don't. It's a coincidence this was noticed for "miserable failure" in specific.

Learn something new every day ...

Posted by Seth Finkelstein at 12:32 PM

January 26, 2007

Defusing The Google-Bomb - And Maybe Reigniting It

SearchEngineLand reports Google defusing Google-Bombs, with a case study of "miserable failure". Google has made an algorithm change "that minimizes the impact of many Googlebombs."

Let the reverse-engineering begin!

Just as a speculation, and not tested much, here's my guess at the algorithm, *something like*:

IF the links to the page contain [BOMB] and

0) There are lot of links with anchortext [BOMB]

1) [BOMB] does not appear on the page or metadata

2) [BOMB] is the most common anchortext in links to the page

3) There are "very few" links of the form [BOMB otherwords]

THEN ignore all links with [BOMB]

This would preserve the ranking of pages talking about it, since they'll have the words on the page, even in the title.

We can test this by adding lots of links with both the expected text and [BOMB]:
George Bush: "Miserable Failure"

Posted by Seth Finkelstein at 12:23 AM | Comments (2)

January 19, 2007

Martin Luther King hate-site and superficial Google search fooling people

I didn't think of the following in time. But Martin Luther King day would have been a perfect opportunity for publicizing the efforts to fix all the links people have mistakenly made to the hate-site (thinking it was a legitimate site). From Natasha Robinson: (h/t Google Blogoscoped forum):

Good to see that MTV's Rock the Vote site took my complaint to heart and removed the link. They also wrote an apology on their blog about the link: http://blog.rockthevote.com/2007/01/last-night-rock-vote-made-mistake.html
The part that disturbs me about the apology is that the webmaster simply used ranking in Google as a means to find an authoritative site.

My emphasis in the below:

Last night, Rock the Vote made a mistake. In honor of Rev. Martin Luther King's birthday, we created a tribute from the RtV front page, as we have done every year for quite some time. To identify the external link, our webmaster searched Google and chose one of the top results, a website that, at a quick glance, appears to be a tribute to Dr. King with speeches, photos and a special emphasis on the holiday (martinlutherking.org - but don't go there). But appearances (and, apparently, popular results on Google) are deceptive. The website is a racist site that disrespects Dr. King and insults all of us who cherish his advocacy for justice. On behalf of RtV, I would like to extend our deepest apologies for this mistake. The link was immediately corrected.

Remember, hate groups can do search engine optimization and marketing too!

Posted by Seth Finkelstein at 07:31 PM

January 15, 2007

Martin Luther King hate-site pushed down in Google by good sites

The Martin Luther King Google anti-hate campaign seems to be working.

By the way, people should be reminded, despite right-wing attempts to claim otherwise, Martin Luther King supported affirmative action.

Posted by Seth Finkelstein at 11:51 PM

January 14, 2007

Sex And The Missing PBS Link

The PBS MediaShift blog has an article discussing the Google / Sex Blogs incident. More interesting than the main article (summary: Bug. Fixed, they say. Google has lots of power.) is the controversy that erupted because of a decision not to link to the sex blogs quoted and discussed in the story.

A number of people have asked why there aren't links to the sex blogs mentioned in this post. If Google had been blocking the blogs, then there would have been links included. But because anyone can easily find the blogs through a search on Google, PBS.org felt it was not necessary to include the links here and risk offending some readers who might not expect to find links to explicit sites on PBS.org.
I ask that you as MediaShift readers please leave comments below explaining what you think the link policy should be here and elsewhere on mainstream media sites and blogs. Should we link to explicit material and how should that be handled? Should we include a warning before the links? Which links are OK and which are not? Your thoughts would be appreciated and I hope to return to this issue in a more in-depth way on the blog. PBS editors, who are involved in this issue, tell me they are very much open to your suggestions.

Now, as a statement of fact, "search on Google" is a cop-out here. Most of the time people don't even click on links right in front of them, much less do a search. And given that the article itself discusses the power that Google has over people being able to find sites, it's very ironic to be deferring to it after a long column about the consequences of a glitch.

Moreover, in the thread, people are pointing out examples where PBS.org did link to sexually explicit sites in other cases.

Look, if you didn't want to take the flak from right-wingers, that's understandable. Maybe not laudable, but comprehensible. Otherwise, I'd say standard web practice is unequivocal on the issue, that readers should be immediately referred to the sites discussed. I don't see any reason to override that convention - for sites that are trolling for traffic and manufacturing controversy, maybe you don't want to give it to them (but any real guilty parties in this case wouldn't care about a link from PBS.org). So put a warning about content next to the link if you must (though I think context makes it very clear). But otherwise, well, I'm now left with a lot of sympathy for why the sex bloggers tended to think Google was deliberately removing them.

Posted by Seth Finkelstein at 08:56 PM | Comments (2)

January 06, 2007

Compensations Of Having a Blog - Currently Google Result #4 for ...

Posted by Seth Finkelstein at 10:13 AM | Comments (2)

December 28, 2006

Google, Sex, Blogs, and really determining Pornography vs. Erotica

[I could not resist a chance to use that title]

I spent some time trying to figure out what caused the recent sexblog kerfuffle. I noticed affected sites all seemed to link to commercial erotic sites (for example Comstock Films?).

My speculation as to what happened, is that Google's anti-spam algorithm got set a little too aggressively in terms of what sites are considered porn-spam. The twist comes that this didn't hit the affected sexbloggers directly, but indirectly, as they then got hit by a linking-to-spam penalty. That is, it's not that they were marked as spam themselves, but rather that they were suddenly seen as closely associated with porn spam.

Such an indirect change wouldn't necessarily affect all blogs which link to the spam-false-positive commercial erotic sites. It's just one factor, and other factors could override any penalty. The actual calculation involved could be very complex. No way to prove this, just a theory.

It's an amusing thought that somewhere deep in the innards of Google's anti-spam algorithm, there might be an honest-to-Potter-Stewart (I-know-it-when-I-see-it) line between "pornography" and "erotica".

Regarding Valleywag.com's original article, which seems to have done a certain amount of poisoning the well:

Some word Violet [Blue] wrote probably triggered a Google ban, inadvertently, but the search engine's rules are opaque, as is the procedure for an appeal against deletion.

Never eat at a place called "Mom's", never play cards with a man named "Doc", and don't take search engine analysis from a site called "Valleywag". There's far more to Google's criteria than simple word counting.

Posted by Seth Finkelstein at 11:51 PM | Comments (3)

December 27, 2006

Note To SexBloggers: Google has no "porn clampdown" - IT'S A BUG!

"SEO superstition" strikes again:

Valleywag.com:

Chronicle writer disappears in porn clampdown
The personal blog of San Francisco's Violet Blue, a sex writer published in the San Francisco Chronicle and Valleywag's sister site, has been removed from the Google index, along with several other adult sites. Tiny Nibbles, which runs a well-known annual list of the year's sexiest geeks, does not show in Google's search results, even if filters are turned off. Other sites affected include ErosBlog, a sex news site, and Comstock Films, which makes adult movies of real-life couples. The content's all legal, and naughty, rather than degrading. Some word Violet wrote probably triggered a Google ban, inadvertently, but the search engine's rules are opaque, as is the procedure for an appeal against deletion. You think there are other search engines, so that's okay? There are no other search engines.

IT'S A BUG! The sites haven't been removed from the index. If you go further into the results, the sites are still there. They apparently got ~~"sandboxed"~~ [update: "marked as spam-like"] for some unknown reason, so they're showing up much lower than normal. Almost exactly 30 spots, in fact.

Google doesn't hate you. Really.

A good person to contact about these things is Matt Cutts.

[Sigh ... *why* *bother*? Who is going to hear me in the face of the sensationalism?]

[Update: Some of the sites are back, the Google people know about the issue. Again, this is really about spam false-positives, not censorship].

Posted by Seth Finkelstein at 11:40 PM | Comments (5)

December 20, 2006

Google "SOAP API" / "AJAX API" - replacement projects, and a Yahoo opportunity

The Google SOAP API, a system for getting Google search results in a way programmers can easily use them, is no longer being supported by Google (non-techies: SOAP is a protocol, like Java is a programming language), in favor of, essentially, a web ad box (aka "AJAX API"). The system hasn't been working well for a while now, and it looks like the plug is being pulled on it.

The basic meaning of this, is that Google is telling independent search developers to get lost, in favor of billboard displayers.

Everybody talks about search-as-a-service, but few people want to do something about it. I suspect this is one of those projects where the cost to run it exceeds what people will really pay for it. I've had ideas of my own in this direction, but the economics is daunting.

Anyway, in the ensuing discussion, there's been relatively little attention paid to the projects to reverse-engineer Google's "web ad box". This mention may be useless in terms of dissemination, but I'll do it anyway:

Cracking Google AJAX Search API
Written by Matthew Wilkinson
Monday 18 December 2006 20:20:09

Recently, Google disabled the use of it's Google Search SOAP API. It now recommends that you use the Google AJAX Search API, which displays a search box on your website, much like a widget. This of course denies developers the means by which to fetch Google Search results and use them in their website. However, me and Martin Porcheron over at MPWEBWIZARD, decided to crack this new API to get some search results out of it.

There's also a screen-scraping EvilAPI (via Google Blogoscoped).

Memo to any Yahoo corporate readers: I assume you already know this, but there's a golden opportunity to grab some of the "cool" from Google here. Set up a compatible server, so anyone who has a Google SOAP API application can switch over to using Yahoo just by switching servers. Yes, it's a lot of server work for no direct revenue, and Yahoo already has a search API, and Google may make threatening legal noises. But you'll rarely have a better opportunity to grab mindshare from developers than now: "Google doesn't want you - but we do!".

Posted by Seth Finkelstein at 11:59 PM

December 14, 2006

Gooptions As Case Study Of The Failure Of Journalism (MSM Or "Citizen")

The Google Employee Stock Options coverage has been a case study in uncritical thinking. I know, what else is new, but I'll say it anyway.

About the best other criticism I've found is an excellent post on SearchViews, doing time-value calculations, about the aspect of that the plan dramatically shortens the time of the option when the employee sells it.

Initially hailed as an innovative HR strategy, then called "good for investors", the option plan has received so much praise that Internet Outsider asks, "If anyone has figured out the drawbacks of Google's new transferable option plan, please weigh in, because at first glance it looks like a win all around." Though numerous 'draw backs' have been suggested, including "an employee rush for exits", "shareholder dilution" and "arrogance", I'm surprised that no one has pointed out the most important nugget from plan's fine print: [detailed calculation]

But it's almost all been echoing of Google's announcement, or confusion over what the "transfer"/sale system does - and what it does not do. For example, there is no innovation here in determining the value of a Google stock price option. There's already a big public market in trading such options. The auction is basically just to determine who is the low bidder for handling the employee option transaction, given there's some weird constraints in the process. Which bring me to one simple example, in discussing the program, where what should be grist for serious reporting has apparently passed unnoticed:

Institutional buyers, who will be invited by Google to participate, will not be able to resell the employee stock options.

No offense meant to any reporter, but what in the world does this sentence MEAN? That is, it should be a big red flag that something strange is going on. Options on a stock are bought and sold all the time. How is the institutional buyer intended to distinguish from "the employee stock options" and "the stock options bought from yesterday's sheep-shearing"?

And this connects to the earlier issue of why not just let employees sell their stock options on the open market? After thinking about it for a while, I *suspect* this has to with the connection between the options and the underlying stock, maybe that if employee options were released into the open market, they would have to be covered by the company issuing stock (or something similar). But if they're just "transfered" to an institution, they still exist in accounting format as options, so certain negative effects (from Google's point of view) are avoided.

Wouldn't you like to know what this is all about? I would. I'm sure there's a professor of finance out there somewhere, who could explain it all. And might even be *blogging* - to an audience of a few hundred people. But they definitely haven't been found by the big echo chambers. And if that person ever did receive a little attention, the blog-evangelists would shout from their hilltops, the bogosphere triumphs - there's a specialist somewhere on the planet, so "overall" - not counting the endless hype reverberating from the massive audience "blogs", and also discounting that "old media" includes small trade newsletters too - blogs win!

I really think it says something profound about the failure of journalism in terms of civic structure, that random unpaid volunteers are supposed to provide the work that isn't supported otherwise.

STEWART: ... it's not so much that it's bad, as it's hurting America.
STEWART: You're on CNN. The show that leads into me is puppets making crank phone calls.

[Update: Changed title from earlier version - share the blame]

Posted by Seth Finkelstein at 11:59 PM | Comments (2)

December 13, 2006

Google Transferable Stock Options (TSO's)

Google Transferable Stock Options (TSOs) are, in my cynical view, Google's way of solving the problem of having someone else take the fall for what happens to its employee's stock options when the Google stock price eventually takes a dive.

Disclaimer: I thought Google stock was a sucker bet all the way up, which may bear on how seriously to take this post. I have never owned it, and have no transaction involving it (i.e. am not "short").

Briefly: Google has a stock problem. A stock's price, even in a bubble, can't rise forever. Google gives employees "options", a right to buy the stock at a specific price. These take time to become active ("vest"). A look at the calendar shows there must be a lot coming due. And I suspect many employees thinking they should get out while the getting is good.

Employee options, to turn into money, normally first have to turn into stock. This dumps stock on the market. Which can drive down the stock price. Which will cause more employees to think they have to get out while the getting is good. Which will further drive down the stock price ...

See the problem?

Moreover, there's a complicated tax law "gotcha" which can hurt employees enormously with bubble-stock options. I won't explain it all, but if you wait to sell the stock, and the stock craters, you can basically end up owing huge taxes on profits which no longer exist. Which is another incentive to sell the stock as fast as possible.

Google employees who get caught in that trap would likely be very unhappy.

So, what to do? Google came up with a great solution: Pass the hot potato to the outsiders, the folks who are hearing the tale of the endless fountain of money. Let employees *sell* the options to outsiders.

At first glance, this sounds like a great idea. After all, options are bought and sold every day. Why should employees not be able to sell theirs? Well, in ordinary circumstances, it wouldn't be a problem. But in a situation like Google, it's a set-up to rip-off the ultimate buyers for the benefit of Google and its employees.

First, someone better versed in the technicalities of options mathematics should check me on this, but I believe many simple option valuation models will give a "wrong" answer for the value of an option in a situation like Google's stock, which is relatively new and has gone almost straight-up. That is, intuitively, the stock price behavior is going to change dramatically at some point, to leveling-off or dropping, and that's not accounted-for in any theory where the calculation has internally modelled an infinite time series dramatically different from the existing series (technically: "misestimation of implied volatility"?).

If everyone is using the same "wrong" rule for their cost, then it's all just a standard market game of Greater Fool. But if some sellers have a *zero* cost, to buyers at an "inflated" cost, that's taking the game to another level.

And more deeply, normally, a market in options is limited in the ability to cause a stability problem, because kind of like matter/anti-matter pairs, a normal option transaction has two people on opposite sides of that transaction, so "financial energy" is obviously conserved. An *exception* to this situation is company-granted options, like Google is doing now. Which is roughly comparable to energy creation (money) while shunting the corresponding equivalent anti-matter (stock) to a time-displaced future date. There's still conservation in an overall sense (Google is not God, so can't get around that constraint), but it can be very imbalancing to put off the day of reckoning. And more importantly, the people who get wrecked at that future date tend to be different from the people who make out like bandits at the time of creation.

Note - this is a complicated topic. I know, "stock options accounting" are fighting words to many. This is a blog post, not a financial treatise.

But here's the "beauty" of what Google is doing - by selling the employee options, the employees can take profits without the options turning into stock! Which keeps the stock price up. Which encourages the buyer to hold onto the options. Which further puts off the day they turn into stock ... Brilliant!

That is, it keeps the party going on by putting off one pressure to sell stock. And who pays the ultimate bill? The buyer of the employee option, who at some point eventually gets stuck paying a high price for something which (oversimplified) becomes worthless when the stock starts leveling-off / going-down.

Doing Evil? Well, depends on whether you're the seller (who's at Google) or the buyer (aka citizen-lunchmeat)...

Posted by Seth Finkelstein at 02:28 AM | Comments (2)

December 08, 2006

"Rankjacking" - Monetizing website cracking via theft of PageRank

While there's been much discussion that the Google PageRank of websites can lead to lots of shady deals around buying and selling links, it's been less remarked that this also provides a way to profit from cracking a website. It used to be that most websites just weren't that interesting. The sites that take credit-cards for data are comparatively few, often use a third-party service for the billing transaction, and redirecting an order page to steal that information will be noticed quickly.

But every site has has its position in the recommendation social network, its ability to link, its Pagerank and "trust".

Thus, if a bad guy finds a security flaw in some website software, rather than being reduced to writing "d00dz rul3z!" on a page, which is not profitable, there's now a brand-new way to make money off the cracking: Insert links to boost another site's search engine results.

One "advantage" of this scam is that sites of non-profit organizations are likely to have a lot of rank and trust, but overworked and underpaid webmasters, which makes such sites a "sweet spot" for exploitation.

So obscure, "hidden" links inserted in various places are not likely to be noticed, and finding someone to fix the page won't set off the sort of red alert reaction involved in credit-card theft.

The United Nations Educational, Scientific and Cultural Organization, UNESCO has now been hit by this scam, as well as many other sites.

At this point, the actual cracker is unclear, and whether or not the link-receiving sites knew about the cracking or were unaware of the criminality. The cracker seems to have exploited a bug in some forum and link-cataloging software.

I mailed the UNESCO webmaster about their site being cracked, and there's now some attention to this particular event. But the general problem is likely to get worse, as the potential becomes more exploited.

Posted by Seth Finkelstein at 12:57 PM | Comments (2)

December 03, 2006

Google Rankings Now Fixed For WikipediaWatch

Shortly after my previous post was published, and echoed at Google Blogoscoped (a popular Google-oriented blog, ranks #44 of all blogs on Technorati), whatever Google penalty flag which affected the Wikipedia Watch site was removed. Search position for relevant terms skyrocketed. It's clear this wasn't a transitory problem, as it had persisted for months. The most likely explanation is someone at Google who had the power to clear the flag, saw the Google Blogoscoped item, and fixed the false positive.

I will not flatter myself to think they saw my post! In terms of audience, the Google Blogoscoped echo only sent around 39 hits. Now, all readers gratefully accepted, but it was a revealing statistic. Another data-point in what I think of as The Meaning Of Exponential Distribution Of Attention.

From another angle, this case was an example of the problems of Google's spam algorithms, and needing to "know someone" to get a problem fixed.

Posted by Seth Finkelstein at 10:46 PM | Comments (1)

December 01, 2006

Wikipedia Watch - Google Spam Filter Victim?

Daniel Brandt claims that Google hates Wikipedia-Watch (his site critical of Wikipedia). He presents some very different ranking for various terms.

This turns out to be interesting, as I was able to refine the tests to a sharper outcome. Now, let's keep in mind the difference between the facts, and the theory to explain them. There's something I call "SEO superstition", which is the very understandable way random variations can mislead people to form bad theories. And "never attribute to malice that which can be explained by stupidity". So a particular pattern may not be even real, or if it is, that doesn't necessarily indicate that Google's editing search results to marginalize critics (we should be so threatening ...).

So, with that in mind, comparing search terms, I found the following rankings today for www.wikipedia-watch.org for the indicated strings of words (searching as a set of words, not a quoted phrase):

[can you sue Wikipedia]

Yahoo - #1 and #4
MSN - #1 and #2
Google - more than #300 (!)

[plagiarism Wikipedia]

Yahoo - #5
MSN - #3
Google - I had to go past #700 before I found a result for wikipedia-watch.org

[phenomenon of Wikipedia]

Yahoo - #3
MSN - #4
Google - somewhere around #80

Note Wikipedia Watch site ranks #1 in Google for the search [Wikipedia Watch], but I think that may be misleading.

Feel free to try to reproduce, it's not difficult.

Conclusion: This is a real differential. It's too much to be explained by various SEO factors. Something is amiss here.

I think wikipedia-watch.org has somehow tripped a spam penalty on Google. This is not necessarily Daniel Brandt's fault. But there is a downgrading of the site.

[PS: Invocation - Spammeister Matt Cutts, you might want to check this out. I know Daniel gives you a hard time about being an ex-NSA spook, but look at it this way - a bug's a bug].

[Update Sun Dec 3 09:30:41 EST - the site's issue has been fixed now, for reasons unknown, and the Wikipedia Watch page updated accordingly showing dramatic ranking increases]

Posted by Seth Finkelstein at 07:42 AM | Comments (2)

November 14, 2006

Quoted in Mercury News about COPA Censorware Report

Study: About 1 percent of Web pages have sexually explicit material

Seth Finkelstein, a programmer and civil-liberties activist, said Google's stance was "horribly self-serving."
"There were no privacy implications in the sense that the data was restricted to a very small set of researcher who were under various sets of protective orders," Finkelstein said.
Finkelstein said Stark's findings about the prevalence of pornography on the Internet are similar to other academic studies.
"What we are learning about the Internet is that it reflects life and that the Internet is not -- contrary to what some people might think -- more sexual than people are in general."

The quotes are accurate, though of course it was a small part of a much longer conversation.

I'm climbing the pundit-ladder! :-)

[h/t: Catherine Crump]

Posted by Seth Finkelstein at 05:49 AM | Comments (6)

November 13, 2006

"Google Subpoena"-related Expert Report On Censorware Now Released

[Small scoop, though several reporters will probably have items shortly - updated with full report ]

The expert witness censorware report which set off the media frenzy Google Subpoena has now been released, almost completely. There are only some small redactions having to do with specific numbers related to various sizes of search engine indexes, which the companies regard as proprietary information.

I was on a list of recipients who inquired and received the full text in a mailing when it was approved for release by the Department Of Justice. ~~As the report will probably show up on the big search blogs, I'll save my disk space and let them post it~~. It's not all that, err, sexy, anyway.

Money shot:

V. SUMMARY

This study reports on the Google and MSN indexes, on AOL, MSN and Yahoo! queries, and on the most popular Wordtracker queries. About 1 percent of the websites in the Google and MSN indexes are sexually explicit. About 6 percent of queries retrieve a sexually explicit website. Nearly 40 percent of the most popular queries retrieve a sexually explicit website. Close to 90 percent of the sexually explicit websites retrieved by queries are domestic. Filters that block more of the sexually explicit websites also block more of the clean websites. The most restrictive filter blocks about 94 percent of the sexually explicit search results, but also blocks about 13 percent of the clean results. Of the sexually explicit websites that get through the filters, 30 percent to 90 percent are domestic.
The number of sexually explicit websites is huge. Search results often include sexually explicit material. A lot of sexually explicit material is not blocked by filters. Of that, a substantial percentage is domestic.

[But we all knew that last paragraph already ...]

[Update: Looks like nobody else bothered:-), and it turns out I can host it, so here it is:]

Main Report:

http://sethf.com/infothought/blog/archives/copa-censorware-stark-report.pdf

Supplement:

http://sethf.com/infothought/blog/archives/copa-censorware-stark-supp.pdf

Supplement 2:

http://sethf.com/infothought/blog/archives/copa-censorware-stark-supp2.pdf

Posted by Seth Finkelstein at 01:33 AM | Comments (1)

November 09, 2006

User Search History Data Portability: Data Export Means Data IMPORT

I believe many commentators are being far too uncritical about the following statement (h/t: Michael Zimmer) of Google's CEO concerning proposed portability of user's search history data:

Making it simple for users to walk away from a Google service with which they are unhappy keeps the company honest and on its toes, and Google competitors should embrace this data portability principle, Eric Schmidt said at the Web 2.0 Conference in San Francisco.
"If you look at the historical large company behavior, they ultimately do things to protect their business practices or monopoly or what have you, against the choice of the users," he said. "The more we can, for example, let users move their data around, never trap the data of an end user, let them move it if they don't like us, the better."

While at face value, this is a praiseworthy statement, I am more cynical. Institutionally, Google is known for

1) A prodigious appetite for data 2) A maniacal secrecy 3) Good PR

Putting this all together, I don't think he wants to make it easy for users to move personal data away from Google. I think he wants to make it easy for users to move personal data away from Microsoft and Yahoo to Google. I suspect this is in fact an attack aimed at Microsoft, where he's going to wave the banner of "portability" against possible Microsoft operating system lock-in tactics.

Not that there's anything wrong with that.

But it's about Microsoft and what Google perceives as competitive advantage, not about "the choice of the users".

Posted by Seth Finkelstein at 02:30 PM | Comments (2)

October 30, 2006

PageRank/Link-Buying Doesn't Care About Blogger Ethics

Much discussion about "PayPerPost" (a service which pays bloggers for posting about products) has understandably focused on the social rules involved in distinguishing what's an acceptable high-class social exchange of mutual benefit, and what's a low-class tawdry selling yourself (hint: where people stand on this is very tightly correlated to where they sit, as in on conference panels vs below in the audience or worse).

However ... in terms of "Search Engine Optimization And The Commodification of Social Relationships", it doesn't matter. That is, Google PageRank does not care about "*disclosure*". I laughed about PayPerPost's latest stunt in paying bloggers to link to a page about "disclosurepolicy.org". There's something very recursively absurd about that.

I don't think PayPerPost advertisers are buying the few dozens to hundreds readers of a typical blog post. Maybe, but I just don't think it's cost-effective. Rather, they're buying more the organic-looking LINKING from the various blogs, which looks to search engines as if the page is legitimately popular. Think of it like Astroturf applied to link-campaigning rather than political campaigning.

And when a product then ranks highly on a search engine due to those paid links, the people seeing that rank are not going to have any idea whether or not the blog posts which contributed to that rank followed A-lister Approved Best Practices For Soul-Selling.

Ironically, to use a buzzword, the A-listers are being "disintermediated" for certain business purposes (and THERE IS RE-INTERMEDIATION by the advertising agency), which I suspect is part of the reason for their howls that standards are being breached. Part of the evangelism sales-pitch is that bloggers are "influencers" - so of course A-listers are the biggest influencers of all. Thus advertisers should cater to them with everything from product freebies to consulting gigs. But what happens when, through The Magic Of The Internet, advertisers bypass those "gatekeepers", and simply buy large amounts of Z-listers through blatant resellers, instead of going through the A-list as intermediaries? Well, you have A-listers obviously upset that their own business model is being undercut. But it can be misleading to see their unhappiness as the main story (though it can be an amusing sideline), rather than a reflection of the economic shifts and battles between commerce vs social, being played out in various ways.

Posted by Seth Finkelstein at 11:58 PM | Comments (3)

October 20, 2006

Search Engine Optimization And The Commodification of Social Relationships

Many media A-list bloggers have been in an uproar over a service that pays bloggers for posting about products. More than just payola, Doc Searls also brought up the connection to "SEO":

Somebody said to me recently that PayPerPost and others like it are just "the latest SEO moves". SEO is "Search Engine Optimization", or the practice of doing things to raise your PageRank and get more Google advertising money, basically.
There are two approaches to SEO. One is to raise your PageRank with tricks. The other is to write useful and interesting posts about subjects you know and care about. Show me a blog with a lot of Google juice and I'll show you a blog that didn't need SEO tricks.

As all students of Search Engine Optimization know, link buying and selling is a big issue. In theory, PageRank is supposed to be developed from social relationship ("organic links"), representing the true value of human interaction. It is not supposed to be a commercial relationship, to the highest bidder.

But this is interacting really badly with commercializing social relationships. There's deep problems, especially when new variations arise in commoditizing connections between people.

Are you allowed to hire people to write useful and interesting posts? That's got to be permitted, right? I haven't seen the blogs which are basically commercial magazines online, being kicked out of the warm-'n-fuzzy backscratching A-list club for having paid staff.

Are you allowed to parcel out the hiring in little bits of cheap labor on other people's sites? Why not? You know what the blog evangelists would say if they were in favor of this, hailing it as a marvelous disintermediation of the old monolithic priesthood of the high barrier to entry media payoffs, compared to the hip new democratized PEOPLE-POWERED PAYOLA.

There's an old joke which runs:

Billionaire to woman: "Would you have sex with me for a million dollars?"
Woman: "Well ... yes"
Billionaire to woman: "Would you have sex with me for ten dollars?"
Woman: "What kind of a girl do you think I am?"
Billionaire: "We've already determined that. Now we're just arguing over the price."

There's two aspects here: Commercial, and amount. The obvious aspect of the joke is that there's two categories of interactions, commercial and social, and there's never supposed to be any overlap between them, whatever the amount. A less often remarked aspect is that there is indeed a "class" division between high-priced commercial and low-priced commercial.

I think we're seeing a real life version of that joke, roughly:

Company to blogger: "Would you write about me for advisory board membership?
Blogger: "Well ... yes"
Company to blogger: "Would you write about me for ten dollars?"
Blogger: "What kind of a flack do you think I am?"
Company: "We've already determined that. Now we're just arguing over the price."

Is a few bucks just the same as an advisory board membership? No - there's a class division, in that an advisory board membership is high-class and expensive, while a few bucks is tawdry and cheap. But there's something a bit methinks-the-lady-doth-protest-too-much when we have the equivalent of executive "escorts" venomously criticizing street prostitutes for being so crude as to be selling it.

Posted by Seth Finkelstein at 04:29 PM | Comments (8)

October 16, 2006

"Wal-Marting Across America" - Did Googlewashing Work?

The "Wal-Marting Across America" story, where a Wal-Mart PR firm sponsored a "fake" blog ("walmartingacrossamerica.com") about a couple's trip involving various Walmart stores, contains this interesting Google aspect:

It was a great way to redefine the term Wal-Marting, which is mostly used pejoratively to mean, among other things, how big box retailers mow down small businesses.

I was interested if the Googlewashing, i.e. crowding out search results, worked here. So far, all it seems to have generated is very poor results (#2 hit now). And at the cost of much negative reaction .

The idea above seemed to be, in part, to use the blog and the link behavior of bloggers to get prominent placement. But - again, so far - the blog ranks very poorly on a search for "Wal-Marting", or "WalMarting". I think what's happened is that the PR people drank the blog-evangelism Kool-Aid, and were misled by hype about blogs. Blogs can in fact be obscure in Google, especially if they are new and have few links, which was the case for this "flog" (PR blog). A-lister's blogs, established and popular, tend to rank well. But that doesn't mean any blog is going to do well, which is the sales-pitch.

Amusingly, there's the inevitable trumpeting that the failure of this stunt proves how blogs are so authentic and sincere (Scott Karp: "And because blogging is not a control-based medium, Edelman couldn't make Wal-Mart appear to be something it's not. It rang false, and they got caught."). In fact, I'd say the stunt didn't work because blogging is a very control-based medium, and you usually won't get heard unless a gatekeeper high up the hierarchy directs attention to you (I know, I say this a lot, I'm proposing it as an alternative explanation for the stunt's failure - it's not that bloggers can't be fooled, but that to fool them, e.g. you have to suck up to A-listers, not just exist).

There's a certain unfalsifiability in the reaction. Exploitations which conveniently blow-up are going to be greeted with a chorus of Transparency!, Conversation!, bloggers are just so gosh darn smart and clever and real that they can't be taken. But successful exploitations which do not fit this storyline will of course not be fodder for more delusion.

Posted by Seth Finkelstein at 09:54 AM | Comments (1)

October 03, 2006

Martin Luther King and The Unpleasant Search Algorithm Result

I'm a bit late in punditing about the CNET article noting:

Using the keywords "Martin Luther King," the first result on Google and AOL--whose search is powered by Google--and the second result on Microsoft Windows Live search is a Web site created by a white supremacists group that purports to provide "a true historical examination" of the civil rights leader.

Microsoft says:

The results on Microsoft's search engine are "not an endorsement, in any way, of the viewpoints held by the owners of that content," said Justin Osmer, senior product manager for Windows Live Search. "The ranking of our results is done in an automated manner through our algorithm which can sometimes lead to unexpected results," he said. "We always work to maintain the integrity of our results to ensure that they are not editorialized."

And Nick Carr:

By "editorialized" he seems to mean "subjected to the exercise of human judgment." And human judgment, it seems, is an unfit substitute for the mindless, automated calculations of an algorithm. We are not worthy to question the machine we have made. It is so pure that even its corruption is a sign of its integrity.

This is a good jumping-off point to note why I don't have a "home", for what I think of as technology-positive social criticism. Because my instinct here is not to bemoan the corruption of the machine, but to say, "That's what you asked it [the machine] to do". That is, if you ask the machine to tell you, very roughly, "What's the most popular site for this phrase", and it tells you something you don't like, well, that's the exact opposite of corruption. Sure, it's possible to ignore it - but that opens up a whole range of problems. Such as, which results are now going to be deemed so offensive to public sensibility that they'll be suppressed?

More importantly, if you start playing favorites, people are going to wonder if every oddity is the result of pressure groups - or should be subjected to manual adjustment. And search engines have enough problems with people falsely believing their personal sites been censored.

Think of it as "Government of laws, not of men".

Posted by Seth Finkelstein at 09:30 AM | Comments (7)

September 22, 2006

Spurious Search Count War On Gore

Geoff Nunberg posts at Language Log regarding how the Al Gore / Internet story history is being subjected to a misleading attempt to minimize its impact via a spuriously low search engine count:

Of course counts of media stories are only a rough indication of how widely diffused a story is, but even if we restrict ourselves to print, the contrast between [Alan] Abramowitz's 19 stories and the actual figure of several thousand is pretty striking. But then anybody who lived through this period knows without having to check that the story was all over the place. Which leads me to ask, How could Abramowitz possibly have believed the number his search returned?

Heck [alert - information you won't find elsewhere here!], let's go the source, from the story's inventor (my emphasis):

Next came the media feeding frenzy. On 11 March, Wired News was the first to report Gore's remarks. Hundreds of articles were quick to appear, many drawing the inevitable comparisons to Gore's other gaffes.

Sigh. Why do I bother?

Posted by Seth Finkelstein at 11:58 PM | Comments (3)

September 06, 2006

Linkdump: Google U California contract, suing Wikipedia, "Where are the women?"

Cleaning out various bogosocial obligations from the last week:

New sucker in the multi-level-marketing scheme for attention, err, I mean, blogger, Karen Coyle has an extensive post analyzing the contract for Google's University of California library digitizing (gatekeepering: Walt Crawford). Amusingly, one can see this post diffuse through the library domain, but not (yet) the search domain.

Daniel Brandt at Wikipedia Watch has a post discussing "Can you sue Wikipedia?". I don't agree with all the legal reasoning in it, but I don't like the way too much discussion is being driven by dysfunctional dynamics between Kool-Aid drinkers and Kool-Aid pushers.

Bandwagon: Vote Aaron Swartz for Wikipedia Board Member (if you have 400 edits, otherwise you can't vote).

A conservative is a liberal who has been mugged.
A liberal is a conservative who has been arrested.
Somebody who has not been invited to a hot party is a discoverer of the power of social connections.
Or "Welcome to Foo [Camp|Party|Networking Session], you lucky few". The A-listers said it, I didn't.

Which is a good segue to note Sour Duck's Where Are The Women Redux (h/t Shelley Powers), making a point that "Technology conferences, newspaper articles, and the Supreme Court workforce are the latest three areas where women are notably absent, prompting bloggers to once again ask, "Where are the women?". Another proof that blogging (if one wants to be read, rather than "connect with people") is not effectively very open at all.

Posted by Seth Finkelstein at 01:01 AM | Comments (5)

September 03, 2006

Google Giving Brazil Personal Data From Orkut - Misleading Explanation

In Google to Give Data To Brazilian Court (Washington Post), describing Google turning over identifying data to the government of Brazil, the following statements are made:

The difference, it says, is scale and purpose.
The Justice Department wanted Google's entire search index, billions of pages and two months' worth of queries, for a broad civil case Brazil, by contrast, is looking for information in specific cases involving Google's social networking site, Orkut.
"What they're asking for is not billions of pages," said Nicole Wong, Google associate general counsel. "In most cases, it's relatively discrete -- small and narrow."

There are some very wrong and misleading aspects in the above paragraphs.

1) The Justice Department went down to 50,000 URLs and 5,000 queries:

http://www.epic.org/privacy/gmail/doj_court_order.pdf

"First, the subpoena requested "[a]ll URL's that are available to be located to a query on your company's search engine as of July 31, 2005." [...] In negotiations with Google, this request was later narrowed to a "multi-stage random" sampling of one million URLs in Google's indexed database. As represented to the Court at oral argument, the Government now seeks only 50,000 URLs from Google's search index. Second, the government also initially sought "[a]ll queries that have been entered on your company's search engine between June 1, 2005 and July 31, 2005 inclusive." (Subpoena at 4.) Following further negotiations with Google, the Government narrowed this request to all queries that have been entered on the Google search engine during a one-week period. During the course of the present Miscellaneous Action, the Government further restricted the scope of its request, and now represents that it only requires 5,000 entries from Google's query log in order to meet its discovery needs. Despite these modifications in the scope of the subpoena, Google maintained its objection to the Government's requests."

2) The information was being sought for a statistical study, and the data would be under a protective order, and not intended to identify any particular person (even if some identification would in theory be possible, nobody was going to try to do it). In contrast, the information being sought here is to identify and hopefully convict specific people of criminal charges.

Of course, Google is in a tough position here, some of the crimes alleged are very serious. This IS the sort of problematic action that people projected onto the DOJ's relatively insignificant study. But it's quite an odious spin to trivialize it as "small and narrow"!

Posted by Seth Finkelstein at 10:59 AM

August 11, 2006

"Race to the Bottom" - Corporate Complicity in Chinese Internet Censorship

Human Rights Watch has released a new report "Race to the Bottom" - Corporate Complicity in Chinese Internet Censorship

It's a thorough examination of the topic. I won't attempt to extensively summarize, since that'll be done by many others.

I'm mentioned (with regard to Google censorship) at the bottom of page 61, in very good company:

For more on this issue see Bill Thompson, "The billblog: Google censoring web content," BBC News, October 25, 2002 [online], http://news.bbc.co.uk/1/hi/technology/2360351.stm; Jonathan Zittrain and Benjamin Edelman, Berkman Center for Internet and Society, "Localized Google search result exclusions," October 26, 2002 [online], http://cyber.law.harvard.edu/filtering/google/; Seth Finkelstein, "Google Censorship - How It Works," Sethf.com, March 10, 2003, http://sethf.com/anticensorware/general/google-censorship.php; and Philipp Lenssen, "Sites Google Censors," Google Blogscoped, January 25, 2005, http://blog.outercourt.com/archive/2005-01-15-n50.html (all retrieved July 12, 2006).

[Hat tip: Philipp Lenssen

Posted by Seth Finkelstein at 09:36 PM | Comments (2)

August 09, 2006

AOL Data Real-World Logs Experiment Yields New York Times Privacy Proof

"A Face Is Exposed for AOL Searcher No. 4417749" is the New York Times' proof of concept of privacy invasion from search data:

Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months' worth of them. "My goodness, it's my whole personal life," she said. "I had no idea somebody was looking over my shoulder."

You can just see the upper levels of the policy and punditry elite digesting this concept, as it becomes valid for them. There's a teachable moment happening right before our eyes, where conventional wisdom is being changed. Concerns about the implications of data retention, search logs, privacy invasion, etc, are suddenly moving from the outer reaches (ie. civil-libertarians) of polite society, to be respectable issues-of-the-day.

For unique material which is not being said dozens of times over by other people, I'll point out that Daniel Brandt at GoogleWatch has been making this case for years now, and even running "Scroogle", an anonymizing search proxy. This supports my points about activism - without media support, without a certain level of insiderness, you will talk forever about an issue, and not make any (or very little) progress.

Posted by Seth Finkelstein at 11:00 AM | Comments (4)

August 07, 2006

AOL Search Data Launches World's Biggest Experiment On Privacy Invasion

AOL Search Data has been released for more than half a million users:

This collection consists of ~20M web queries collected from ~650k users over three months. The data is sorted by anonymous user ID and sequentially arranged.

This is a privacy problem!

While it'll be well-discussed, I'll observee: AOL has just given us the world's biggest real-world experiment as to whether privacy invasion can be done from search-engine data. Previously, when discussing the Google Search subpoena, all people could do was speculate - the data might have this, it could include that, maybe possibly someone could do this from it. Now we have both a huge amount of data, and many interested geeks playing with it and mining it.

I joked we'll now see a huge distributed reverse-engineering collaborative effort to track down as many anonymous user ID's as possible. At least, I hope that was joke. Maybe it wasn't.

Note this data is being far, far, more widely released than the subpoena data, which would have been under confidentiality agreements and protective orders. Worrying about Big Government can be a distraction over far worse Big Corporations.

Posted by Seth Finkelstein at 10:29 AM | Comments (7)

July 25, 2006

Google Germany Censored Sites vs. Germany's Voluntary Self-Monitoring Blacklist

Philipp Lenssen asks Why Is Stormfront.org Missing in Google Germany?, discussing Google censorship:

How does Google know which sites they need to censor? One thing Google and others in Germany do is to access blacklist data on a server by the Association for the Voluntary Self-Monitoring of Multimedia Service Providers, FSM("Freiwillige Selbstkontrolle Multimedia-Diensteanbieter eV") ... Stormfront.org, however, is not on this BPjM blacklist module, according to the BPjM.

My comment on this was that he hasn't found a bug in Google's censorship, he's found a bug in the "BPjM blacklist" :-).

The response he got from Google was unhelpful as usual.

One of the reasons I've opposed censorware is that secret blacklists preclude judicial review. This may be a commonplace now, but it's acquiring new resonance with, let's say very prominent cases involving claims of secrecy and national security:

pp.39-40, "If the government's public disclosures have been truthful, revealing whether AT&T has received a certification to assist in monitoring communication content should not reveal any new information that would assist a terrorist and adversely affect national security. And if the government has not been truthful, the state secrets privilege should not serve as a shield for its false public statements. In short, the government has opened the door for judicial inquiry by publicly confirming and denying material information about its monitoring of communication content."

But then, we're back to the same problem - I'm preaching to choir here, and marginalized to anyone else :-(.

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

July 22, 2006

Google News problem parsing "Techdirt" site

Techdirt:

Google News changed something on July 6th, so that all of our stories appearing in Google News now show up with the headline "Permalink to this story." We hadn't made any changes to our site, and Google News had always performed flawlessly in the past. ... What followed was a series of emails from Google staff (much of it sounding like boilerplate "canned" responses) that in almost every case blamed us for their own glitch. That's what we get for trying to point out a glitch to them.

This might be an interesting case study of algorithmic quirks. I assume Google isn't doing this deliberately. But I suspect there's a long list of trade-offs in Google News' ad-hoc parsing which is leading to a poor result for Techdirt. And that Google doesn't want to devote the time of a programmer skilled in debugging in order to diagnose the issue.

I sent Techdirt a suggestion: Put a class="permalink" attribute on your permalinks (<a class="permalink" href="...">). That *may* fix it.

If they try my idea, and I'm right, I'll write more about it.

[Update: I guess not. Google works in mysterious ways.]

Posted by Seth Finkelstein at 11:59 PM | Comments (3)

July 20, 2006

Google News and Infoworld not appearing

Jon Udell - News about Google News (about the site Infoworld.com not appearing, though its weblog subdomain is appearing):

According to Google News product manager Nathan Stoll, the omission is a technical problem rather than an editorial one. The Google News crawler, he says, is a very different beast from the regular Google crawler. And while the regular crawler happily includes our stuff, the news crawler -- for reasons as yet undetermined -- doesn't.
I was surprised to learn this because I've only ever been aware of three user-agent strings (i.e., crawler signatures) broadcast by Google bots:
1. GoogleBot (for the main index)
2. GoogleBot-Image (for images)
3. Feedfetcher-Google (for RSS feeds)
There's no separate signature for the news crawler. It identifies itself as GoogleBot too. Given that the main crawler and the news crawler use different algorithms for site traversal and page analysis, according to Stoll, I'd expect them to identify themselves differently. But perhaps for historical reasons, they don't.

Despite a tendency for snarky sites to play "Gotcha" with that explanation, it does seem to be true.

According to an older mailing list report,

Leaving out the version numbers, Google News user agent is "Mozilla (Googlebot)" whereas regular Google is just Googlebot.

I suspect that's slightly incorrect now, i.e. Google News has the "Mozilla (Googlebot)" signature, though not all instances of that signature are Google News (though it may have true at the time it was written, given various lag times in use of different code).

Given that Google News does include "www.infoworld.nl", my guess is that someone made a typo in the sources file somewhere for "www.infoworld.com", and the Google News crawler is mistakenly looking at a cybersquatted site (hence it wouldn't be reporting a can't-find-site error, but it wouldn't find any useful news content either).

Posted by Seth Finkelstein at 03:57 PM

July 14, 2006

Kinderstart vs Google lawsuit dismissed, and ranking on ranking

The Kinderstart vs Google search ranking lawsuit was dismissed, with a complete victory for Google. See Eric Goldman for legal commentary, Danny Sullivan at SearchEngineWatch.com.

This has set off another round of pedantic parsing and nitpicking over the marketing PR that Google issues about its algorithms. My take on it all is much shorter. Meaning no disrespect to anyone in particular, but attempting to convey the issue succinctly from my point of view:

Marketing PR does not fully explain algorithms - GET OVER IT!

If Google gave a halfway detailed overview of what it does, all non-geeks eyes would glaze over immediately. Yes, people don't understand it. Yes, Google's statements aren't that informative. Yes, explaining it is good. But Kinderstart's basic charge about unfair ranking in this case reminds me of nothing so much as the trivial argument: "The US Constitution's First Amendment says I have freedom of speech, and it doesn't say anything about except for libel or death threats or copyright infringement, so THE CONSTITUTION IS BEING VIOLATED".

And conversely, lawyers and free-speech activists spend a lot of time arguing about this, so why should Google's search results be any easier to explain to a non-specialist? How many people think copyright "fair use" means "I can copy it all and distribute it all as long as I don't charge for it"?

Right, Google's PR doesn't say that it penalizes the results for sites it considers web-spam gamers. Yeah, in theory, it should. But I really can't get worked up that it doesn't, especially when web-spam gamers are the ones doing the complaining.

Posted by Seth Finkelstein at 05:12 PM | Comments (1)

July 03, 2006

Kinderstart.com v. Google

The Kinderstart.com v. Google search-ranking court case progressed recently:

The search giant is being sued in California by a parenting website which claims it lost most of its traffic when its ranking dropped to zero. The site, Kinderstart.com, claims that it was downgraded because it is a competitor to Google. A motion by Google to dismiss the case was heard in California last Friday, where Kinderstart argued that it competed with Google because it also offers a search facility on its site.

Now, it is, err, my opinion, that Kinderstart.com was downgraded because the site looks like a spam-type directory, not because "it is a competitor to Google" (if it's a competitor, Google has nothing to worry about ...). But Kinderstart.com is riding the wave of Googlenoia, so it's getting a lot of coverage.

When Google defended its right to rank sites as an "opinion", in legal terms, it used the word "subjective", which is causing some discussion over the subjective meaning of "subjective". In English, though, the key part was that they're saying that if they think you're playing web-spam games, it's their right to throw you out of the search rankings with no notice or explanation, notwithstanding whatever else may be in their ranking algorithm. While that's not exactly a nice thing to do, it does seem to be a pretty solid legal right.

Update: More source documents from Eric Goldman

Posted by Seth Finkelstein at 11:58 PM | Comments (1)

June 27, 2006

Google Catawba County indexing of password-protected student information story

I did a bit of investigation on the "story about how the Catawba County Schools in North Carolina has gained a temporary injunction":

... for "Google to remove any information pertaining to Catawba County Schools Board of Education from its server and index and alleges conversion and trespass against the corporation." The school blames Google for some how getting into a password protected area and indexing the content.

I didn't find anything more than the general information that is outlined in the SearchEngineWatch article above. There's a server, it has password-protected pages, it's not clear how Google crawled it.

Just speculating, there might be a flaw in Google's crawler, where in order to be efficient it's keeping login/password credentials in effect over multiple page retrievals, whereas the correct (but much less efficient) behavior would be to re-establish the credentials for each retrieval. So if there was a link with the login/password to a private page on the server, but one without such sensitive data, those credentials might have gotten re-used for other pages with more sensitive data. Again, that's only a theory. But it would explain what the school saw:

We did troubleshoot this situation by searching for the students' information at Yahoo, Dogpile, and AltaVista. We did not find any information on these three search engine returns and we attempted the searches over a three-day period.

So that makes it unlikely that the issue was purely a matter of a misconfigured server, one left open in an area which should have been password-restricted.

More "amusingly":

We acted so aggressively with Google because, until the media got involved, we could not get beyond an operator at Google. We could not get operators to connect us with technical support, the legal department, or to anyone higher up in the organization. We were only given an email address to which we could submit a complain - which we did but got no response. ... Only when the news media submitted its own inquiry to Google did we get a call regarding the situation. ...

It's still who you are that determines if you get heard.

Posted by Seth Finkelstein at 04:48 PM

June 21, 2006

More China Google Censorship, Domains

Philipp Lenssen has compiled a Sample List of Censored Domains in Google.cn (I made a few suggestions as to sources, hence I'm graciously noted). I think there's something interesting in this data, but it would require substantial time to try to analyze it.

Some of the sites listed are free web hosting services (like Geocities or Angelfire), others are international news sites or human rights sites. With these it's kind of obvious why the Chinese gov't treats them as opponents. For other sites however, like those of music bands, I didn't see any obvious connection.

Sharing the same IP? (Virtual Hosting). Domain inheritance? Copyright list traps? Somebody in the censorship bureau hates the music band? The mind of a censor can be inscrutable.

Update: For some different but related info, see Nart Villeneuve on Keywords & Google.cn

Posted by Seth Finkelstein at 11:59 PM

June 18, 2006

Surprising Extent Of China Google Censorship

I've been commenting on and refining some of the analysis of Philipp Lenssen's tests of China censored Google pages. The basic result turns out to be that since China bans some very popular domains from Google (e.g. news.bbc.co.uk, geocities.com, angelfire.com) in their entirety, many Google searches have a least one result censored on the first page. The numbers of initial pages affected is quite large, when searching for common words.

Related material echo:

http://www.rsf.org/article.php3?id_article=18015"

15 June 2006 Reporters Without Borders / Internet Freedom desk CHINA
YAHOO! CLEAR WORST OFFENDER IN CENSORSHIP TESTS ON SEARCH ENGINES Reporters Without Borders said it found Yahoo! to be the clear worst offender in censorship tests the organisation carried out on Chinese versions of Internet search engines Yahoo!, Google, MSN as well as their local competitor Baidu.

Posted by Seth Finkelstein at 06:59 PM | Comments (3)

June 15, 2006

Google Groups Censorship in Germany - found two posts

Refining yesterday's work on Google Groups Censorship, I've managed to determine two of the specific posts which have been censored. Note these are censored over the world, not just within Germany. They're:

Message-ID "1147700268.906162.35470@j55g2000cwa.googlegroups.com"
and
Message-ID "1147883544.336917.193380@u72g2000cwu.googlegroups.com"

Obviously, I can't link to them in Google Groups. But it turns out that they're quoted later on in the "de.soc.politik.misc" newsgroup thread, at

http://groups.google.com/group/de.soc.politik.misc/msg/289f77fda42fa642
and
http://groups.google.com/group/de.soc.politik.misc/msg/b8656928a61937c8

This can be verified by constructing a WORD or AUTHOR search query which would return the post if it were not banned, but instead will return the censorship message. Don't try to search by message-id, that won't distinguish between normal missing posts and censored posts. And be sure to disable the similarity option ("we have omitted some entries very similar") - amusingly, self-referentially, the censored posts always count as omitted.

Posted by Seth Finkelstein at 02:54 AM | Comments (1)

June 14, 2006

Finding A Usenet Message Censored In Google Groups - "de.soc.politik.misc"

[Original research!!!]

Some Google Groups Posts Removed in Germany (Google Blogscoped):

Gary Price of ResourceShelf informs me [Philipp Lenssen] that there are new cases of censorship in Google Germany, but this time, in Google Groups, as Chilling Effects shows. Considering that Google has a quasi-monopoly on the Usenet archive, this is unsettling, especially as all of this happens in the background and we don't know which posts have been removed. (We do know the reason for the removal; it supposedly contained hate speech/ "Volksverhetzung," e.g. promoting Nazi opinions). In China, Google only blocks a path to the censored sites, and the Chinese gov't is responsible for blocking the sites themselves; with Google Groups, Google actually holds the newsgroup content on their servers.

Found it - at the thread level

http://groups.google.com/group/de.soc.politik.misc/browse_frm/thread/35488eb1d5d5da1f/f1b90810dd318edb

In response to a legal complaint we received, we have removed one or more messages. If you wish, you may read the legal complaint .

Someone who can read German and gets "de.soc.politik.misc" independently should be able to narrow it to the exact post.

Update1:

The censorship notice at the bottom of the Google Groups screen :

http://www.chillingeffects.org/notice.cgi?sID=1489

Turns out to refer to the same Chilling Effects document as one mentioned in the above post:

http://www.chillingeffects.org/international/notice.cgi?NoticeID=4249

Update 2:

Another: (not listed in the post!)

http://groups.google.com/group/de.soc.politik.misc/browse_frm/thread/89ebdac43bd3f7b0/7555566e41f75434

Update 3: Changed title to be more specific, and I found a source for "de.soc.politik.misc"

Update 4: Found two specific posts, see following entry

Posted by Seth Finkelstein at 02:40 PM

June 13, 2006

"Zen and the art of sexing hedgehogs"

http://rjwaldmann.blogspot.com/2006/06/zen-and-art-of-sexing-hedgehogs-or.html

"I am shocked, crushed and devastated that google can not help me find a photograph of a hedgehog penis. What is the world coming to ?"

[Via Michael Froomkin]

He later updates that one working search is for [hedgehog gender identification picture].

But I'm not sure this was a complete Google failure.

I was able to find a useful picture in a Google image search for [hedgehog sexing] (even with SafeSearch on!). And a Google web search for [hedgehog penis] yielded useful items in the first few results.

I suppose the underlying lesson is that this is a demonstration that image searching is much more difficult than text searching.

Also:

"I searched the web for photographs of hedgehog penises. It is actually possible that I am the first person to do so."

No way, for completely non-prurient reasons. For example, a Yahoo image search turned up photographs, related to a sad tale of the life and death of a pet hedgehog involving a related cancer tumor. And knowing male/female for pets is often very important if you have two or more of the same species.

Posted by Seth Finkelstein at 11:52 PM | Comments (2)

June 05, 2006

"Google Bombing for Alaa", revisited, and on-line activism analysis

Jon Garfunkel has an extensive article on "Constructive Activism", which discusses the activism techniques used in the case of imprisoned Egyptian blogger Alaa.

Two weeks ago, Alaa Ahmed Seif Al Islam-- an activist, blogger, Cairene, Drupal developer, Egyptian, and fairly good husband to his wife Manal (in alphabetical order)-- was beaten and arrested, along with ten other demonstrators, as part of ongoing protests in Egypt in suppport of an independent judiciary. What followed was a smattering of global protests, online and offline to free Alaa and other hundreds of jailed protestors. These helped in part to generate media stories, and even the U.S. State Department has called the actions of the Egyptian government were a "mistake."
Still, it remains difficult to judge the effectiveness of some of the new online activism tactics, particularly as they are ongoing and have have yet to achieve the ultimate goals, but in some quarters they've already been celebrated without qualification.

I'm late to the party, so I'm just going to use my puny platform to recommend it. Why? Because there's plenty of analysis about Google-bombing problems, Google Ads as activism tools, the consideration of site design, code for "badges", and more.

[Disclaimer: I'm mentioned favorably in the piece]

Posted by Seth Finkelstein at 01:00 AM | Comments (2)

May 15, 2006

10 Things You Might Not Know About Google

Philipp Lenssen

This article is written by Philipp Lenssen as part of the Blog Swap with Seth Finkelstein – Seth's article on 10 Things You Might Not Know About Censorware can be found at Philipp's blog.

Blog Swap

1. Google query syntax underwent some subtle changes over the years.

Not too long ago, you couldn't enter more than 10 words into the Google search box. Or to be more precisely, you *could*, but subsequent words were ignored. I bet the Google founders were thinking "10 words ought to be enough for everyone," and mostly there were right – but for some advanced uses, especially with the Google Search API, a little more is helpful. Then, a while ago, Google increased the words limit to 32 words. This is probably OK for a few more years!

Another change is that Google ignores stop words nowadays. Stop words in search engines are words like "the" or "a" which are too tiny or common to be useful additions to most searches. However, Google is now accepting them as semi-normal words (one remaining difference being that they're not highlighted, or linked to the dictionary). This means in Google.com, you get different results when search for [the tale of a cowboy] vs [* tale * * cowboy] vs [tale cowboy]. (I'll be using square brackets around search queries – they're not to be included in the search.)

Another operator changed its functionality during the years; a couple of years ago, you could only query Google for [site:something.com], but not [site:something.com/something/]. Today, you can add folders to the site operator.

2. Google itself was Beta.

These days, everyone puts a Beta tag on their 2.0-ish web app. But did you know back in 1998, when Google launched their search, it was also in Beta? Take a look at a copy stored in the WayBack Machine to see it. Be aware the page might look quite ugly by today's standards... heck, it was probably ugly even back in 1998 (then again, so was my homepage in 1998!).

3. PageRank more than 1-10 – maybe.

While no one outside Google knows for sure, it is often speculated that Google's PageRank value – the "authority rank" (or quantity of backlinks which themselves receive lots of backlinks) – is a much more precise number than the plain 1, 2, 3... 10 values. A float, not an integer, if you will.

So, for example, if you're looking at a site which shows a PageRank 8 in the Google Toolbar, its internal PageRank may be something like 8.355 (or however precise Google's number is). But we don't know for sure – maybe Google's algorithms prefer speed over quality when it comes to the recursive PR calculations of billions of pages. This calculation might not be a breeze even for Google's 10,000 - 200,000 computers (that's another number we can't be too sure of outside of Google).

4. Google's co-founders didn't like each other in the beginning.

I guess when you're an uber-geek, like Google founders Larry Page and Sergey Brin are, you are also very competitive (to the point of risk being arrogant towards slower thinkers, maybe). John Battelle in his book The Search (page 67/68), tells of how the two met at Stanford University in the summer of '95:

Like most schools, Stanford invites potential recruits to the campus for a tour. But it wasn't on the pastoral campus that Page met Brin – it was on the streets of San Francisco. Brin, a second-year student known to be gregarious, had signed up to be a student guide of sorts. His role that day was to show a group of prospective first-years around the City by the Bay.

Page ended up in Brin's group, but it wasn't exactly love at first sight. "Sergey is pretty social; he likes meeting people." Page recalls, contrasting that quality with his own reticence. "I thought he was pretty obnoxious. He had really strong opinions about things, and I guess I did, too."

"We both found each other obnoxious," Brin counters when I tell him of Page's response. "But we say it a little bit jokingly. Obviously we spent a lot of time talking to each other, so there was something there. We had a kind of bantering thing going."

5. Google has 16 official blogs.

You might have come across the official Google Blog. But did you know Google actually has 16 different – and all official – blogs (give or take one)? Here's the full list (I'm also collecting these all on one page):

Google Blog - googleblog.blogspot.com
Google Talkabout - googletalk.blogspot.com
Google Base Blog - googlebase.blogspot.com
Google Video - googlevideo.blogspot.com
Inside Google Desktop - googledesktop.blogspot.com
Google Code - code.google.com
Inside AdWords - adwords.blogspot.com
Inside AdSense - adsense.blogspot.com
Google Reader Blog - googlereader.blogspot.com
Blogger Buzz - buzz.blogger.com
AdWords API Blog - adwordsapi.blogspot.com
Google Enterprise Blog - googleenterprise.blogspot.com
Google Research - googleresearch.blogspot.com
Google Maps API Blog - googlemapsapi.blogspot.com
Google Writely - writely.blogspot.com
Inside Google Book Search - booksearch.blogspot.com

6. Google self-censors in several countries.

You heard about how Google self-censors in China (e.g. human rights sites top-ranked by Google in other countries are missing in Google.cn). But did you know that Google showed censored search results in other countries for years, sometimes even without showing a disclaimer that something was missing? In Germany and France, that was the case.

You can see this for yourself if you first search Google.com for [site:ety.com]. This will result in 9,940 results. Now if you do the same search on Google.fr – Google France – you get zero results. However, there's a disclaimer at the bottom:

"In response to a legal request submitted to Google, we have removed 260 result(s) from this page. If you wish, you may read more about the request at ChillingEffects.org."

Note Google's disclaimer is showing the wrong number of missing pages – it 1,000s, not 260. Following the link to Chilling Effects, we see this text:

Google received complaints prior to March 2005 about URLs that are alleged to be illegal under U.S. or local law. In response to these complaints, one or more URLs that would have appeared for this search were not displayed.

In other words, Google is not censoring this out of their own belief, but by following government requests. Now what's ety.com anyway, except being one of the many censored domains? A quick glance will show it's some kind of stupid Nazi propaganda site, illegal by some country's standards. But you know what Voltaire said... "I may disagree with what you say, but I will defend to the death your right to say it."

7. Google stopped counting their index size.

Since around 2001, Google on their front-page were proud to show off the number of pages they search through... a number that went from a billion and a half to over 8 billion (according to Google). Today, Google doesn't show this number anymore. Maybe Googlers – that's what Google employees are called – realized that results quality beats results quantity. Or maybe they just realized that by sheer numbers, competitors were winning. In August 2005, Yahoo in their blog announced:

As it turns out we have grown our index and just reached a significant milestone at Yahoo! Search – our index now provides access to over 20 billion items (...) [including] over 19.2 billion web documents

Today, when you want to find out about the Google index size, there's a workaround though: search Google for ["* *"] – that's a good estimate. Right now, it's displaying 25,270,000,000 pages. In a direct comparison, when we search for "the" on both Google and Yahoo, Google shows a couple of billion pages more. Then again, these numbers are hard to verify – Google only lets us see the first 1000 results for each query. And in the end, who wants to see more than that anyway? Most people don't even go beyond the first 10 results, and rather adjust their search query instead!

8. The Google API may offer over 1,000 requests.

If you're a developer utilizing the Google web search API, and you need way beyond the 1,000 requests per day Google offers by default, here's a tip: you can email the Google API support and request more hits for your API key. Depending on your projects and traffic needs, which you will have to outline, Google just might grant you the request!

9. Google comic book search.

While Google doesn't have its own comic book search engine, you can still achieve good results by going to Google Images, setting the file size to "Large images", and then searching for [comics]. Using this setting, you can also search for an artist's name, like ["john byrne"], ["john romita jr"], ["frank miller"] or ["daniel clowes"]. You might even have some fun adding your own speech bubbles to the comic book pages you find (use a free font like WebLetterer for best results)...

10. Google Writely is a multi-user chat.

OK, so Writely – which Google recently acquired – is not really a chat, but an online word processor. However, by inviting others to your Writely document, you can group-edit any document... and see the changes by others merged into the document as you type! This feature allows you to chat with a group, and you can have fun with positioning text on different places on the screen, wiki-editing what others wrote, or adding colors and images.

Posted by Seth Finkelstein at 10:26 AM | Comments (37)

May 14, 2006

"Google Bombing for Alaa"

Demoblog - Google-bombing for Alaa, press release:

On Sunday May 7, Alaa Ahmed Seif El Islam, a prominent Egyptian blogger and political activist, was detained in Cairo by the Egyptian authorities while protesting the earlier detention of political activists rallying for a free judiciary.
...
On Tuesday, a group of bloggers connected to the site Global Voices decided to launch a different kind of campaign, one that would use the mechanics of the internet itself to bring world-wide attention to Alaa's case. They launched a campaign called "Google bombing for Alaa," an effort to manipulate the ranking of the world's search engines so that a blog dedicated to freeing Alaa (http://freealaa.blogspot.com/) would be the first page displayed when a person searches for information on the word "Egypt".

(via Jon Lebkowsky)

This is interesting, for a few un-obvious reasons. "Egypt" is a word which has many, many, links. So I doubt it'll get much traction, certainly not for a long time. It then turns into a kind of meta-experiment, where media attention is obtained for the attempt itself.

Posted by Seth Finkelstein at 11:59 PM

May 03, 2006

More On MaineWebReport lawsuit, and Google Search Results Ranking Motivation

The Google aspects of the MaineWebReport.com lawsuit continue. AdAge has an article: Ad Agency Sues Blogger for Defamation:

Tom McCartin, president of WKPA, is most concerned about Mr. Dutson's public posts because if potential clients search for the agency online, they will likely see Mr. Dutson's critique-filled blog before the agency's own Web site. As a result, Mr. McCartin says his business, which sees capitalized billings in the $40 million range, has been hurt. And he wants to protect his reputation.

I'm dubious about the likelihood about appearing "before the agency's own Web site". Maybe that would be true for an A-list blogger. But for anyone else, that would be rare. Now, appearing on the first page, that would be possible in many cases.

The article cites a mainewebreport.com blog post from Feb 28 which I'll quote further:

I noticed in Maine Web Report's stats that someone found the site through a Google search for "Paino Advertising" ... this can't be good for the company's reputation. Sure enough, searching Google for Paino advertising brings up this site on page 2 (not great, I know! But we're moving up gradually). Not good at all for an ad firm.

Did an ad agency really sue over this (or at least have it be a major factor)? It would be notable if true.

Posted by Seth Finkelstein at 05:25 PM

May 01, 2006

When Does Google-Results Presence Matter?

Part of the rhetoric around the lawsuit against the MaineWebPeport blogger is a large amount of "Google-huffing". The plaintiff, an advertising agency, is going to have Google results for its name dominated by criticism of bloggers. Note while I think that in principle it's a good idea that the more powerful should need to consider a public backlash when suing the less powerful, there's an aspect of meet-the-new-boss-same-as-the-old-boss in the concept that a handful of bloggers have the ability to determine the public perception of an entity. After all, there can only be ten top-ten results (and sites can appear twice). So we're talking about an extremely small number of people.

One of the very few advantages of my having a blog is that it provides me a means of running Google experiments. Despite being a Z-lister, I have accumulated enough site PageRank and such in my weblife (from other work) that I often rank far higher than my lowly blog position would otherwise grant me.

And indeed, my earlier post on the case is now in the top ten Google results for the plaintiff's name. But there's only been around five hits to it from various searches. So, sorry to blog boosters, I'm not sure the Google-huffing is accurate here. The mainstream media coverage is likely going to have far more of an impact based on sheer numbers.

Obviously, there's instances where such effects would matter. But it's going to depend a lot of the status of the critics and the relative power of the entity being criticized.

Posted by Seth Finkelstein at 11:57 PM

April 29, 2006

Warren Kremer Paino v. Lance Dutson, and Google keyword matching

Warren Kremer Paino v. Lance Dutson is a lawsuit by an advertising agency against the writer of the blog Maine Web Report. (source: MBA, Boston Globe).

Some key issues of the dispute appear to revolve around actions of the Maine Office of Tourism, and its Pay-Per-Click (PPC) Google advertising campaign. Lance Dutson has been criticizing this campaign on various grounds, and agency Warren Kremer Paino Advertising has sued him for "copyright infringement ... defamation and trade libel/injurious falsehood".

Here's one aspect of the case I've dug though. When a search is done for words such as [Camden Maine Bad Lawyers], the Google advertising display algorithm might match on the keywords "Camden Maine", and display the ad for that. This would not mean that the person who was buying the keywords had any particular interest in targeting "Bad Lawyers". Or the algorithm might match on the words "Bad Lawyers", which would not imply that the buyer had any interest in "Camden, Maine". There are some very broad choices as to the extent of matching which can be made by the ad-buyer. This is the background to Lance Dutson's post:

Maine Office of Tourism Corners Smut Market
http://www.mainewebreport.com/2006/04/07/maine-office-of-tourism-corners-smut-market/

Well the ads aren't down, wishful thinking on my part.
But it appears the MOT is diversifying it's target audience, maybe to make sure more good folks come to see our state. These are screenshots from Google this morning:

Then he displayed screenshots of Google searches for [camden maine child pornography], [camden maine escort], [camden maine xxx], [camden maine swingers]. These matched the "camden maine" keywords, and hence had ads for the Maine Office of Tourism ("MOT")

In a later comment (April 28) to the post, he explained:

You are completely correct, these ads were a result of broad matching. That's what I'm trying to illustrate, the folly of broad matching, because the ads end up in stupid places, like I've shown here.

However, the Maine Office of Tourism seems to have taken that post as a literal accusation that they were intentionally advertising to pedophile tourists. From the lawsuit:

11. Dutson also claimed, falsely, that WKPA expended state tourism funds for the purpose of returning internet search results for non-tourism activity, such as pornography and pedophilia.

I am not a lawyer, so I won't comment on the legal merits of such a charge. Though socially, given the relative power of the parties involved, it strikes me as an extreme overreaction.

Posted by Seth Finkelstein at 09:39 AM

April 24, 2006

Chocolate Poker Chips, the Google Logo, and Search Relevance

A Google Blogoscoped post about Google chocolate poker chips caught my attention. The description says:

We love chocolate, and occasionally we're known to play a game of poker or two. Why not combine the two and offer this fun-but-odd treat, the Google Milk Chocolate Poker Chip. Sold individually.

But that's some very expensive chocolate!
At 75 cents per each "Chocolate Poker Chip", those are priced like Google's stock (it's a complete reverse of "this item not packaged for individual retail sale").

Now, I wondered just how much the logo is costing. I've seen that sort of chocolate novelty before, unbranded. It turns out the very same basic chocolate coin, without the logo, can be had for around 19 cents ($65/345 chips). And probably even cheaper at a discount store.

However, those chips wouldn't have the "Google" logo on them. So you're paying a lot for the designer label. And that's where things get interesting from another angle. I tried to search for how much it would cost to put a corporate logo on a chocolate poker chip (wonders of the Net). However, the resulting Google search was not a pretty sight. All the "poker" spammers reduced the Google search results to a very bad hand indeed. A Yahoo search seemed a little better, but not by much.

Clusty won the results relevance battle royally. There was still a huge load of spam, but searching ["chocolate poker chip"], and selecting the "Milk Chocolate Poker Chip" had a desired result on the first page. Item #8 pointed to an internal page of a company called A La Carte, and flipping through their catalog quickly showed logo prices for decorated poker chips.

So between the base price for the customization, multiple colors on the logo label, and whatever volume deal Google might have, it seems that Google wasn't ripping off people on the price of a chocolate chip.

But again, anyone buying them is paying a lot to have that brand on a piece of chocolate. And it's not even good chocolate.

Posted by Seth Finkelstein at 10:13 AM | Comments (3)

March 30, 2006

Google Finance, Blogs, and Politics - use COMPANY NAME, not TICKER SYMBOL

Note: Exposing Yahoo, Inc., or other companies, by having blog posts show up on a company's page in Google Finance, is accomplished by having the post rank highly for the company name, not company ticker symbol.

The algorithm used by Google to rank blog posts for Google Finance doesn't have anything to do with the company ticker symbol. It works off a search of the phrase of the full company name. It just looks like it's related to the ticker symbol, since that's how the Google finance page itself is organized, and many finance articles are written in a style of "[name] China Repression Associate Personnel [symbol] (Nasdaq: CRAP) ..."

This is very clear if you look at a company with a long name, such as e.g. "Check Point Software". You'll likely see many articles which make it clear that what matters is name, not symbol. So if you have a post about "Secure Computing (SCUR)", it's the "Secure Computing" part which matters, not the "(SCUR)" (the best phrase to use there is actually "Secure Computing Corporation". At best, the ticket symbol is a related word, but by no means the primary ranking factor.

Spammers have already discovered this use of blogs, so it'll probably be changed soon. But while the fun lasts, remember to Google-bomb with the correct target.

Posted by Seth Finkelstein at 06:47 AM | Comments (4)

March 17, 2006

http://www.google.ca/#terrorists - spam result

In Google Terrorists on Yahoo, Philipp Lenssen asks

When you search Yahoo UK/ Ireland for "google", the fifth result will be for... "www.google.ca/#terrorists". What's that, a Yahoobomb?
[Thanks Eric of Mechanical Turk Monitor.]
Update: Something similar happens when you search Yahoo.com for "mcdonaldsforshizzo" (the second result for me is "www.google.com/#mcdonaldsforshizzo"). [Thanks Maurizio M.]

(also Search Engine Watch)

These seem to be results of web-spamming, perpetuating itself by pages being scraped and re-scraped. It's not clear how it started. But I conjecture the tag after the "#" seems to be so that they can track which spamming pages are working.

Yahoo is being affected by the spam-pages.

Here's some examples (*not* linked, since these are spam!)

http://www.google.com/search?q=cache:7J67NvwZsgYJ:community.jyve.com/index.php

http://www.google.com/search?q=cache:oBjjhB7C9mMJ:www.selectedlink.com/worldweb/canada%2Bsite.html

http://www.google.com/search?q=cache:TDBiUu8a46AJ:198.65.112.65/search/results.php

http://www.jolt12.co.uk/google_search_engine.html
(note there's a tag "#darwin" there)

Posted by Seth Finkelstein at 07:38 PM

March 15, 2006

Google Subpoena Hearing Result - Of Mountains, Mice, and Elephants

Now we know the Google subpoena hearing result:

A lawyer for the Justice Department told [Judge] Ware that the government would like to have a random selection of 50,000 Web addresses and 5,000 random search requests from Google, a small fraction of the millions the government originally sought.

As the saying goes, the mountain has labored and brought forth a mouse.

To save myself typing, I'll just quote Andrew Orlowski's take:

For the hearing today was a charade in several ways. Google and Justice department attorneys had already agreed on the scope of the data to be transferred, in private negotiations before today's hearing - for which the Judge complemented both parties.
So why hold it at all?
Because the hearing allows both parties to clean up their tarnished public reputations.

It's all been a serious of misreporting, hype, and more importantly, a projection of people's worries onto a convenient target. Because, further, the real problem is not going to be so openly discussed, and elephant in the room.

Under the PATRIOT Act, Federal officials can undertake wide ranging data mining requests on Google's treasure trove of information. And not only is Google unable to refuse such requests - it can't even talk about them.

There's a deep issue with search engines as Big Brother's agents. But it's hard to get people's attention about that, given the secrecy surrounding the serious problem. So a relatively trivial searching data study was drafted as a stand-in for the, ahem, sexy topic.

Hopefully there's been consciousness-raising. We'll find out.

Posted by Seth Finkelstein at 09:02 AM

February 21, 2006

Perfect 10 v Google - Google Image Search Can Be Copyright Infringement

In Perfect 10 v Google, a judge has ruled (news report), in a preliminary injunction:

The Court now concludes that Google's creation and public display of "thumbnails" likely do directly infringe P10's copyrights. The Court also concludes, however, that P10 is not likely to succeed on its vicarious and contributory liability theories.

This is a quite unfavorable outcome for the dispute over Google Print: Copyright vs. Innovation vs. commercial value.

Some key elements:

i. Commercial Versus Noncommercial Use
In assessing whether a use is commercial, the focus here is not on the individuals who use Google Image Search to locate P10's adult images. Nor is it on whether their subsequent use of the images is noncommercial (e.g., titillation) or commercial (e.g., to print and sell). Rather, it is Google's use that the Court is to consider. That use, P10 contends, is commercial in nature. The Court agrees.
Courts have defined "commercial uses" extremely broadly. [...] Google unquestionably derives significant commercial benefit from Google Image Search in the form of increased user traffic--and, in turn, increased advertising revenue. The more people who view its pages and rely on its search capabilities, the more influence Google wields in the search engine market and (more broadly) in the web portal market. In turn, Google can attract more advertisers to its AdSense and AdWords programs.

Note this is very unfavorable for the Google Book fair-use argument. Because there, Google's use is also commercial in nature, under similar reasoning.

A distinguishing factor from an earlier, more favorable, decision (Kelly v. Arriba Soft):

But unlike Arriba, Google offers and derives commercial benefit from its AdSense program. AdSense allows third party websites "to carry Google-sponsored advertising and share revenue that flows from the advertising displays and click-throughs."

And regarding the factor of effect on potential markets:

On the other hand, Google's use of thumbnails likely does harm the potential market for the downloading of P10's reduced-size images onto cell phones. Google argues that because "P10 admits [that] this market is growing," its "delivery of thumbnail search results" must not be having a negative impact. Apart from being more relevant to the quantification of damages, this weak argument overlooks the fact that the cell phone image-download market may have grown even faster but for the fact that mobile users of Google Image Search can download the Google thumbnails at no cost. Commonsense dictates that such users will be less likely to purchase the downloadable P10 content licensed to Fonestarz.

That's a strong legal rebuff to a commonly-seen argument on these issues.

And more:

D. Public Interest
Google argues that the "value of facilitating and improving access to information on the Internet . . . counsels against an injunction here." This point has some merit. However, the public interest is also served when the rights of copyright holders are protected against acts likely constituting infringement. Furthermore, in this case a preliminary injunction can be carefully tailored to balance the competing interests described in the first paragraph of this Order: those of intellectual property rights on the one hand and those promoting access to information on the other.

Though this is just a preliminary injunction, it's a stark reminder that courts do not necessarily agree with arguments we echo.

[Update - clarified this is only a preliminary injunction]

Posted by Seth Finkelstein at 11:58 PM | Comments (2)

February 07, 2006

BMW.de, Google results, and punditry

BMW.de got "banned" from Google (temporarily) for search results manipulation, and this set-off an amazing amount of punditry. To sum it up briefly, the German website of the car-maker BMW hired a website firm that used fake pages ("doorway pages") to mislead search engines as to the content of the site. In my view, this is a very bad thing, basically corrupting search engine results, a kind of spam.

Google penalizes such actions by removal. This, again in my view, is a good thing. It is a use of power, but necessary and proper against spammer-like actions.

I originally thought there wasn't much to say - after the basic descriptions of what happened, what more is there? But, this just goes to show that I don't have what it takes to be a pundit. One must make controversy - e.g. accuse Google of enforcing "orthodoxy", of being "Orwellian" (the journomind has no good way to deal with thinking about private power - so it comes out in very strange ways, usually involving a few keywords). And that rhetoric brings in the all-important links.

So I am hereby going to stake out my punditry-point:

Death to spammers!

No, on second thought, Google-death is too good for them. A fate worse than death. No, a fate worse than a fate worse than death. ... Well, you get the idea.

How many Technorati-points is this worth?

[Update: Resurrected!. That was a short death. And they've probably got lots of links out of the publicity - SEO by infamy.]

Posted by Seth Finkelstein at 11:58 PM | Comments (5)

January 31, 2006

Lazy punditry: Seeking cheap irony in Google's US vs China actions

[I wrote this as a reply to a mailing-list message about Google "hypocrisy"]

The Boston Globe points out that Google achieves the height of hypocrisy in simultaneously fighting the COPA subpoena while caving in to China's censorship.

While the cheap irony seems irresistible to pundits, it's hardly a reasonable comparison:

1) Say what you will about its sorry state these days, the US government is far more amenable to legal challenges than the government of China.

2) The COPA subpoena is about one part (an expert's report) which is one part of an overall case, where Google is not a party. China apparently made censorship a condition of Google doing business in the country.

The Globe article also gets it wrong, repeating the mistake about "US Justice Department's investigation of online child pornography." The COPA case has nothing to do with an investigation of child pornography. It's basically about a certain type of sexual material legal for adults but not minors and the burdens of restricting access to it with regard to minors.

People have read all sorts of deep political significance into the COPA subpoena, that simply isn't there. In fact, from a business standpoint, Google's actions are far more consistent than hypocritical. That is, they'll make a fuss if it's good PR and relatively costless, but not make any real sacrifice. What incentive is there for any publicly-traded company to act differently?

Posted by Seth Finkelstein at 02:15 PM | Comments (3)

January 26, 2006

My articles on Google and privacy, elsewhere

I've got an article The Google Search Subpoena in Perspective as a guest-post on Google Blogscoped (a widely-read blog about Google). It's a longer version of the points I've made earlier.

I also seem to have done some good in the world, as comments I made about the issue at the popular liberal blog Hullabaloo ("Digby") were graciously incorporated into a post.

Posted by Seth Finkelstein at 09:06 PM

January 20, 2006

Google, Subpoena, and Privacy

[I wrote this as a contribution to the discussion on Dave Farber's mailing list, but I might as well shout to the wind here, as it may not make the moderation cut. The best documentation I've seen is Gary Price's summary at Blog.SearchEngineWatch.com, and their coverage]

Let's take a deep breath and step back for a minute, and recall that this all started from a statistics professor's bright idea of how to design a survey of search engines and measuring how many porn sites are in the average results. It's not Big Brother, NSA Echelon, Total Information Awareness, or any sort of attempt to snoop on individuals. The government narrowed the request down to a sample of one million URLs ~~[DEL]and "a random sampling of one million search queries submitted to www.google.com on a given day" (page 14, McElvain Declaration file). That's it.[DEL]~~ [CORRECTION UPDATE - it was brought to my attention that this was part of the negotiations, but seems to have been dropped - CORRECTION UPDATE]. And it's hedged with protective orders and presumably whatever non-disclosure agreements are necessary.

If I were to be utterly cynical, I'd conjecture that Google decided to make a big noise over this relatively trivial request as a PR strategy to counter the increasing criticism of its omnivorous database collection practices. Remember, there's fever-swamping wolf-criers who will hype an error in government website cookie settings into attacks on privacy laws, or a minor change in obscure harassment provisions to be the end of anonymous blog comment posting. So marketing a storyline of "Google Stands Up To THE FEDS To Protect YOUR PRIVACY" will be quite appealing to a certain mentality, even if the effects are insignificant in practice. Essentially, Google can't lose here. If the subpoena is quashed, it's a big hero for beating back Government Snooping. If not, Google gets to loudly divert attention to the terrible, terrible injustice of being forced by men with guns to produce some search strings for a survey. This will probably inoculate Google against much critical examination in the press, since it will point to how it Stood Up For Freedom.

Now, there's a way in which this could be consciousness-raising, regarding the privacy implications of the huge amount of personal data collected by search engines. But such examination would require journalists going beyond the PR-chow they'll be fed. And sadly, I doubt that will happen.

[CORRECTION UPDATE - it was brought to my attention that the narrowing to "a random sampling of one million search queries submitted to www.google.com on a given day" was part of the negotiations, but seems to have been dropped - CORRECTION UPDATE].

Posted by Seth Finkelstein at 09:12 PM | Comments (5)

January 19, 2006

Google searches and government investigation of pornography sites

It's the return of Free porn, Google, spam, Internet censorship, and the Supreme Court! (really)

Bush Lawyers Ask Judge To Make Google Hand Over Data; Google Promises A Fight:

The Bush administration on Wednesday asked a federal judge to order Google Inc. to turn over a broad range of material from its closely guarded databases.
The move is part of a government effort to revive an Internet child protection law struck down two years ago by the U.S. Supreme Court. The law was meant to punish online pornography sites that make their content accessible to minors. The government contends it needs the Google data to determine how often pornography shows up in online searches.
...
As a result, government lawyers said in court papers they are developing a defense of the 1998 law based on the argument that it is far more effective than software filters in protecting children from porn. To back that claim, the government has subpoenaed search engines to develop a factual record of how often Web users encounter online porn and how Web searches turn up material they say is ``harmful to minors.''

[via John Battelle's Searchblog]

I hope this doesn't lead to another round of touting censorware.

On the other hand, maybe I'll finally be hired as an expert-witness for a fat consulting fee :-).

Posted by Seth Finkelstein at 05:55 AM

January 17, 2006

British National Party (BNP) and Google News

Isn't it very old news that Google News includes the far-right British National Party (BNP)?

Compare:

BNP gets top news listing on Google
Julia Day
Monday January 16, 2006
Google has defended the integrity of its news service after it emerged that reports filed by the British National party are being listed as sources on its website. ...

With, more than a year ago:

http://www.theregister.co.uk/2004/11/07/google_bnp_news/

Tide of migrant BNP PR menaces Google News
News picks a bit illiberal?
By John Lettice
Published Sunday 7th November 2004 11:36 GMT
Google News, which last year accepted that press releases counted as news, now apparently thinks press releases from the far-right British National Party count too. ...

[Update: See also

http://www.journalism.co.uk/news/story1132.shtml

Google inflames critics
Posted: 8 November 2004 By: Jemima Kiss
Email: jemima[at]journalism.co.uk
The credibility of Google's news tool has come under fire again following its use of press releases from far-right UK political group the British National Party.

]

I guess the news is what the reporters or A-listers say is news :-)

Posted by Seth Finkelstein at 03:54 PM

November 20, 2005

Googlebombing FeministFemaleSexualDysfunction

Lis Riba asks:

Finally, as I mentioned in an earlier post, all the top hits when Googling "feminist and FSD" or "feminist and sexual dysfunction" are from the anti-FSD contingent. This page of my blog is actually within the top 100, but fairly low on the page.
I don't normally ask this, but if a few more of you would be willing to link to this week's archive using those keywords... well, at least that way other sufferers searching for help can more easily get this point of view as well for a more balanced picture.

That is, a feminist and female sexual dysfunction (FSD) Googlebomb?

Posted by Seth Finkelstein at 11:59 PM | Comments (3)

November 18, 2005

Google Print - Fair Use vs "Microsales"

The Google Print debate has gone another round. I think it's illuminating to approach it from a mirror-image of fair use:

It's about "microsales" (really, micro-commercial use)

What's new, in an evolutionary sense, is that Google has found a way to make large amounts of money off accumulated small sales. This has led to an argument I'll call the "willful ignorance of scaling differences".

The argument runs that if a single excerpt can be fair use in a vaguely commercial context (e.g. quoting a snippet in a review, even if it's a paid review), then an unlimited number of excerpts (scale in one direction) in a purely commercial context (scale in another direction) are theoretically identical.

This doesn't follow. The result is in fact, "undefined". Like the saying "The Constitution is not a suicide pact", it's arguable that fair use is not license for market-death by a thousand cuts.

The issue didn't arise before, because there wasn't a context where this sort of usage could be marketed in a large scale. But in retrospect, the problem arises very clearly from lowered transaction costs.

But it's not obvious that the authors and publishers are right either. Google's certainly providing a service where stifling it with rights clearances seems inadvisable. That's not going to benefit either authors or publishers - only lawyers!

Has anyone explored that some sort of mechanical license might be better than winner-take-all?

Posted by Seth Finkelstein at 02:51 PM | Comments (3)

November 16, 2005

GooglePrint EgoSurfing

Google Print now has enough items to make for an interesting print egosurf. Some numbers for a few websites:

37 pages on "peacefire.org"
6 pages on "censorware.org"
1 pages on "censorware.net"
4 pages on "sethf.com"

Posted by Seth Finkelstein at 11:42 PM | Comments (1)

October 31, 2005

"Regulating Search" Conference

"Regulating Search: A Symposium on Search Engines, Law, and Public Policy"

Search is big business, and search functionality increasingly shapes the information society. Yet how the law treats search is still up for grabs, and with it, the power to dominate the next generation of the online world. How will this potential to wield control affect search engine companies, their advertisers, their users, or the information they index? What will search engines look like in the future, and what is the role of regulators in this emerging market? This symposium will map out the terrain of search engine law & policy.

[via copyfight]

It's another Sign - if not Bubbleness, definitely much Hot Air. Which is not to say I'd be averse to having a share of any inflated values.

Posted by Seth Finkelstein at 11:51 PM

October 24, 2005

Google Print And Fair Use

Scrivener's Error has a series of posts, focusing on issues such as Google's digitization and Fair Use (via Derek Slater). Much substantive criticism:

Admittedly, this doesn't look a whole lot like the analysis [of thumbnail images]. That is primarily because, as I've tried to make clear, this case isn't [about thumbnail images]. It is not being heard in the Ninth Circuit, which (along with the Eleventh Circuit) has the least-stringent view of fair use; it is not based on materials merely gathered, but for which substantial and conscious copying must occur for any of the three "uses"; it is not based upon reuse of materials in exactly the same form, medium, and purpose/function as provided by the copyright holder; and does not concern a well-delineated final use and presentation.

It's good to get out of the echo chamber.

Posted by Seth Finkelstein at 09:14 AM

October 19, 2005

McGraw-Hill v. Google, another Google Library Project lawsuit

McGraw-Hill v. Google is the latest publisher lawsuit against the Google Library Project (via Copyfight). The complaint seems basic, again claiming the project is copyright infringement.

This lets me elaborate on a point I've been making, and was earlier quoted about Google Print (thanks, Andrew):

What both parties really mean is that Google has got stuff, if not for free, then at a bargain price. Libraries had to pay for licenses or physical material: Google only pays for the scanning - which is an extremely good deal for Google.
As Seth Finkelstein reminds us:
"Consider that this is not Google contributing to culture. It's Google trying to supplant the publishers as the middleman business between authors and readers," he wrote.
So what at first looks like a copyright issue on closer examination is really a compensation issue. ...

A copyright issue is virtually always a compensation issue (the exceptions, "moral rights", are very rare). Copyright functions as a RESTRICTION ON TECHNOLOGICAL INNOVATION to impose compensation issues. And I don't mean that's new. When the printing press was being developed, it was extensive technological innovation. And surely, at the time, it must have seemed as cool and "geeky" and rich with unbounded promise, as search engines do now.

Now, printing itself, is becoming even more of a commodity, and in some cases (E-books), being eliminated entirely. A publisher's role as a middleman in terms of arranging for the physical printing of a book is much diminished. Whatever printing is required can be contracted, and perhaps even done on-demand. Physical distribution is still required for the chunk of paper, but that is being shifted to mail-order from warehouses.

So what's left, of the publisher middleman function? Promotion. Marketing. Advertising. All of which are becoming more important from the shifts above.

Which is exactly what Google does, in terms of ads for search terms, and "snippets", and trying to match readers with products, err, search results.

That's why Google wants this business. It's not culture. It's the intermediation role between writers and readers.

Posted by Seth Finkelstein at 11:52 PM | Comments (10)

October 17, 2005

Siva Vaidhyanathan on Google: "They don't work for us"

I think this is worth echoing, from an "On The Media" segment on Google Print: (my emphasis below)

BOB GARFIELD: If not Google now, then who? And when? Who should be in charge of deciding which books get scanned?
SIVA VAIDHYANATHAN: Well, I actually think that this is the job of libraries. I think libraries should be doing this first and foremost. The Library of Congress should have identified this as a major public need and goal and pursued this sort of project years ago. Instead, they've outsourced it to a private corporation, and this corporation, as good as they like to make us think they are, is still operating by keeping us blind. Their technology is proprietary. Their algorithms for search are completely secret. We don't actually know what's going to generate a certain list of search results. They don't work for us.

Again - "They don't work for us". Whatever their cool geek-dream origin (and I share the fantasy!), Google is a now a very large corporation, accountable only the shareholders. It may seem overly critical to emphasize it, but that's reality.

Posted by Seth Finkelstein at 09:55 AM | Comments (4)

September 23, 2005

Google Print, Statutory Damages, And The Library Exception

Ed Felten ponders:

"... because if Google loses, it won't just have to reimburse the authors for the economic harm they have suffered. Instead, Google will have to pay statutory damages ... In light of the risk Google is facing, it's surprising that Google went ahead with the project."

Aha! Now it all falls into place!

In fact, Google WON'T necessarily have to pay ANY statutory damages. Because of an obscure part of the statutory damages provision:

The court shall remit statutory damages in any case where an infringer believed and had reasonable grounds for believing that his or her use of the copyrighted work was a fair use under section 107, if the infringer was (my emphasis):
(i) an employee or agent of a nonprofit educational institution, library, or archives acting within the scope of his or her employment who, or such institution, library, or archives itself, which infringed by reproducing the work in copies or phonorecords; or ...

Google has the lawyer-power where, even if it loses on legal principle, it can likely persuade the judge to let it off the hook for ANY damages because of the "agent of a ...library" exception.

That explains a lot which has been going on. Quite a lot. Truly, follow the money, and much is revealed.

Posted by Seth Finkelstein at 11:01 AM | Comments (2)

Google Print Is Not Copyright's Enemy-Of-My-Enemy-Is-My-Friend

Siva Vaidhyanathan makes excellent points about the Google Print Lawsuit:

The issue is the effect on the "potential" markets, not the established markets. Because a market exists (and a greater potential market lurks) for licensed digital images of published books, the library project is about that market (see Amazon and Google Print) rather than the market for the physical book. ...
Again, please don't misunderstand me. I am not cheering for the authors here. I am just worried that admiration for Google is clouding judgements. ...
The copyright issue at hand here is not really fair use. That's just trivia.
It is this: Will copyright remain a copy right or will it become a distribution right? Which is better? Which should it become? What are the gains and losses if we were to see such a shift? Would Time-Warner and Disney (both major book publishers) let that happen?

Google is using an "open" business model here: Use the content, or services built on the content, as a loss-leader to draw eyeballs and so sell advertising. This is a venerable, workable, business model. Thus, people then think that boosting Google's use of this business model is a blow against the copyright business model. Therefore, it's called "fair use", it seems to me often more on the basis of this policy advocacy, rather than any detailed legal analysis.

It's an appealing thought. But sadly, I have the sense that in this case we're just replacing one boss with another. This is not an altruistic act where Google is merely contributing to the Commons. Rather, it's strategic business positioning for them. There's nothing intrinsically wrong with that. It's a good move, leveraging their current strengths. However, there's no need to automatically imbue it with an enemy-of-my-enemy-is-my-friend aspect, which isn't necessarily there.

Posted by Seth Finkelstein at 09:11 AM | Comments (6) | Followups

September 21, 2005

Google Print Lawsuit

The inevitable Google Print Lawsuit has been filed, by the Author's Guild.

The complaint doesn't appear to argue much beyond a simple claim that Google's actions are copyright infringement, the core is:

39. Google has made and reproduced for its own commercial use a copy of some of the literary works contained in the University of Michigan library, which contains the Works that are the subject of this action, and intends to copy most of the literary works in the collection of that library.
40. Google's conduct is in violation of the copyrights held by the Named Plaintiffs and other members of the Class.

As I wrote earlier in Google Print: Copyright vs. Innovation vs. commercial value, I think there are some inherent conflicts here:

That is, the technology company can't be right every time, almost by definition. Because copyright as a limited monopoly fundamentally restricts innovation in some ways. That's the trade-off.

I'm not in the business of writing legal briefs, and I don't have any particular passion for or against Google Print, so I'm not going to go deeply into the fair-use arguments (no point for me in that ...). Anyway, I suspect that it's just going to come down to a whether the relevant judges believe the project is useful or not, which is leading to a perception/PR battle.

Posted by Seth Finkelstein at 02:31 PM | Comments (3)

September 09, 2005

"Searching for failure? Try George W. Bush"

The Register notes:

... Google is currently offering the Prez's biography as its top link for the search 'failure':

They don't seem to have realized that this is not a new Google-bomb, but rather some sort of effect of the old "miserable failure" Google-bomb (and possibly also due to transient effect of Google being in the middle of an update). Or maybe they did, but decided it was worth reporting anyway ...

The miserable failure root cause is apparent from George W. Bush's biography being the top result right now for a search for either failure or miserable (because of all the links for miserable failure).

And further proof, Michael Moore's site (similarly Google-bombed) appears as a result for all three searches.

~~I suspect some of this will change in a few days as the Google update completes~~. [Update 9/27 - It's a long-lasting failure]

Posted by Seth Finkelstein at 11:52 PM

August 23, 2005

Yahoo! Google Size Study Still Flawed

The study comparing sizes of Yahoo! and Google has attempted to address some issues, but is still flawed. Per Jean Véronis:

In the new study, the authors still draw two words at random in the ispell dictionary, but exclude a third, random word from the search (using the exclusion operator - ), in the hope of removing word lists and spam from results. For example, they will search for switchers trophoblast -agnus. They find that Google still returns more results (although less often than before).
Unfortunately, this new strategy doesn't remove the bias. Word lists and spam are still returned, as can be easily checked on any of the queries used, such as switchers trophoblast -agnus. Here are the results from a Google search this morning : all results but one are word lists and junk.

Let me further elaborate. The study's authors assume:

To deal with this problem we modified our original search parameters of searching for two random words from the commonly available English Ispell Wordlist (a total of 135,069 words) [4]. Instead, we searched for two random words and not a third random word. This method, we feel, helps to exclude the vast number of "dictionaries" and "wordlists" because those results should be filtered out by the "not a third random word" part of our search query.

The intent is clear. But the above statement is just not very true. In fact, it may not even exclude format variations of the original wordlist. For example, hypothetically, if there's a wordlist split into two files, one covering words starting with letters "a-n", and another for letters starting "o-z", then searching [alpha beta] will find the first file, yet searching [alpha beta -zebra] will still find the exact same file.

More importantly, all wordlists are not identical. A specific example in the "verification" study is searching [guck wheeze -prothrombin].

Terms: guck wheeze -prothrombin
Google totals:
Duplicates Omitted Estimate: 88
Duplicates Omitted Total: 56
Duplicates Included Estimate: 88
Duplicates Included Total: 83

Yahoo totals:
Duplicates Omitted Estimate: 30
Duplicates Omitted Total: 25
Duplicates Included Estimate: 29
Duplicates Included Total: 28

Ratio of Google/Yahoo for this query:
Duplicates Omitted Estimate: 2.933333
Duplicates Omitted Total: 2.240000
Duplicates Included Estimate: 3.034483
Duplicates Included Total: 2.964286

But "guck" and "wheeze" are common words, while "prothrombin" is much more obscure. So, per the search, there are still many wordlists which contain "guck" and "wheeze", but not "prothrombin" (as well as spam pages).

In general, sampling bias must be carefully examined, because extensive repetitions of a flawed procedure will still yield a fundamentally flawed outcome.

Posted by Seth Finkelstein at 02:01 PM | Comments (1)

August 16, 2005

"A Comparison of the Size of the Yahoo! and Google Indices"

The study "A Comparison of the Size of the Yahoo! and Google Indices" is being widely reported. On initial examination, I've found a bad problem with it.

The methodology is severely flawed, with a sampling-error bias.

In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist (a total of 135,069 words) [4] and wrote a PERL script to randomly select two words at a time from that list. The script then used those keywords to search both Yahoo! and Google and logged the number of results returned. For the purposes of this study we used a sample of 10,012 different searches of Yahoo! and Google using our randomly selected keywords.

By sampling random words, they biased the samples to files of LARGE WORDS LISTS!

And this effect applies, to a great or lesser extent, to EVERY SAMPLE.

One can see this in their log of search results.

First entry:

Terms: carbolization clambers
Google totals:
Duplicates Omitted Estimate: 7
Duplicates Omitted Total: 4
Duplicates Included Estimate: 7
Duplicates Included Total: 7

Yahoo totals:
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0

Do the Google search

Every entry is a large word-list file. Some are presumably (near?) duplicates of the same file

And every search will have this problem, since every search will pick up files like those.

It's a severe systematic error.

Update [12:30 pm EST] - add search-engine spam to the sampling bias. Consider:

Terms: alkaloid's observance
Google totals:
Duplicates Omitted Estimate: 29
Duplicates Omitted Total: 15
Duplicates Included Estimate: 29
Duplicates Included Total: 29

Yahoo totals:
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0

Look at the results. Every page is either a gibberish spam page or a wordlist.

Posted by Seth Finkelstein at 11:44 AM | Comments (8)

August 12, 2005

Google Print: Copyright vs. Innovation vs. commercial value

The recent Google Print debate has been far-reaching, e.g. Siva Vaidhyanathan: Google Avoids Copyright Meltdown:

If copyright is to mean anything at all, then corporations may not copy entire works that they have never purchased without permission for commercial gain. I can't imagine what sort of argument -- short of copyright nihilism -- would justify such a radical change in copyright law.

When discussing the implications of the copyright system, I sometimes try to point out that there are intrinsic conflicts inherent in it, especially in terms of technological advances.

Let's step back for a moment. Why is Google doing this book-scanning project? It's not because it's just so cool (even if it is). While coolness may justify a small-scale promotional project, the scanning efforts are expensive. So Google, as a company, obviously sees some value in the effort. This is not wrong. But it's also a direct conflict with the granted monopoly know as copyright. Whenever there is value, particularly commercial value, there is conflict over who should be able to receive it.

It's not hard at all to see potential returns here. Besides the obvious selling of ads from searches, consider that it positions Google to be a potential partner in any e-books venture. It's not a guarantee. But if a company already has a scanned, indexed, "production" version of the book, that's a good selling point. From this perspective, Google's interest in working with libraries can be seen as a way to do an end-run around contracts with publishers, and Amazon's own evident efforts (talking about doing well by doing good!)

That's just an example. Look at it this way. Google is saying, "Let us make e-books of all library content, and keep them - for copyright reasons we'll only display search results". That's clearly very dubious under copyright. But ... it's obviously an innovation. However, it's a very commercially valuable innovation. Which brings us back to copyright. A problem with the polarized debate over copyright is that it's often framed in terms of morality of property rights, opposed by individual usage rights (which leads to screaming of "monopolists" vs "thief"). But if the Google Print scanning project is viewed as a balance of economic interests - between one company that wants to leverage its search expertise into the e-book area, and other companies which want to maintain their limited monopoly on the potential market, then assuming one believes copyright properly grants some exclusive rights - it's not obvious which is correct here.

That is, the technology company can't be right every time, almost by definition. Because copyright as a limited monopoly fundamentally restricts innovation in some ways. That's the trade-off.

Posted by Seth Finkelstein at 10:47 PM | Comments (1)

June 04, 2005

Online Journalism Review - "Companies subvert search results to squelch criticism"

Companies subvert search results to squelch criticism
http://www.ojr.org/ojr/stories/050601glaser/

"It's not illegal, but it's SEO gone bad. Companies such as Quixtar are using Google-bombing, link farms and Web spam pages to place positive sites in the top search results -- which pushes the negative ones down."

Echoed for the following:

CNN has denied any wrongdoing. "There is absolutely no truth to any speculation that CNN was involved in blog spam," CNN spokeswoman Christa Robinson told me via e-mail. Programmer/blogger Seth Finkelstein theorizes that the person spamming the blogs was more likely trying to help get noticed by search engines by doing amateurish keyword stuffing, rather than an elaborate anti-optimization attack.
"The Net is filled with people who go around and spam blogs to get their message heard, with various degrees of skill at it," Finkelstein wrote on his Infothought blog. "So by the saying 'When you hear hoofbeats, think of horses before zebras,' when you see weird spam, think marginal people before elaborate PR campaigns. It's a much better fit."

[Which quoting, note, did not "just happen", but was due to the grace of an A-lister to whom I flacked my post, and found it worthy, so I was approved by a gatekeeper]

Posted by Seth Finkelstein at 10:13 PM

May 11, 2005

Google Web Accelerator As Circumventor Of China Censorware?

According to Fons Tuinstra at "China Herald" blog:

But a reader of this weblog suggested that in combination with IE Google's new tool would also beat the internet censor. Well, that was enough encouragement for me to have a try. Indeed, the web accelerator helps to beat our internet nanny, at least I got to the BBC news services very easy. I do not think that Google wanted to bring down the firewall (as far as it still is in place with so many proxies around) but they effectively did.

[Note the "obvious" solution is to censorware Google]

Posted by Seth Finkelstein at 11:42 AM | Followups

April 29, 2005

TrustRank

"TrustRank" is criticized by Jeff Jarvis:

Michael Zimmer points us to what I think is a fairly hair-brained scheme from Google that reveals its fetishistic prejudice in favor of machines and also its prejudice in favor of big, old media.
The search engine wants to come up with an algorithm to judge trust in news. They already have a trademarked name for it: TrustRank.

"Fetishistic prejudice"? No, no, no. Such algorithms are the missing piece of building a journalism data-mining business. That's what's needed to really turn the results into other than a list of items by keywords. Moreover, something useful would be the best thing ever to happen to "citizen journalism"!

Every once in a while, when I talk to Andrew Orlowski, about Google and society, I say there are deep, hard, computational problems in the world, and nobody has solved them. But in these efforts, sometimes someone comes up with just a little nibble at the solution, and the outcome can be extraordinary (of course, a lot else has to go right too, many businesses have had good technology and failed, that's another topic).

One big problem with "citizen journalism" is finding effective ways to sort through the piles of ranting and propaganda and echo-chambering, etc., in order to get something useful, at the limits an ordinary person can stand. Lists of articles where keywords appear, don't scale (a workable solution there, for web pages, was the original advantage of Google).

Of course any such algorithm will have certain values and prejudices. A whole book could be written on the problems of Google's algorithms. To be fetishistic about something being an algorithm is indeed a common sociological failing. And as noted, the algorithm itself could favor old vs new, big vs small etc (similar criticisms have been made of Google's web page ranking, and in fact there appear to be certain tweaks to deal with those issues).

But it seems likely that someone who develops a "trust" algorithm which is halfway functional - even if it's ponderous, flawed, prejudiced, biased (sound like something? e.g. criticism of journalism?) - will have an immense advantage in the race to exploit that commodification and de-professionalizing of journalism.

Maybe the best thing to do is to fund Google alternatives, to insure Google doesn't turn into the next Microsoft-like monopoly

[That wasn't a pitch, though it reminds me again I really should get back to analyzing Google. The relevant keepers of the gates are better for me, and there's money in it, in contrast to the horrible effects of fighting for net-freedom]

Posted by Seth Finkelstein at 11:59 PM | Comments (2)

March 31, 2005

Guilt by Blogroll Association, or Google-Abuse

Following up on Google-Newsbombing, where

Juan Cole discussed a "Google Smear" in the News search, the right-wing magazine had a rebuttal. Passing over all the Middle-East politics, which is outside the scope of my own blog, the quality can be inferred from the gem of "Google-Abuse" (emphasis in the original):

Cole clearly regards Raimondo as a legitimate, authoritative source of information, while complaining that his critics rely on dubious sources. We counted 14,400 web pages in which the names Juan Cole and Justin Raimondo appear together.

I can't even figure out where they got the 14,400 number.

But even despite the silliness of deriving much from a count of pages where two name appear together (Google-guilt by association), it's doubly dumb because the count itself will include many, many pages from bloggers who have both names on their individual post blogrolls. As well a duplicate pages, mirrored pages, not to mention the article itself (and now this page!).

Justin Raimondo pointed out that searching his name and "David Horowitz", (the infamous publisher of the right-wing magazine), yield many web pages too.

This inspires me to propose: Celebrity Google-Whack - find a search of just one hit with two quoted celebrity names (the more famous the celebrities, the better the whack).

Posted by Seth Finkelstein at 12:22 AM | Comments (2)

March 27, 2005

Juan Cole and Google-Newsbombing

Juan Cole discusses what he calls a "Google Smear":

The GoogleSmear as Political Tactic

The Google search has become so popular that prospective couples planning a date will google one another. Mark Levine, a historian at the University of California Irvine, tells the story of how a radio talk show host called him a liar because he referred to an incident that the host could not find on google. That is, if it isn't in google, it didn't happen. (Levine was able to retrieve the incident from Lexis Nexis, a restricted database).
It seems to me that David Horowitz and some far rightwing friends of his have hit upon a new way of discrediting a political opponent, which is the GoogleSmear. It is an easy maneuver for someone like Horowitz, who has extremely wealthy backers, to set up a web magazine that has a high profile and is indexed in google news. Then he just commissions persons to write up lies about people like me (leavened with innuendo and out-of-context quotes). Anyone googling me will likely come upon the smear profiles, and they can be passed around to journalists and politicians as though they were actual information.

The interesting Google aspect here is that what's going on is not the typical Google-bomb, of link-text words. It's a Google-NEWSBOMB. Which isn't primarily affected by link-text words, but rather the general "newsworthiness" of the site.

That is, currently a Juan Cole Google News search, brings up as its first value a right-wing hatchet job. Because, roughly, it's the most powerful "news site" which has the necessary factors (it's not PageRank alone, but PageRank here is the critical relevance factor which is extremely "expensive")

This is really quite interesting in terms of implications. Because, remember, the news sources in Google News are all selected by Google. And their process for selection is very opaque.

Again, the implications for the selection algorithm are different for the news search than for the web search. In the web search, the first item when searching for a person is typically that person's own site, if they have any web presence at all. It'll certainly be in the top ten. But on a news search, their home site will never appear. The top items will be the top "news" sites, roughly by "volume". Imagine if there was a "radio search" and it was ranked by number of stations. Thus, anything said by syndicated talk show hosts would come up on top. The effects would be worrisome. Because it's a lot easier to buy ranking in general that way.

Follow-up: See Guilt by Blogroll Association, or Google-Abuse

Posted by Seth Finkelstein at 11:59 PM | Comments (0)

March 07, 2005

Shelley Powers' Theory of Google AutoLink as Sexual Dimorphism

Google AutoLink has been prompting various negative reactions. But Shelley Powers has a hilarious satire/explanation, which warrants love for the linklorn:

When we women ask the power-linkers why they don't link to us more, what we're talking about is communication, and wanting a fair shot of being heard; but what the guys hear is a woman asking for a little link love. Hey lady, do you have what it takes? More important, are you willing to give what it takes?
Groupies and blogging babes, only, need apply.
And the phrases, "circle jerk" and "Google juice", take on new depth and sudden meaning in light of this discovery.
...
Yes, so much is explained now. Where I saw AutoLink as a relatively uninteresting and innocuous innovation, to some guys it was a way of dropping their pants and swinging what they got, while to others, it was a big metal Zipper, just waiting to catch the unwary.

But ... but ... isn't it just the territorial imperative? As men, we are culturally expected to be responsible for the defense of the community against invaders. Which, in cyberspace, then must translate into defending the HTML page against outsiders who might appropriate the link-resources for their own click-"progeny". So, from this perspective, we form into hunting bands to better make use of scarce energy resources. Hence ... both the A-list, and their reaction, is the inevitable neural programming of the sociobiology of blogolution.

No?

Posted by Seth Finkelstein at 11:59 PM | Comments (1)

December 19, 2004

Total Google Awareness, continued

Google-Watch appeals to the American Library Association
http://www.google-watch.org/appeal.html

I'm aware that the ALA is already involved with discovery and lobbying on this issue with the Justice Department over practices that grew out of the USA Patriot Act. But keep in mind that the scale of anything Google does is a million times larger than the scale of anything that involves discrete libraries, access to paper hard copy, and occasional subpoenas for specific information. Perhaps the scale of what Google does is even ten million times larger.

["me too", worth thinking about.]

Posted by Seth Finkelstein at 11:59 PM | Followups

December 15, 2004

The Dark Side Of Google

The Dark Side Of Google
http://www.interesting-people.org/archives/interesting-people/200412/msg00094.html

["me too" - via Discourse]

Posted by Seth Finkelstein at 11:59 PM | Followups

December 01, 2004

China and Censoring Google News

The following post is a service of "Almost everyone else is echoing it, but I'll add some value so people don't complain that I only talk about my struggles with activism".

The latest Google censorship story has its echo-chamber root in the following Reporters Without Borders story:

Google urged to react after Chinese authorities block Google News
http://www.rsf.org/article.php3?id_article=11968

"China is censoring Google News to force Internet users to use the Chinese version of the site which has been purged of the most critical news reports," Reporters Without Borders said. "By agreeing to launch a news service that excludes publications disliked by the government, Google has let itself be used by Beijing."

For some ground-level info, go read the comments on "Shanghai-based journalist" Fons Tuinstra's site (via Dan Gillmor)

"The current block on Google News is a strange one, since only a few of the Google IP addresses have been blocked. The discussion on what is going on is still in full swing, the association RWB makes with the Chinese google site seems one of the less likely ones." (Fons Tuinstra)

[I also did a quick check myself, the ban is sometimes there, but the small implementation seems an accurate description]

Posted by Seth Finkelstein at 07:48 PM | Followups

October 21, 2004

Mysteries of the Google Bomblet

A while back, I did a small experiment:

Let's say I linked a certain phrase "EBig EBrother" to somewhere (such as Google ...). I've used an uncommon phrase here, so as to make it easy. The words "Big Brother" have many hits, but there's no occurrence "EBig EBrother". Well, there wasn't until this post gets indexed.
What happens?

Time to try the experiment again. It turns out that small-scale Google-bombs in fact might be useful probes into the functioning of Google's index. We already know that large-scale Google bombs tell us something about how it functions (remember, popularity vs. authority). So I suspect there's also insight which can be gleaned from lesser efforts.

[Update: For variety: EBiggest EBrother]

Posted by Seth Finkelstein at 11:59 PM | Followups

September 25, 2004

Mercury News mentions in article about Google and China News Censorship

I'm mentioned today in a Mercury News article about Google and China censorship:

In many cases, links to the Web sites will appear in search engine results, but they return error messages when users click on them.
Seth Finkelstein, an Internet filtering authority, said he suspects Google agreed to remove the Web sites from its Chinese news service as part of a deal with the Chinese government to lift its 2002 ban.
But [Jonathan] Zittrain said he doubted there was any type of collusion and believes Google appears to have been forced into a difficult business decision.
``They have to face a decision if they want a footprint there,'' he said. ``If they offend the Chinese government, it will be hard to succeed there, which will make it impossible to offer the Chinese citizens any information.''

I like being described as "authority", and being in the same piece as Jonathan Zittrain (n.b. I don't think we're saying that much different). I need all the ego-boost I can get :-).

Posted by Seth Finkelstein at 09:08 PM | Comments (1) | Followups

September 18, 2004

Google Chinese News Censorship

[An echo, but I don't think this has been publicized much yet]

http://www.peacehall.com/news/gb/english/2004/09/200409180915.shtml

Google Chinese news censorship demonstrated(Pic)
(Sept. 18, 2004)2004 Sep. 16, Bill Xia, Dynamic Internet Technology Inc.
On Sept. 15, 2004, a DynaWeb volunteer reported that Google's Chinese news returned different results depending whether the search was conducted in China or in the U.S. Today, we were able to confirm this report through proxies in China. Search results inside China do not contain news from blocked sites such as www.epochtimes.com.au. (boxun.com)
...
On the first page, any entry from http://www.epochtimes.com.au or http://www7.chinesenewsnet.com are not shown when searching with a proxy inside China.

[This appears accurate. They give instructions in the article for reproducing the censorship. Though the examples involve keywords in Chinese, which makes it difficult to understand for non-Chinese speakers]

Posted by Seth Finkelstein at 12:35 PM | Followups

August 13, 2004

Google IEM Futures Market Data

I've been trying to figure out if there's some rational way to figure out what's a good bid for the Google IPO price. There's a real market even now, part of the "Iowa Electronic Markets":

The Iowa Electronic Markets are real-money futures markets in which contract payoffs depend on economic and political events such as elections. These markets are operated by faculty at the University of Iowa Tippie College of Business as part of our research and teaching mission.

They're running an IPO Google market

But unfortunately for fans of the idea of markets for everything, the results are very unimpressive.

For exact numbers, there doesn't seem to be much trading data (nothing on many days)

For ranges, there's still not much data. It's sparse and irregular.

As far as I can read it, the market oracle says "Future cloudy. Ask again later."

Posted by Seth Finkelstein at 12:12 AM | Comments (1) | Followups

August 03, 2004

Google Initial Public Offering - Where Are The Bubble-Blowers?

Given the Google Initial Public Offering site has launched, I expected the hype to kick into high gear.

But where is it? Where's all the articles proclaiming "Google has revolutionized the world - you must own this stock at any price!". Where are the stock market touts, arguing "Computer power doubles every 18 months, so Google's value should double every year and a half, so it'll be worth a dozen times more in a handful of years".

The absence is astonishing. All I've been able to read is well-argued evaluation and skepticism and even crabbiness.

I'm beginning to think this is a sobering demonstration of what happens when the marketing machine is turned off, or even reversed. Google did not make the standard IPO deal, where the investment banks fleece the small investors, generally by lying about the stock (just look at what came out in the scandals). This may not have been for for any particular moral reason. But rather because Google wanted any fool's money for itself, in a *self-fleecing* system via auction.

However, it seems that in retaliation, the moneybags are now dumping on Google. Not a scintilla of hype to be found. It's scary.

Posted by Seth Finkelstein at 11:52 PM | Comments (1) | Followups

July 27, 2004

Google and Supreme Court argument revisited

Walt Crawford has released yet another edition of his library 'zine (not blog) "Cites & Insights", for August 2004. He kindly mentions me, for Google and censorware discussion. I may write more later, but one quick note is apropos today regarding Google. In discussing my examination about the Google silliness of the Free Porn, err, Justice, Department "evidence" in the "COPA" Internet censorware law Supreme Court case, he notes:

These arguments took place in early March 2004. Solicitor General Theodore Olsen, arguing to overturn the injunction, used a web search (probably Google) to illustrate the extremity of "online smut." Type in the words "free porn" and you get a list of 6,230,000 websites, he said: "I didn't have time to go all the way through those sites."

The oral argument transcript confirmed that it definitely was Google being used:

I did the same, this again is outside the record, but I did this, anyone can do this, the same experiment over the weekend. I went to Google and I typed in disable filter and you push the button and you will get a screen full of programs that will tell you step by step how to dismantle the computer so your parents won't know about it. It is that easy, and you can put it back on.

Amusingly - or maybe not - Solicitor General Olsen is also wrong again here. While you can find instruction pages, they're way out of date, and so not exactly good evidence for anything. Don't believe everything you Google on the Web. Even if you're making an argument before the Supreme Court.

Posted by Seth Finkelstein at 11:59 PM | Followups

July 26, 2004

Google IPO price

Google's IPO price is now reported to be $108 - $135 per share (via John Battelle). This is a very high value. The price-to-earnings ratio is given as 329. Comparison: Microsoft - 56, Yahoo - 110, Ask Jeeves - 54.7, large capitalization stock average (S&P 500), about 20.

In the interests of not echoing what everyone else is saying (besides run away), I'll repost a very long mailing-list message below (by someone else) discussing the issues of IPO pricing. Everyone knows that Google's IPO price is going to be irrational. The issue is who is going to capture that irrationality premium - the underwriters or the company itself? Google is running things so that the bubble-juice goes to them, not the money-bags. I suppose it's as good a place as any.

http://www.interesting-people.org/archives/interesting-people/200303/msg00095.html

interesting-people message

Subject: [IP]"A few of my friends still believe it's good news whenever an
IPO skyrockets.It's not-- it's the best indication that people are getting ripped off."

* From: Dave Farber
* To: ip
* Date: Fri, 07 Mar 2003 15:06:31 -0500

------ Forwarded Message From: Peter Wayner Date: Fri, 07 Mar 2003 14:18:34 -0500 Subject: Re: [IP] Dan Gillmor: Quattrone clique disgraced Silicon Valley

Dave, I don't know if you've exhausted the IPO topic, but I spent some time recently working through the IPO process. A few of my friends still believe it's good news whenever an IPO skyrockets. It's not-- it's the best indication that people are getting ripped off.
http://www.wayner.org/modules.php?name=News&file=article&sid=12

Let me know what you think.

_Peter

IPO Fraud

Posted by admin on Friday, March 07 @ 14:14:22 EST

It looks like the SEC and the investment community may be going after Frank Quattrone. Some may see this as a case of no good deed going unpunished because the man brought such a flood of capital to Silicon Valley, but others probably see it as a chance to punish someone guilty for ripping off many people. I can't speak to the specifics of what Mr. Quattrone has done-- that's the job for the government, but I have watched the IPO world for long enough to complain about the process in general. Enough of my friends worked for these so-called hot companies and enough of them have been hurt by the IPO process. It's time that things change.

The big problem comes when the shares soar on the first day. In the past, most people have seen this as some wonderful, hypeworthy event. The hard workers are getting rich. The initial investors are reaping big rewards. Everyone should be happy and the press usually blathers away about the strength of the IPO market.

Alas, that's so far from the truth. Many of the people are getting ripped off because money meant to help the company achieve it's goals is heading for the pockets of insiders. Yes, many of the great Dot Com companies were silly and doomed to failure, but many of them never got all of the capital the market intended for them. We'll never really know what was supposed to happen because the money never reached the front line troops.

I believe that mispriced IPOs that skyrocketed on the first day are a real moral and legal challenge for the technology industry and the capital markets. While I'm sure that the laws have plenty of loopholes that may let everyone escape to a life a leisure, I'm convinced that capital was misdirected by this mechanism and this misdirection was one of the problems that led to some of our major market failures. There are plenty of people out of work because their companies didn't have the capital to complete their business plans.

It's easy to work through the math and discover that much of the money that investors intended to go toward a particular company went, instead, into the hands of investment bankers and their chums. Let me take you through the steps of some hypothetical company, call it ultrametabetaelectrotech or UMBET for short. The bankers at J. Plutocrat and Fils convince the umbet dudes that going public is the right step.

Here are key decision points along the journey:

* 1-- UMBET and Plutocrats file for an IPO including a document listing all sorts of reasons why this is a truly risky venture that no one in their right minds would ever buy.

* 2-- The marketplace ignores all of the warnings in (1) and concentrates on the upside. The marketplace floods the Plutocrat firm with buy orders at anywhere between 20 and 100 a share.

* 3-- The fils at Plutocrat look at this order book and decide that they could sell 1 million shares at $40.

* 4-- The fils at Plutocrat ignore this reality and price the shares at $20.

* 5-- The dudes at UMBET decide to go along with this pricing. The company will get $20 million minus the 7% commission paid to Plutocrat. Each of the managers at UMBET don't really care what the price is because they each have 10 million shares in their pocket. The market will assign a fair price afterwards no matter what the initial price happens to be.

* 6-- The fils at Plutocrat start allocating the 1 million shares to others. They know from their information about the order book that the price will pop up to at least $40/share because of the demand. It might go higher. That means that every share they give someone is probably equivalent to $20 bill. As such, they allocate shares to people in these classes:

* a-- Great guys and gals.
* b-- People they owe money or favors. Politicians have been some of the lucky to get these shares.
* c-- Mutual funds that want to move money between funds.
* d-- Investors who want technology stocks and are willing to pay roughly 30% of the first day return back to the Plutocrats in inflated "commissions".
* e-- Others with malice aforethought.

* 7-- The Plutocrats tell investors that one way to get 1000 shares at $20 is to make a commitment to buy 1000 shares on the first day in the aftermarket. Investment banks like deals like this because it shows "support" for the shares.

* 8-- The IPO day arrives. The shares begin at $20. Investors start buying and run up the price to $40. Everyone is happy. Everyone talks about what a success the IPO is.

Here's why I think this hypothetical IPO was a fraud:

* The company only ended up with $20 million in the bank to support expansion and the pursuit of the business success despite the fact that many of the investors thought it should be allocated $40 million.

* People who bought on the day of issue paid $40/share, but only $20/share went to the company. They lost 50% right off the bat. Other investors flipping their shares walked away with it.

* People who bought in step (7) paid an average of $30/share and only lost $10/share to the "inefficiencies" of the IPO process. They may think they've got a great 33% gain on their investment on the first day. ("Boss, we paid an average of $30/share for this and it now trades at $40." "Simpson, I like the cut of your jib.") This is a charade. There's only $20/share left in the company. They've really suffered at 33% loss!

* The mutual funds in (6c) are robbing the investors in one fund in a family to reward another. These funds often generate outstanding returns in new funds by allocating the underpriced shares to the new hot fund in the family. The investors in the old, tired fund end up with shares priced in the after market.

* The payoff recipients in (6b) are almost sure to recognize the $20/share they receive as a pure gift. If they don't count it as a debt, the Fils at Plutocrat aren't doing their jobs right.

* There may be some truly great folks buying at step (6a), but I'll let the readers decided the probable size of this group.

* The umbet management dudes in (5) just let someone walk away company assets (shares) worth $40 a piece for a price of only $20. If that's not a breach of fiduciary duty, I don't know what is. To make matters worse, some of the lucky folks are called "friends and family".

* The Fils in step (4) ignore good faith offers of a higher price. That sounds like a breach of fiduciary responsibility to me.

Recently, the news stories focus on prosecuting people for spinning or destroying documents. That may be the right legal strategy because the legal system probably protects the underwriter even when coming up with a completely bogus price.

The sad fact is that there is a better way. Bill Hambrecht has been pushing his OpenIPO concept for some time, but Wall Street resists it. Instead of allocating shares to the lucky insiders, the system chooses the people willing to pay the most money. The cash flowing into the company is maximized and the fraud is eliminated.

I've discussed this topic with others in the industry. Some people agree, but others tell me I'm flat out wrong. There needs to be the slop in the system to reward everyone or else people won't buy the shares at all. I don't believe this. It's been well known for some time that people who buy IPO shares in the aftermarket are usually net losers, at least on average. Most savvy investors know that the aftermarket is a sucker's game. The practice of allocating shares to trusted insiders and investment friends was hurting the aftermarket before the crash and now it's completely destroyed the IPO business. There are no IPOs is because no one trusts the mechanism anymore.

Archives at: http://www.interesting-people.org/archives/interesting-people/

Posted by Seth Finkelstein at 03:02 PM | Comments (1) | Followups

June 24, 2004

CMP Media, Google News, blocking report

There's a report, now A-listed, that:

CMP Media is blocking links from Google News. When you click on a link on Google News to a headline on one of CMP Media's technology publications, you get [a block page]

I can't reproduce this. I'm not saying it's wrong, but I can't get it to happen for me. It's of course possible that CMP changed it.

I can get that blocking to happen from a few domains. Anything at domains:

"com.com" (CNET/ZDNet), "cnet.com", "linuxtoday.com".

But not from news.google.com or similar.

It's working off the "Referer:" HTTP header, so any header-removing privacy program stops the effect.

[Yeah, I know I should be cheerleading today on the INDUCE Act. But I want to use my technical skills, that's my strength, and in addition Google is a much less dangerous topic for me]

Update: Seems not to be true, or a brief accident. There's commentary in the cyberjournalist post

Posted by Seth Finkelstein at 11:59 PM | Comments (1) | Followups

May 25, 2004

Nazi blog spam for Google PageRank

[Really! Shades of "Hitler's Doctor's Dog" again]

It turns out that Gary Lauck (remember, an honest-to-God(win) Nazi) has been posting his phony "Rabbi Jokes" protest message to more than the GrepLaw site. He's put the message on several small news sites and spammed by various blog comments. Just do a Google search for the words:

rabbi jokes removejewwatch

Moreover, for the website field in the comment entries, he's giving http://www.removejewwatch.com , to try to make it seem a legitimate activist protest. There's one phrase in the message that didn't appear in the version I saw yesterday:

Internet activists are urged to cut-and-paste this announcement and insert it into forums.

That is, we have a Nazi, masquerading as an anti-Nazi, going around blog-spamming supposed protests of his own site!

I suppose he thinks he'll win either way - if people get angry at the spamming, they'll blame http://www.removejewwatch.com , and if he can stir-up a protest, he'll get attention and status in the Nazi community.

It's common to use anonymity/pseudonymity to attack other people. I've heard of cases of people pseudonymously promoting themselves. But this is the first time I've ever truly seen someone engage in an extensive series of posts pseudonymously protesting himself!

Posted by Seth Finkelstein at 11:59 PM | Comments (1) | Followups

May 24, 2004

"Rabbi Jokes" and Anti-Semitic Google bombing jealousy

"Rabbi Jokes", as a phrase, is now being Google-bombed by Anti-Semites. I've known about this for a while, but didn't want to publicize it in terms of giving the campaign any attention. But it look like it's being promoted now, and in a very strange case of (I used this term literally) Nazi jealousy. Today GrepLaw has the following article, supposedly about "Jew Watch" and more:

posted by mpawlo on Monday May 24, @10:44AM
from the google-hack dept.
Anonymous Coward writes
In April 2004 "jewwatch" incident was widely publicized. When Steve Weinstock entered the search term "Jew" on Google, the world's leading search engine, the openly anti-Jewish web-site http://www.jewwatch.info appeared as the very first listing!
Mr. Weinstock launched a campaign - at http://www.removejewwatch.com - to remove that site from the top spot. He initially succeeded, but the site soon regained the first position before again losing it. It appears to be an ongoing battle.
Since May 2004 Google's #1 listing for "Rabbi Jokes" shows neo-nazi Gary Lauck's site http://www.nazi-lauck-nsdapao.com . The 20-language site has cartoon animations of leading Jewish figures, free nazi computer game downloads, holocaust denial books and the nazi file "The Eternal Jew".

What's wrong with this picture? JewWatch.COM, which was the site, isn't the same as jewwatch.INFO. The site jewwatch.INFO is a mirror run by Gary Lauck. That is, the above description is giving the wrong site. And doesn't the end part sound odd? Like a press release?

It turns out that the poster of this article left his email address. Which is the domain dnsb.org, registered to, drumroll, "Gary Lauck"! That is, the above is a self-promoting article, by one Nazi who is apparently jealous of all the press attention given to another Nazi.

The agendas are becoming farce, if they aren't already.

Posted by Seth Finkelstein at 11:59 PM | Followups

May 18, 2004

Google Ethics Committee

The Google Ethics Committee story has been popular recently (note I'm linked there :-)). Here's something adapted from part of an e-mail I wrote, elaborating on what I think was meant by the phrases used:

"We change PageRank[tm] when we find that spammers are abusing it, but we don't change it often."

By "change PageRank", he's undoubtedly referring to very deep changes to the algorithmic calculations, not specific site censorship. That is, an example of a deep change is what Google did some months ago, during the "Florida Update" spam-fighting upheaval. There was a complicated change to the display scoring system. Roughly, rules were added such that if a site had too many links with just one term, *and* that term was on a spammish terms dictionary, then either (it was unclear, maybe changed) those links wouldn't count for page rank calculation, or the site was marked as a spam site. I believe that's what he's referencing to in the phrase "when we find that spammers are abusing it".

In contrast, the suppression blacklist happens after all the PageRank calculations, and it's just a technically trivial tossing of a URL.

I have no doubt that they've been offered money by some sites to boost those sites' PageRank (and I assume have refused). Similar to the following, revealed last year (oh, tell me again how Big Bloggerdom is a meritocracy).

http://radio.weblogs.com/0001014/2003/07/07.html#a4052

Adam Curry's Weblog
Taking a stand on rss
Time to come clean on an investment I made a year and a half ago. At the time, UserLand software had released a Mac OSX version of Radio and I was totally digging the built in news aggregator. I came up with a cunning plan: I asked Userland if I could purchase a pre-installed feed on their aggregator, which supports RSS xml feeds. I paid $10,000 for a one year license. To date I've been delighted with my purchase and although I haven't checked recently, I'm pretty sure Userland still has me in the defaults.
...
So I'm invoking an age olde american tradition of letting my wallet do the talking. I will again invest $10k in aggregator default placements this year, but I will spread it around, to all developers who adhere to RSS2.0. Include (N)echo and you're out of luck."

[(N)echo was a previous name for the rival format now known as "Atom"]

Posted by Seth Finkelstein at 11:59 PM | Followups

May 05, 2004

JewWatch.com site back at #1 for Google "Jew" search

The JewWatch.com site has returned to the #1 spot on Google for a search for "Jew". This is breaking news, I don't see it reported yet elsewhere. Whatever the technical reasons ... It's b-a-c-k. I've updated my report which discusses these issues:

"Jew Watch", Google, and Search Engine Optimization
http://sethf.com/anticensorware/google/jew-watch.php

Abstract: This report examines issues surrounding the high ranking of an anti-semitic website, "JewWatch.com" for searches on the word "Jew". The search results present complex issues of unintended consequences and social dilemmas.

Posted by Seth Finkelstein at 12:06 AM | Comments (2) | Followups

May 03, 2004

Googlebay!

[Googlebay? Hmm, Google has long registered Googlebay.com!]

Remember when I recently wrote of Google IPO Auction as Ebay, for what that means when something is bid-up? Compare:

"Googlebay" renews dot.com craze
Seven tips before you consider bidding on this hot IPO
By Paul B. Farrell

http://cbs.marketwatch.com/news/print_story.asp?print=1&guid={E70EEA01-C5DF-438B-B428-A7633EFB4B97}&siteid=mktw, also http://www.investors.com/breakingnews.asp?journalid=21002980&brk=1

I don't like to link-and-run, but this one is worth it:

Will you be one of the millions of American investors to cross this line? Will you relapse into the irrational exuberance of the late 1990s? How will you know? You'll know when you bid on Googlebay!
Yes, Googlebay. That's my nickname for the new Web site that will handle bids for the upcoming Google initial public offering. It will launch soon ...
Absurd P/E ratios and silly valuations
Stop, dammit! Listen to yourself! It's happening again! Get a grip!
Not one of America's 94 million long-term buy-and-hold investors with an ounce of self-respect and a brain in their heads should participate in this idiotic bidding process.

Posted by Seth Finkelstein at 10:01 PM | Followups

April 30, 2004

Google IPO reading round-up

Goo-goo-g-g-g-g-l-e. Everyone has something to say, so why should I duplicate what's being done elsewhere? Here's pure aggregation from what I've been reading.

The Source: Google's SEC filing

Best business coverage: Gordon Smith / Venturpreneur (I recommend reading through it all)

Best coverage not repeated in a dozen ways elsewhere (and source of the above): Aaron Swartz / google.blogspace.com

Best text summary: John Battelle's Searchblog

While my summary of the letter may sound negative, it's my honest and initial response: to me, the letter comes off pretty strong, and likely will anger many on Wall Street. But I have to commend the founders for sticking to their beliefs, and using the IPO as something of a megaphone/soapbox. It is brave, unique, and rather commendable to very publicly state that the founders are controlling the company, and the founders will decide what is best for Google, not Wall Street. They've set themselves a very high long-term bar, claiming they will best the system, in essence. I think it will be very interesting to see how Wall Street responds. There is a chance, in the end, that the Street will feel slighted, and turn its back on the company.

Best bells-and-whistles summary: Danny Sullivan / searchenginewatch.com

Google Worries
Google's filing is full of many standard things that investors might be warned about. The company even addresses the issue of privacy concerns about its Gmail system or the controversy over anti-Semitic site ranking at the top of its results for the word "jew" might hurt its brand.

Funniest coverage: Andrew Orlowski: Google files Coca Cola jingle with SEC

"We'd like to build the world a home," write co-founders Sergey Brin and Larry Page. "And furnish it with love. Grow apple trees and honey bees, and snow-white turtle doves." The unconventional sentiments will puzzle Wall Street analysts, but delight Google's teenage fans - and children of all ages who make up its most ardent users.
"We'd like to teach the world to sing," they plead. "In perfect harmony."
We made that up, of course. But the real "Letter from the Founders" that introduces today's 26-page filing borrows as much from The New Seekers as it does from Warren Buffet.

Update: Best reverse-engineering of Google from the documents: Tristan Louis / TNL.net

Just for the sake of argument, let's go with 1 Gigaflop per processor. This means that the Google supercomputer has about 189 teraflops of power on the low end of my estimates, 253 teraflops on the middle end, and 316 teraflops on the high end. This would easily put it on top of the list of fastest computers in the world.
Any way you slice it, that's a lot of power.

Posted by Seth Finkelstein at 10:17 PM | Followups

April 29, 2004

Google IPO Auction as Ebay

Google IPO is here! Is there anything else in the tech world to discuss today?

One thing I've been wondering about in terms of the auction process, is that its effect may just be a way of differently allocating the inevitable irrationality. There's this gem in Google's SEC filing:

The auction process for our initial public offering may result in a phenomenon known as the "winner's curse." At the conclusion of the auction, bidders that receive allocations of shares in this offering (successful bidders) may infer that there is little incremental demand for our shares above or equal to the initial public offering price. As a result, successful bidders may conclude that they paid too much for our shares and could seek to immediately sell their shares to limit their losses should our stock price decline. In this situation, other investors that did not submit successful bids may wait for this selling to be completed, resulting in reduced demand for our Class A common stock in the public market and a significant decline in our stock price. Therefore, we caution investors that submitting successful bids and receiving allocations may be followed by a significant decline in the value of their investment in our Class A common stock shortly after our offering.

Remember what happens to items which get bid-up on Ebay during auction fever ...

Posted by Seth Finkelstein at 09:40 PM | Comments (1) | Followups

April 28, 2004

Hate-Site: "Proposal to place Jew Watch back to #1 on Google"

Anti-Semites are calling for a Google-bombing campaign to re-raise the Google-ranking of the "Jew Watch" hate-site. The discussion can be found at the following URL ("Stormfront White Nationalist Community")

http://www.stormfront.org/forum/showthread.php?t=129666&page=1&pp=10

(Not a link, for obvious reasons)

I would like to make a proposal to fellow Stormfront members in an honorable eActivism effort. ...
Please, this is an honorable thing to do, it takes a little effort, but with all the support of [Stormfront] and it's friends, we can do this to help spread the truth ! The Jews have the power of the media, so we need to take it back from them and defeat them. Out of 3.7 million pages that come up from typing the word "Jew" into Google, Jew Watch *used* to be #1 !! Let's take it back !

And another poster, in part:

They've used Blogs against us, if you know any friends that participate in "blogging" simply advocate them to our cause. This ball will get rolling.

Whenever anyone talks of Google representing the opinion of the web, it's important to keep in mind that such an opinion may represent only the activities of special-interest groups.

Posted by Seth Finkelstein at 11:59 PM | Comments (1) | Followups

April 26, 2004

JewWatch.com homepage back, for now

As I write this, the home page for jewwatch.com is back as the top page for a Google search for the words: Jew Watch. The homepage is also back in a search for the word Jew, but only at result #38.

But the homepage doesn't have a fresh date on it. And I can't find the new mirror at http://www.nazi-lauck-nsdapao.com/jew-watch/index.htm

Hmmm .. hard to say with this means. Older datacenter? But there's recent results. Older basic data plus a few fresh result?

Sometimes, it's very difficult to figure out what's going on. When I used the terms "an interesting combination of malice and stupidity" to describe Google's answers about policy, some people took offense. But again, it's very much a case of often they really don't know themselves (because a good answer requires a deep understanding of the ranking algorithm and system details), but they can't admit that, and even if they did know, they wouldn't say.

Public Relations isn't technical support.

[Technorati-bait: Anti-Semitic site drops off Google]

Posted by Seth Finkelstein at 11:59 PM | Comments (2) | Followups

April 25, 2004

JewWatch.com updates - Nazi mirror site

Gary Price, who runs The ResourceShelf tipped me that the first result for the words Jew Watch is now a JewWatch.com mirror. (easy come, easy go, for that top slot ...).

This is actually extremely interesting from a Google-analysis aspect, as though the page is on an existing Neo-Nazi site, it's a new page. The site, http://www.nazi-lauck-nsdapao.com/, currently proclaims

"The educational web-site http://www.jewwatch.com is under attack by the enemies of free speech. Free speech activist Gerhard Lauck is trying to help them. An (at least partial) mirror web-site has been established at http://www.jewwatch.info and at http://www.nazi-lauck-nsdapao.com/jew-watch/index.htm

[note: Neo-Nazis apparently have an inferior webmaster-race, since making a full mirror is not difficult these days.]

So, we have the following critical point: The first-place rank of that mirror page for a seach for the words Jew Watch, cannot be caused at the moment by any links. It exists solely because of search engine optimization factors (which do include the freshness of a page).

Posted by Seth Finkelstein at 12:40 AM | Comments (3) | Followups

April 24, 2004

"Jew Watch" Update: FRONT PAGE of site on blacklist (or not?)

I've updated my report with a new development, breaking news:

"Jew Watch", Google, and Search Engine Optimization
http://sethf.com/anticensorware/google/jew-watch.php

Google has now added the front page of the JewWatch.com site, that is, the url http://www.jewwatch.com/, to their internal blacklist. blacklist. The site itself has not been removed from Google's index. However, now the front page of the site will never appear in any Google search. So that front page will be gone when searching for the word "Jew"

This effect can be verified by doing a search for the words: Jew Watch. Ordinarily, the front page of the JewWatch.com site would appear in the top spot. But currently, other site pages appear further down in the search results. And in a wonderful twist of fate, since the site front page has been suppressed, [my] writing is in the top spot for that search! ("subtleties of language" can lead to unintended consequences ...).

[See also the coverage at searchenginewatch.com]

UPDATE 4/24 3:25pm: According to Google, the homepage is empty now, not because of blacklisting, but because the site was down for a time while changing servers.

As relayed by tripias.com: "Director of Corporate Communications David Krane replied to my initial email within a matter of hours (actually in the middle of the night on a Friday night, surprisingly), and had this to say:"

"No, Google did not blacklist or make any other manual change to intentionally remove the jewwatch.com website from our index. It does not currently appear in Google's search results because the website was offline for a number of days last week. In our most recent crawl of the web, we were unable to reach the jewwatch.com website, therefore it was not included in our index. Now that the site is back up again, it's likely that at some point soon, jewwatch.com will re-appear in Google."

Danny Sullivan at searchenginewatch.com has a similar update.

[Update 4/24 6pm: Tripias has the scoop that Jew Watch's New ISP to Stay ]

Posted by Seth Finkelstein at 01:15 AM | Followups

April 22, 2004

"Jew Watch", Google, and Search Engine Optimization

[The "Jew Watch" site is back, as I predicted - I've written up the issues in a new report]

"Jew Watch", Google, and Search Engine Optimization
http://sethf.com/anticensorware/google/jew-watch.php

Posted by Seth Finkelstein at 06:16 AM | Followups

April 21, 2004

Google Gmail Bill Now In CA Senate

[Scoop? I don't see this anywhere I've searched.]

The Google Gmail legislation proposed by California State Senator Liz Figueroa has now been formally introduced and released.

Link to Liz Figueroa press release:

FIGUEROA INTRODUCES BILL TO STOP GOOGLE FROM SECRETLY "OOGLING" PRIVATE E-MAILS
http://democrats.sen.ca.gov/servlet/gov.ca.senate.democrats.pub.members.memDisplayPress?district=sd10&ID=2102

SACRAMENTO - Responding to a world-wide outcry from privacy advocates, Senator Liz Figueroa (D-Fremont) today introduced a bill that would forbid Google from secretly scanning the actual content of e-mails for the purpose of placing targeted direct marketing ads. Instead, the Internet giant would be required to obtain the informed consent of every individual whose e-mails would be "oogled."

Link to text of Google Gmail legislation: http://democrats.sen.ca.gov/servlet/gov.ca.senate.democrats.pub.members.memDisplayBillDetail?district=sd10&bill_number=sb_1822&sess=CUR&house=B&site=

[Update 4/22 5:25pm Note there's some sort of legislative maneuver being used here, by modifying an older bill. Make sure to follow the "Amended" link above.]

Posted by Seth Finkelstein at 10:53 PM | Comments (3) | Followups

April 20, 2004

J-e-w-w-a-t-c-h.com, Act II coming up

I've been following the issue about the anti-semitic site and the high ranking it has on Google. Fox News ran a story on it today: Google in Middle of Anti-Semitic Flap (amusingly, my site traffic spiked through the roof the moment it hit the air, guess why :-). For a while, I was trying to figure out what was going on, but no harm done).

So far, a lobbying effort has gotten the site taken off its hosting company. I predict this is a sure prelude to "Act 2" of any censorship-style drama, where the site comes back with a new host and causes yet another wave of articles on the issue. You heard it here first.

As stated in one recent message dated April 19:

According to reports we've received, frustrated by their failure to shut us down through cyber-attacks, the censors began putting pressure on "Everyone's Internet" -- threatening to do severe damage to the economic interests of the firm and its clients if the 'offending' sites were not removed. So, we're moving to a new server -- and concurrently making plans to make our sites more nearly impregnable to both cyber-attacks -- and censorship -- in the future.
The efforts to shut down jewwatch.com are a matter of public record -- even Google has apologized (on its main search page[!], if you enter the word 'Jew') -- under this pressure, and the censors' efforts expanded when they discovered that nationalvanguard.org was truthfully covering the Jew Watch saga in its pages -- and was located at the same server facility.

Stay tuned ...

Posted by Seth Finkelstein at 11:59 PM | Comments (1) | Followups

April 15, 2004

Wall Street Journal on Google and JewWatch.com

[Update 4/22: New report:

"Jew Watch", Google, and Search Engine Optimization
http://sethf.com/anticensorware/google/jew-watch.php

]

The "Jew Watch" and Google controversy was in the Wall Street Journal section OpinionJournal - Best of the Web Today, and I received a link (thanks) - though for my earlier report, on "Chester":

Again, we were initially inclined to accept Google's explanation, but then we noted the New York Times' report that at the request of officials from Chester, England, Google had removed a page called "Chester's guide to molesting young girls" from its search results. Several readers faulted us for not noting part of Google's explanation for that change: that Google had "removed sites from its rankings that promote pedophilia, which is illegal."

This explanation, however, looks to us rather disingenuous. For one thing, although sexual relations with children obviously are illegal, "promoting pedophilia" probably is not. It's also not clear that the offending page--actually titled "Chester's guide to picking up little girls"--really was promoting pedophilia. According to "Chester's Guide to Molesting Google," which now is the first hit on a Google search for "Chester Guide," it was a satire. Having looked at the page, we tend to agree--though the satire is so unspeakably vile, we refuse to link to it.

It turns out, further, that Google has not removed the "Chester's guide" from its search engine altogether; it comes up at the top of a search for the phrase "picking up little girls."

Ah, that last point is notable. When there's a censorship blacklisting, Google doesn't remove material, it removes references (URL's). If the same material appears at a different location, Google will take no action until it receives a specific censorship directive. This is in contrast to search engine spam, which Google tries to remove as much as possible and even pre-emptively.

One thing I've found, is that Google's answers about policy have to be read very carefully and skeptically. It's an interesting combination of malice and stupidity. That is, someone might be trying to deflect you about something they don't really understand in the first place!

Posted by Seth Finkelstein at 11:59 PM | Comments (2) | Followups

April 10, 2004

"Mesothelioma", lawyer games, and Google

"Mesothelioma", a form of lung cancer induced by asbestos exposure, is apparently a top selling ad keyword. It seems there's an eBay-like effect where prices are bid up, at least according to a recent article making the rounds, on search engine ads and Mesothelioma lawyers. And this leads inevitably to "interesting" search engine optimization:

The high price of mesothelioma ads has had some unintended consequences as firms try other means to land mesothelioma patients. In particular, some firms are attempting to boost their Web sites' spot on search engines' so-called algorithmic, or nonpaid, listings by tweaking the content and links to get a higher ranking. These efforts can include using the desired keywords (like "mesothelioma") frequently near the top of their home page, and including them in the Web address.
Due to these efforts, eight of the top 10 nonpaid listings in a recent Google search of "mesothelioma" were for sites sponsored by law firms, pushing down nonlawyer sites such as the National Cancer Institute. By comparison, a search for "cancer" a tamer ad category produces the American Cancer Society as the top nonpaid result.

Yup. The National Cancer Institute's Mesothelioma: Questions and Answers page is around #12 for a Google search.

I've said this before, but I should make the following into a catch-phrase:

Google ranks popularity, not authority

Posted by Seth Finkelstein at 11:59 PM | Comments (7) | Followups

April 08, 2004

Google Gmail TOS vs. The Bloggerdom A-List

I've been trying to figure out why, basically, John Gilmore's screed on the Gmail terms-of-service, is getting such extensive notice from the bloggerdom A-list. So far I've seen it echoed by Dan Gillmor and Dave Winer, and from Boing Boing. Part of it is of course that the A-list echoes the A-list. But I don't think that's the complete answer. Now, John Gilmore does not care what I think, but his analysis is, well, weird. It's not so much that it's wrong, exactly, but that it reads if he's trying to hype-up a tone of THIS IS AN OUTRAGE, against terms-of-service clauses that have been around for many years. It may in fact be an outrage, but he - and the A-list - just discovered it all? It's as if, my god, are you sitting down, are you ready to hear this: Software-makers say they license their products, not sell them! And they claim shrink-wrap is a binding contract. Tell the world, let the protests begin.

Let's see what we could do in a similar vein with the plain old (not Gmail) Google Terms Of Service

We may modify or terminate our services from time to time, for any reason, and without notice,

[Parody] How arrogant! They say they can do what they want, when they want, however they want! Isn't this EVIL?

including the right to terminate with or without notice, without liability to you, any other user or any third party.

[Parody] And they can cut you off even if they don't like your face, too!

We reserve the right to modify these Terms of Service from time to time without notice. Please review these Terms of Service from time to time so that you will be apprised of any changes.

[Parody] Hey, get a load of this, they then expect YOU to keep up with their arbitrary and capricious changes!

And so on.

Now, why? This is the kicker.

I think what's happening is that Google's image is changing among (some) A-lister's. It's a way of dealing with the fact that Google may not be God after all.

Posted by Seth Finkelstein at 08:32 PM | Comments (5) | Followups

April 07, 2004

"Big Brother nominated for Google Award"

The Register has a funny/serious article by Andrew Orlowski: "Big Brother nominated for Google Award"

I'll say no more, since I'm quoted (thanks!):

"What seems to be missed is that the sheer scale of centralization of Google's service is frightening," writes Seth Finkelstein, who dubs it 'Total Information Awareness', after the DARPA data collection project led by convicted Iran Contra felon John Poindexter. "Every message you send, every message you receive, in ONE PLACE, tagged and sorted and indexed, with a history of who sent it to you and who you sent it to (traffic analysis!). And correlate it all with your web-searching, and your social network (Orkut) and your shopping (ads).

I was thinking of how to explain the problem of Gmail to people who conceive of it as simply another web-mail reading service. I came up with this:

Imagine Google-ing your mail. Great! Now imagine John Ashcroft Google-ing your mail. See the problem?

Posted by Seth Finkelstein at 01:35 PM | Comments (5) | Followups

April 03, 2004

Total Google Awareness

[This was accepted to the interesting-people list, replying to a news story regarding "Google's E-Mail Strategy Criticized"]

> ...Google records the numerical Internet addresses of the computers that
> request each of the Web searches the company performs. But it hasn't had
> names or other identifying information to link those addresses to specific
> people and learn who, for example, is searching for "Janet Jackson halftime
> show."

Not on a mass scale, no. But ... Orkut.

http://www.orkut.com/privacy.html
Orkut's privacy policy

"We may share both personally identifiable information about you and aggregate usage information that we collect with Google Inc. and agents of orkut in accordance to the terms and conditions of this Privacy Policy."

Google will now have the equivalent of a "mail cover" (tracking who is sending email to whom - and about what!), plus "friends" social data, plus web searching data ...

And they're going to scan all your mail for keywords, in order to better fight terrorism, I mean, serve ads ...

Forget about "Total Information Awareness". It would be cheaper for the Federal government to just buy into Google.

Posted by Seth Finkelstein at 09:13 AM | Comments (5) | Followups

April 02, 2004

Google Gmail and privacy

Google's Gmail service announcement is the buzz of the day. As I read about it, I can't help thinking:

Y'know, this is really scary.

Most of the articles I've seen take the view that, well, there's already credit-card companies and cellphone call records and all sorts of data collections already, etc. etc.

But what seems to be missed is that the sheer scale of centralization of Google's service is frightening. Every message you send, every message you receive, in ONE PLACE, tagged and sorted and indexed, with a history of who sent it to you and who you sent it to (traffic analysis!) ...

And correlate it all with your web-searching, and your social network (Orkut) and your shopping (ads).

From one point of view, this is great, think of the technical tricks that can be done with the data. From another point of view, this is a tracking horror waiting to happen, think of the technical tricks that can be done with the data.

I can recite the Libertarian line by heart, so save me the stock noise - it's a private company, it's not the government, don't like it don't use it, stop thinking about it.

But if the US Post Office offered such a service - or even if Microsoft offered such a service - I suspect that people would be willing to think more about it.

Posted by Seth Finkelstein at 06:38 AM | Comments (2) | Followups

March 30, 2004

"Jew Watch", Google, and Evil

[Update 4/22: New report

Jew Watch, Google, and Search Engine Optimization
http://sethf.com/anticensorware/google/jew-watch.php

]

Search "Jew"

As noted by http://www.jewishjournal.com/home/preview.php?id=11998 (via JOHO the blog):

"Online searchers punching the word "Jew" into the Google search engine may be surprised at the results they get.
In fact, the No. 1 result for the search entry "Jew" turns out to be www.jewwatch.com. The fanatically anti-Semitic hate site is ranked first in relevance of more than 1.72 million Web pages."

Hate groups are learning search engine optimization. That ranking is no accident.

The No. 1 ranking of Jew Watch came as a surprise to David Krane, the director of corporate communications for the San Mateo-based Web giant.
Such a page might not pop up for Google searchers in European countries, where Holocaust denial is illegal. But Krane adamantly stated that Google has no plans to manually alter the results of their ranking system to knock Jew Watch from its top spot.

Yup (to all).

Do a German search for "Jew", or French search "Jew", the hate site is not there. For exactly the Google censorship reason noted. This is well-known, from the first "Localized Google search result exclusions" report by Benjamin Edelman and Jonathan Zittrain.

But it's a legal site in the US, full protected under the First Amendment as political speech.

This is an excellent example for a many points I made, but in specific:

Google ranks popularity, not authority

Posted by Seth Finkelstein at 11:58 PM | Comments (58) | Followups

March 29, 2004

Google, image searching, and censorware circumvention

[I wrote this letter about news article regarding students using Google image search as a means of circumventing censorware]

Dear Annalee Newitz

I read with great interest your story on Google, censorware, and image searching, as a school censorware problem, at:
http://www.alternet.org/story.html?StoryID=18213

I've published much work about the issue of Google image searching and similar sites being a "loophole" for censorware. It even was referenced in the expert reports in the District Court decision on library censorware (unfortunately, it has been extremely poorly publicized and otherwise unreported). See, for example:

District Court CIPA decision
http://sethf.com/pipermail/infothought/2002-May/000010.html

BESS's Secret LOOPHOLE: (censorware vs. privacy and anonymity)
http://sethf.com/anticensorware/bess/loophole.php

BESS vs The Google Search Engine (Cache, Groups, Images)
http://sethf.com/anticensorware/bess/google.php

BESS vs Image Search Engines
http://sethf.com/anticensorware/bess/image.php

The Pre-Slipped Slope - censorware vs the Wayback Machine web archive
http://sethf.com/anticensorware/general/slip.php

But I noted one major error in your article, in this part:

> The second problem, which is strictly laughable, is that regular
> Google also has caching. When I recently did a Google search (not an
> image search) on "hot naked babes," I was able to retrieve images of
> naked people from the cache.

I don't think this is what happened. It just seemed that way. What really happened is that when you retrieved the text page from the Google cache, it had within it, image links to the naked people pictures at the non-Google sites. Since your computer was not censorware'd, you were able to retrieve those images. But again, that wouldn't have worked in the case where censorware prevented you from viewing anything on the non-Google image sites.

Note, however, the retrieval would work the way you described, with the Wayback Machine web archive:
http://www.archive.org/

Perhaps that will be the next site to become popular with students, and then prohibited.

Posted by Seth Finkelstein at 11:59 PM | Comments (1) | Followups

March 26, 2004

Google-bombing cannot be defused trivially

There's a proposed Google-bombing solution in the article "Five-domain Googlebomb explodes in boardroom":

"An easy fix for many bombs," explains Brandt "Google should not use terms in external links to boost the rank of a page on those terms, unless those terms are on the page itself. This is a no-brainer. But it means another CPU cycle per link, which is why Google won't do it."

Unfortunately, I have to disagree here. It's not so simple. In fact, the way it works now is ultimately the Right Thing from a technical point of view, in terms making relevancy inferences from a simple algorithm.

One nontrivial reason is misspellings. If many people make the same spelling error in linking (such as turning "Dan Gillmor" into "Dan Gilmore"), it's useful to return that linked page for the search, rather than ignoring it since the wrong spelling likely won't be on the target page.

There's also issues with robots.txt. The robots.txt file isn't for privacy, it's just an advisory to have search-spiders work more efficiently (think of how ill-considered it would be, to have a public file listing material which should not be viewed - "Do Not Look Here"). If the site doesn't want spidering, but many people link to it with certain words, it seems a reasonable thing to return that site for those words. The option of not returning the site isn't necessarily right, because sites often just use robots.txt to avoid the load of being spidered, rather than to hide in any way.

Many issues with Google, or any complex search system, are more subtle than they might appear at first glance.

Posted by Seth Finkelstein at 11:59 PM | Followups

March 25, 2004

I can bomb that Google in ...

The Register has an article today, "Five-domain Googlebomb explodes in boardroom", talking about connecting the phrase "out of touch executives" to Google.

As I've noted, e.g. in discussing the miserable failure Google-bomb, the key concept is the confusion between popularity and authority. But I'm not sure how far this can be pushed in terms of giving relevance to obscure phrases.

Perhaps an experiment is in order, to demonstrate a principle.

Let's say I linked a certain phrase "EBig EBrother" to somewhere (such as Google ...). I've used an uncommon phrase here, so as to make it easy. The words "Big Brother" have many hits, but there's no occurrence of "EBig EBrother". Well, there wasn't until this post gets indexed.

What happens?

Posted by Seth Finkelstein at 11:59 PM | Followups

March 03, 2004

Free porn, Google, spam, Internet censorship, and the Supreme Court

[Yes, this post really seriously concerns *all* the topics listed, it's truly that _tour de force_]

The Supreme Court just heard arguments on another Internet censorship law, "COPA", ( Ashcroft v. ACLU, 03-218). The Boston Globe reported:

Ordinarily, US Solicitor General Theodore B. Olson prepares for an appearance before the Supreme Court by acting out his argument before a pretend court. This time, for a case about the Internet, he added a new twist: searching online for free porn.
At his home last weekend, Olson told the justices yesterday, he typed in those two words in a search engine, and found that "there were 6,230,000 sites available."
The top lawyer who represents the Bush administration before the Supreme Court said the search's results illustrate how pornography on websites "is increasing enormously every day," a central point in his argument for saving an antipornography law that was enacted six years ago but has yet to go into effect.

Now, let's do something often unrewarded in this world - think. What search did he do exactly? It seems to be the following search in Google:

http://www.google.com/search?q=free+porn

That gives me now "about 6,320,000" results, close enough, the total number returned often varies a bit.

Now, what that search means is roughly the number of pages containing the words "free" and "porn" anywhere in the entire page (or links with those words). This blog entry will qualify as one of those results as soon as it is indexed. I don't think this blog entry is proof of how pornography on websites "is increasing enormously every day,", much less the need for an Internet censorship law.

I've written about the problems of Google and stupid journalism tricks before. But, sigh, nobody reads me, so this won't get reported. Anyway, the story gets even better.

I started digging down into the results to see if I could find some non-sex-site mentions before the Google 1000 results display limit (Yes, Mr. Olson, there are more than 1000 sites devoted to sex in the world, that's true). Google's display ~~crashed~~ stopped in the high 800's! That is, displayed at the bottom, for:

http://www.google.com/search?q=free+porn&num=100&start=900

In order to show you the most relevant results, we have omitted some entries very similar to the 876 already displayed.
If you like, you can repeat the search with the omitted results included.

The number varies, but it's been under 900.

Joke: Hear ye! Hear ye! Instead of "6,230,000 sites available", there's really uniquely less than 900! At least, according to Google.

Now, this is the Google display crash from bugs in the Google spam filtering. Google has cleaned-up their index so the crash is not happening on the first screen of results. But it's still in their results display code. Usually, people don't see the bug in practice, since the crash has now been pushed very far down in the sequence of results.

But here I had a reason to go looking out as far as I could, and ran into the crash in a bona-fide real-world situation. Not just a trivial query too, but one with profound implications for Censorship Of The Internet.

[Update 3/4: Michael Masnick brings to my attention that what I thought was the old Google spam crash is now reduced to duplicate-removal processing on the 1000 results display limit - the point is still that I can use fallacious superficial search "logic" to assert there's less than 900 sites, because Google "says" so. But the technical reason is not quite what I wrote originally]

Humor: If the evidence from a Google search was good enough to be used to justify censorship when it said "6.2 million", why isn't it good enough to justify no censorship if on further investigation it says less than 900? That is, if you thought it was valid before, with a big number, why isn't it valid now, with a small number? (garbage in, garbage out)

Look at me, I'm a journalist (or grandstanding lawyer) - Google says there's no practically no porn on the net!

Posted by Seth Finkelstein at 09:52 AM | Comments (9) | Followups

February 22, 2004

Ralph Nader "Meet The Press" candidacy announcement escapes via Google News?

Folks, do the following search on Google News:

source:alternet Nader "Meet The Press"
http://news.google.com/news?hl=en&ie=ISO-8859-1&edition=us&q=source%3Aalternet+Nader+%22Meet+The+Press%22

At 2:32 am EST Sunday, I get:

A Risk-Free 'Nader' in 2004 AlterNet - 57 minutes ago ... And Ralph Nader's candidacy as an Independent, announced on Meet The Press Sunday, only lessens the chances of success in November. ...

Trying to click through to the story, I get:

The story you have selected is only available to AlterNet Syndication Clients. If you are already a client, please sign in below.

Hmm ... it wasn't much of secret before, but it sure isn't now!

Posted by Seth Finkelstein at 02:38 AM | Followups

February 18, 2004

Google and stupid journalism tricks ("Lies, Damned Lies, and Google")

There's an interesting taking-to-task of lazy journalism in:

"Lies, Damned Lies, and Google"
http://www.mediabistro.com/articles/cache/a1217.asp
with, sadly, a few error itself. First, some goofs:

What's more, as you might remember from December news reports, the phrase "miserable failure" for a while directed searchers to the White House home page, and "French military victories" brought up zero pages.

The "miserable failure" Google-bomb went primarily to the "Biography of President George W. Bush" page, not the White House home page. But a howler, the "french military victories" Google-bomb never returned zero pages.. The top page was a joke which claimed there were zero pages, and the punchline was the suggestion
"Did you mean: french military defeats"?

A deeper flaw which caught my eye, is that all throughout this article, many reporters don't seem to realize that a search for words without quotes, is significantly different from searching for words as a phrase, i.e. with quotes. Given several words, Google will rank highly the results with the words next to each other, returning them at the top of the list. This seem to have misled many people at to what they're doing. That is, searching hot dog is not the same as "hot dog". The former is roughly any page with the words "hot" and "dog" related to it, while the latter is the phrase "hot dog" (this is an approximate description).

So many of the number reported are utterly and completely meaningless. They don't even do the silly measure of the phrase the journalist thinks they measure. That is, the journalist might believe they are doing something tangentially related to frankfurters by searching for the phrase "hot dog" (neglecting use as e.g. a surfing term or different product). But in fact, they're searching for everything up to "It was a hot day, my dog was unhappy".

The Spokesman Review, in Spokane, Washington, confirms that the phrase "build backyard ice rink" yields 5,400 Google hits. ... If you're Canadian and stuck on the wrong side of the border without proper ID, don't worry, Google will save you, reports the Canada's Times Colonist; the phrase "permanent resident cards CA" will bring you to a "staggering" 92,200 sites on the subject.

NO. The phrases return zero or a few hits. The words return that many hits, but having lot of pages with the four words "permanent" "resident" "cards" "CA" somewhere on them, is not "staggering".

Sigh. Flash - journalists write nonsense. Not news at 11.

Posted by Seth Finkelstein at 11:59 PM | Comments (1) | Followups

February 14, 2004

Royal Caribbean, Oceana, and Google Ads

"Royal Caribbean" is a cruise company, which is being criticized by the environmental group Oceana. Apparently, Google pulled advertising of the criticism. Quoth Oceana's press release:

Last week, Oceana placed two advertisements with Google, the first describing Oceana's mission and linking to the organization's website, www.oceana.org, the second focusing on Oceana's well-known campaign to stop cruise pollution. Google removed the ads after two days, citing the cruise pollution ad for "language that advocates against Royal Caribbean," and the general ad for using "language advocating against the cruise line industry and cruisers." Google's public editorial guidelines, however, make no mention of any such specific prohibition, stating only that the company reserves the right to exercise editorial discretion when it comes to the advertising it accepts.
"To exercise editorial discretion is one thing, but to stifle a message that the public needs and deserves to hear based on some secret criterion is quite another," said Sharpless. " ...

Now, Google doesn't really have a secret criterion. Just a policy which leaves a lot of room for "interpretation". Frankly, I'm a bit taken back that this story has attracted so much coverage and interest - the Oceana PR seemed to have worked well. This is by far not the first time Google has pulled an ad. There's cases such as:

Blather.Net and George W. Bush

Who would Jesus bomb?

Anita Roddick and the "vomitous worm" story

And LittleCubeNews.com on McDonald's , Quatloos.com , SeeYaGeorge.com , etc.

Anyway, Oceana's now getting more exposure than they ever would from the ad. And there's other ways to make the point (I wonder how high this post will rank on a search for: Royal Caribbean)

Posted by Seth Finkelstein at 11:59 PM | Comments (4) | Followups

February 06, 2004

Google, Orkut, Personal Data Sharing, and Privacy Policy

There's an interesting _Register_ article by Andrew Orlowski "Google revives discredited Microsoft privacy policy for Friendster clone" which discuss the privacy policy of the social site Orkut. Jeremy Zawodny started the issue, asking Why Google needs Orkut. He (Zawodny) speculates:

Let's assume that Google internationalizes Orkut and lets it run to the point that it has millions of users registered and active. That's not an unreasonable thing to expect. Then, one day down the road, they quietly decide to "better integrate" Orkut with Google and start redirecting all Orkut requests to orkut.google.com.
Bingo!
Suddenly they're able to set a *.google.com cookie that contains a bit of identifying data (such as your Orkut id) and that would greatly enhance their ability to mine useful and profitable data from the combination of your profile and daily searches.

Why the conjecture?

Has anyone pointed out yet that Orkut outright says that it "may" share information with Google?

Orkut's privacy policy states: (emphasis mine)

We may share both personally identifiable information about you and aggregate usage information that we collect with Google Inc. and agents of orkut in accordance to the terms and conditions of this Privacy Policy. We will never rent, sell, or share your personal information with any third party for marketing purposes without your express permission.

Part of the "in affiliation with Google" obviously means that Google is not considered a third party. In fact, later on (emphasis mine):

Personal information collected on this site may be stored and processed in the United States or any other country in which orkut.com or Google Inc. or agents maintain facilities, and by using this site, you consent to any such transfer of information outside of your country.

Whatever the eventual result, the data-sharing connection sure isn't unclear.

Posted by Seth Finkelstein at 11:58 PM | Followups

February 04, 2004

Google, Joe Trippi, and me - or, I Want To Start A Google-Bomb!

Rather than continuing the political punditry, let me segue into an intriguing Google result my recent posts have generated.

It turns out that, right now, my post Howard Dean, Joe Trippi, and Bubble Valuation has the #9 position for a Google search for the phrase "Joe Trippi" .

I'm not worthy!

Or am I? :-)

This intrigues me. FastCompany.com, MSNBC.com, CNN.com, SethF.com ...
Which one of these is out of place? (Granted, there's a joke here about it not being difficult to be better than the media in terms of reporting, but still ...). I assume it won't last, but I wonder how long I'll be on the top ten page there.

I'm NOT A-List. I'm at the 100-150 readers level (I also get more readers sometimes from Google searches than I do directly!). Hmm ...

Set Us Up The Bomb.

No, not linking "Joe Trippi" to my post Howard Dean, Joe Trippi, and Bubble Valuation. That's mean. Rather, I'm wondering just how much Google can be gamed (apologies, it's in the name of science).

Proposal: Link the term "bubble" to "www.deanforamerica.com". As in, the bubble which is/was the Dean campaign. This is a bit different than ordinary, as "bubble" is not an obscure term. It also seems to be a term on Google's dictionary for its Bayesian spam filter, so that's a confounding factor. On the other hand, people might actually do it, a key ingredient in a successful Google-bomb.

The power, the power ...

[I was going to put a screenshot of the search results here, since someone said I needed pictures, but it's a very boring picture, especially to devote a whole screen. Maybe I need to get a cat for my blog.]

Posted by Seth Finkelstein at 11:59 PM | Comments (0) | Followups

January 23, 2004

"Miserable Failure" and Google

The "Miserable Failure" Google-bomb, of linking the keywords "Miserable Failure" to the White House page "Biography of President George W. Bush", was recently covered in the New York Times. It's now becoming a tactic in political campaigns.

Google-bombing, as I think of it, demonstrates the conflict between *popularity* and *authority* for search engines. As the article notes: about searching "miserable failure":

The more high-traffic sites that link a Web page to a particular phrase, the more Google tends to associate that page with the phrase - even if, as in the case of the president's official biography, the term does not occur on the destination site.

It's an illustration of many people repeating something (popularity) for purposes of having it accepted as meaningful (authority). This leads to obvious concerns as to just how much neutral authority can be corrupted by partisan popularity (note this assumes for the sake of discussion that course there's a neutral authority in the first place - a very arguable assumption). To wit (the link below is my own, for humor):

Google plays down the significance of Google bombing, saying the search results merely reflect what is actually happening on the Web.
"We're only seeing it with obscure queries where there's really not that much action on the Web about them," said Craig Silverstein, Google's director of technology. "I don't think it's possible to do this sort of thing on queries with well-defined results like I.B.M.' So given that it only affects one query out of the more than 200 million a day we handle, it's hard to see it becoming much of a problem."

I'm actually a little puzzled by that statement. What does he mean by "well-defined results"? Maybe "results which have many links already". Then it looks like he's basically right. You can't capture a term which already has a strong meaning. But even so, there's a still a lot of search-space in which to play.

[Update September 15 2005 - This has led to a "failure" Google bomb]

Posted by Seth Finkelstein at 11:58 PM | Comments (1)

November 26, 2003

Google Bayesian Spam Filtering Problem?

New Google report from Seth Finkelstein:

Google Bayesian Spam Filtering Problem?
http://sethf.com/anticensorware/google/bayesian-spam.php

Abstract: This report describes a possible explanation for recent
changes in Google search results, where long-time high-ranking sites have disappeared. It is hypothesized that the changes are a result of the implementation of a "Bayesian spam filtering" algorithm, which is producing unintended consequences.

Posted by Seth Finkelstein at 09:07 AM | Followups

November 15, 2003

Google Deskbar

Google Deskbar is the latest little tool from Google. It's a self-contained searching program, which is very lightweight and fits snugly in a desktop screen (PR: "Google Deskbar enables you to search with Google from any application without lifting your fingers from the keyboard. Installs easily in your Windows taskbar.")

I was poking around at its innards in order to see if there was anything interesting inside. Internally, it seems to be a "microbrowser". That is, I think it hooks into Windows/Internet Explorer services in order to do a search, exactly as if you had typed it into the Internet Explorer browser. And then uses the Windows Operating System display routines to present the results.

On the one hand, that makes it heavily operating-system dependent in terms of code. On the other hand, it's extremely cheap in terms of development, a neat little hack.

The most socially interesting thing about it, is that given it's tying into Windows/Internet Explorer services, it appears to share the Google cookie with Internet Explorer, and use the Google cookie itself in all searching. That's not obvious, though it makes sense in retrospect.

It's actually a little strange, in terms coming full circle with applications, to realize it's a microbrowser. That is, the original web browsers were simple programs devoted to rendering simple code. Then the inevitable "creeping-featurism" took over ("2. More generally, the tendency for anything complicated to become even more complicated because people keep saying "Gee, it would be even better if it had this feature too"."). So the browser became a behemoth, of often not-quite-working plugins, handling sound and video and cascades of style bleats. It's now so bloated that writing a small and fast program to do one common operation and display the results quickly, is some sort of innovation. Somewhere there's a lesson in that.

Update: I should have mentioned Dave's Quick Search Taskbar Toolbar Deskbar, thanks to LISnews

Posted by Seth Finkelstein at 11:56 PM | Followups

October 19, 2003

"Watch Baseball London"

"Watch Baseball London" is another phrase which is problematic for Google Spam Filtering Gone Bad. There's another Register article (by Andrew Orlowski) "Options dwindle for London baseball mavens" noting the problem.

Posted by Seth Finkelstein at 03:03 AM | Comments (3) | Followups

October 14, 2003

OS X Panther discussion

"OS X Panther discussion" isn't really the topic of this post. Rather, this is about Google's algorithms. Andrew Orlowski has an interesting Register article today, Blog noise achieves Google KO. He discusses a situation where several blog "TrackBack" pages fill the results of a Google search for OS X Panther Discussion.

In what must be a record, Google is - at time of writing - returning empty Trackback pages as No.1, No.2, No.3 and No.4 positions. No.5 gets you to a real web page - an Apple Insider bulletin board. Then it's back to empty Trackback pages for results No.6, No.7 and No.10. In short, Google returns blog-infested blanks for seven of the top entries.

Honestly, I think this isn't too much of a problem. I believe it's a confluence of at least three different major rules being triggered, having to do with fresh pages (some of the results are very recent), authority pages (Mac stuff, such as OS X Panther, is very popular with some well-linked bloggers), and trying to find a "best" match for all keywords. Here, in particular, the TrackBacks label themselves "Discussion", so Google is putting much weight on that word. Google's a complex system, and algorithmic oddities will happen.

Now, this post is going to trigger some of those rules as well. Sometimes the best way to make a point is to demonstrate something directly :-).

The Trackback creators are aware of the issue too, and seem to be working on fixing it.

Disclaimer: Andrew Orlowski has covered my Google writing before, most recently in "Google bug blocks thousands of sites" for my report last week:

Google Spam Filtering Gone Bad
http://sethf.com/anticensorware/general/google-spam.php

So I hope he won't be angry at me for writing this. Sigh, politics.

Update: I had the number-1 spot on Google, for a day, for those search terms. It's nice to be right :-). More significantly, I'm learning interesting bits about how the freshness rule functions. I could even see when certain search indexes were swapped out or in, as the hit-flow to my website would drop off or pick up again. The implications are staggering ...

See also the follow-up article: Emergency fixes for blog-clogged Google.

Posted by Seth Finkelstein at 11:12 PM | Comments (5)

October 12, 2003

Iraq "astroturf" letters, and Googling

I just sent this to Dave Farber's list, as a supplement for investigating the Olympian story where US newspapers barraged with same letter from different soldiers.

From: Seth Finkelstein
To: Dave Farber
Subject: Re: [IP] US newspapers barraged with same letter from different soldiers

On Sun, Oct 12, 2003 at 09:26:57AM -0400, Dave Farber wrote:
> Original URL: http://www.theinquirer.net/?article=12049
> A Google search by the INQ shows only two online newspapers so far
> including one of the key phrases: "The quality of life and security
> for the citizens has been largely restored, and we are a large part of
> why that has happened." ...
>
> The Google links are to the Register-Herald and the Pittsburgh Daily
> Courier.

I just replicated what they did. They made a searching procedure mistake. They searched the phrase as normal, but forgot or didn't know that by default, Google doesn't show "very similar" results. In this case, "very similar" results are exactly what they wanted. They should have followed the link on the search page which says:

"If you like, you can repeat the search with the omitted results included."

Many more online newspapers with the astroturf are then visible, e.g.

http://www.mvtelegraph.com/opinion/83932mtnview09-11-03.htm http://www.dailymail.com/news/Opinion/200309105/ http://www.uticaod.com/archive/2003/09/11/opinion/14782.html
http://www.heraldnet.com/Stories/03/9/6/17425574.cfm

--
Seth Finkelstein Consulting Programmer sethf[at-sign]sethf.com http://sethf.com

P.S.: Hmm, maybe I should go into "Google studies", market myself as a "Google-expert consultant" :-)

That PS is a reference to a recent report I wrote, which is currently a story.

Google Spam Filtering Gone Bad
http://sethf.com/anticensorware/general/google-spam.php

Google studies. Definitely, Google studies. Google may have warts, but they aren't evil. They don't threaten or sue. They don't send out PR smears. One day I might even conceivably get paid for expertise here, which has never happened with my censorware research. Google, google, google ...

Posted by Seth Finkelstein at 07:21 PM | Followups

October 07, 2003

Google Spam Filtering Gone Bad

I believe I've uncovered the cause of the "Google NACK", a problem where Google is returning no or very few results for certain combinations of search terms. I conjecture it is a consequence of trying to eliminate spam search results, but instead wrongly eliminating all subsequent results. Read:

Google Spam Filtering Gone Bad
http://sethf.com/anticensorware/general/google-spam.php

Abstract: This report describes a problem which caused Google to return very few, or no, results for particular combinations of search terms. It is almost certain this is a consequence of search results being post-processed by spam-defense which has gone awry.

Feel free to verify my methodology. Google has an incentive to rapidly patch any publicized examples.

[Hmm, maybe I should go into "Google studies", Google doesn't sue people!]

Posted by Seth Finkelstein at 03:37 PM | Followups

May 20, 2003

Googlewash, Nunberg, Orlowski

[Semi-name-dropping disclaimer - I like Andrew Orlowski's articles, and think they're asking good questions even if not immediately having the best answer to the question. I've even been quoted, willingly, in one Register Google piece. I've never talked to Nunberg, but I believe he's used some of my censorware investigations research in his CIPA expert testimony, so I also have incentive to favor him.]

I was puzzled recently when Edward W. Felten wrote:

Sunday's New York Times ran a piece by Geoffrey Nunberg complaining about (among other things) the relative absence of major-press articles from the top ranks of Google search results. ...

The real explanation is simpler : The Times forbids Google to index its site.

Huh? This took me aback. I couldn't even find that "complaining" in the piece at first. Some digging, via John Palfrey to Doc Searls finally let me figure it out. I believe what's fueling a certain reaction is this:

People think that the Nunberg/New York Times article is in part complaining about their Google PageRank - because that is what concerns net-writers!

No, folks. New York Times writers don't care about their PageRank. They don't need it!. They're heard already. By people who read short briefing papers prepared by staff. The New York Times is at the top, and it's a very diferent world up there, from down here.

If anything, I read Nunberg as being ever so slightly critical of Orlowski, and quite accepting of the Google results. I think he was saying very roughly that Google returns what people were talking about, and more people were talking about a "blog" topic than a "major-press" topic here, so that's what you get. Then people viewed this as somehow being a "complaint". But I didn't see Nunberg as complaining, so much as stating that chatter may be popular, but it isn't authoritative, and shouldn't be expected to be so. The same sentiment I express as "Google is good, but not God."

Posted by Seth Finkelstein at 03:45 PM | Followups

May 10, 2003

Crash update, Google insight

It's now been a little less than two weeks from the website crash. The new installation of the blog is debugged, and I've fixed any critical file-not-found errors. Old provider http://www.phpwebhosting.com/ remains very apologetic that they didn't have one bit of backup. But sorry and lost mailing-list data still leaves me with painful loss of mailing-list data. That's another personal discouragement.

The other notable after-effect was that I was no longer being visited by My Friend The Freshbot (the Google crawler which checks certain sites for daily updates). It turns out that the daily Google crawler still thinks that my site is hosted on the old location (http://www.phpwebhosting.com/), even though it's now been moved for many days.

That's interesting, as it indicates that the daily Google crawler is rather slow to update its DNS. I've got a log full of errors. That log shows a pattern which seems to confirm that the highest PR or most-linked pages are what forms the basis of the daily crawl (which makes sense).

I'm seeing a brief daily visit from a Google crawler on my new location (Project Geek). But it's just checking the front page and robots.txt. This is probably Google's general crawler to keep track of what websites exist and what shouldn't be searched on them (robots.txt).

As an interim measure, I put some pages back on the old host, so hopefully Google/Freshbot will find them soon. I'm going to keep track of when Freshbot Comes Home (new home, that is).

Posted by Seth Finkelstein at 11:53 PM | Followups

"Google is good, but not God"

I'm quoted (accurately!) in an article in The Register:
"Google to fix blog noise problem"

Or as Seth Finkelstein reminds us,"Google is good, but not God."

Posted by Seth Finkelstein at 12:00 AM | Followups

May 01, 2003

"French Military Victories" and Google

I've been looking at How bloggers game Google by Google Watch. It describes the "French Military Victories" prank. A key part of the article is the claim

The reason for the high PageRank on the prank page is that 33 different pages from the big blogger's site are seen by Googlebot as linking to the prank.

I don't believe this claim is correct. There has to be a limit on how much PageRank a single site can contribute for a link. For example, it's a frequent practice for all pages on a site, to link to the root URL of the site. This doesn't generate an astronomical PageRank. Moreover, even if the own site is an exception, it's a very common to have a page structure where there's a frame or table of associated sites, on each page of the site. A "blogroll" is just one example of this structure. I just looked at Privacy.org, for an example, and note the "Privacy Resources" table on every page, for all thousand or so Privacy.org articles.

Now, there's a deep social issue about information here, which I don't mean to dismiss. But the explanation given by Google Watch for the effect is not right. And in fact, it muddies the issue. It implies a kind of technical bug: "But only about one-third of the page is duplicated in this case, so Google thinks they're all worth indexing.". The problem isn't that Google's duplicate-detection algorithm was fooled. Rather, it's a social "bug", in that the ranking algorithm produces results which are in some ways problematic.

Posted by Seth Finkelstein at 05:04 PM | Comments (77) | Followups

April 16, 2003

"Googlewashed" as a concept

I've been thinking about Andrew Orlowski's article on "Googlewashed" (the fact that it mentions me is not a coincidence, but not the main factor :-))

Aside from the specifics of the story, it seems to me there's something very subtle going on here.

Which means that Google is being "gamed" - and the language perverted - by what in statistical terms in an extremely small fraction indeed.

Hmm, an extremely small fraction of people who influence meaning, sometimes not for the better, for perhaps insular and pack-style conceptions - now, where have I hear this before - journalists!

I don't mean that too sarcastically. The subtle factor is that all of classic media analysis seems importable to blogs + Google. That's interesting.

Posted by Seth Finkelstein at 11:58 PM | Followups

April 15, 2003

Google and superstition as to disappearing site results

Maybe I should move entirely into "Google studies" instead of censorware (at least Google doesn't sue people). I've noticed there's a great deal of superstition generated by quirks in the page-rank algorithm. The following mailing-list message by Danny Yee deserves greater propagation:

Date: Wed, 16 Apr 2003 11:53:52 +1000
From: Danny Yee <danny[at=sign]anatomy.usyd.edu.au>
Subject: Re: [STOP] Googlewashed

Google updates its index monthly, which means it can take up to 70 days for pages on even high profile sites to get into the main index. In between updates, it has a freshbot that grabs some sites on a faster (daily) basis. But things drop in and out of that as it tries to get the freshest news. You will have noticed the "left" sites disappearing because that's what you read.

This is not to say Google is perfect, or doesn't have censorship issues, but I think this particular concern is mis-placed.

For a longer explanation, read Brett Tabke's letter to the Register about their Googlewashing story - http://www.webmasterworld.com/forum3/11518.htm

Danny.

http://dannyreviews.com/ - over six hundred book reviews
http://danny.oz.au/ - free speech, free software, travel

Posted by Seth Finkelstein at 11:56 PM | Followups

March 17, 2003

My Google report makes the front page of Slashdot!

http://slashdot.org/article.pl?sid=03/03/17/2046246

Amazing ... for so many reasons ... Thanks, Timothy!

Posted by Seth Finkelstein at 11:57 PM | Followups

March 15, 2003

The Register publishes my Google report letter

Letter: Molesting Google
http://www.theregister.co.uk/content/35/29773.html

Many hits.

Thank you.

Posted by Seth Finkelstein at 11:58 PM | Followups

February 28, 2003

The Register covers my Google removal report

Be careful what you wish for, you might get it ... or not quite what you want ...

Google in paedo censorship debacle
http://www.theregister.co.uk/content/6/29531.html

It could've been worse, but there were some severe errors here. I've made some updates to:

Chester's Guide to Molesting Google
http://sethf.com/anticensorware/general/chester.php

Posted by Seth Finkelstein at 12:56 AM | Followups

October 24, 2002

How-it-works note for "Localized Google search result exclusions" report

In a fascinating report:
"Localized Google search result exclusions Statement of issues and call for data"
http://cyber.law.harvard.edu/filtering/google/
authors Jonathan Zittrain and Benjamin Edelman examine sites excluded by Google from localized country-specific searching. In discussing results, they conjecture:

The implication of these results -- confirmed in our subsequent searches on google.com versus google.fr and .de for the terms at issue -- is that the French and German versions of Google simply omit search results from the sites excluded from their respective versions of Google.

This implication can be refined and clearly demonstrated by observation of more sophisticated searching. The following example uses the "allinurl" syntax of Google, which searches for URLs which have the given components (note the separate components can appear anywhere in the URL, so "allinurl:stormfront.org" is "stormfront" and "org" in the URL, not just the string "stormfront.org" as might be naively thought).
See http://www.google.com/help/operators.html#allinurl

Consider the following US search:
http://www.google.com/search?q=allinurl:stormfront.org&num=100&hl=en
This returned: Results 1 - 25 of about 1,670.

Now compare with the German counterpart:
http://www.google.de/search?q=allinurl:stormfront.org&num=100&hl=en
This returned: Results 1 - 9 of about 1,670.

Immediate observation: The rightmost (total) number is identical. So identical results are in the Google database. It's simply not displaying them. How is it determining which domain results to display?

Note which "stormfront.org" site URLs are visible on the German page:

www4.stormfront.org:81/guest/RemoteListSummary/NNA
irc.stormfront.org:8000/
lists.stormfront.org:81/guest/remoteavailablelists

What do these all have in common?
They all have a port number after the host name.
The exclusion pattern obviously isn't matching the :number part of the URL.
It's matching a pattern of "*.stormfront.org/", as in the following which are displayed the US search, but not the German search.

kids.stormfront.org/
nna.stormfront.org/
www4.stormfront.org/
www2.stormfront.org/
women.stormfront.org/
www.hessmemorial.stormfront.org/
www3.stormfront.org/
ldf.stormfront.org/

Thus, the restrictions appear to be implemented as a post-processing step using very simple patterns of prohibited results.

Update: See also my explanation "Google Censorship - How It Works"

Posted by Seth Finkelstein at 10:28 AM | Followups

Reason note for "Localized Google search result exclusions" report

In a fascinating report:
"Localized Google search result exclusions - Statement of issues and call for data"
http://cyber.law.harvard.edu/filtering/google/
authors Jonathan Zittrain and Benjamin Edelman examine sites excluded by Google from localized country-specific searching. In discussing the data, they state:

Many such sites seem to offer Neo-Nazi, white supremacy, or other content objectionable or illegal in France and Germany, though other affected sites are more difficult to cleanly categorize.

The purpose of this note is to point out that one reason for certain sites being affected, is that they were formerly in such an objectionable category. Even though the domain has changed owners since then, they apparently remained blacklisted. For example, from http://cyber.law.harvard.edu/filtering/google/results1.html
consider the site:

1488.com - "Chinese Legal Consultation Network"

However, years ago, this domain was apparently a Neo-Nazi site:

http://web.archive.org/web/19980421022707/http://www.1488.com/
Look at the upper-left-hand corner, "The Swastika Homepage"
http://web.archive.org/web/19980421023027/www.1488.com/svastika/

Then the domain went up for sale:
http://web.archive.org/web/20000511232352/http://www.1488.com/
"This Domain - for sale"

Then it became the current Chinese site:
http://web.archive.org/web/20011005165901/www.1488.com/gb/

Similarly www.14words.com was once a White-supremacist site:
http://web.archive.org/web/19980421023659/http://www.14words.com/

But it is now just an empty domain. That's why it comes up with nothing but a homepage for a hosting company.

The implications here are that the blacklist is not re-examined or updated with any particular care, if at all.

Update: See also my explanation "Google Censorship - How It Works".

Posted by Seth Finkelstein at 07:51 AM | Followups