The Nitke v. Gonzales case, which challenges the conflict between obscenity, "community standards", and the Internet, is being appealed to the Supreme Court, in response to an unfavorable lower court ruling.
"The CDA contains provisions that ban speech and images from the Internet that any local community in the U.S. could deem obscene, even though that speech would be fully protected elsewhere. The CDA also contains a provision that states that it's illegal to put any obscene material on the web in such a way that minors can access it. However since the Internet can be accessed by anyone with a computer, anything on the web can be accessed by a minor as previously held by the Supreme Court in Reno v. ACLU."
In the new study, the authors still draw two words at random in the ispell dictionary, but exclude a third, random word from the search (using the exclusion operator - ), in the hope of removing word lists and spam from results. For example, they will search for switchers trophoblast -agnus. They find that Google still returns more results (although less often than before).
Unfortunately, this new strategy doesn't remove the bias. Word lists and spam are still returned, as can be easily checked on any of the queries used, such as switchers trophoblast -agnus. Here are the results from a Google search this morning : all results but one are word lists and junk.
Let me further elaborate. The study's authors assume:
To deal with this problem we modified our original search parameters of searching for two random words from the commonly available English Ispell Wordlist (a total of 135,069 words) . Instead, we searched for two random words and not a third random word. This method, we feel, helps to exclude the vast number of "dictionaries" and "wordlists" because those results should be filtered out by the "not a third random word" part of our search query.
The intent is clear. But the above statement is just not very true. In fact, it may not even exclude format variations of the original wordlist. For example, hypothetically, if there's a wordlist split into two files, one covering words starting with letters "a-n", and another for letters starting "o-z", then searching [alpha beta] will find the first file, yet searching [alpha beta -zebra] will still find the exact same file.
More importantly, all wordlists are not identical. A specific example in the "verification" study is searching [guck wheeze -prothrombin].
Terms: guck wheeze -prothrombin
Duplicates Omitted Estimate: 88
Duplicates Omitted Total: 56
Duplicates Included Estimate: 88
Duplicates Included Total: 83
Duplicates Omitted Estimate: 30
Duplicates Omitted Total: 25
Duplicates Included Estimate: 29
Duplicates Included Total: 28
Ratio of Google/Yahoo for this query:
Duplicates Omitted Estimate: 2.933333
Duplicates Omitted Total: 2.240000
Duplicates Included Estimate: 3.034483
Duplicates Included Total: 2.964286
But "guck" and "wheeze" are common words, while "prothrombin" is much more obscure. So, per the search, there are still many wordlists which contain "guck" and "wheeze", but not "prothrombin" (as well as spam pages).
In general, sampling bias must be carefully examined, because extensive repetitions of a flawed procedure will still yield a fundamentally flawed outcome.
I'm on blog-vacation, probably until September (not a physical vacation, but my free time and net access is minimal because of home repairs). I'll be back at least once for an essay regarding the vacation, and then we'll see.
Update 8/23 - I did a quick post on the size study, since a new version was in the news, and I'm checking email, but again, posting will be very, very, light at least until Labor Day.
The study "A Comparison of the Size of the Yahoo! and Google Indices" is being widely reported. On initial examination, I've found a bad problem with it.
The methodology is severely flawed, with a sampling-error bias.
In order to create a large number of queries that returned less than 1,000 results, we took the commonly available English Ispell Wordlist (a total of 135,069 words)  and wrote a PERL script to randomly select two words at a time from that list. The script then used those keywords to search both Yahoo! and Google and logged the number of results returned. For the purposes of this study we used a sample of 10,012 different searches of Yahoo! and Google using our randomly selected keywords.
By sampling random words, they biased the samples to files of LARGE WORDS LISTS!
And this effect applies, to a great or lesser extent, to EVERY SAMPLE.
One can see this in their log of search results.
Terms: carbolization clambers
Duplicates Omitted Estimate: 7
Duplicates Omitted Total: 4
Duplicates Included Estimate: 7
Duplicates Included Total: 7
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0
Do the Google search
Every entry is a large word-list file. Some are presumably (near?) duplicates of the same file
And every search will have this problem, since every search will pick up files like those.
It's a severe systematic error.
Update [12:30 pm EST] - add search-engine spam to the sampling bias. Consider:
Terms: alkaloid's observance
Duplicates Omitted Estimate: 29
Duplicates Omitted Total: 15
Duplicates Included Estimate: 29
Duplicates Included Total: 29
Duplicates Omitted Estimate: 0
Duplicates Omitted Total: 0
Duplicates Included Estimate: 0
Duplicates Included Total: 0
Look at the results. Every page is either a gibberish spam page or a wordlist.
The recent Google Print debate has been far-reaching, e.g. Siva Vaidhyanathan: Google Avoids Copyright Meltdown:
If copyright is to mean anything at all, then corporations may not copy entire works that they have never purchased without permission for commercial gain. I can't imagine what sort of argument -- short of copyright nihilism -- would justify such a radical change in copyright law.
When discussing the implications of the copyright system, I sometimes try to point out that there are intrinsic conflicts inherent in it, especially in terms of technological advances.
Let's step back for a moment. Why is Google doing this book-scanning project? It's not because it's just so cool (even if it is). While coolness may justify a small-scale promotional project, the scanning efforts are expensive. So Google, as a company, obviously sees some value in the effort. This is not wrong. But it's also a direct conflict with the granted monopoly know as copyright. Whenever there is value, particularly commercial value, there is conflict over who should be able to receive it.
It's not hard at all to see potential returns here. Besides the obvious selling of ads from searches, consider that it positions Google to be a potential partner in any e-books venture. It's not a guarantee. But if a company already has a scanned, indexed, "production" version of the book, that's a good selling point. From this perspective, Google's interest in working with libraries can be seen as a way to do an end-run around contracts with publishers, and Amazon's own evident efforts (talking about doing well by doing good!)
That's just an example. Look at it this way. Google is saying, "Let us make e-books of all library content, and keep them - for copyright reasons we'll only display search results". That's clearly very dubious under copyright. But ... it's obviously an innovation. However, it's a very commercially valuable innovation. Which brings us back to copyright. A problem with the polarized debate over copyright is that it's often framed in terms of morality of property rights, opposed by individual usage rights (which leads to screaming of "monopolists" vs "thief"). But if the Google Print scanning project is viewed as a balance of economic interests - between one company that wants to leverage its search expertise into the e-book area, and other companies which want to maintain their limited monopoly on the potential market, then assuming one believes copyright properly grants some exclusive rights - it's not obvious which is correct here.
That is, the technology company can't be right every time, almost by definition. Because copyright as a limited monopoly fundamentally restricts innovation in some ways. That's the trade-off.
The latest is Adam Curry's Podshow, taking nearly $9 million from Kleiner Perkins and Sequoia (more details). No matter how you feel about podcasts -- from horribly overhyped to the next great communications medium -- it's hard to see how any company in the space needs that much cash today. John Doerr has said he's interested in podcasting in the past -- so it's not entirely surprising that he'd be involved in such an investment. However, it's difficult to figure out what a company like Podshow could possibly do productively with $9 million at this stage of its development.
NOW I understand why Dave W-ner was so [redacted] over the falling-out with Adam Curry over the podcasting business.
And $9 million will buy a lot of hype.
Think for a moment. VC's want a big payout for their invested money,
they don't do it for emergent punditocracy and self-expression. Which means
sucker downstream investor has to be found who is going to pay even more at the exit.
Get set to hear how you should be PODCASTING, it's so great, because it's the highest form of expression of your voice, and now you have a voice (buy buy buy ...)
And remember the VC's behind the curtain. Only a heretical killjoy would imply they might not have you, yes you, for best interests in mind.
Walt Crawford's publication Cites & Insights 5:9, July/August 2005 was released a while back, and I kept putting off writing about it. It covers the Grokster case extensively, DVD-bowlderizers, conference commentary, etc. All worth reading.
I get mentioned a few times, which warms my heart. There's matters about which I'd want to clarify or expand my views - but on the other hand, it's not worth typing pages about it, especially during the middle of summer. There's one portion where I can add particular value. The Guns-vs-iPods issue has in fact been in the news, in terms of the various standards of liability for different types of products (remember Andrew Orlowski's joke: "It may soon be possible to carry around an AK-47 assault rifle and an iPod with you down the street - and be arrested for carrying the iPod." - we aren't quite there yet, but that definitely sums up one potential future).
I believe a handgun company that advertises its products as "Perfect for taking out your old lady" and bases its business model on an increased rate of homicide should be liable, regardless of the Second Amendment. (That's a hypothetical case!)
Interestingly, that's not such a hypothetical case. For example, there's a discussion of "Merrill v. Navegar":
"The TEC-9/DC9 was designed to be fired from the shooter's hip; the barrel of the gun was threaded to accommodate silencers and flash suppressors; and Navegar advertised the assault weapon as having excellent resistance to fingerprints. Navegar's director of national sales and marketing testified that he welcomed negative news stories about the TEC-9/DC9s because "whenever anything negative has happened, sales have gone tremendously high."
It turns out the debate corresponds very deeply, with inferences from design, proposed technology mandates, making inducement arguments, and so on. It's surprisingly similar.
I'm not going to say anymore. But the analogy turns out to be provocative on many levels.
Perhaps my bubble-prayer is starting to be answered. A few days ago, the search engine Baidu IPO did an IPO-party like it's 1999!. This connects very interestingly with all the recent examination of blog search
Back at the start of the development of censorware, people would sometimes say to me: "Seth, if you think censorware programs are so bad, why don't you work on making better censorware?". I always opposed that line of thought, for reasons others may have thought dogma or abstract morality, but I thought were predictable bad consequences. Since the problem itself was fundamentally flawed, it was just going to further entangle civil-libertarians in touting censorware (I turned out to be right, but that did me little good).
I didn't want to sell snake-oil to people, even if it was slightly less toxic snake-oil than other brands. But search algorithms aren't snake-oil. A small incremental improvement has value. Sometimes much value.
And remember, it's pretty apparent these days that I don't have much of a future in activism/law/policy. As well as won't ever get out of the Z-list of blogging. I can't sell that snake-oil either (link omitted out of self-preservation ...).
I was briefly quoted in CNET's "Blogma" regarding the issue that the first rule of A-List Club is you do not talk about A-List Club, and discussions about the BlogHer Conference meritocracy implications:
In these circles, apparently, BlogHer represents a form of gender-based politics that is a product of older generations and antithetical to the utopian libertarianism espoused so often in cyberspace. Yet as one observer noted in response to an essay that conveyed this point of view: "There's a difference between an ideal and a delusion. I think you have confused the two."
To me, the post-conference debate is self-proving. Consider the mathematics:
There were a few hundred people who attended the BlogHer conference. Which leads to a few hundred direct opinions from attendees about how it went. Add indirect opinions from interested readers too. Now, of this melange of viewpoints and conversations, which ones were amplified overall and then retailed to thousands of people not involved. Simple:
THE OPINIONS OF THE A-LISTERS!
So, if you believe all that matters is socializing, you can dismiss everything else, since it doesn't affect whatever socializing happened. If you believe being heard and having an influence matters, well, that fact that a handful of rich/connected ranty A-listers (some who weren't even there) are basically defining the issues to everyone else, should be a sterling disproof of meritocracy.
Of course, that also implies this post doesn't matter, but it has an individual purpose in noting I'd been quoted :-).
Back from the cliffhanger, with everything from comic relief ("... I definitely used the word "sleazy" more than once") to madcap confusion ("... crowd is filled with both conspiracy theorists and reporters, and sometimes the two types overlap. So all the hens were clucking, passing stories to each other ..."), lawyer Jennifer Granick concludes the exciting saga with the FBI chapter. My two take-away parts of general interest:
I notified the agent in charge that I represented Mike Lynn and that he was asserting his Fifth and Sixth Amendment rights not to be questioned outside my presence. (Tip: Always assert both your right to remain silent and your right to have an attorney present.)
(Another tip: Don't try to convince law enforcement of your own innocence. Get a lawyer. Really.)
[Indeed, I know people have messed-up here (though I don't want to seem to criticize anyone by a link)]
To me, the most interesting thing about this chapter (putting aside the human drama) is that apparently neither Cisco lawyers nor ISS lawyers called in the FBI. If it wasn't those lawyers, who was it? Some other part of Cisco or ISS? Doing it without telling the legal department? Looks like she can't say, even if known. But it's disturbing to see that FBI can be invoked so readily, with all the problems that causes.
Read Jennifer Granick's inside account on the Mike Lynn case, explaining the legal issues regarding his disclosure of Cisco router security problems. Money quote (pun intended):
At the point that you get sued, or even charged with a crime, it matters less what actually happened and whether you did something wrong and more what it takes to get out of the case as unscathed as possible. It's sad, but true, that our legal system can often be more strategy than justice.
Core of the "interesting" legal controversy:
It seemed that Cisco was claiming that Mike's actions were improper because he violated the End User License Agreement (EULAs), which prohibited reverse engineering. So now I was having fun. I'm totally interested in EULAs and the circumstances under which they take away public rights that are otherwise guaranteed us. Usually, a breach of contract is no big deal. But increasingly in the tech field, we're seeing big penalties for what's essentially a contract violation. Under the Computer Fraud and Abuse Act, if you exceed your authorization to access a computer, you've committed a crime. Cases have said you exceed authorization when you breach a EULA, terms of service, or employment contract. Other cases have said that EULAs can waive fair use rights and other rights guaranteed under copyright law. Lynn's case presented the question of whether EULAs could subvert the legislature's express desire to allow people to reverse engineer trade secrets.
[Note - I've said this before, many times, but once again, here's more evidence that the types of legal risks I faced myself in investigating censorware were severe, and it was a very serious matter of extensive attacks combined with lack of support which made me quit censorware decryption research]
New Third Way Report Finds Children Are Major Users Of Internet Pornography; Porn Sites Target Kids
Group Endorses New Bill To Require Age Verification and Impose "Smut Tax"
The first point of this "report" was:
# Online pornography is proliferating online at an alarming rate - from 14 million web pages in 1998 to 420 million today.
These are lying with statistics, which I've debunked before. The misdirection is to neglect that the web itself has grown since 1998.
But it's been widely echoed credulously by the press:
Report says porn Web sites "exploding" as Internet goes unchecked
July 27, 2005, 5:57 PM
LITTLE ROCK, Ark. (AP) -- A report released Wednesday by a group of Democrats seeking a moral authority some say their party has lost says the number of pornographic Web pages has grown 3,000 percent since 1998 and federal laws must be changed to keep children away from them.
The think tank Third Way says there were 14 million pornographic Web pages in 1998 and 420 million today. Even amid broad discussion of morality issues, politicians are surprised by the sudden growth that has allowed adult Web sites to dominate the Internet almost unchecked, Third Way spokesman Matt Bennett said.
So, as a good little blogger, I have fact-checked the press, and exposed the censorware company sleaze.
But - WHO CARES! It shows the structural bankruptcy of blog evangelism. What good does it do for me to say this to a tiny audience, of the choir (and opposition-researchers!)? The only way this could ever possibly make a difference is if an A-lister or similar echoed it. Otherwise, I'm just shouting to the wind. It's a meager pleasure to indulge myself with ineffectual ranting.
(image from Jonathon Delacour)
Every time this goes around, I think of writing a FAQ ("Frequently Asserted Querulousness") on A-list issues. Then I remember how many times it's been done before, and that self-referentially, almost nobody would read it (since I am very far from the necessary status).
A long time ago, on certain USENET newsgroups, a guy named Carl Kadie used to do something very useful. For certain debates, he'd do a lengthy post along the lines of (roughly) "The last time this topic came up, X people said [THIS]. Y people said [THAT]. Z people said ...". Maybe it was my imagination, but it seemed to me that it helped. Then again, times have changed, and that might not do any good now.