"The Book Club Blog" has a great collection of information as to whether the supposed "diary of a london call girl", the Belle de Jour blog, is a hoax.
I remain with the skeptics. "She" recently "said" (my emphasis):
Unfortunately for the conspiracy theorists, there is no conspiracy. I am a young woman, I have sex for money, and I love to read and write. My taste in books shouldn't come as a surprise. After all, this job affords more spare time than most. Think of Occam's razor, the principle of parsimony: what would be simpler - that I am who I say I am, and write about, or that I am a famous author living a double life, unable to tell anyone and having a joke at the expense of my agent, publisher and readers? What does bother me is the presumption that a person's occupation is a reflection of their intelligence or value to society:
Let me reframe:
"... that I am a real well-read call-girl who instantly writes award-winning polished prose, or that I am a not-so-famous author who would like to be more famous, and saw an opportunity to do so by writing a fake blog and feeding on the media appetite for sex and the Internet and blogs and selling papers via titillation and scandal?"
When this question is put forth, there's almost a lawyer-trick of deflecting the suspicion by pounding the table and accusing the skeptic of bigotry: You think prostitutes can't be smart! Sexist! Classist!
No. I think writing is hard work for anyone. And that Occam's razor, the principle of parsimony, is that an established writer claiming to be a media-attention-draw is very likely indeed, much more so than such a real person getting awards and book deals. It's just ghost-writing taken one step further, where the writer starts by creating the celebrity in the first place (rather a clever idea, in retrospect).
Given the forthcoming "Belle de Jour" book, I was tempted to suggest turning its Amazon book reviews section into a hoax-information discussion forum. But that's probably playing into the book's buzz-hype. Still, it was an appealing thought.
[Update 4/22: New report
Jew Watch, Google, and Search Engine Optimization
http://sethf.com/anticensorware/google/jew-watch.php
Abstract: This report examines issues surrounding the high ranking of an anti-semitic website, "JewWatch.com" for searches on the word "Jew". The search results present complex issues of unintended consequences and social dilemmas.
]Search "Jew"
As noted by http://www.jewishjournal.com/home/preview.php?id=11998 (via JOHO the blog):
"Online searchers punching the word "Jew" into the Google search engine may be surprised at the results they get.
In fact, the No. 1 result for the search entry "Jew" turns out to be www.jewwatch.com. The fanatically anti-Semitic hate site is ranked first in relevance of more than 1.72 million Web pages."
Hate groups are learning search engine optimization. That ranking is no accident.
The No. 1 ranking of Jew Watch came as a surprise to David Krane, the director of corporate communications for the San Mateo-based Web giant.
Such a page might not pop up for Google searchers in European countries, where Holocaust denial is illegal. But Krane adamantly stated that Google has no plans to manually alter the results of their ranking system to knock Jew Watch from its top spot.
Yup (to all).
Do a German search for "Jew", or French search "Jew", the hate site is not there. For exactly the Google censorship reason noted. This is well-known, from the first "Localized Google search result exclusions" report by Benjamin Edelman and Jonathan Zittrain.
But it's a legal site in the US, full protected under the First Amendment as political speech.
This is an excellent example for a many points I made, but in specific:
Google ranks popularity, not authority
[I wrote this letter about news article regarding students using Google image search as a means of circumventing censorware]
Dear Annalee Newitz
I read with great interest your story on Google, censorware,
and image searching, as a school censorware problem, at:
http://www.alternet.org/story.html?StoryID=18213
I've published much work about the issue of Google image searching and similar sites being a "loophole" for censorware. It even was referenced in the expert reports in the District Court decision on library censorware (unfortunately, it has been extremely poorly publicized and otherwise unreported). See, for example:
District Court CIPA decision
http://sethf.com/pipermail/infothought/2002-May/000010.html
BESS's Secret LOOPHOLE: (censorware vs. privacy and anonymity)
http://sethf.com/anticensorware/bess/loophole.php
BESS vs The Google Search Engine (Cache, Groups, Images)
http://sethf.com/anticensorware/bess/google.php
BESS vs Image Search Engines
http://sethf.com/anticensorware/bess/image.php
The Pre-Slipped Slope - censorware vs the Wayback Machine web archive
http://sethf.com/anticensorware/general/slip.php
But I noted one major error in your article, in this part:
> The second problem, which is strictly laughable, is that regularI don't think this is what happened. It just seemed that way. What really happened is that when you retrieved the text page from the Google cache, it had within it, image links to the naked people pictures at the non-Google sites. Since your computer was not censorware'd, you were able to retrieve those images. But again, that wouldn't have worked in the case where censorware prevented you from viewing anything on the non-Google image sites.
Note, however, the retrieval would work the way you
described, with the Wayback Machine web archive:
http://www.archive.org/
Perhaps that will be the next site to become popular with students, and then prohibited.
There's a proposed Google-bombing solution in the article "Five-domain Googlebomb explodes in boardroom":
"An easy fix for many bombs," explains Brandt "Google should not use terms in external links to boost the rank of a page on those terms, unless those terms are on the page itself. This is a no-brainer. But it means another CPU cycle per link, which is why Google won't do it."
Unfortunately, I have to disagree here. It's not so simple. In fact, the way it works now is ultimately the Right Thing from a technical point of view, in terms making relevancy inferences from a simple algorithm.
One nontrivial reason is misspellings. If many people make the same spelling error in linking (such as turning "Dan Gillmor" into "Dan Gilmore"), it's useful to return that linked page for the search, rather than ignoring it since the wrong spelling likely won't be on the target page.
There's also issues with robots.txt. The robots.txt file isn't for privacy, it's just an advisory to have search-spiders work more efficiently (think of how ill-considered it would be, to have a public file listing material which should not be viewed - "Do Not Look Here"). If the site doesn't want spidering, but many people link to it with certain words, it seems a reasonable thing to return that site for those words. The option of not returning the site isn't necessarily right, because sites often just use robots.txt to avoid the load of being spidered, rather than to hide in any way.
Many issues with Google, or any complex search system, are more subtle than they might appear at first glance.
The Register has an article today, "Five-domain Googlebomb explodes in boardroom", talking about connecting the phrase "out of touch executives" to Google.
As I've noted, e.g. in discussing the miserable failure Google-bomb, the key concept is the confusion between popularity and authority. But I'm not sure how far this can be pushed in terms of giving relevance to obscure phrases.
Perhaps an experiment is in order, to demonstrate a principle.
Let's say I linked a certain phrase "EBig EBrother" to somewhere (such as Google ...). I've used an uncommon phrase here, so as to make it easy. The words "Big Brother" have many hits, but there's no occurrence of "EBig EBrother". Well, there wasn't until this post gets indexed.
What happens?
Walt Crawford has a special "Broadcast Flag" edition of his library 'zine (not blog) "Cites & Insights":
On November 4, 2003, the Federal Communications Commission (FCC) adopted a Report and Order and Further Notice of Proposed Rulemaking in the Matter of: Digital Broadcast Content Protection, MB Docket 02230. In English, the FCC adopted the Broadcast Flag. You can find the lengthy report (72 pages single-spaced, plus four appendices) on the web. This commentary may be long but it's far from comprehensive--and certainly not final, since the rulemaking is only a first step. My aim here is to provide a reasonable sampling of background, direct documents, and apparent consequences--and to give you some reason to believe that librarians, and those concerned with the future of digital technology in the U.S., should be concerned about the Broadcast Flag and its implications.
All worth reading, and recommended. I've not been much involved in that battle, though I've mentioned some "Broadcast Flag" strategies.
I do have one note of commentary (emphasis mine):
Paragraph 41 is also interesting as it cites limits within DMCA: nothing in this section shall require that the design of, or the design and selection of parts and components for, a consumer electronics, telecommunications, or computing product provide for a response to any particular technological measure, so long as such part or component, or the product in which such part or component is integrated, does not otherwise fall within the provisions... In other words, DMCA doesn't require new technological measures. Does that call into question the FCC's ability to impose such measures? Not according to the FCC: They limit the significance of the emphasized section to one subsection of DMCA, and deem it as not in any way limiting the FCC from imposing such requirements.
Well, sadly, basically, the FCC is right on this point (in my nonlawyer but DMCA studied view). The DMCA does not require a broadcast flag. But there's no pre-emption or affirmative limit there. That is, even though the DMCA doesn't mandate it, some other law or regulation could give the FCC the power to impose this, and that would not be a conflict. That's what the FCC is saying.
The FCC's claim to have authority over equipment-makers strikes me as broad, but there might actually be some precedent for it. But even if so, it would be on a very different basis from the DMCA.
Checking other "Belle de Jour" articles, I found one which argued skepticism based on a "Gender Genie", an algorithm for allegedly determining male or female authorship. Comments pointed out the statistics are unimpressive.
So I tried testing the infamous book review, the (female author) passage of text which supposedly formed the basis of the recent identity hunt.
In the results below, there's a caveat "(NOTE: The genie works best on texts of more than 500 words.)". All book reviews were given as "nonfiction" category writing.
Words: 256
Female Score: 74
Male Score: 346
The Gender Genie thinks the author of this passage is: male!
Amusing, when I clicked on feedback submission ( "Am I right? The author of this passage is actually ..."), the results were:
That is one butch chick.
According to Koppel and Argamon, the algorithm should predict the gender of the author approximately 80% of the time.
Accuracy Results
Am I right?
yes 129165 (63.72%)
no 73542 (36.28%)
Note coin-flipping will be right 50% of the time. So 80% is interesting, but not all that amazing. And 63%, for this implementation, seems only a slight improvement on the coin-flipping algorithm.
Testing a second review:
Words: 143
Female Score: 172
Male Score: 192
The Gender Genie thinks the author of this passage is: male!
Testing a third review:
Words: 261
Female Score: 337
Male Score: 280
The Gender Genie thinks the author of this passage is: female!
One out of three is bad (though granted, these are small-word samples)
So, now testing the "Belle de Jour" first month archive:
Considered as category "fiction" or "nonfiction":
Words: 1785
Female Score: 2138
Male Score: 1936
The Gender Genie thinks the author of this passage is: female!
Considered as category "blog entry" (apparently different keywords)
Words: 1785
Female Score: 2326
Male Score: 3384
The Gender Genie thinks the author of this passage is: male!
I can't see these results as worth much at all.
The Belle de Jour blog is supposedly a "diary of a london call girl", written by an anonymous prostitute. Given that "she"'s landed an award and a book deal, there's been (a PR stunt? interest? a journalistic pack-story?) over her identity. The funniest part is the suggestion that it's Andrew Orlowski (this falls into the class of things which if they aren't true, should be :-)). The original suspect has denied it
After reading and hearing about all this, I did a little digging myself. Now, literary forensics is harder than it looks. It's the practice of determining authorship from quirks, styles, idiosyncrasies, etc. I've played around with it, and been wrong. My speculations, which again, might certainly be wrong:
1) The "Belle de Jour" blog is a fake, written by at least two people, one starting it, then another taking over later.
2) At least the second person, the one who took over, is a journalist.
I'm more certain of #1 than #2.
Here's why - look at the use of the singlequote character. As Don Foster claimed originally, there's a style of singlequote for phrases, doublequote for conversation. But, as I've found, in the first month archive, there are NO - none - zero - singlequote usages at all. Load the archive file http://belledejour-uk.blogspot.com/2003_10_01_belledejour-uk_archive.html into a text editor, and search for the two character sequence singlequote and space. Nothing in the text. Now repeat the search with the third month archive file http://belledejour-uk.blogspot.com/2003_12_01_belledejour-uk_archive.html. Many, many, such usages (e.g.: descriptors 'It Girl' and 'double-barrelled' apply).
Now, this is the sort of observation where someone can sneer - "Look, he's talking about a quotemark, how silly!". And it can be wrong, a writer might just have a new computer, or use a new composition procedure, or something similar. But fingerprints themselves are just smudges made by oily skin ridges, and have to be interpreted with care too.
I'm not sure if there's significance that some of the line break HTML has the sequence period-space-br-tag while others just period-br-tag (no space). That's not 100% consistent, very attackable, but also suggestive of two different origins (which could be either people or procedures, note!). Also sometimes the quoting of conversation is only in singlequotes.
But that "second" person's style sure looks journalistic. It's not that a call-girl can't be literate and write well. Rather, look at it this way - between a real prostitute imagining being a journalist, and a real journalist imagining being a prostitute, which sounds more likely? Which profession is better equipped to exploit the other?
Today I attended the debate " New Media Forums and the First Amendment", featuring Bruce Taylor of the "Free Porn" department, err, Department of Justice, on one side, and Shari Steele of the EFF on the other. Aside from the deep issues of the debate, I was able to satisfy my curiosity regarding one little mystery involving Bruce Taylor (recounted with permission).
As a bit of free, unpaid, working-for-nothing, voice in the wilderness, reporting, I mean "citizen journalism", I checked with him if he had really said something a recent Declan McCullagh CNET article quoted him again as saying:
It's not personal. Taylor relishes the chance to clash with First Amendment lawyers. "Every year we'll put a bill in there, every other year, just to keep the ACLU in business," he told me a few years ago, talking about his efforts to lobby Congress. "They should send me Christmas presents instead of hate mail. I'm putting their rotten little kids through private school."
So, knowing Declan McCullagh, I asked Bruce Taylor if he had actually said it.
He responded that he hadn't said it about the ACLU, he had been joking with Declan about porn-site lawyers, and didn't mean civil-liberties lawyers. He had written a letter of apology to various ACLU lawyers explaining he hadn't meant what was reported. I asked if the letter was available. He said, without irony, that Declan had a copy, and then listed to which ACLU lawyers he had sent it (I decided not to pursue this). These events happened years ago, note (so even if the first use of the quote, at the time, could be pleaded to be honest error, the second use, now, is surely deliberate).
Frankly, this story made much more sense. Porn-site lawyers, e.g. those who represent Larry Flynt/"Hustler" personally, often do make good money. In contrast, civil-liberties lawyers, those who bring legal cases on principle, are typically very poorly paid. Bruce Taylor surely knows this (as does Declan McCullagh). Thus a putting-kids-through-private-school joke works far better about Larry Flynt's lawyers than ACLU's lawyers. So, especially given Declan, I thought Bruce Taylor's explanation had the ring of truth. Declan McCullagh's "journalism" modus operandi is to fabricate meanings more than words.
[Jay Rosen just wrote a long article focusing on "two-way" journalism. Against my better judgment, I wrote this comment in response]
"Journalism may be a lot more interesting once it gets interested in the benefits of going both ways."
Jay, can I ask a puzzled question, illustrating my very non-journalist perspective? Honestly, I think I'm missing something in grasping the worldview of this subculture (very foreign to me).
Does the average journalist - pre-blog, pre-Internet, pre-New-Era, pre-this-changes-everything - really ordinarily think no readers can have something to say? Something intelligent to say?
I read you. I read Jeff Jarvis. I read Dan Gillmor. Etc. I keep getting the image of a scene that would fit in the old Planet Of The Apes movie, where the sentient apes in a Council are expressing their astonishment at the existence of an intelligent human species:
"What manner of a creature is this? It talks! It expresses itself in coherent sentences! But it's still a reader. How can this be? We've never seen anything like it before. Does it do tricks? Can it be trained for more complex labor? Of course, whatever higher attributes it may have, it's still a dangerous beast. But maybe it can serve us better in the future if we carefully (always maintaining ultimate control) allow it to use more of its capabilities, at our direction."
In specific, I feel like I'm looking at an article by one of those chimpanzee factions who were in favor of the utility of the humans.
Am I wrong? Or has my long acquaintance with, e.g. the "work" of Declan McCullagh, given me a skewed perspective? (maybe that's the orangutan faction, which knows the truth, but suppresses it for their religious ends?)
Or, to turn it around completely, you're claiming yourself that journalism as a whole has *never* *before* cared what readers say? (which is the logical equivalent of your original statement!)
[Update: Jay Rosen responded, in comments:
Xian has part of the answer, Seth. Someone else who does is Tim Porter at First Draft. Follow the link to some of his better posts. My short answer is this: it's not that newspapers and journalists were uninterested in "readers" or had no contact with an alien species.
The rhetoric of "serving readers" was everywhere in the industry from the late 1980s on. The Reader was constantly invoked in journalism discussions, too, but this is different from having a lot of human contact with actual readers, listening to what they say, or dealing with what they write.
Prior to the Internet, metropolitan daily journalism was pretty insulated from readers and their complaints, let alone their ideas. You have to grasp how extreme this isolation could be. A team of journalists might work for weeks on a large story, and be pleased to get three or four letters and a couple of phone calls as their total reaction. The normal condition was to hear nothing from anybody after a story.
For hard data, there was market research that told something about readers; there was also the journalist's disdain for marketing (editing by the numbers), which led to fears of "caving in" to readers. That gaves you some sense of the factors that were operating... then.
]
Jay Rosen has an interesting post "The Legend of Trent Lott and the Weblogs" discussing the Shorenstein Center report "Big Media" Meets the "Bloggers" (and one post of mine is mentioned at the end collection - thanks much, am I blog-royalty yet?). I think I should comment at length about one particular sentence:
The Harvard study has gotten notice in Blogistan, but its stingy formatting (the pdf is encrypted and won't allow you to cut and paste) has been discussed in greater depth than the story it tells, perhaps because we think the events are well known.
That's because, recursively, the A-list hasn't been pushing it 1/2 :-). But with regard to myself, I thought I had a much higher chance of someone, somewhere, actually caring about what I had to say concerning the encryption/fair-use formatting aspect, than Yet More Punditry About Pundit Pack Propoundings.
I mean, what I wrote about "the pdf is encrypted and won't allow you to cut and paste" might realistically hope to affect the world in some small way, since I do have some small measure of expertise and status regarding the DMCA. But on the press topic itself, I'm just part of the "bunch of people ranting away on the Internet, which is nothing new".
One commenter recently noted that though I intended a blogging slowdown, I wrote much recently. Mea Culpa. The Free porn, Google, spam, Internet censorship, and the Supreme Court combination was too tempting to let pass: Google *and* porn *and* stupid journalism tricks *and* Internet censorship laws ...
And then I wanted to capitalize on whatever traffic it might have brought me.
But looking at the numbers for that post since it was written, the audience doesn't seem to have been spectacular.
Total specific readers (unique IP's): 514
No referer : 105
librarian.net 131
google's : 139
other search sites (may contain rebranded google): 82
bloglines.com subscribers: 27
and misc sources
So, observe the importance of the gatekeeper. In this case the librarian.net reference. That reference was comparable to my total average readership.
Too many people are looking for Free Porn in all the wrong places (though, to be fair, they may not know it).
Sigh. Again, gatekeepers or nothing. Meet the new boss, same as the old boss.
Jeneane Sessum announced "PhoneCon", "where we'll be bringing together some of the smartest minds from across the Harbor to talk about talking on the Telephone" (via JOHO).
Any similarity to "BloggerCon" is purely intentional. My favorite:
KEYNOTE ADDRESS
Tilden for America -- How the Telephone Will Affect the Hayes-Tilden Presidential Campaign of 1876. Keynote Speaker: M. SO. Trippy.
I contributed the following to the various comments:
I've heard that because anyone can become a phoner, we are all orators now. We can route around Big Podium. But the people saying this all seem to be on the A-Directory.
Hope shown at one time in history (not to draw exact parallels!):
In my view the strongest force of all, one which grew and took fresh shapes and forms every day war, the force not of any one individual, but was that unmistakable sense of unanimity among the peoples of the world that war must somehow be averted.
News yesterday:
In an attempt to lasso support from Google, a key proponent of the syndication format RSS has proposed that it merge with its challenger under the auspices of an Internet standards body.
I'm not taking any sides in the "RSS Wars". I don't have a dog in that fight, and it'd be much too dangerous for me.
I just hope I can get away with one observation, a generic consideration for all who over-rate the supposed revolutionary effects of blogs and such:
How can you route around big media, revolutionize society, create new forms of participatory democracy, solve deeply complicated social problems ... when "we" CAN'T EVEN AGREE ON A FORMAT FOR WEB SITE CONTENT SYNDICATION?!
Really. Site syndication is a "little" problem. Nobody is going to literally die over it. Not like access to health care, or poverty, or world wars.
But there's no popularity points for me in saying that. No gain, no win, no benefit. It will not be amplified, megaphoned, echoed. Which in a way, is a relevant statement itself.
Anyway, I wish the peacemaking efforts well.
Given that I spent so much effort on the ability to make fair use of the Shorenstein Center report "Big Media" Meets the "Bloggers", I should use that ability myself for some commentary.
The strongest passage, which leapt out at me, is the conclusion, and he said it, not me (emphasis mine):
But if blogs offered "big media" a rich vein and a testing ground for potential story ideas, it in turn conferred legitimacy on the blogosphere, and provided the "bigger megaphones," as Atrios puts it, that the young medium needed to be heard. "Weblogs," Atrios observed, "still need the validation of print and television media--otherwise it's just a bunch of people ranting away on the Internet, which is nothing new."
Elsewhere:
"For the most part," Atrios maintains, "the influence of blogs is limited to the degree to which they have influence on the rest of the media. Except for the very top hit-getting sites, blogs need to be amplified by media with bigger megaphones."
And:
Many in the press and in the blog world gave Marshall credit for "pushing the Lott story to the forefront," as one observer wrote, "with more vigor than any other online pundit."58 Atrios, too, was credited by some with being "nearly as influential" as Marshall in calling attention to what Lott had said.59 But Atrios himself argues that Glenn Reynolds played a key role in elevating the story out of the blogosphere and into the mainstream. "The truth is," Atrios maintains, "if Glenn Reynolds hadn't taken a stand on this story, then no one would have considered the role of bloggers in [it]. ... It isn't because Glenn was the first or the most vocal. Rather it was because he has a big megaphone and real media connections."
Now, this is of course coming from these people:
This case was written by Esther Scott for Alex Jones, Director of the Joan Shorenstein Center on the Press, Politics and Public Policy, for use at the John F. Kennedy School of Government, Harvard University.
So, considering the source, it must be noted they aren't going to produce a study hyping blog-triumphalism. But the observations are very useful to those who take "everyone's a journalist" too seriously (everyone's a journalist like everyone's a potential candidate for California Governor)
There's other great stuff, such as fascinating sections which show the psychology of "pack journalism":
O'Keefe remembers that an employee of another network "had one of their producers in their [Washington] bureau look at it and later came back and said, `No, I don't think it's anything.'" This gave O'Keefe some pause, causing him to second-guess his judgment. "I think there is something to the [notion] of pack journalism," he reflects, "of individuals believing that if something is noteworthy, ... everyone will get it. ... If they didn't all get it, then it couldn't possibly be a newsworthy item."
And the mechanics of reportorial sausage-making (emphasis mine):
O'Keefe quickly contacted Linda Douglass, ABC's congressional correspondent, who began making phone calls "to a lot of different interest groups and folks" to seek a response to what Lott had said. Douglass was "trolling for reaction," as O'Keefe puts it, which was standard journalistic practice when someone had made a possibly controversial statement. The press, Halperin notes, "is usually not in the business of saying, `Oh my God, this is outrageous,' but rather of asking someone else [to express an opinion]."
In other words, if you're a journalist, and you want to write "This is an outrage!", you don't come right out and write "This is an outrage!". Rather, you call around to the various groups you know, and see if you can "troll" someone to say it. So you can write, that in reaction to X's remark, Y said "This is an outrage!". There seems to be something wrong with a system where disguising the editorializing via a straw-mouthpiece is acceptable.
I'm reminded, to connect to a different story that has some parallels, that I've seen this as part of Declan McCullagh's technique in proselytizing Libertarianism. For example, where e.g. in the Al Gore Internet hit piece, he studiously avoided asking anyone who had actually been involved in technologically inventing the Internet, and got reactions only from right-wing and Libertarian-type flacks. No accident, he knew exactly what they would say.
Anyway, the whole report strikes me as an interesting view into the perspective of insiders as they work out how to place the new niche into the predator-prey-fodder foodchain.
Tech Law Advisor (Kevin Heller) very kindly noted my previous "fair use" post, but the summary was just a little bit off:
Seth F. makes fair use of the report on "Big Media" Meets The "Bloggers" [pdf] by printing to file and removing some text in a nicely marked tag that says "Do not remove this tag under penalty of (DMCA) law".
Umm ... no offense meant. But the whole point of my postings is to avoid removing anything from that tag, because to do so is arguably a DMCA violation. And Adobe does not play nice with programmers who decrypt PDF's (note the Tech Law Advisor item was written before I updated my post with a procedure that could more closely be misdescribed per above).
[Not-a-digression: People don't understand why I worry so much about the impact of, e.g., a hatchet-job from Declan McCullagh, or a Slashdot-smear given the de facto support of "editor" Michael Sims. If I get into serious DMCA trouble, I'm never going to be able to defend myself from malicious press attacks.]
Anyway, if one prints (with the Adobe Acrobat reader) a usage-restricted PDF document to a file, that file begins with the following almost literal "Do Not Remove This Tag Under Penalty Of Law (DMCA)" notice:
% Removing the following eight lines is illegal, subject to the Digital Copyright Act of 1998.
mark currentfile eexec
54dc5232e897cbaaa7584b7da7c23a6c59e7451851159cdbf40334cc2600
30036a856fabb196b3ddab71514d79106c969797b119ae4379c5ac9b7318
33471fc81a8e4b87bac59f7003cddaebea2a741c4e80818b4b136660994b
18a85d6b60e3c6b57cc0815fe834bc82704ac2caf0b6e228ce1b2218c8c7
67e87aef6db14cd38dda844c855b4e9c46d510cab8fdaa521d67cbb83ee1
af966cc79653b9aca2a5f91f908bbd3f06ecc0c940097ec77e210e6184dc
2f5777aacfc6907d43f1edb490a2a89c9af5b90ff126c0c3c5da9ae99f59
d47040be1c0336205bf3c6169b1b01cd78f922ec384cd0fcab955c0c20de
000000000000000000000000000000000000000000000000000000000000
cleartomark
Is it actually illegal to remove the lines under the DMCA? Maybe. Again, talk about "Do Not Remove This Tag Under Penalty Of Law"!
Now, this is well-known, and the decryption of it can even be found in a PDF FAQ
But my observation was that for the particular document under discussion the relevant text was already in the clear at this point. So one didn't have to circumvent the PDF restrictions, merely extract the available text.
I hope the previous many lines are not illegal, subject to the Digital Copyright Act of 1998. I hope.
Dowbrigade has sad comments on difficulty in making fair use of the Shorenstein Center report "Big Media" Meets the "Bloggers": (link credit Dave Winer)
The weird thing is the extent to which the authors have gone to make sure this milestone article in the academic history of the Blogosphere is unbloggable. Excerpts or selections of the text cannot be saved, or copied and pasted. The document cannot be converted to another format or saved as anything else. ... The selection below were typed out by the Dowbrigade, letter by letter.
It takes a very twisted view for a court to believe things like this do not impinge fair use rights ...
The encryption used here is well-known, and trivially within my technical ability to decrypt. But given what happened to the last guy who programmed about PDF files and decryption (the name Dmitry Sklyarov might ring a bell), I'll let someone else take the risk of an unquestioned DMCA 1201(a)(2) violation.
Instead, I'll note a very simple way to get usable text from the restricted file. Observe that printing is allowed. Now, one does not have to get fancy with OCR or images. Simply do a version of the "analog hole". The document can be printed. The printing process has the ability to print to a file. Use that option. That is, print the document to a file instead of directly to a printer. This produces a file in a different format.
There's a "Do not remove this tag under penalty of (DMCA) law" bit of code in that file, which handles the security for usage restrictions. HOWEVER, the text of the document itself is in the clear here! All that's needed is to make it more usable. So extract the whole text chunk from any line in the file where the line starts with a left parenthesis or ends with a right parenthesis (no text chunk has a segment with more than two lines)
That is, cough, I meant to say,
perl -n -e 'print $1 if (/^\(([^)]+)/ || /([^)]+)\)$/);' < shorenstein.ps
[I think I'm allowed to write the English statement, but in peril with the Perl statement, at least under current court precedents]
All done. You now have a file of text which, though not all that pretty in formatting, is quite amenable to cut-and-paste.
Does even this post violate the DMCA? Is it trafficking in "technology" that "is marketed by that person ... for use in circumventing a technological measure that effectively controls access to a work protected under this title."?
You guys at Harvard will defend me, right? Right? Right? ...
Disclaimer: No encryptions were broken in the making of this post.
[UPDATE: I found a simpler, better, procedure (all the following are standard Linux programs)Use the program xpdf to generate the postscript print file. This program obeys the usage restrictions itself, but does NOT insert the usage restriction code in the generated print output.
Then use pstopdf13 to generate a PDF file from the print file (the default 1.2 version didn't work well, 1.3 works better).
This new PDF file is not usage restricted!
Then run pdftotext over this new file ... and presto, a pretty text version!
I'm really worried now ...
]
I probably shouldn't waste my time writting these posts, but the recent net censorship Supreme Court argument struck a deep chord with me:
Ms. Beeson argued that there were less restrictive alternatives to the pornography law: parents could now take matters into their own hands by using Internet filtering software and configuring it to reflect their own values. Congress already requires that schools and libraries use filters.
Chief Justice William H. Rehnquist and Justice Antonin Scalia seemed skeptical of that argument, however, and both noted that the civil liberties union had opposed the library filtering bill. Mr. Olson also noted that a number of Web sites gave step-by-step instructions on defeating the technology.
Here - not ancient history, not years ago, but this week's Supreme Court Internet censorship law arguments - is an illustration of the problem I faced for so many years. Because the part of the civil-liberties strategy was, and remains, arguing favorably about censorware in this legal context. See Peter Junger's "least restrictive means" message for the best legal analysis (in my view).
I never opposed this as a legal argument. But for too long, for too many prominent people, that legal argument turned into a social argument for touting censorware. And so ...
If you said censorware didn't work, you were going against the strategy.
And that was bad. And thus the censorware critics had to be discredited. And here my trouble began.
In 1995, when I first decrypted censorware. I called my then-friend Mike Godwin, famous net.legend Internet civil-liberties lawyer, for help. Well, at that time, he was making policy advocacy statements such as:
This is why I believe that the right role for Congress to play is to encourage the development of software filters that prevent my child and others from being harmed in the first place.
Recall that the basic technology we're talking about here is the computer -- the most flexible, programmable, "intelligent" technology we build and market.
-- Mike Godwin, 1995 Congressional testimony
Thus he was not pleased to be informed about censorware's lack of "intelligent" technology. And I got an earful of all the (my description) dirty deals that were trying to be cut behind the scenes. I suppose now it's no secret that the ACLU blew me off when I tried to get their help (I still have the messages). But they didn't go on a personal attack-campaign about it.
Anyway, much has happened since then. However, some of the fundamental paradoxes are still in evidence - this week, in the Supreme Court.
I note this in an attempt at a "teachable moment". When I try to explain the background of censorware politics, the factors which caused things to evolve as they did, I often get trivialization and dismissiveness ("Petty bickering! Size measuring! Pissing contest!"). It's so easy to scream "EGO!", which means you don't have to think about anything.
There were, and are, reasons which drove it all, and still matter right now. But looking back on how it affected me, over nearly a decade: If I had to do it all over again, I wouldn't. Personally, it wasn't worth it.
[Yes, this post really seriously concerns *all* the topics listed, it's truly that _tour de force_]
The Supreme Court just heard arguments on another Internet censorship law, "COPA", ( Ashcroft v. ACLU, 03-218). The Boston Globe reported:
Ordinarily, US Solicitor General Theodore B. Olson prepares for an appearance before the Supreme Court by acting out his argument before a pretend court. This time, for a case about the Internet, he added a new twist: searching online for free porn.
At his home last weekend, Olson told the justices yesterday, he typed in those two words in a search engine, and found that "there were 6,230,000 sites available."
The top lawyer who represents the Bush administration before the Supreme Court said the search's results illustrate how pornography on websites "is increasing enormously every day," a central point in his argument for saving an antipornography law that was enacted six years ago but has yet to go into effect.
Now, let's do something often unrewarded in this world - think. What search did he do exactly? It seems to be the following search in Google:
http://www.google.com/search?q=free+porn
That gives me now "about 6,320,000" results, close enough, the total number returned often varies a bit.
Now, what that search means is roughly the number of pages containing the words "free" and "porn" anywhere in the entire page (or links with those words). This blog entry will qualify as one of those results as soon as it is indexed. I don't think this blog entry is proof of how pornography on websites "is increasing enormously every day,", much less the need for an Internet censorship law.
I've written about the problems of Google and stupid journalism tricks before. But, sigh, nobody reads me, so this won't get reported. Anyway, the story gets even better.
I started digging down into the results to see if I could find some
non-sex-site mentions before the Google 1000 results display limit
(Yes, Mr. Olson, there are more than 1000 sites devoted to sex in the
world, that's true). Google's display crashed stopped in the high 800's! That is, displayed at the bottom,
for:
http://www.google.com/search?q=free+porn&num=100&start=900
In order to show you the most relevant results, we have omitted some entries very similar to the 876 already displayed.
If you like, you can repeat the search with the omitted results included.
The number varies, but it's been under 900.
Joke: Hear ye! Hear ye! Instead of "6,230,000 sites available", there's really uniquely less than 900! At least, according to Google.
Now, this is the Google display crash from bugs in the
Google spam filtering. Google has cleaned-up their index so the
crash is not happening on the first screen of results. But it's still
in their results display code. Usually, people don't see the bug in
practice, since the crash has now been pushed very far down in the
sequence of results.
But here I had a reason to go looking out as far as I could, and ran into the crash in a bona-fide real-world situation. Not just a trivial query too, but one with profound implications for Censorship Of The Internet.
[Update 3/4: Michael Masnick brings to my attention that what I thought was the old Google spam crash is now reduced to duplicate-removal processing on the 1000 results display limit - the point is still that I can use fallacious superficial search "logic" to assert there's less than 900 sites, because Google "says" so. But the technical reason is not quite what I wrote originally]
Humor: If the evidence from a Google search was good enough to be used to justify censorship when it said "6.2 million", why isn't it good enough to justify no censorship if on further investigation it says less than 900? That is, if you thought it was valid before, with a big number, why isn't it valid now, with a small number? (garbage in, garbage out)
Look at me, I'm a journalist (or grandstanding lawyer) - Google says there's no practically no porn on the net!