January 14, 2005

Making Fair Use of cut-and-paste restricted PDF files

PDF files with usage restrictions often pose a problem regarding how to exercise one's fair-use right to quote excerpts. Back last March, I wrote about how to do "permission arbitrage", in a post "Making Fair Use of the Report on "Big Media" Meets The "Bloggers"" (there's a certain amount of irony there ...).

It seems as relevant now as it was then, so I'll repost it today.


Dowbrigade has sad comments on difficulty in making fair use of the Shorenstein Center report "Big Media" Meets the "Bloggers": (link credit Dave Winer)

The weird thing is the extent to which the authors have gone to make sure this milestone article in the academic history of the Blogosphere is unbloggable. Excerpts or selections of the text cannot be saved, or copied and pasted. The document cannot be converted to another format or saved as anything else. ... The selection below were typed out by the Dowbrigade, letter by letter.

It takes a very twisted view for a court to believe things like this do not impinge fair use rights ...

The encryption used here is well-known, and trivially within my technical ability to decrypt. But given what happened to the last guy who programmed about PDF files and decryption (the name Dmitry Sklyarov might ring a bell), I'll let someone else take the risk of an unquestioned DMCA 1201(a)(2) violation.

Instead, I'll note a very simple way to get usable text from the restricted file. Observe that printing is allowed. Now, one does not have to get fancy with OCR or images. Simply do a version of the "analog hole". The document can be printed. The printing process has the ability to print to a file. Use that option. That is, print the document to a file instead of directly to a printer. This produces a file in a different format.

There's a "Do not remove this tag under penalty of (DMCA) law" bit of code in that file, which handles the security for usage restrictions. HOWEVER, the text of the document itself is in the clear here! All that's needed is to make it more usable. So extract the whole text chunk from any line in the file where the line starts with a left parenthesis or ends with a right parenthesis (no text chunk has a segment with more than two lines)

That is, cough, I meant to say,

perl -n -e 'print $1 if (/^\(([^)]+)/ || /([^)]+)\)$/);' < shorenstein.ps

[I think I'm allowed to write the English statement, but in peril with the Perl statement, at least under current court precedents]

All done. You now have a file of text which, though not all that pretty in formatting, is quite amenable to cut-and-paste.

Does even this post violate the DMCA? Is it trafficking in "technology" that "is marketed by that person ... for use in circumventing a technological measure that effectively controls access to a work protected under this title."?

You guys at Harvard will defend me, right? Right? Right? ...

Disclaimer: No encryptions were broken in the making of this post.

[UPDATE (from March 2004): I found a simpler, better, procedure (all the following are standard Linux programs)

Use the program xpdf to generate the postscript print file. This program obeys the usage restrictions itself, but does NOT insert the usage restriction code in the generated print output.

Then use pstopdf13 to generate a PDF file from the print file (the default 1.2 version didn't work well, 1.3 works better).

This new PDF file is not usage restricted!

Then run pdftotext over this new file ... and presto, a pretty text version!

I'm really worried now ...

By Seth Finkelstein | posted in dmca , security | on January 14, 2005 09:32 AM (Infothought permalink) | Followups
Seth Finkelstein's Infothought blog (Wikipedia, Google, censorware, and an inside view of net-politics) - Syndicate site (subscribe, RSS)

Subscribe with Bloglines      Subscribe in NewsGator Online  Google Reader or Homepage


Seth, you rock.

I'm still mildly pissed because a "book note" *I* published in the Harvard Journal of Law and Technology is (a) online, but (b) in unprintable PDF. I remember flinging copious techie measures at it, eventually producing (with the aid of a cracker-esque friend) a printable version -- albeit ballooned to a 10+mb file. I think ghostscript was heavily involved. I did so (far to stubborn to call and explain to whoever is running that journal now that they really ought to send me an unrestricted version) in full awareness that I was violating the DMCA even though I actually own the copyright to the damn article.

I think those of us on the left need to start thinking more critically about fair use. Although it SHOULD be a "right," but I don't think it really is. It's more a "defense." Ie. "They can't punish you for doing it, but they're not required to make it easy."

As for your "circumvention" -- again, good for you. I really wouldn't worry though. I think they have better things to do than come after you for merely explaining what their code does. Really, that's all you did. (Don't take this as formal magic legal advice, of course, but you know that...) I really think that's different, as a matter of First Amendment principle, though I can't cite a case to support it. You didn't publish your own product, you just said "hey, public, here's the line of code in their software which blocks access!

That's a bit of a fine distinction, but if it's not meaningful, I think all really is lost.

Posted by: Paul Gowder at January 14, 2005 10:59 AM

OMG, this should be so obvious.
You rock!
Let me see if my windows tools can do this.

Posted by: Firas at January 14, 2005 04:13 PM

Ah.. failed.. got a non-restricted PDF but with nothing in it except an error message when printing to a (non-adobe) freeware PDF maker.

Posted by: Firas at January 14, 2005 04:18 PM

xpdf + ghostscript did the trick. Nice.

By the way, by "well-known", do you mean a standard alogorithm? (eg. DES etc.?)

Posted by: Firas at January 14, 2005 08:29 PM

Even easier if you're on a Mac. Grab a copy of Trapeze, drag and drop the "copy-restricted" PDF to it, and Presto! it's converted into your choice of text, rtf, or html. Easy.

Posted by: Clark at January 21, 2005 10:16 PM