PDF files with usage restrictions often pose a problem regarding how to exercise one's fair-use right to quote excerpts. Back last March, I wrote about how to do "permission arbitrage", in a post "Making Fair Use of the Report on "Big Media" Meets The "Bloggers"" (there's a certain amount of irony there ...).
It seems as relevant now as it was then, so I'll repost it today.
The weird thing is the extent to which the authors have gone to make sure this milestone article in the academic history of the Blogosphere is unbloggable. Excerpts or selections of the text cannot be saved, or copied and pasted. The document cannot be converted to another format or saved as anything else. ... The selection below were typed out by the Dowbrigade, letter by letter.
It takes a very twisted view for a court to believe things like this do not impinge fair use rights ...
The encryption used here is well-known, and trivially within my technical ability to decrypt. But given what happened to the last guy who programmed about PDF files and decryption (the name Dmitry Sklyarov might ring a bell), I'll let someone else take the risk of an unquestioned DMCA 1201(a)(2) violation.
Instead, I'll note a very simple way to get usable text from the restricted file. Observe that printing is allowed. Now, one does not have to get fancy with OCR or images. Simply do a version of the "analog hole". The document can be printed. The printing process has the ability to print to a file. Use that option. That is, print the document to a file instead of directly to a printer. This produces a file in a different format.
There's a "Do not remove this tag under penalty of (DMCA) law" bit of code in that file, which handles the security for usage restrictions. HOWEVER, the text of the document itself is in the clear here! All that's needed is to make it more usable. So extract the whole text chunk from any line in the file where the line starts with a left parenthesis or ends with a right parenthesis (no text chunk has a segment with more than two lines)
That is, cough, I meant to say,
perl -n -e 'print $1 if (/^\(([^)]+)/ || /([^)]+)\)$/);' < shorenstein.ps
[I think I'm allowed to write the English statement, but in peril with the Perl statement, at least under current court precedents]
All done. You now have a file of text which, though not all that pretty in formatting, is quite amenable to cut-and-paste.
Does even this post violate the DMCA? Is it trafficking in "technology" that "is marketed by that person ... for use in circumventing a technological measure that effectively controls access to a work protected under this title."?
Disclaimer: No encryptions were broken in the making of this post.
[UPDATE (from March 2004): I found a simpler, better, procedure (all the following are standard Linux programs)
Use the program xpdf to generate the postscript print file. This program obeys the usage restrictions itself, but does NOT insert the usage restriction code in the generated print output.
Then use pstopdf13 to generate a PDF file from the print file (the default 1.2 version didn't work well, 1.3 works better).
This new PDF file is not usage restricted!
Then run pdftotext over this new file ... and presto, a pretty text version!
I'm really worried now ...