Data Redaction: You're Doing it Wrong
PDF files are a common way to distribute documents on the Internet and even are used for distributing documents with redacted (removed) content. However, when you distribute redacted documents make sure that the data you don't want out there isn't, in fact, still in the file.
Case in point, take the upcoming trial of former Governor Rod Blagojevich. He just submitted a motion to force President Obama to testify during his criminal trial. As you can imagine, there is sensitive information in the motion. You can read the motion here. The areas that are redacted are pretty obvious. Now, hit Control-A. Open a text editor or Microsoft Word (or the like). Hit Control-C.
Hello, Mr. Face. Meet, Mr. Palm. This particular mistake isn't new. There was a well-publicized SNAFU involving the US Department of Defense publishing a redacted document that contained classified information which was happily leaked on the Internet using the same method.
If the data is important enough to redact, it is probably important enough to verify that the data is actual gone. Of course, this is a problem for more than just PDF documents. An amusing HR trick is to take a look at Microsoft Word resumes, particular the "Track Changes" history.
The take away is to make sure to use commercial tools (or tools specifically designed for the task) to delete, not just mask, redacted information and to check to ensure that the redacted information is not easily retrievable... especially with something as trivial as "Copy-Paste". If you are too stingy for a commercial software package, just print the document with the redacted portions and re-scan it as PDF to ensure the text is gone.
(You can read about the issue from this article which is heavy on the facts of the particular trial in question).
--
John Bambenek
bambenek at gmail /dot/ com
Comments
My take on this has always been to use print-to-pdf functionnalities, at time of publication.
This way, the pdf mimicks paper and only contains the information wanted and nothing else.
Prontissimo
Apr 23rd 2010
1 decade ago
But with a simple copy/paste the report was published in clear text ( you can read the full story here : http://www.voltairenet.org/article30249.html )
So it seems that almost 5 years later someone forgot the lesson.
Sticky
Apr 23rd 2010
1 decade ago
I tested print to PDF on the subject document from Evince on Linux, and the copy/paste of redacted sections revealed nothing. I'd be dang cautious about assuming that it's not there at all, though. I have no idea how the viewing app & print driver conspire to compose the output based on the input. I'd not be surprised if results could vary from software to software, too.
Ken
Apr 23rd 2010
1 decade ago
Open a vector drawing program, type in some text, then draw a white rectangle over some of the text. Print through a PDF queue. Voila - you can still select and copy the overlaid text.
When Word prints black text on a black background to a PDF queue, it converts it to PostScript (still black text on a black background), which gets cleansed a little and compressed and turned into PDF. But the PostScript instructions inside the PDF file still say, "Print this text in black with a black background". Which means the content of the text is still there.
As far as metadata, Properties on the Blagojevich document shows that Aaron Goldstein was the listed author, and the title as "Microsoft Word - motion to subpoena president redacted", which supports my assertion that they probably just highlighted the text with a black background in Microsoft Word.
I imagine that if you were experienced with a PDF inspection tool (which I am not), you might be able to find even more interesting things.
One tool worth keeping around is the Remove Hidden Data tool for Office XP/2003 (the same thing is built into Office 2007). I can't vouch for it being perfect, but it's probably better than nothing.
Also, one other approach similar to the print and scan solution would be to print through a TIF print queue (like the Microsoft Office Document Image Writer queue that comes with Office 2003) and then PDF the resultant TIF file. Still worth looking for potentially incriminating metadata. This approach avoids the noise and alignment issues in the print and scan approach.
anonymous
Apr 23rd 2010
1 decade ago
grimmfarmer
Apr 23rd 2010
1 decade ago
Why not just deleted it totally?
they could have been real clever and made it white on white.
m4tt
Apr 23rd 2010
1 decade ago
Why not just deleted it totally?
they could have been real clever and made it white on white.
m4tt
Apr 23rd 2010
1 decade ago
Tisiphone
Apr 23rd 2010
1 decade ago
http://www.nsa.gov/ia/_files/support/I733-028R-2008.pdf
phred
Apr 23rd 2010
1 decade ago
Mark Miller
VP Sales
www.extractsystems.com
Mark Miller
Apr 26th 2010
1 decade ago