Automating PDF Redaction using the Datalogics PDF Java Toolkit

Automating PDF Redaction using the Datalogics PDF Java Toolkit

http://www.dreamstime.com/royalty-free-stock-image-safe-cloud-online-remote-data-businessman-pressing-cloud-ic-icon-image43622966

Sample of the Week:

Joel Geraci

I discussed the reasons for and the process of performing redaction on PDF files using the Datalogics PDF Java Toolkit in an earlier article on this blog. While this article is about the RedactionAnnotDemo sample, it’s about a lot more than just redaction. The sample demonstrates several key concepts important to working with PDF files when using the Toolkit; namely, text extraction, getting word locations, and creating annotations. Once the developer has an understanding of these concepts, it’s relatively easy to create an application that can search for and redact specific words and phrases in just about any PDF file.

One of the idiosyncrasies… and part of the brilliance… of PDF is that the instructions to place a group of letters near enough to each other that the human eye will assemble them into a word can be spread out all over the file. The two lines below both represent the word “Datalogics.” The only difference is the tracking that was used in Adobe Illustrator. On the page, the two words look almost identical. The average person would need a light table to see the difference in character spacing.

[(Da)4(talog)6(ics)]TJ
[(D)1.6(a)4.5(t)-8.5(al)-4.4(o)-9.9(g)0.5(i)-1.2(c)-13(s)]TJ

Unlike computers, humans are very good at recognizing words; that’s why “Captchas” work as a way to prevent robots from signing up for their own Facebook accounts. When trying to digest a PDF file and extract the text… whatever library you use… it has to be able to look at the instructions for graphically representing the word… from wherever they may be in the file… and deconstruct those instructions into a word. The Datalogics PDF Java Toolkit contains a TextExtractor class that provides the functionality to extract text from the content streams of a PDF document. Developers can then iterate over the words looking for the specific words they are interested in like, for example, finding out their locations, bounding boxes and then adding redaction marks to them.

The word locations in PDF are specified in something called QuadPoints or “quads;” the coordinates of the lower-left, lower-right, upper-left, and upper-right corners of the bounding rectangle. For most words, even rotated words like the one in the illustration below, one set of quads is all that’s required to define the rectangle that encompasses the entire word.

quads

Because PDF annotations also use quads, it’s easy to use the quads from a word to create an annotation that fits precisely on top of it.

PDFAnnotationRedaction annot = PDFAnnotationRedaction.newInstance(pdfDoc);
annot.setQuads(word.getBoundingQuads());
annot.setRect(annot.getRedactionAreaBBox());

Using these simple methods, applications can be developed to search and redact massive numbers of documents. In this sample, only redaction comments are added to the document so that a human can examine the documents before applying the redactions but this simple step can still save a lot of time.

View and download the RedactionAnnotDemo sample or get all the samples and documentation by requesting an evaluation of the Datalogics PDF Java Toolkit.

Leave a Reply

Your email address will not be published. Required fields are marked *