PDF Java Toolkit Sample of the Week: Quads!

PDF Java Toolkit Sample of the Week: Quads!

Sample of the Week:

No… not those quads… QuadPoints in PDF.

Even among developers who have experience working with PDF, there’s still a lot of confusion around what exactly QuadPoints are and why they’re useful. While this article references a Gist that uses the Datalogics PDF Java Toolkit, the discussion is more generalized and can be applied to any PDF developer tool.

If you are a developer who is trying to make sense of the text in a PDF file… especially one without structure tags, knowing the locations of the characters that constitute a word is critical. Knowing only the bounding rectangle of the word is helpful but when you have rotated text, text on a curve, or hyphenated text, you need more than just the bounding rectangle, you need the QuadPoints.

QuadPoints or “quads” represent the coordinates of the lower-left, lower-right, upper-left, and upper-right corners of the bounding rectangle. For most words, even rotated words like the one in the illustration below, one set of quads is all that’s required to define the rectangle that encompasses the entire word.

quads

For hyphenated text or text on a curve, you’ll get an array of quads because, potentially, each character may require it’s own bounding box as in the illustration below.

TextCurveSo, given the image above, which is actually captured from the output file of the Gist for this article, if a developer wanted to redact that word and just relied on the bounding rectangle of the word, to create the redaction annotation, a lot of other content below the arc or the word may also get redacted. That would be bad. But using all 25 of the quads for the word will allow the developer to precisely add annotations exactly where they need to be and only where they need to be.

Now onto the sample…

To add Polygon annotations around each character, I needed to discover it’s quad. I did this by using the TextExtractor service in the Datalogics PDF Java Toolkit and then iterating over each Word object. A Word object contains a Unicode string representing the word along with a list of bounding boxes as needed to enclose all of the characters. If the characters of the word are colinear, then there might only be one bounding box covering the entire word. If the characters follow a curved path, then a list of bounding boxes, potentially one for each character, will be returned.

I can then use the array of QuadPoints of the Word to create the right vertices to draw my rectangle Polygon. But there’s a problem, The QuadPoints contain only 4 sets of x,y coordinates. My rectangular Polygon requires 5.

Five?… But the rectangle has only 4 corners.

Five…the rectangle Polygon annotation is drawn using lines that connect five vertices.

  1. Start at the lower left corner
  2. Draw a line to the upper left corner
  3. Draw a line to the upper right corner
  4. Draw a line to the lower right corner
  5. Draw a line back to the lower left corner to close the Polygon

Four lines but five vertices.

In code, this is represented by creating an array from the quad values and then appending the first x and y coordinates to the array and then using the longer array as the vertices in the annotation dictionary.  

To get started working with PDF, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.  

Leave a Reply

Your email address will not be published. Required fields are marked *