Searching PDFs: Desperately Seeking the Right Phrases

Searching PDFs: Desperately Seeking the Right Phrases

In a previous blog article, Joel demonstrated how to use the PDF Java Toolkit WordFinder with regular expressions to find phone numbers. Regular expressions are useful, but they are more useful when you can match more than one word at a time. So, let’s examine how to do this with Adobe PDF Library’s WordFinder using the DotNet interface.

The phrase ‘Linearized PDF’ found and highlighted.

If you are familiar with our TextExtract C# sample app, then the following, based on the ExtractTextUntagged() method will look rather familiar:

Some differences are:

where ‘textToExtract’ is a StringBuilder Object rather than a string, and this mapPosToWord and wordOffset variable. It’s the latter two that are key: we want to map positions of words in textToExtract to the Word objects in pageWords.

So the next step is:

After that, if we add a white-space character, we need to increment our wordOffset to keep things consistent.

Now here is where the magic happens, and where things start diverging from the TextExtract sample:

We collapse the textToExtract to a string, and pass in the regular expression string parameter to Regex.Matches – from which comes forth zero or more matches for us to examine.

We then travel the entire length of the match’s span, adding all of the word indices from the map of text positions to the HashSet object. The HashSet object will keep only the unique word indices.

So, each index value allows us to get a Word object. From which we collect the one or more quadrangle objects and gather them up into a list.

With that list we create Highlight Annotations.

From there, it’s all over but for the clean-up:

Note that the same approach could be used with Redaction Annotations rather than HighlightAnnotations. And the same technique could be easily implemented in Java using either Adobe PDF Library’s Java interface or PDF Java Toolkit, like Joel’s sample.

If you have any questions, comment below or contact us!

Leave a Reply

Your email address will not be published. Required fields are marked *