Experimenting with Text Extraction

Experimenting with Text Extraction

It is a truth universally acknowledged that a single useful sample app is the basis of hundreds of variants, as code is copied and tweaked and re-used all over the place. That’s certainly the case with our Text Extraction sample, for which I’ve implemented my share of variations. My most useful Text Extract variant grabs all of the Quads, WordAttributes, and Style changes and dumps all of it to an XML file; it makes it relatively straightforward to investigate text Extraction issues.

But all of my TextExtraction variations have essentially followed the same pattern:

That old GetWordList, which wraps around APDFL’s PDWordFinderAcquireWordList function:

Finds all words on the specified page and returns one or more tables
containing the words. One table contains the words sorted in the order in
which they appear in the PDF file, while the other contains the words sorted
by their x- and y-coordinates on the page.
Only words within or partially within the page’s crop box (see
PDPageGetCropBox()) are enumerated. Words outside the crop box are
skipped.
There can be only one word list in existence at a time; clients must release
the previous word list, using PDWordFinderReleaseWordList(), before
creating a new one.
Use PDWordFinderEnumWords() instead of this method, if you wish to find
one word at a time instead of obtaining a table containing all words on a
page.

Though I had not ever seen any code that makes use of it; that PDWordFinderEnumWords functionality is exposed in the Java and DotNet interfaces. So I thought I’d adapt the TextExtract (C#) sample app to use it:

Basically, create a wordFinder object and for every page of the document, call the EnumWords method with a callback object. The TextExtractProc callback object inherits a WordProc interface. It needs to implement a Call method. But first, we are going to use the callback object to keep some state which will persist between calls to its Call method. (who came up with these names, anyway?)

The state information we need to keep persistent is: the StreamWriter object which is writing the words to an output stream, whether the file is tagged, and the last page we were on.

The last page because we are going to write out where the page breaks occur:

Whether the file is tagged – because we are going to use the same code path whether the file is tagged or not – and we want to handle the quirks of both flavors appropriately. And I like reducing boolean logic when appropriate:

And sometimes, I don’t really care:

Finally, we write out the text, and return true to indicate we can continue with the next word.

And voila!

We now have code that doesn’t require creating a list of words and freeing it after we’re done with it.

So, how does EnumWords compare to old faithful, GetWordList ?

Interestingly, while I had expected EnumWords to be somewhat more memory-efficient than GetWordList since it didn’t require building up and tearing down the word List, ProcessExplorer indicated that it actually used a bit more memory. However, EnumWords was consistently roughly 5% faster, at least extracting the text from the PDF Reference.

Any questions? Comment below or contact us.

Leave a Reply

Your email address will not be published. Required fields are marked *