DLE using Python

DLE using Python

I’ve always had an appreciation for the higher level languages, the ones that make life easier, that let you code rather than worry about the housekeeping. C# is an improvement over coding in C or C++, since it relieves you of many of the burdens of tracking pointers and object ownership. You still have to compile the program before you can run it.

Scripting languages like Python give the best of both worlds. Programs don’t require compilation before being run, and in fact, you can type commands to an interactive console, just like in the old days of BASIC.

I’ve been something of a Pythonista for a long time now, and I’ve always wanted to access the PDF Library from Python. With DLE we can.

Before you go digging in the distribution to find the secret Python bindings, I’ll tell you there aren’t any. We’re going to use a little trick. There are versions of Python that run on some of the major VMs out there. One of them is IronPython, which runs on .NET, and the other is Jython, which runs on the JVM.

Both mix the ease of use of Python with direct access to the features of the underlying VM. Generally, Python and Java or .NET objects can be freely mixed, and you don’t have to really know in which language classes and objects are declared, especially from the point of view of the Python code.

For this article, I’m going to focus on Jython.

Getting started with Jython

I’ll start by presuming that you’ve installed Jython according to the installation instructions. On Mac,

Change to the directory that contains your DLE installed files, and make sure that Jython can find the components of DLE.

On Macintosh, we set a few environment variables:

[sourcecode language=”bash”]
$ export DYLD_FRAMEWORK_PATH=$PWD
$ export DYLD_LIBRARY_PATH=$PWD
$ export JYTHONPATH=$PWD/com.datalogics.PDFL.jar
[/sourcecode]

On Windows it’s sufficient to

[sourcecode]
> set JYTHONPATH=com.datalogics.PDFL.jar
[/sourcecode]

And we start up Jython:

[sourcecode]
$ jython
*sys-package-mgr*: processing new jar, ‘E:DatalogicsAPDFL10.1B1a-x64DLEcom.datalogics.PDFL.jar’
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.5.0_19
Type "help", "copyright", "credits" or "license" for more information.
>>>
[/sourcecode]

On Mac OS X, because DLE is 32-bit only,you’ll have to make sure to invoke the 32-bit version of the JVM by using the -J-d32 option:

[sourcecode language=”bash”]
$ jython -J-d32
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) Client VM (Apple Inc.)] on java1.6.0_35
Type "help", "copyright", "credits" or "license" for more information.
>>>
[/sourcecode]

Let’s import the DLE classes for ease of use:

[sourcecode language=”python”]
>>> from com.datalogics import *
[/sourcecode]

If you got this far, then the DLE classes are imported into your namespace. Where did that com.datalogics.PDFL module come from? From the .jar file! All the Java classes in that namespace are now imported into our Python interpreter.

It’s important to initialize and terminate the Library on the main thread, so the first thing we do is initialize:

[sourcecode language=”python”]
>>> lib = PDFL.Library()
[/sourcecode]

Now let’s open the sample.pdf document (in the sample data that comes with DLE), and see what kinds of attributes a document object has:

[sourcecode language=”python”]
>>> doc = PDFL.Document(‘Samples/Data/sample.pdf’)
>>> dir(doc)
[‘ALL_PAGES’, ‘BEFORE_FIRST_PAGE’, ‘LAST_PAGE’, ‘XMPMetadata’, ‘__class__’, ‘__copy__’, ‘__deepcopy__’, ‘__delattr__’, ‘__doc__’, ‘__eq__’, ‘__getattribute__’, ‘__hash__’, ‘__init__’, ‘__ne__’, ‘__new__’, ‘__reduce__’, ‘__reduce_ex__’, ‘__repr__’, ‘__setattr__’, ‘__str__’, ‘__unicode__’, ‘author’, ‘baseURI’, ‘bookmarkRoot’, ‘class’, ‘close’, ‘compressionLevel’, ‘countXMPMetadataArrayItems’, ‘createNameTree’, ‘createPage’, ‘creator’, ‘defaultOptionalContentConfig’, ‘delete’, ‘deleteOnClose’, ‘deletePages’, ’embedFonts’, ‘enumIndirectPDFObjects’, ‘equals’, ‘fileName’, ‘findBookmark’, ‘findLabelForPageNum’, ‘findPDFObjectByID’, ‘findPageNumForLabel’, ‘flattenOptionalContent’, ‘flattenTransparency’, ‘getAuthor’, ‘getBaseURI’, ‘getBookmarkRoot’, ‘getClass’, ‘getCompressionLevel’, ‘getCreator’, ‘getDefaultOptionalContentConfig’, ‘getDeleteOnClose’, ‘getFileName’, ‘getFonts’, ‘getInfo’, ‘getInfoDict’, ‘getInstanceID’, ‘getIsEmbedded’, ‘getIsLinearized’, ‘getIsModified’, ‘getIsOptimized’, ‘getIsPxDF’, ‘getKeywords’, ‘getLoadedFonts’, ‘getMajorVersion’, ‘getMajorVersionIsNewerThanCurrentLibrary’, ‘getMergedXMPKeywords’, ‘getMinorVersion’, ‘getMinorVersionIsNewerThanCurrentLibrary’, ‘getNameTree’, ‘getNeedsSave’, ‘getNumPages’, ‘getOptionalContentConfigs’, ‘getOptionalContentContext’, ‘getOptionalContentGroups’, ‘getPage’, ‘getPageLabels’, ‘getPageMode’, ‘getPermanentID’, ‘getProducer’, ‘getRequiresFullSave’, ‘getRoot’, ‘getSubject’, ‘getSuppressErrors’, ‘getTitle’, ‘getVersionIsOlderThanCurrentLibrary’, ‘getVersionString’, ‘getWasRepaired’, ‘getXMPMetadata’, ‘getXMPMetadataArrayItem’, ‘getXMPMetadataProperty’, ‘hashCode’, ‘infoDict’, ‘insertPages’, ‘instanceID’, ‘isEmbedded’, ‘isLinearized’, ‘isModified’, ‘isOptimized’, ‘isPxDF’, ‘keywords’, ‘loadedFonts’, ‘majorVersion’, ‘majorVersionIsNewerThanCurrentLibrary’, ‘mergeXMPKeywords’, ‘mergedXMPKeywords’, ‘minorVersion’, ‘minorVersionIsNewerThanCurrentLibrary’, ‘movePage’, ‘needsSave’, ‘notify’, ‘notifyAll’, ‘numPages’, ‘optionalContentConfigs’, ‘optionalContentContext’, ‘optionalContentGroups’, ‘pageLabels’, ‘pageMode’, ‘permRequest’, ‘permanentID’, ‘print’, ‘printToFile’, ‘producer’, ‘removeNameTree’, ‘removeOCG’, ‘replacePages’, ‘requiresFullSave’, ‘root’, ‘save’, ‘secure’, ‘setAuthor’, ‘setBaseURI’, ‘setCreator’, ‘setDeleteOnClose’, ‘setInfo’, ‘setIsEmbedded’, ‘setIsOptimized’, ‘setKeywords’, ‘setMinorVersion’, ‘setNeedsSave’, ‘setPageLabels’, ‘setPageMode’, ‘setProducer’, ‘setRequiresFullSave’, ‘setSubject’, ‘setSuppressErrors’, ‘setTitle’, ‘setXMPMetadata’, ‘setXMPMetadataArrayItem’, ‘setXMPMetadataProperty’, ‘subject’, ‘suppressErrors’, ‘title’, ‘toString’, ‘unsecure’, ‘versionIsOlderThanCurrentLibrary’, ‘versionString’, ‘wait’, ‘wasRepaired’, ‘watermark’]
[/sourcecode]

It’s interesting that Jython infers attributes from get and set calls, so we can get the number of pages without a function call. But we can’t set the number of pages; it’s a read-only attribute.

[sourcecode language=”python”]
>>> doc.numPages
2
>>> doc.numPages=2
Traceback (innermost last):
File "", line 1, in ?
AttributeError: read-only attr: numPages
[/sourcecode]

You also might have noticed that there are no declarations of types. That’s because Python is duck-typed: if it looks like a duck, and quacks like a duck, it’s a duck. Every Python object has a specific type, but that type is a property of the object, not the name that is used to reference it.

Time to try something fun, like extracting some text from a PDF file. First, let’s get a word finder:

[sourcecode language=”python”]
>>> wf = PDFL.WordFinder(doc, PDFL.WordFinderVersion.LATEST, PDFL.WordFinderConfig())
[/sourcecode]

Now, since we’re in Python, we can get a list of words just by calling getWordList, and it will act like a Python list. So let’s map the unbound function Word.getText onto it, and see what we get.

[sourcecode language=”python”]
>>> map(PDFL.Word.getText, wf.getWordList(0))
[u’National’, u’Weather’, u’Service’, u’Zone’, u’Forecast’, u’http://’, u’www.’, u’crh.’, u’noaa.’ …
[/sourcecode]

There’s a lot of power packed into this one line. First, DLE returns native list types where appropriate, so the list is a java.util.ArrayList. Jython extends Python sequence semantics to Java list types, so we can treat that ArrayList like any other list.

Python’s functional programming (map) turns loops into one-liners, applying a function to each item of a sequence and returning a sequence of results.

Java strings and Python strings are transparently converted.

One thing to remember: The Library object always has to be cleaned up. In the Java-based version of DLE, use the delete method:

[sourcecode language=”python”]
>>> lib.delete()
[/sourcecode]

Once an object in DLE has been deleted, dependent objects become invalid. So, if we delete our Document object, the Pages we get from it are no longer usable, and so forth. Deleting the Library object cleans everything up, so we can no longer use our Document.

[sourcecode language=”python”]
>>> doc.numPages
File "<stdin>", line 1, in <module>
java.lang.RuntimeException: Object is no longer valid (perhaps a parent object was already destroyed) …
[/sourcecode]

Scripting languages offer the ability to explore an API interactivity. Code can be tested before it is placed in a complete application, and without requiring compilation. Jython makes an excellent exploration tool for the Java version of DLE.

A future installment will show how Python programs can be written for DLE using Jython.

Other resources

  • Jython Console is a wrapper around a Python console prompt that offers code completion.

Leave a Reply

Your email address will not be published. Required fields are marked *