Comparing PDFs: How To Spot the Differences

Comparing PDFs: How To Spot the Differences

You may have, on occasion, seen a newspaper or magazine feature where there are two pictures or illustrations side by side that are very similar. And the instructions direct you to spot the differences between the two. Invariably, the first few differences are quickly and easily spotted, drawing you into its trap and wasting precious minutes of your life which you will never, ever get back again. You’re trying to pinpoint the last few differences which you know are there, hiding in plain sight, but which are causing you to go cross-eyed minutely comparing the two pictures sector by sector as if you were playing Battleship with an aircraft carrier facing off against a submarine…or maybe, that’s just me.

That said, there are some PDF issues that require comparing their structure to diagnose. You would get too many false-positives from Indirect Object IDs being different or font-names having a subset-prefix which (by design!) does not match between the two, et cetera.

And yet, PDFs do have a tree structure (with root grafts run amok, but tree structure nonetheless). Which means that this problem can be conquered by the Power Of GraySku…sorry, by the power of recursion.

Enough Blather, on to the Code

So here’s what I came up with:

Parsing the arguments

Please excuse the abused Backus–Naur form, but basically, I wanted a way to compare two documents at the same time. But not just two documents, I also wanted to compare pages or objects either in separate documents or in the same document. And if I want to compare a Page to a Form XObject, then hey, why not?

I sort out the arguments with a little Finite State Machine:

From which we get two, or possibly one, documents to compare:

Comparing PDF Objects

Now, let’s consider the case where we are comparing Pages or Objects first:

Here, for each ID we are using the compare enum to retrieve the initial PDFObject using either Document.GetPage().PDFDict or Document.FindPDFObjectByID() based on the compare value associated with each ID. We essentially use the same technique to create the root label passed into our compareObjs() call.

The compareObjs() routine

And now, let’s look at the heart of the program:

The first thing we do here is check if both left and right are indirect objects that we have already visited before. If so, no need to compare them again. For the rest of compareObjs, we will need to know the type of PDFObject we are dealing with. So, to simplify things, we map the object type to an enumeration:

The next part of CompareObjs, we check for mismatching object types. Generally, if they don’t match, then we can describe the two objects we are comparing and call it done.

Describing PDFObjects is done with this:

Which is fairly straight forward: we describe objects by their values if they are simple enough, and by their type if they are more complicated like dictionaries or streams, or large arrays. For small arrays, we recursively describe each of their objects.

There is, however, one complication where the objects are different – if we are comparing a PDFInteger to a PDFReal, or vice-versa. We still want to compare their values to see if there truly is a difference. So, we convert the integer value to a double in order to compare like-to-like and ensure we only describe the objects if there is actually a difference between the two values:

For the rest of CompareObjs, we are now comparing like-to-like, and for simple object types, the code is straightforward enough:

With arrays, however, we start getting into recursion; we can only compare up until the end of the shortest array, and then we describe the remaining entries in the longer array.

Conceptually, comparing dictionaries is similar to comparing arrays, but the indices are string tokens rather than a continuous range of integer tokens. We basically want to separate the dictionary keys into three piles: keys used in both dictionaries, keys used only in one dictionary, and keys used only in the other dictionary. This could have been painful to implement, but the HashSet class makes this almost trivial. First, to get the keys from the dictionaries:

From this set of keys, we want to first remove any keys that we don’t want to compare. Keys like ‘Parent’, for example, that are are effectively the back-pointing reference of a double-linked-list node. Then, we are going to make another copy of the keySet from the left and intersect with the keys from the right, which yields a set with only keys that are in both. Next, we remove those keys from the left and right key sets and we now have our three piles of keys. We can then describe entries only in one, only in the other and recursively compare the entries for keys that are in both dictionaries:

The code for handling streams is currently virtually identical as for handling dictionaries.

I could potentially send the (uncompressed) stream data through a cryptographic hash function to check if both blocks of data are identical or not. It’s on the to-do-list.

Otherwise, that’s basically all there is to CompareObjs(). So, let’s go back to the scenario where we are comparing not just two PDFObjects, but two PDF Documents:

Comparing PDF Documents

Here, we are comparing first the document’s root catalogs, excluding the page tree because we iterate over the pages sequentially rather than walking the page tree ourselves. Tree structures within PDFs aren’t necessarily comparable (Depending on how they were created, they may be balanced differently). Which brings up the NameTrees TODO.

Although NameTree data in a PDF are stored in tiered arrays of key-value pairs, within Adobe PDF Library, a Nametree is effectively a PDFDictionary on steroids and able grow much larger than a single PDFDictionary. Which means that you would want to compare it like a dictionary, including extracting all the keys from the Nametree. The NameTree class has a new callback recently added to it to enumerate over all of its items. That can be used to pick up all of the key names. Then, the same logic as was used for comparing dictionaries and streams can be used for NameTrees.  It’s also on the to-do list. Then there’s the NumberTree class, which currently lacks the enumeration callback. Fortunately, there are only two NumberTrees in the PDF Specification; PageLabels and the parentTree of the StructTreeRoot dictionary. We can always ignore those by tossing them on to the skiplist.

In summary

This is a nifty little diagnostic tool. I listed a couple, but more are certainly possible! Please share any suggestions you may have for enhancing this tool.

The full code is here.

Leave a Reply

Your email address will not be published. Required fields are marked *