Saving PDF Files using the Datalogics PDF Java Toolkit

Saving PDF Files using the Datalogics PDF Java Toolkit

Sample of the Week:

In my previous post I talked about using the PDFSaveIncrementalOptions to save Certified PDF files. In the case of a Certified Document, incremental saves are the only useful save operation because performing a full save on a PDF document that contains a digital signature invalidates the signature by rewriting every part of the file. But with regular PDF files, you may very well need to completely rewrite the file.

What You Need to Know First:
PDF is old… really old… really, really, old. I know because I was hired by Adobe just after the launch of version 1.0 and I was young then. Now I’m not. It’s still the only reliable way to communicate a document across multiple platforms and have it look like the author intended… but it is old. And like all evolved organisms, it carries with it some artifacts of it’s past that have become useful only in cases other than their original purpose. Incremental Save is one such feature of the PDF Specification.

When PDF and Acrobat were first introduced, computers were slow and didn’t have much memory, hard drives were slow and floppy drives were even slower, networks were slow and wi-fi wasn’t even heard of. Acrobat and PDF had to work well in this environment if Adobe wanted any chance of establishing it as a standard. So for PDF, incremental saves were an easy way to  quickly save the modifications to a large PDF file by simply appending the changes to the end of the file without rewriting the whole thing. For example, if you added a few bookmarks or sticky notes to a 10,000 page document, the computational cost and write cost to your media of choice were high. Incremental save lowered that cost.

But… there’s always a “but”. Incremental save left open the possibility that a user would delete massive parts of the document, save the file, and then… it got bigger.

Why? Because only the changes were appended to the end. The deleted content was never removed from the file. The incremental changes just told the viewer to ignore those objects.

The Gist referenced below help demonstrate this issue pretty clearly and also shows how to avoid this pitfall.

Saving PDF Files using the Datalogics PDF Java Toolkit
The original version of the JavaScript for Acrobat API Reference is 779 pages long and is 4.375 Mb. In the Gist, we open the file, delete all but the first 10 pages and then save the file using the default set of PDFSaveIncrementalOptions.

The resulting 10 page file is 4.54 Mb. Bigger… by quite a bit.

Again, only the changes were appended to the end of the file, the objects that are no longer referenced were not removed from the file. This is exactly what you want to happen if the file is certified and you want the original content to be potentially accessed in the future in order to compare versions but it’s the opposite of what you’d want if you just deleted a bunch of pages.  

Further along in the Gist, we save the file with the default set of PDFSaveFullOptions, yielding a file size of about .5 Mb and then we save again cranking up the compression to get a final file size of about 75,000 bytes. In all three cases, the PDF file looks exactly the same to the end user, no images were downsampled, no fonts were removed, no bookmarks or named destinations deleted; functionally the files are identical. As a result of leveraging the different save methods, they’re just dramatically different sizes.

So, today, with hard drives being faster and solid state drives being super fast, the need to use incremental saves is far less of a requirement outside of the certification and signature workflows. Incremental save has evolved.  

To get started working with PDF, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.

Leave a Reply

Your email address will not be published. Required fields are marked *