PDF Optimization: A Lose-Lose Proposition?

PDF Optimization: A Lose-Lose Proposition?

PDF files are too big.

PDF is too slow.

When I click on a link to a PDF, my screen goes blank.

These are the most common complaints about PDF that I’ve heard over the years… and all three are real problems with PDF. But they don’t have to be. PDF is one of the most robust file formats for communicating final form documents ever created in human history. PDF is used by nearly every government agency in the world, nearly every regulated industry, court, manufacturer, bank, commercial printer, the list goes on. But each of these groups uses PDF for slightly different reasons and in slightly different ways. For example, a PDF file created by a user in a government agency might contain structure information that is essentially invisible to the sighted but absolutely essential for people using a screen reader. However, that same structure information is unnecessary for printing the document. Now, generally, structure information in a PDF doesn’t take up all that much room but the inverse example isn’t always true. Image information necessary to print a document at high resolution and in full color can take up a lot of space but most of that data is tossed away when pages get rendered to the screen.

The root of this problem is the fact that most of the PDF files created by individuals are created in the same way; using the application defaults in their PDF tool of choice. While Adobe Acrobat does provide a very good set of defaults, even it can’t guess what you might be using the PDF file for further along in it’s life cycle so it can’t streamline the PDF to a specific use case without the user telling it to. And then there are the really bad PDF tools. These tools don’t necessarily create bad PDF in that they do conform to the specification, they just do things in really bizarre ways; which is understandable given the complexities of the PDF Specification. Typically, these tools only target the visual representation with no regard for the underlying structure that allows for content reuse or much of anything other than printing the file. And finally there is that whole big mess in the middle; PDF tools that get most things right, or close to right, and rely on the fact that Adobe Reader will fix it up automatically before displaying it.

So… out of this swirl of creation tools and use cases has emerged a sort of PDF aftermarket toolset designed to take the standard output of the various PDF creation tools and optimize it. This article is the first in a series that discusses various aspects of PDF Optimization. There’s a lot of PDF expertise here at Datalogics and the goal of this series is to share that expertise with you to help you better understand what is involved with PDF Optimization, set expectations for what can be optimized, and debunk some myths.

What You Can Look Forward To:

The next article will discuss one of the hard facts of PDF Optimization; it’s lossy. You’re going to remove data and with that, you’re going to limit what the PDF file is useful for. The term PDF Optimization is generally used… well… generally… too generally. People who say it know what they mean but the people who hear it don’t necessarily hear the same thing. You can’t have a reasonable discussion about Optimization without knowing what you’re optimizing for; what application or use case are you targeting? The target use case will pretty much determine how big is “too big” and inform how much loss of data can be tolerated. And you are going to lose data. But what might you gain? Faster download times? Faster rendering? If you’re just removing unused named destinations, you may effectively be losing nothing.

Is PDF Optimization lose-lose or can something be gained by intelligently streamlining the file?

The remaining articles will discuss…

  • Image downsampling, which is probably the easiest optimization to get right.
  • Coalescing font subsets which is probably the hardest to get right and with certain files may be impossible.
  • Is Refrying even an option? Sometimes it seems like you just have to print the PDF to PostScript and convert it back to PDF. Yes – it’s ugly, but does it work?
  • How to manage the user’s expectations. We’ll discuss what to do when “as small as you can go” is still too big.

These are the topics on my list but we’d like to hear from you as well. Send us your use cases for PDF Optimization. How big is too big? How helpful would auditing the space usage be? Leave a comment and share your thoughts.

6 thoughts on “PDF Optimization: A Lose-Lose Proposition?

  1. Would like to know how can you help to minimise the dpi of pdf, i would like the pdf to have smaller dpi as possible, for online viewing purpose and fast downloading as well. Appreciate your valuable advise. Thanks.

  2. We are struggling with customers and Bad PDF. We have started looking at the “Is Refrying even an option? Sometimes it seems like you just have to print the PDF to PostScript and convert it back to PDF. Yes – it’s ugly, but does it work?” Would love to understand the correct way to “fix” PDF. The comment we always get is “If we open and save it in Acrobat, the PDF is fine”

    1. Marq:

      “If we open and save it in Acrobat, the PDF is fine” is, in fact, a pretty good way to fix a bad PDF. Unfortunately, it’s not scalable. The reason this works is that Acrobat has some code in it that will fix many common problems in PDF files created by non-Adobe technology but Acrobat isn’t licensed to run on a server so you can’t rely on that method to fix the bad PDF files.

      However, both the Adobe PDF Library and the Datalogics PDF Java Toolkit do have similar code that will allow them to read through a bad PDF file and fix some of the most common problems.

      That said, there is no “correct” way to fix a bad PDF, there is only a correct way to fix a given problem in a particular PDF file. The fun part is diagnosing the problem… this is where a good pre-flight tool can come in handy.

      1. We have the datalogics Java Toolkit. What are the parts of the code we should look at to correct some common problems?

        1. Marq:

          Since you have the Datalogics PDF Java Toolkit, I’d recommend that you bring up these issues with our Support team for your specific needs…but to respond to your question in a general sense… you can look at the PDFDocument.wasRepaired() method to detect if the underlying COS document was repaired to recover from damage and if so, do a full or linear save operation on it so that the cross-reference table and all the objects get rewritten. A bad or slightly off cross-reference table is one of the more common problems with PDF.

          The SanitizationService will examine the file and make some pretty significant changes while attempting to maintain the visual fidelity. This may not be what you want since it will flatten forms but it does examine each object in the file and rewrite it completely, so that may solve some problems as well.

          The PDFXObjectMap will allow you to iterate over and examine all of the XObjects in the PDF and possibly resolve any issues there.

          And finally, you may be able to use the RasterCallBackInterface to get in-depth information on different content items being rasterized. This might be a way to see where things are going wrong in the file but I’ve never used it for that purpose so I’m just guessing here.


  3. Thomas:

    If you’re referring to the DPI of images in a PDF, then the question of minimizing is really a matter of use case. The Adobe PDF Library, the Datalogics PDF Java Toolkit, and most other PDF tools will allow you to downsample images in place fairly easily. If you ever think your target reader will need to print the file or may want to OCR it, you may want to downsample only to 300 DPI, which can still be kind of big for a mobile download.

Leave a Reply

Your email address will not be published. Required fields are marked *