Note: This is the second in a series of my articles addressing PDF Optimization. You can read the first one here.
If I wasn’t clear in my first article, I’ll be far more terse here, hopefully, to promote understanding… PDF Optimization is complicated. Personally, I don’t even like the term “Optimization” but it’s the term people in the PDF world use so I’ll keep using it in an attempt to appease the SEO gods. Instead, I prefer the term “Repurpose.” As I mentioned in my previous article, You can’t have a reasonable discussion about Optimization without knowing what you’re optimizing for. It seems to me that the process of manipulating the content of a PDF file in such a way as to target a specific use case and removing data that may be required for other use cases is more like repurposing than optimizing. But again… like my Facebook relationship status, it’s complicated.
One area of
repurposing PDF Optimization is images. Optimizing image data is quite possibly the easiest thing to get right. In fact, if many of your PDF files were created a long time ago or by some of the more basic tools, simply recompressing the file using a modern library without changing the image data could save you tons of disk space per file. Forcing recompression of uncompressed streams in PDF isn’t really an image manipulation so I won’t cover here but, trust me, if your PDF files seem to be too big, try recompressing the file and see if that works for you. With a lot of older PDF files, that may be all you need.
Assuming you are not targeting pre-press applications, there are a few simple methods of removing massive amounts of image data from a PDF without significantly affecting the visual fidelity of the pages. I’ll cover just a few.
Image downsampling is possibly the most effective way to reduce PDF file size but also the most subjective. How low can you go without the image quality suffering? Well… that depends. If you think that the file might require OCR, you may not want to downsample to anything lower than 400 PPI but anything over 600 PPI won’t significantly improve the OCR results… so pick something in between. After that, I honestly don’t have any advice on downsampling images. What you decide on is highly subjective and really does depend on the target application. What you definitely want to do is strike a balance between file size and pixel density while also thinking about the future. I remember a particular customer from a few years ago that had optimized their files for use on iPads. They were sized perfectly for the iPad, small, efficient, they downloaded quickly, they looked beautiful… then the Retina displays came out… I’ll let you finish that thought yourselves.
The thing is, image downsampling is lossy; when it comes to image downsampling, how low you can go may not be how low you should go. Consider what your PDF may be required to do in the future.
Removing Duplicate Images:
There are plenty of really sloppy PDF creator tools out there that support PDF output but were built on top of print drivers. When you’re printing to paper, every page must receive the data required to produce every page… obviously. So these drivers that produce PDF just send the same data to the PDF file. If you have a corporate logo on every page of a document, you’ll get that logo stored in the PDF once for every page. That’s a lot of duplicated data. If the logo is grayscale… or color… that’s a lot of duplicated data.
A good PDF library tool can scan through the images in the file, detect the duplicates, and then set the pages up to refer to a single instance of that particular object. Of course, with the right tool, this could work for all types of PDF Cos objects, not just images… it’s just that images tend to take up the most space.
By removing duplicate images, you can remove significant amounts of data without changing the visual fidelity of the file at all… less data, same appearance.
Cropping Clipped Images:
Another method of removing data from images without changing the visual fidelity is to crop the image area outside of the clipping path. Page layout software that output or allow you to save as PDF may insert an entire image file into the PDF even though you’ve cropped it down to only a small portion. The part of the image outside the visible area on the page is “clipped” rather than cropped. All the image data is there, you just can’t see it. By removing the hidden areas of the image, you can, again, remove significant amounts of data without changing the visual fidelity of the file at all.
I’m sure there are other image optimizations that can be used to streamline PDF files. What have you had success with? Comment and let us know.