Note: This is the third in a series of my articles addressing PDF Optimization. You can read the prior articles at the links below.
The PDF specification was made available by Adobe Systems in 1993 and it’s still the only file format that allows authors to publish documents that contain text, line art, images, audio, video, 3D models and interactive form fields with any expectation that their audience will be able to view them with a reasonable degree of fidelity. No other file format capable of displaying fully formatted documents has stood the test of time in the way that PDF has. Because of it’s longevity and the fact that multiple ISO standards for long-term storage have used PDF as their starting point, PDF, PDF viewers, and PDF developer tools have had to ride wave after wave after wave of technology changes. Just like the rings of a tree can tell us a lot about environmental changes far into the past, year after year, slicing through the PDF specification can tell us a lot about the technologies that came and went while PDF was adapting to it’s environment… Nothing shows this more clearly than Fonts.
I’m not even going to try to explain how fonts are handled in PDF. I’ll let Wikipedia do it…
A font object in PDF is a description of a digital typeface. It may either describe the characteristics of a typeface, or it may include an embedded font file. The latter case is called an embedded font while the former is called an unembedded font. The font files that may be embedded are based on widely used standard digital font formats: Type 1 (and its compressed variant CFF), TrueType, and (beginning with PDF 1.6) OpenType. Additionally, PDF supports the Type 3 variant in which the components of the font are described by PDF graphic operators.
And there’s font encodings… more from Wikipedia…
Within text strings, characters are shown using character codes (integers) that map to glyphs in the current font using an encoding. There are a number of predefined encodings, including WinAnsi, MacRoman, and a large number of encodings for East Asian languages, and a font can have its own built-in encoding. (Although the WinAnsi and MacRoman encodings are derived from the historical properties of the Windows and Macintosh operating systems, fonts using these encodings work equally well on any platform.) PDF can specify a predefined encoding to use, the font’s built-in encoding or provide a lookup table of differences to a predefined or built-in encoding. The encoding mechanisms in PDF were designed for Type 1 fonts, and the rules for applying them to TrueType fonts are complex. For large fonts or fonts with non-standard glyphs, the special encodings Identity-H (for horizontal writing) and Identity-V (for vertical) are used. With such fonts it is necessary to provide a ToUnicode table if semantic information about the characters is to be preserved.
The emphasis was mine.
To say that the rules for encoding fonts are complex is an understatement. I remember the fun we all had at Adobe when certain viewer optimizations ran head first into the introduction of the Euro symbol to system level TrueType fonts. A quick Google search shows that many PDF libraries and tools still are having trouble handling the Euro… insert your own Brexit joke here.
Font subsetting adds another layer of complexity. Fonts are big, so one way to optimize a file is to embed only the characters that are used in the file; only the ones necessary to reproduce the document on screen or in print. Generally, this isn’t a problem. Assuming you are using a good PDF tool to create your files, there’s nothing you can do to a PDF with a fully embedded font that you can’t do with a PDF containing fonts that have been properly subset. But…
…there’s always a “but.”
When developers try to combine PDF files that contain different subsets of the same font, the impulse is to combine those subsets to save space. Sometimes this will work, sometimes it won’t… and sometimes it will appear to work in that the tool you are using doesn’t throw an exception, but the results are undesirable. All too often the only measure of the quality of a PDF creation tool is whether the resulting file looks good in Adobe Reader. Often times, the developers of these tools cut corners where the dictionary descriptions in the PDF specification have the term “optional” in them. Technically, the optional dictionaries are, in fact, optional… if all you want to do with the file is view it. If you want to perform other operations like merging it with another file, extract the text to convert it to another format, create a search index, or redact the file, you really need more information about the characters on just what is strictly “required”.
Fonts are complicated. If your PDF Optimization tool is capable of combining font subsets and you’re not getting the results you expect, it may, in fact, be a problem with the software… but it might be a problem with the input files… it might even be both. And because fonts are complicated, it might take a hardcore PDF expert to determine where the problem lies… luckily, I have access to several here at Datalogics.
Do you have a font horror story related to combining documents? Did it ever get resolved? Tell us your story in the comment section below. And we’re always looking for really odd test files; share them with us if you can… not in the comments section though.