I remember the first time I heard about PDF. A co-worker dropped a stack of 3.5 inch floppies on my desk and said “Adobe’s about to change the world. Install this and let me know what you think.” I worked for Xerox at the time and the idea of sending electronic paper rather than FedEx-ing paper copies or faxes was going to be a problem for us. So I installed the software, it was called Carousel at the time, but we now know it as Adobe Acrobat. I didn’t have a manual or any kind of guide but eventually discovered that I could convert a PostScript file to PDF using the Distiller application so I opened Word, found a document I had been working on, printed it to the PostScript drive as a file and then ran that through the Distiller. I opened the file in the viewer and… well… to be frank… I was unimpressed. “Why would anyone do this?”
Note that several months ago, I wrote up a sample app which recreated Acrobat’s Font list in the Document Properties dialog, this is now part of a series where we use APDFL to extract or recreate the information contained in the Adobe Acrobat’s Document Properties tab.
I’m going to skip going over how to extract File(name), Location and File Size as you don’t really need APDFL to get that information. But otherwise, let’s proceed from the top down in order, starting with the first four at the top:
Nothing too difficult here, I use PDDocGetInfoASText mainly because these fields could contain Unicode text and shoving it into an ASText variable makes it easier to handle, even if all I’m doing is converting it to UTF-8 for extraction purposes.
If you don’t need Unicode text extraction; for example, you are extracting date properties, the following also works:
Note that I could have parsed the date string and formatted per the current locale, but
I’m lazy that’s outside the scope of APDFL per se.
For Application and PDF Producer, I’m pulling these properties directly out of the XMP metadata embedded in the file using the PDDocGetXAPMetadataProperty call…if the metadata stream is actually in the file. The reason that you might want use this call instead of PDDocGetInfoASText is if you want to extract other metadata that PDDocGetInfo doesn’t know about; such as the PDF/A or PDF/UA flags.
Next up is checking the PDF version and the corresponding version of Acrobat that can open that file, and Adobe Extension levels (to handle the fact that the PDF format has been stuck at version 1.7 for the past decade waiting for
Godot the ISO32000 committee to finalize PDF version 2.0. Adobe snuck in a few new features into PDF by declaring them to be Adobe extensions. The extension levels map to unofficial PDF versions. The code below matches Acrobat’s secret decoder ring:
A little known feature of the Description tab is that it will provide page size information about the current page. While you could calculate this from the page CropBox, there are a couple of other factors that could come into play. In the code below, since we don’t have a current page, we’ll just grab the information from the first page.
Grabbing the number of pages is one call:
Determining if the document is a tagged PDF however is a bit more complicated as I didn’t find a good call or flag for determining if the document is tagged or not, so I had to drop to the Cos-level to find the information and it’s a slight bit more complicated than the PDF Reference makes it out to be, as it needs to both have a StructTreeRoot and a MarkInfo:Marked entry set to true in order for Acrobat to consider the document to be a tagged PDF:
Lastly, Fast Web View means that the document is linearized so that the first page that gets opened when viewing the document (which isn’t necessarily page 1) is at the very beginning of the file with all the necessary resources it needs to display; so that that page could be displayed while the rest of the document was slowly be downloaded over a 56kb Modem
And that’s that. Full code is available here.
Interested in trying Adobe PDF Library? Sign up for your free eval today!
Sample of the Week:
I have a real love/hate relationship with PDF Portfolios. On the one hand, they are a brilliant way to package multiple files, PDF or otherwise, in a single secure package and send them around. On the other hand, you can’t rely on the recipients having a consistent experience… even across the Adobe viewers. It’s really annoying. PDF’s popularity and longevity are due in large part to its ability to reliably communicate documents across platforms and viewers and maintain the visual fidelity; it looks like what the author intended. Even in the worst PDF viewers, the visual fidelity is preserved even if you can’t work with a form or comment on the file or add the signature that you were requested to add. But there is a way out… PDF Portfolios are really just an extension of the old style PDF Packages which were much simpler but far more consistent in their behavior.
Note: This is the third in a series of my articles addressing PDF Optimization. You can read the prior articles at the links below.
The PDF specification was made available by Adobe Systems in 1993 and it’s still the only file format that allows authors to publish documents that contain text, line art, images, audio, video, 3D models and interactive form fields with any expectation that their audience will be able to view them with a reasonable degree of fidelity. No other file format capable of displaying fully formatted documents has stood the test of time in the way that PDF has. Because of it’s longevity and the fact that multiple ISO standards for long-term storage have used PDF as their starting point, PDF, PDF viewers, and PDF developer tools have had to ride wave after wave after wave of technology changes. Just like the rings of a tree can tell us a lot about environmental changes far into the past, year after year, slicing through the PDF specification can tell us a lot about the technologies that came and went while PDF was adapting to it’s environment… Nothing shows this more clearly than Fonts. Continue reading