Parsing Adobe PDF Library Header Files for Documentation (4 of 4)

Parsing Adobe PDF Library Header Files for Documentation (4 of 4)

In this final installment, we put together everything and actually generate a PDF document from the Adobe PDF Library’s Header files.

So here it is:

There’s more to it, of course. In part 1 we focused on lexing APDFL’s C header files. Then in part 2, we parsed the results into a set of related classes. And in part 3, we took those related classes and normalized them into a nested table structure in preparation for this last part: committing the information to a PDF page.

However, while we’ve been building up the pieces to place marks on a PDF page, we still need to add navigational aids to all of this information. Navigational aids like PDF bookmarks, a Table of Contents, page headers and page labels. If we’re going to do this, might as well pull out all of the stops. So let’s take a look at the code that does all of that: generatePDF().

generatePDF basically has 5 parts:

I. Initialization

Initialization consists mostly of setting document metadata and instantiating the fonts that are going to be used later.

And we create the title page as the first page of the document.

II. Nested Iteration

Here we are doing two things as we iteratively delve deeper into the the API layers, Object types, to the individual pieces of the API. We are creating a bookmark tree for the individual pieces of the API and we are inserting front matter before each layer (a title page for the layer) or subsection (a summary for the Definitions/Types/Functions of an APDFL Object).

LayoutSummary is essentially a variation of the Table of Contents layout procedure discussed below and in a previous blog article, but without a practice pass. The call fits entries into a multi-column layout.

Because this summary is laid out before the content it points to, we use named destinations with the actual page destinations To Be Determined later. Because named destinations aren’t directly supported, we create a GoToAction with a dummy ViewDestination. Which we immediately overwrite in the next line in the Action‘s dictionary.

The key to the multi-column layout is the rectlist parameter, which represents areas of the page to sequentially fill with content. Once we get to the bottom of one rect, we start with the top of the next one until we run out of boxes. At that point, we are done with the page and restart with a new page and from the top of the list.

Once we are done with the section summary, it’s time to layout actual content. The big concern here is to see if the entry will fit into the space we have left on the page. First we check if we can fit the entry on the full page. If the entry is too tall to fit, then we’ll split the entry into a piece that will fit and the rest. We place the piece that will fit, along with a bookmark, and start a new page for the rest. Rinse, lather, repeat until we run out of pieces that don’t fit. If the entry (or remaining piece of it) is larger than the space that remains, it gets a new page. Lather, rinse, repeat with the next entry.

Unlike when we are creating the subsection summaries, we have a landing page for the bookmark, so no need to use the NameTree, but note that a NameTree bookmark will also be created as part of laying out the entry’s tokens (we skipped over this detail in part 3).

III. Page Headers

After all the entries have been laid out, then we are going to place information at the top of each page to help a reader orient themselves as to what layer and subsection they are in. For example:

That header information is captured in the branches of the bookmark tree we’ve been creating as the entries were being laid out.

So let’s delve a bit more into how we do that. We start from a flattened list of bookmarks that remembers their depth in the tree.  The key to understanding this code is that most bookmark entries are going to have a depth of 3 and are going to be skipped over.  It’s when we head back up to an indent level of 2 or less that we create a page header and place it on all the pages that we skipped over from the last time we placed a page header.

However, we don’t want to place a redundant page header on a layer title page. So if the indent level indicates a bookmark to a layer title page, we bump up by one the start of the page sequence to apply the page header to.

Creating the page header’s Form isn’t all that exciting but one interesting detail is that we are actually wrapping it in a Container, which we are marking as an Artifact.


IV. Table of Contents

Conceptually, this is essentially a re-implementation of the Table Of Contents from Bookmarks code I previously blogged about. In IronPython rather than C#.

I did tweak the implementation by splitting the bookmark title into words using space delimiters prior to the loops that positioning the titles of individual entry titles. Instead of searching for IndexOf a space and finding words with substrings, all that work is delegated to a builtin function.  It also helps that most entries are going to be a single word.

So let’s take a look at the result.

This is page from the Table of Contents showing the PDDoc Definitions:

Note that each line is a hyperlink to the corresponding entry. Also note the page numbers have a ‘PD-‘ prefix, we’ll discuss that below.

This is the summary of the PDDoc Definitions:

Only the first term, in blue, is a hyperlink to the entry. But it also contains a summary sentence describing the entry (if available).  That can be handy when trying to figure out what pieces you need. Sometimes.

Finally, we have the individual entries, with full information, a half dozen of which fit onto a single page in this case.

V. Page Labels

As part of pulling out all of the stops: every layer gets its own prefix. Except for the table of contents which will use roman numerals only.

We created the Page labels as part of the nested iteration phase. The one gotcha with page labels is that after inserting a whole bunch of pages for the table of contents, the start page for when the page-label range begins is off (by about 50 pages or so). So we have to go back and bump up the page number for each page label entry.


Thank you for your patience as we slowly built up a large-ish program for generating a PDF document. The documentation that I generated with this program has helped me better understand the APDFL Library’s structure a bit better. It has also helped me answer questions that would have been more challenging to answer without this documentation.

The ability to have APDFL documented in a single, organized, but also easily searchable file is very powerful. At minimum, it helps expand the horizons of what is possible with APDFL.

Leave a Reply

Your email address will not be published.