Parsing Adobe PDF Library Header Files for Documentation (3 of 4)

Parsing Adobe PDF Library Header Files for Documentation (3 of 4)

In our last two installments, we first lexed APDFL’s C headers so that we then could parse the C declarations and the C Preprocessor defines and macros into subclasses of a ‘Defn’ class. Then, we sorted them into logical groupings through the use of magical heuristics – with a brute-force override for those cases where the magic didn’t work.

The next step is to organize that information collected so that it could be laid out on a page.

Now, the easiest way to do this would be to do a single-column layout. As Adobe did, but at the cost of wasted space on a page. One of the design goals is to reduce the wasted space compared to Adobe’s documentation. Because we are going to have a lot of pages. Plus, some of the documentation blocks contain (mostly simple) HTML tables, so we are going to need to be able to do a little bit of table layout. Since we are in for an inch, we might as well be in for a mile and make everything a nested table.

To do that, we are going to define 3 classes: A LayoutTable which is a two-dimensional collection of LayoutCell(s), and LayoutToken(s). Now LayoutTable is a subclass of LayoutCell. If a LayoutCell is not a LayoutTable, then it’s a one-dimensional collection of layout tokens.
So let’s generate one table for each Defn object:

Each of the six functions called is going to have some variation based on what type of Defn is being processed, but they all have some commonality. So, let’s take a look at the simplest type first, which will generate output like this:

Basically we create create a common preface, a syntax entry, and some additional information detailing where the type is used and returned from. Working backwards:

Here we look at three dictionaries and if they have any entries, then we add another two rows to the DefnEntry Table. Note that DefnEntry is a single-column table. Process_funcNameList, however, returns a table that can have multiple columns.

Each name in the list will be a hyperlink to another entry in the documentation. Also, note that the creation of table rows of data cells is HTML-like.
Backtracking a bit, processSyntax is basically about formatting with a proportional font. If we encounter a token that is the name of another Defn, make it a hyperlink. Lastly, tweak the spacing after punctuation marks.

Now the Full Preface consists of a header row with the name of the entry, and which header it came from, then any Deprecation Notes, then the entry’s description, then any special notes. If there are any related methods listed for this entry, they are placed besides the description in a second-column on the right side.

Note the process_htmBlock call:

Handling HTML

process_htmBlock is a fairly long and intricate Finite State Machine for processing sequences of text interspersed with a subset of simple HTML tags.

That list of tags expands to a list of tokens that the FSM knows about, any additional HTML tags would have to be worked into this routine. Otherwise, we are going to assume that all that follows will fit into an ordinary LayoutCell (until we come across token tags for a table).

Now we deal with Text tokens. Some Text tokens will have names of functions along with suffixes like parentheses or ending punctuation. We still want those function names to be links. So we look for those sorts of suffixes with regular expressions in such a way that we can remove the suffixes and use the remaining term to search our link map. The other thing to look out for are prefixes which have white-space implications.  And there are certain non-text tags which we actually treat as text.

I’m going to skip over tags which essentially modify the font and font size being used. I’m assuming that they are properly paired and just pushing and popping the font information, treating such tags as being LIFO. The interesting part of the code is dealing with HTML Table tags.

When a Table tag is encountered, the current entry is shoved into the entryQueue and a new entry object is created consisting of a TableCell. The TableCell interface is essentially modeled after the HTML Table tags so handling the consequent tags is straightforward. This FSM can’t handle an HTML table within an HTML table, but we don’t expect such horrors to be in code documentation. In the end, if we didn’t encounter a table in the token array, we have a layoutCell, and if we did encounter a table, we have a list of layoutCells (one or more of which might by a LayoutTable), which we turn into a single-column table.

for output like this:

All well and good, but we need these nested tables to fit onto pages which have definite dimensions. We have to figure out what the dimensions of these tables are going to be, and we are going to have to split them up so that they fit into the space we have available. With a table essentially being a collection of cells, to know the dimensions of a table we need to know the dimensions of its cells. So let’s look at how we dimension LayoutCells first.

Tables and Cells

The first step would be to see if the cell can fit in a given width. From the width, we create an absurdly long box and then we practice laying out the text into lines that fit inside that box. The height of the line is the biggest font size used – unless the text token is the beginning of a paragraph, in which case, we double the height. The sum of all of the line heights is then the height of the cell.

The actual layout of text follows the exact same algorithm for laying out the text. But it does a bit of additional work for horizontal alignment and creating links.

Likewise, for tables, we check the size of a table within a given width by making a really long box. But then we split the table into columns, and then into rows. And the farthest edges of all of the cells determine the width and height of the cell. Again, the actual layout follows exactly the same logic, but with a bit of additional work – like table rules between cells. Note, however, that a Table doesn’t really position text, it delegates that work to its individual cells.

Splitting tables

We split tables recursively because of nested tables. So the first step is checking to see if the table needs to be split at all. Otherwise, we remove table rows from the bottom of the table until the remaining table fits. In some cases, we’ll end up with a 1 row table that’s larger than the area we have available. In that case, we need to try to split up the cells to fit in that area… and re-bundle them together into two tables after. And in other cases, while we can split a table at a row boundary, we have so much remaining white space that it makes sense to split the cells of the first row of the overflow table, and tack the top halves of the cells as another row to the bottom of the first table. As complicated as it sounds, this algorithm is basically as demonstrated by Chico and Groucho.

For splitting a LayoutCell, we just do a layout of the text of that cell until its height exceeds the height limit. Then, we cleave the cell into two at that point. Some split cells will have all of their original text in one cell and the new cell will be empty.

These are the major pieces needed to layout all the documentation entries. Tune in next month for our final episode where we’ll put all the pieces together and add on the final bells and whistles.


Leave a Reply

Your email address will not be published.