Parsing Adobe PDF Library Header Files for Documentation (2 of 4)

Parsing Adobe PDF Library Header Files for Documentation (2 of 4)

In our last episode, we introduced the lexing needed to parse the Adobe PDF Library’s C headers. In this installment, we are going to start parsing for documentation blocks interwoven with the C header declarations.

From those headers, we now want to extract six different pieces of information: functions: callback functions, structures, enumerations, other typedefs, and C Pre-Processor defines, along with any blocks of documentation text that may be associated with them. For this task, we are going to make use of 3 python classes: Defn, from which we subclass Proc, from which we further subclass callbackDefn for functions and callback functions respectively. Everything else gets shoved into Defn, which is unofficially subclassed with the defnType member values of ‘Function‘, ‘FuncTypedef‘, ‘structTypedef‘, ‘structMember‘, ‘enum‘, ‘enumConst‘, ‘simpleType‘, and ‘define‘.

If you’ve noticed the mismatch between defnTypes, Defn subclasses, and the items we are trying to extract; there’s a reason for that. This project grew organically starting from parsing Proc header files for functions, using Defn for everything else… until I got to Callback functions where I wanted to re-use my function-parsing code. But I also wanted to keep the callbacks separate. The way that functions and callback functions are mixed together in Adobe’s APDFL documentation is one of the things I wanted to fix. So, I subclassed Proc.

The Defn class is where the Docblock gets parsed. And a little bit of processing of the first/summary sentence of an entry.

I’m not going to dwell on the specifics of the parsing of the header files. But it’s basically a big ol’ Finite State Machine that invokes other FSMs.

To give you a flavor of what’s involved, here’s the main loop for parsing Defns:

And here is the function triggered by the typedef token, line 578 above:

This code was organically grown to match the input code base rather than strictly based on any semblance of the C standard, so let’s not think too hard about it.

Organizing the information

The next challenge is organizing these items into the layers and objects. I was able to map header files to layers with this configuration:

Which I then parse with:

Adobe doesn’t consider the APDFL Plugins to be their own layer. And their documentation certainly doesn’t have Datalogics-added functionality to work into their model. But having those as their own layers makes sense from a conceptual perspective.

Mapping functions to objects is mostly straightforward. If the return-type or the first parameter’s type matches the function’s prefix, then that’s a good indication as to the function’s object grouping.

But add in Structs, enums, callbacks, and C Pre-processor macros and things get a bit more complicated from there. So I try a number of heuristics, and if those don’t work the way I want, then I can manually override the heuristics. When the heuristics work, they’re a bit magic.

For example: in the code above we look at object groups of unassigned typedefs. If a type is used by or returned only from functions of the same object group, then we consider that typedef to be part of that object group. Pure magic, when it works.

And purely maddening when it doesn’t work.

A titular bonus

Finally, I had said I wouldn’t get to using the DotNet interface until the 4th installment, but if you’ve made it this far, here’s a little reward. A bit of code for creating a title page for the documentation we are going to be generating:

In the next episode we are going to start formatting these Defn objects into something that can be laid-out on a page. Or multiple pages, as needed.

Leave a Reply

Your email address will not be published. Required fields are marked *