Parsing Adobe PDF Library Header Files for Documentation (1 of 4)

Parsing Adobe PDF Library Header Files for Documentation (1 of 4)

When I first started doing technical support for APDFL, my supervisor pointed me to an out of date PDF version of Adobe’s Acrobat and PDF Library Reference as easier to navigate and more useful than the HTML version of the documentation used since.

The documentation document model

I still prefer it to the HTML documentation. Don’t get me wrong, though. That document is a 43MB and 5937-page monster of a beast, and separating out the wheat from the chaff from this document takes a bit of work. More importantly, it’s not being updated with the latest APIs being added to the PDF Library. So I’ve been taking a close look at it and its source material, and I’ve been thinking that I could improve on it a bit.

Take, for example, this function summary below.  Some of these are not strictly API function calls but callback function signature declarations. Two similar but different things; they should be in two different lists. Also, the Javadoc convention being borrowed from is to have a summary in the first sentence, not an entire paragraph. I wanted to fit more summary on the page; a two-column layout seemed like it would allow more to fit more compactly and be more legible.

Looking at an entry for an individual function, it seemed to me that there was a lot of wasted horizontal space on the right. So I reorganized things a bit. Specifically, I removed of some entries that weren’t relevant to PDF Library users.

The Adobe document had some returned-From, used-by information at the object group level. I thought that information more relevant at the typedef level, and since I also need to parse the structs for documentation, why not also add in a used-in entry as well?

And finally, there is one thing that was never going to show up in Adobe’s documentation:

First steps: Lex-ing

The first step was getting a tool to help with the lex-ing. I went with David Beazely‘s PLY.

Fortunately, for me PLY has an example app for parsing  Ansi C which I used as the starting basis for my HeaderDoc program, since APDFL presents itself as a C library.

The C parsing largely stayed in place but were supplemented by parsing for C pre-processor directives.

And finessing what is considered a C comment (/* this is a comment*/,  /**** and so is this! ***/ ) and what is the start of a block of documentation ( /** this is a sentence intended for documentation. */). Oh, and hex numbers, and C++-style comments also thrown in.

I also made extensive use of PLY’s conditional lexing features. Outside of a docblock, fairly conventional C statements with heavy usage of the C pre-processor to smooth cross-platform issues. But within the docblock, there was a mix of HTML (largely a limited subset of HTML 2.0), and Javadoc-inspired tags.


I should be clear here, that I’m not running PLY in ordinary C Python, but in IronPython. The eventual goal here is to use the DotNet interface to generate PDF pages containing APDFL documentation. I could have used Jython and the Java interface, but I’ve used IronPython with the DotNet interface before. It’s just been awhile.

I won’t actually get to producing PDF Pages until part 4. Next up, I’ll be parsing random header files to extract summary information first.


Leave a Reply

Your email address will not be published. Required fields are marked *