Marked Content to PDF/UA Compliant Tagged PDF

Marked Content to PDF/UA Compliant Tagged PDF

Matt Kuznicki’s recent Feature Exploration of Marked Content inspired me to work on a sample app idea I’ve had on the back burner for a while. Can I use Adobe PDF Library (APDFL) to modify an existing PDF document to make it PDF/UA compliant?

Give or take a heaping handful of caveats, yes I can.

screenshot showing PDF/UA requirements fufilled.

The Caveats

The main caveat was that I had the source code to generate the brochure and could (and did) tweak how the brochure was generated to make it easier for my little sample app to process. The bulk of the changes were to add the elements to a Container’s Content instead of adding the Elements directly to the Page’s Content. I did change the order of some text content so that it would be added to the page in logical reading order. I also had to redo code that generated a simulated text shadow so that the text shadow could be marked as a meaningless artifact while the foreground was marked as an H1 header. Lastly, I added a LinkAnnotation to the logo for testing purposes. Apart from that, my input document should be identical to this document as described here.

But tweaking the page’s Content Stream is not sufficient to make a document a tagged PDF, much less a PDF/UA compliant document. For that, I turned to APDFL’s PDSEdit layer. Let’s take a look at the code.

The Code

This first step is just to make sure we are starting from an un-tagged file, to avoid complications. We create a new Structure Tree. Next, we add the first node on that tree.

The Document node’s title we steal from the PDF’s metadata. We’re also hard-coding the language here so that all the children nodes inherit this property. Those children nodes being added in the next block of code:

Here we iterate over the pages. If the pages contain any annotations, we set the tabs property to indicate that they should be processed in the order found in the structure tree. The tagElements procedure is where the action is.

Aside from some initializations, we’re essentially just iterating over the elements of the PDEContent passed in. The most important element we are looking for is the kPDEContainer type:

From the Container, we extract the Marked Content Tag. If the tag is an Artifact, then we skip adding to the Structure tree. Otherwise, we use that tag as the Type for the Structure element we create. To that Structure Element, we add the container as a child before adding the the newStructElem as a child of the rootElem. Note that the tag types are actually from section 14.8.4 of the PDF 32000:2008 ISO specification. Tag types don’t necessarily have to be those Standard Structure types, but that’s a topic for another day. Next, we recursively descend into the content of the container and the newStructElem becomes the parent rootElem.

If the Tag is a Figure, meaning an image or illustration, then we need to add some attributes to the Structure Element – the element’s bounding box and height and width essentially.

But most, importantly, we also need to add some alternative text to the Structure Element. Each Figure requires its own alternate description of the visual representation. I’m reading that information from a Tab-separated-values file. A file which contains page number, approximate bounding box information for each figure, and its description:

And when looking for the Alternate text for a Figure on a particular page at a particular location on the page, I simply iterate over each record from the tsv file until I find a match. For a larger file, I would break up the records into a page-based map, but for a reasonably small proof of concept document, this works well enough:

Anyway, we’re going to re-use that logic for finding alternate text for Link Annotations.

If we find a Link Annotation that overlaps the Container’s bounding box, we add a new Structure element of type Link as a child of the newStructElem. The annotation’s Cos Object is added as a child of of the link structure element. Similar to how we handle Image elements:

This proof-of-concept code doesn’t actually handle any other Elements.

So it’s not robust against just anything that might be thrown at it. But, let’s go back to the main method for the final steps we need to take to make this a PDF/UA compliant document.

We mark the document as being tagged.

Next, we set the DisplayDocTitle viewer Preference.

Then, we set the Document’s language. To apply to any text that might be in Artifacts, which are not in the Structure Tree.

And finally:

We mark this document as being a PDF/UA document and save it.

Closing Thoughts

This project is akin to my earlier PDF/A from scratch article; it’s a proof-of-concept demonstrating how to get started. In my own getting started for this project, I found the Access for All PDF Accessibility Checker to be quite helpful. Particularly in combination with the PDF Association‘s PDF/UA reference suite.

Interested in seeing what Adobe PDF Library can do for you? Click here for a free trial.




Leave a Reply

Your email address will not be published.