Datalogics at the 2016 Callas PDF Hackathon

Datalogics at the 2016 Callas PDF Hackathon

GLS Berlin Campus
GLS Berlin Campus: Callas PDF Hackathon.

It was my privilege to represent Datalogics at the PDF Hackathon that our partner company Callas Software organized in Berlin on April 11 and 12. This was Callas’ second Hackathon; their first was focused on Callas’s new and impressive pdfChip technology, but this iteration was broadened in scope to any PDF-related
topic.

Olaf Drümmer explaining the Hackathon process.
Olaf Drümmer explaining the Hackathon process.

With most of the attendees being Callas customers, the topics they wanted to hack tended towards pre-press issues with Callas tools.  But there was a small group of topics which caught my fancy, including an issue with converting Emails to PDFs causing Images to be split-up, a request for how to mask an image with a vector path, and an interesting request from a gentleman from a Finnish newspaper for how to extract an article from PDF(s) for a reprint service given an XML file that describes where the components of the article are on the PDF, all other ancillary material having already been destroyed as it would overwhelm their resources to archive the InDesign files generated on a daily basis.

Challenge: Extract the article from the newspaper page's PDF file.
Challenge: Extract the article from the newspaper page’s PDF file.

My initial thought was to wonder if InDesign might possibly be including Article and Bead information (a PDF v1.1 feature; section 8.3.2 of the v1.7 PDF Reference) in the document when it generated the PDF, and if that could be used to extract the relevant article.  Alas, a small bit of sleuthing revealed that this little-used PDF feature is seemingly not used by the product most likely to populate that information in a PDF.

Reducing the problem to its fundamentals.
Reducing the problem to its fundamentals.

Turning to the example XML file and examining the elements and attributes that it contained, we convinced  ourselves, because it would be much easier for us if it were true, that its article coordinates were likely to be Desktop publishing points (1/72 in.),  until I later  noticed that the xgeometry coordinate in our sample XML file was beyond the right edge of our sample PDF. Oops.  Our newspaper man then turned to InDesign to determine the position of the article on the page in points,  and I turned to the XML coordinates to determine how to convert them to match those point-based coordinates.

Creating a pdfChip program to create an article reprint PDF from a newspaper page PDF.
Creating a pdfChip program to create an article reprint PDF from a newspaper page PDF.

Once I figured those formulas (yeah for Algebra!), we fed them into the JavaScript program for pdfChip that Olaf put together in order to create a new PDF page which essentially clipped the rest of the page  contents to display only the article.

Other team members worked on a JavaScript routine to calculate the minimum-size page/bounding box if the article had more than one element on the page. Sadly, this code did not make it into the final program as our one sample PDF/XML file set  only had one element to it so it wasn’t required for what we could demonstrate.

At a lull in between, I took a small break to put together a quick DLE program to demonstrate how one masks an Image with a vector path, which I’ll discuss in a follow-up article.

Leave a Reply

Your email address will not be published. Required fields are marked *