Back in March of this year, I had an opportunity to try my hand at being a Product Manager for a pair of products from Datalogics, PDF Java Toolkit and PDF WebAPI. It has certainly been an interesting journey going from the engineering management side to the product management side.
In September, I began an experiment and set up a feedback forum through uservoice to collect feedback from our customers about our current set of products. We have casually mentioned this in a few of the posts I have written, mainly in posts about PDF WebAPI. Today I wanted to take the time to write something specific about uservoice.
As we move our products forward, we will post ideas to our feedback forum and update the status of these as we go through and implement them (or put them on ice for the winter).
If you have thoughts or suggestions for what Datalogics could do to improve its current products or develop new ones to help you, post them on http://feedback.datalogics.com and help us build solutions for the pain points you may have when working with eBook or PDF technologies!
PDF is widely accepted for submitting documents by courts across the world. In most cases, there are size restrictions but the courts do allow you to e-file documents in sections and they suggest you try to divide up the document in logical places; between chapters or sections of the document for example.
Acrobat allows you to do this fairly easily by letting you split the document based on the number of pages in a document, by file size, and by top-level bookmarks. But if you want to automate the process on a server, or integrate the process into a document management system, Acrobat isn’t the right tool; it’s automation features are designed or end-user interaction and it isn’t licensed for this type of server use. However, the Datalogics PDF Java Toolkit can be used to replicate the Acrobat functionality and this week is the first of a new three-part series that will discuss how to programmatically split a document in the same way that Acrobat does.
What You Need to Know First: What we generally refer to as bookmarks are referred to in the PDF Specification as the Document Outline, is stored in the document Catalog and consists of a tree-structured hierarchy of outline nodes which serve as a visual table of contents to display the document’s structure to the user. In the PDF, the nodes at each level of the hierarchy form a linked list chained together through their Prev and Next entries and accessed through the First and Last entries in the parent node. For the sake of consistency with the PDF Java Toolkit API, we’ll use the term “bookmark” to refer to an outline item from here on.
Bookmarks are not necessarily created in the proper order to split a document without overlapping pages. But most tools that create PDF from well structured source documents do. While it’s not a guarantee, it’s a safe bet that this code will work for many typical PDF files with bookmarks.
The Process: As Adobe was developing the “Gibson” library, the original version of the Datalogics PDF Java Toolkit, one of the design concepts was that “you should be able to open the Putty Book and know how to use it.” – the Putty Book is how Adobe folks referred to the first published edition of the PDF Specification; the color on the cover, as you can see, is primarily putty.
Once we have the List of top level bookmarks we can make some decisions about exactly how we want to split the document; especially if the first bookmark is not to the first page. We also need to decide what to do with any pages that are after the last bookmark. In this Gist, I’ve chosen to break up the document in a process involving three steps.
Extract the pages from the first page to the page prior to the destination of the second bookmark. This will give me a file that contains the front matter of the document as well as any pages that are in the first section.
Then I split the document at each of the remaining bookmarks creating a new file that contains pages from just that section.
Finally, I create a new file that contains pages from the start of the last section (the last bookmark destination) to the last page. This may not be what you want for your documents but the Gist is easy enough to modify to make it do what you prefer.
The second step is shown below. Each bookmark has a destination and each destination refers to a page in the document.
We can use the destination page numbers to calculate the starting page and number of pages to extract to a new file using the PMMService.
Part of my job as the Evangelist at Datalogics is to scour the Acrobat SDK Forums and try to help people understand how to use the Adobe PDF Library instead of Adobe Acrobat or Reader when they need Acrobat functionality… but they need it to run on a server or via the command line.
One of the questions that I still come across with alarming frequency and is the cause of some of the most frustrated complaints aimed at Adobe relates to command line printing from Reader. It seems that many enterprises have build mission-critical, server-based functionality around unsupported features in Adobe Reader and then get in trouble when Reader updates or Adobe changes how Reader works.
Adobe Reader just wasn’t engineered for this use case.
Instead, use the Adobe PDF Library, the same technology that Adobe uses to build Acrobat and Reader. The Gist below demonstrates the use of the the Adobe PDF Library with Datalogics Extensions to print a PDF file without any user dialogs. The code contains comments on how to set some common print parameters.
One of the most useful features of PDF for navigating long documents are bookmarks. Bookmarks allow you to quickly move from one part of a document to another… and when the PDF is Optimized for “Fast Web View,” Adobe Reader can skip the download of all the pages in between. Bookmarks make browsing PDF files far more efficient for both the user and the internet connection.
Many interactive PDF authoring tools like Microsoft Word or Adobe InDesign will create bookmarks for you automatically based on headings and styles but most PDF Library tools don’t… unless they are also creating the PDF file. That said, unfortunately, there are still a lot of PDF files that were created without bookmarks and could really benefit from having them… but adding bookmarks to an existing PDF can be a bit tricky.
Ok – That’s not exactly true. Just adding bookmarks is easy… discovering their destinations… that’s the tricky part. Fortunately, the ReadingOrderTextExtractor class in the Datalogics PDF Java Toolkit makes it easy to find paragraphs that match certain criteria that can be interpreted as a Heading or Subheading and use the information it provides to create a proper bookmark destination…
In this example input file, the top-level headings, or H1, are in 21 point, MinionPro-Bold, the H2 headings are 18 point, MyriadPro-Bold and the H3 headings are in 14 point, MyriadPro-Bold.
Knowing this, we can iterate through each paragraph of the PDF file and use this style information to locate headings and then set the destinations of the bookmarks to the coordinates of the words we find with those styles. By extracting the text in reading order, we know that the bookmarks will also be in the correct order and nested properly regardless of the length of the document.
In the Gist below, I’ve set up ranges of sizes for the heading fonts we are interested in; this should make it easier to modify the code to fit your particular needs.
The first step in reading the text of the PDF is to set up the text extractor. In order for the text extractor to interpret the text correctly, it needs to know what fonts are available, once you’ve loaded them, setting up the ReadingOrderTextExtractor class is easy.
From there you can easily iterate over each paragraph.
Each “paragraph” is an ArrayList of ArrayLists, which are the sentences, which are in turn are an ArrayList of Word objects. It is the Word objects that can be used to discover position of the word on the page and we can use the characters in the Word to discover the font name and size of the Word. Once we’ve determined we have a Heading paragraph, we can then add the bookmark in it’s appropriate place in the bookmark tree and set it’s destination.
The images below are before and after screen captures of the input file being displayed in Adobe Acrobat DC.
You can see that an entire tree of bookmarks has been added to the PDF file perfectly reflecting the heading styles and how they are nested. To get started with adding bookmarks to PDF files, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.