Splitting a PDF Document by Top Level Bookmarks using the Datalogics PDF Java Toolkit

Splitting a PDF Document by Top Level Bookmarks using the Datalogics PDF Java Toolkit

Sample of the Week:

PDF is widely accepted for submitting documents by courts across the world. In most cases, there are size restrictions but the courts do allow you to e-file documents in sections and they suggest you try to divide up the document in logical places; between chapters or sections of the document for example.

Acrobat allows you to do this fairly easily by letting you split the document based on the number of pages in a document, by file size, and by top-level bookmarks. But if you want to automate the process on a server, or integrate the process into a document management system, Acrobat isn’t the right tool; it’s automation features are designed or end-user interaction and it isn’t licensed for this type of server use. However, the Datalogics PDF Java Toolkit can be used to replicate the Acrobat functionality and this week is the first of a new three-part series that will discuss how to programmatically split a document in the same way that Acrobat does.

What You Need to Know First:
What we generally refer to as bookmarks are referred to in the PDF Specification as the Document Outline, is stored in the document Catalog and consists of a tree-structured hierarchy of outline nodes which serve as a visual table of contents to display the document’s structure to the user. In the PDF, the nodes at each level of the hierarchy form a linked list chained together through their Prev and Next entries and accessed through the First and Last entries in the parent node. For the sake of consistency with the PDF Java Toolkit API, we’ll use the term “bookmark” to refer to an outline item from here on.

Bookmarks are not necessarily created in the proper order to split a document without overlapping pages. But most tools that create PDF from well structured source documents do. While it’s not a guarantee, it’s a safe bet that this code will work for many typical PDF files with bookmarks.

The Process:
As Adobe was developing the “Gibson” library, the original version of the Datalogics PDF Java Toolkit, one of the design concepts was that “you should be able to open the Putty Book and know how to use it.” – the Putty Book is how Adobe folks referred to the first published edition of the PDF Specification; the color on the cover, as you can see, is primarily putty.

Once we have the List of top level bookmarks we can make some decisions about exactly how we want to split the document; especially if the first bookmark is not to the first page. We also need to decide what to do with any pages that are after the last bookmark. In this Gist, I’ve chosen to break up the document in a process involving three steps.

  1. Extract the pages from the first page to the page prior to the destination of the second bookmark. This will give me a file that contains the front matter of the document as well as any pages that are in the first section.
  2. Then I split the document at each of the remaining bookmarks creating a new file that contains pages from just that section.
  3. Finally, I create a new file that contains pages from the start of the last section (the last bookmark destination) to the last page. This may not be what you want for your documents but the Gist is easy enough to modify to make it do what you prefer.

The second step is shown below. Each bookmark has a destination and each destination refers to a page in the document.  

We can use the destination page numbers to calculate the starting page and number of pages to extract to a new file using the PMMService.

If the Gist runs correctly, you end up with a new file for each top-level bookmark in the source file. To get started with splitting PDF files, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.  


Leave a Reply

Your email address will not be published. Required fields are marked *