Splitting PDF Documents: Let’s Count the Ways

Splitting PDF Documents: Let’s Count the Ways

Recently, I’ve had the opportunity to discuss three different PDF Library workflow applications that needed to split PDF documents. Of course, each application had it’s own criteria as to how they wanted to split the PDF which required  a bit of additional code in addition to the simple split methods. I’ve packaged up all three into a single piece of code that I thought would be useful to review:

  • The first application needed to split a PDF based on pre-defined intervals, perhaps every page or every other page.
  • The second needed to split a document based on pre-existing bookmarks (outlines). Consider a book with chapter and subchapter bookmarks or perhaps a bank statement with bookmarks that point to the start of each individual statement.
  • The third required splitting the document based on search hits.  Again, consider a financial document with individual statements of varying length, but where you know that at the start of each statement, a unique phrase will be found.

Oh! It was also mentioned that there does not seem to be an API named “split” in the PDF Library documentation. True enough; there is no split API!  That’s because the splitting process is typically implemented by creating one or more new, empty documents and inserting pages from the source PDF into the target PDFs.  In other words, to split a PDF, we use the same API that we use to merge PDFs together.

Lets take a look at SplitPDFVariations.cs available on Github. To simplify things for the demo code, I’ve setup some simple booleans to choose the option:

For the splitByPageInterval and for the splitByTextString options, additional settings allow you to set the page interval and the search string. Each option/function will keep track of where to split the input document and pass the listOfPageNumsToSplit to the final function that will perform the work.

To split by page intervals only requires knowing the number of pages in the input document and performing a simple division, plus a modulo to catch the condition when the number of pages is not equally divisible by the interval.

To split by bookmarks first requires getting the root bookmark followed by a call to a recursive function

that enumerates through the bookmarks getting each page destination

Note that a bookmark may have children, so we need to call the function recursively.

Finally, to split by text search strings, it requires performing a WordFinder.  The WordFinder and it’s options are demonstrated elsewhere in the TextExtract and ListWords samples, so we’ll just list a brief snippet here:

And finally, once you’ve collected your list of split locations, it’s time to perform the work.

The full code can be downloaded here.  Until next time, with a final shoutout to Patrick for the the bookmark enumeration code.

Leave a Reply

Your email address will not be published. Required fields are marked *