Detecting PDF Form Types using the Datalogics PDF Java Toolkit

Detecting PDF Form Types using the Datalogics PDF Java Toolkit of the Week:

Joel Geraci

If you’re a US citizen, it’s that time of year again… time to fill out a bunch of forms… or at least sign a bunch of forms that someone else has already filled out for you. The United States Internal Revenue Service (IRS) does provide other formats, but these forms are almost invariably in PDF format and have fillable fields.

PDF forms come in two basic varieties; the ones from prior to Adobe’s acquisition of Accellio (JetForm)… and the ones that came after… in more common terms that’s AcroForms and XFA, respectively. But, to the end user, a PDF form is just a PDF form. As long as the user can add the information that is required of them, they don’t care very much how it was created or what’s going on under the hood. Adobe Acrobat and Reader do a great job of hiding the complexities of the various PDF forms technologies from the user.

But developers of server-based forms processing applications need to know exactly what type of PDF form they are dealing with because each form type has it’s own idiosyncrasies. The way that XFA forms organize data inside a PDF file is completely different from how AcroForms organizes data even if it’s the exact same field values. If the developer is in control of the input forms, this obviously isn’t an issue. But when the developer isn’t in control or inherits the input PDF forms from an earlier process or provider, they need a way to detect the type of form to determine which workflow to invoke.

This issue particularly important for IRS tax forms.

The PDF forms on the IRS Forms and Publications site were created using Adobe LiveCycle Designer ES 9.0 and are a kind of hybrid between a PDF file and XFA. This type of form is referred to as XFA Foreground (XFAF) or Static XFA… depending on who you’re talking to. As I mentioned above, AcroForms and XFA forms store data differently in a PDF file; Static XFA Forms stores data in yet a third way… ok… not a third way… both ways. Static XFA forms store the form data in both the way that AcroForms do and as an XML dataset in the XFA portion of the file. Adobe Acrobat and Reader know how to handle these types of files so that there are no conflicts. Most other PDF viewers don’t even have a chance of getting it right because they don’t understand what to do with XFA forms… most PDF developer tools don’t either.

However, the Datalogics PDF Java Toolkit provides the FormTypeEvaluator sample which shows developers how to use the getDocumentType() method of the XFAService to detect the type of form that is passed in as a parameter. The sample will print the form type of the document.

  • Flat means that the PDF contains no form fields. It might look like a form, but it isn’t interactive.
  • Acroform – This is the original interactive forms technology introduced by Adobe in 1996. Internally, AcroForms fields are defined through dictionaries of PDF objects and can be created in Acrobat.
  • StaticNonShellXFA – Internally, form objects are represented in XML, and creating XFA forms for use in Adobe Reader requires Adobe LiveCycle Designer. Static XFA Forms contain a predefined set of form fields which display in a predetermined fashion.
  • DynamicShellXFA – The arrangement and appearance of the form can be made dependent upon the data as it is entered. For example, a parts list table can “grow” additional rows as data is entered, or additional form fields can appear as a result of data entered earlier in the form. Dynamic XFA Forms were introduced in the XFA 2.1 Specification.

You can test your own forms to see what type they are by checking out our PDF Forms Detector. This PDF Forms Detector is a simple online tool which allows you to upload your PDF and see what type of form exists within it. It is built upon the Datalogics PDF Java Toolkit and demonstrates some of the capabilities of that SDK.


Leave a Reply

Your email address will not be published. Required fields are marked *