How to Find and Remove Watermarks Using the Datalogics PDF Java Toolkit

How to Find and Remove Watermarks Using the Datalogics PDF Java Toolkit

Sample of the Week:

This post is the direct result of a specific customer question. The developer was looking for a way to remove existing watermarks from a set of PDF files and needed to understand how “Watermark” annotation work.

As it ends up, the way that watermarks and backgrounds get stored in the PDF varies depending on the use case. If you are trying to remove all of the existing watermarks or backgrounds, understanding the Watermark annotation will only get you halfway to the solution.

What You Need to Know First:

When creating a watermark in Adobe Acrobat using the dialog box below, it will create a watermark using one of two methods. The method depends on the state of the bottom checkbox in the center of the screen; “Keep position and size of watermark text constant when printing on different page sizes”

WatermarkKeepPositionAndSize

 

When on, Acrobat will add a “Watermark” annotation to the page, allowing the watermark to scale independently of the underlying content. When the checkbox is off, Acrobat will add an XObject to the page content which allows it to scale as the page scales. These are the two different use cases that developers must look for when attempting to remove existing watermarks.

Note: because Acrobat allows for multiple watermarks and backgrounds to be added to the same page, it’s conceivable, though unlikely, that both methods may be used on the same page.

This sample looks for and removes watermarks created using both methods.

The Process:

You begin by iterating over the pages and then process each page looking for each of the two types of watermarks or backgrounds. The first step of this process is relatively simple. You simply iterate over any annotations and testing to see if they are of the subtype “Watermark”, when you find one, remove it… very straightforward.

The second step is considerably less straightforward. Detecting XObjects that are being used as Watermarks or Backgrounds and removing them requires a bit of PDF surgery. The methods in this Gist can help with that process though.

A PDFXObjectMap can be created from the PDFResources object so that the XObjects associated with the page can be iterated over and examined. One feature of Watermarks and Backgrounds is that they can be set to appear on screen… or not, and they can be set to print… or not, or any combination of the two. Optional Content Groups (OCGs) are used to implement this feature and it is membership in one of these OCGs with a specific usage type of “BG” or “FG”, BackGround or ForeGround, that will help us determine if the XObject in question is in fact a Watermark or Background.

Once we’ve detected an XObject that we want to remove, we need to actually remove it. However, it is not enough to simply remove the XObject from the resources, in order to prevent the viewer from throwing errors and telling you that there are page rendering problems, you must also remove references to the XObject from the page content. In PDF, any XObject can be painted as part of another content stream by means of the Do operator. The syntax is the same in all cases, although details of the operator’s behavior differ depending on the type of XObject. The Do operator takes a single operand and it’s value is the ASName of the XObject. So, to locate and remove references to the XObject, we need to iterate through the page content painting instructions looking for “Do” operators and examining it’s operands looking for one that matches the XObject’s name. When we don’t find a match, we write that instruction to a new ContentWriter creating a new set of instructions that is identical to the original except without references to the XObject. The result becomes our new PDFContents object for that page.

Once we’ve removed the references to the XObject, we can safely remove the XObject itself from the page resources.

At this point, you’re pretty much done; both types of watermarks and backgrounds have been removed and the page content has been scrubbed so that the viewers won’t throw errors because of missing references.

As further cleanup, we can, though are not required to, remove the OCG artifacts that were created for the Watermarks and Backgrounds removing the entire Optional Content Properties dictionary if it ends up being empty after we clean it.

To get started with this Gist, download this Gist and request an evaluation copy of The Datalogics PDF Java Toolkit.  

 

Leave a Reply

Your email address will not be published. Required fields are marked *