Capturing Text from Target Regions and Links

Capturing Text from Target Regions and Links

I recently helped out on a project that needed to capture building operation and maintenance data for construction projects, and then transfer that information to a database. Unfortunately, the originating system and the new repository were entirely separate vendors and systems, so the only available transport mechanism was a set of master PDF reports created by the originating system that contained links to other data – one for each construction project. While the PDF reports themselves were untagged (unstructured), we were fortunate that all of the reports were formatted consistently. That meant we were able to infer some basic structure based on the location of various items. For example, the heading and subheadings for the data columns were always found in specific locations, and the pointers to individual resource material were always captured within links.

In essence, to collect the desired data for this project, we needed to capture the text from specific locations on the page and also capture any text that fell within a link’s boundaries. PDF links are a type of annotation rather than part of the page content stream. In other words, the text that you see when you click on a PDF link is not part of the link itself, so the “link text” needs to be captured using the same type of process that we use to capture text from a specific target region. In addition, the project needed to capture the link destination and the link coordinates because those were part of the data hierarchy. As a result, the code will examine any link that points to another page (GoTo), points to another document (GoToR or Launch), or points to a web page (URI).

For reference, our test document is a single page PDF containing some basic text and small set of link annotations.

We create a pair of classes, one for tracking user-specified target locations and another for tracking links. The links will be tracked on a page by page basis, while the user specified target locations will apply to all pages, though of course we could expand it to be page specific as well. Each instance will contain the x and y coordinates of the lower left corner and upper right corner – stored as the Left, Bottom, Right, and Top entries.

Before we perform any processing, we setup the WordFinderConfig and call the WordFinder. As the WordFinder is already covered in the the TextExtract and ListWords samples, there’s no need to elaborate other than to point out the PreciseQuads setting, a feature that was added to the Adobe PDF Library v15.0.1 release.

By default, the bounding rectangle around text includes any surrounding whitespace that is part of the typeface glyph design. In documents where the text is densely packed, especially those with little or no leading between lines, the bounding box of some Words may overlap those of other Words.

The PreciseQuads option enables the WordFinder to more-narrowly follow the boundaries of the text glyphs themselves, minimizing the quad dimensions to enclose the text with as little border whitespace as possible. As an example, note the differences below in the Word quad values for the first word in our sample document, set in an eleven point Calibri font. When PreciseQuads is enabled, the Bottom Y value is more than one point higher (707.715 vs 706.291) and the Top Y value is more than three points lower (717.231 vs 720.322); a significant difference for a commonly used font size.

Capturing the text within a link rectangle uses the same general process as capturing text from specific target areas, so we’ll just review the one section of code. We first iterate through the links and check what type of Action the link specifies. We are only interested in GoTo, RemoteGoTo, Launch and URI actions:

Then we add the coordinates, the action type and the destination to our link info.

We then start the process of comparing the link coordinates to the Word coordinates. We’ve added a couple of fudge factors that allow us to adjust for any links that are mispositioned. For example, note that the Launch link in the sample document does not properly enclose the text.

We must enumerate through all of the text and compare the coordinates of the Words with the coordinates of our target locations. If the Word fits within the target area, we add it to our extraction.

We will also pull out the link coordinates and Action information that we previously stored.

For the database import requirement of the original project, it would be more appropriate to create an XML stream or a CSV file, but for this demo, we will just create a simple text file. When we are done, the output looks something like this:

<page 1> has 2 user target regions, 4 links (4 annotations) and 22 Words
User target area #1 Description: [Upper Left header] Text: [TopLeft ]
User target area #2 Description: [Lower Right footer] Text: [BottomRight ]
Link #1 Coordinates: [243.44,612.48 367.88,632.88] ActionType: [Goto page] Destination: [0] Text: [GoTo link to current page ]
Link #2 Coordinates: [225.81,570.57 387.61,586.48] ActionType: [Remote goto] Destination: [AnotherPDF.pdf] Text: [GoToR link to another document ]
Link #3 Coordinates: [251.87,524.42 362.88,540.13] ActionType: [URI] Destination: [] Text: [URI link to a webpage ]
Link #4 Coordinates: [283.20,480.44 337.22,494.58] ActionType: [Launch] Destination: [AnotherPDF.pdf] Text: [Launch link ]

You can download TextExtract-FromRegionsAndLinks.cs from Github.

For more questions or more information, contact us or comment below.

Leave a Reply

Your email address will not be published.