Easily Extract Text, Data, and Tables from PDFs with PDF Alchemist 2.3

Easily Extract Text, Data, and Tables from PDFs with PDF Alchemist 2.3

Datalogics is proud to announce the release of PDF Alchemist 2.3. We’ve been hard at work on new features and improvements during 2018 and the beginning of this year, and we’re eager to show people what we’ve working on. Here are the biggest things we’ve addressed:

OCR Support

This release features our initial addition of OCR support for retrieving text from pictures within PDF files. In most PDF files, the text that we see as people is represented as text in various fixed places on a page. However, some text that we read is locked within pictures within those PDF pages – and is not represented as actual words or letters, but rather as pictures of words and letters. With the addition of optical character recognition (OCR), PDF Alchemist can now peer within pictures, find picture representations of text, and turn these into actual output text. As a user, you can retrieve image text from OCR as alternate text for pictures in your output. Or, you can have PDF Alchemist remove images and replace them with their textual content equivalents. Of course, you can also leave pictures alone and not search for text inside them.

Unlike pure OCR-based solutions, PDF Alchemist analyzes text within documents as text – rather than taking a picture of the page and trying to re-create text. This gives PDF Alchemist much better performance with multi-lingual text and eliminates potential errors caused by going through the extra steps when processing all content through a pixel-focused OCR workflow. PDF Alchemist only uses OCR for processing content that is already represented as pictures in your PDFs, for maximum accuracy and performance.

Structured Content Reconstruction Improvements

We’ve spent a lot of time working on PDF Alchemist’s table and content reconstruction capabilities. PDF Alchemist 2.3 contains a plethora of fixes and enhancements in these areas, designed to foster better and more accurate structured information extraction from PDFs. We’ve specifically worked towards better table reconstruction – including improved creation of HTML and XML tables from borderless tables! This release also features better address block and small-chunk text preservation, better text and image relative alignment (left / center / right), and some fixes to more accurately preserve images that are perceived as background or page artifacts when these are desired.

Tagged Text Span Support

Most PDF files contain text with unstructured word placement sequences – this is a primary reason why text extraction from PDF files is so difficult. Some PDF files contain information that marks specific character drawing sequences with a text equivalent to be used for text extraction. However, sometimes, this text equivalent is different than what is drawn on the page. PDF Alchemist is now able to look at these text equivalents (/ActualText entries in Tagged PDF spans) and will output these text equivalents for words in PDF files when they are present.

With the advanced capabilities of PDF Alchemist, you can use the extracted data in many different applications, some of which include artificial intelligence/machine learning, eDiscovery, data science and data analytics, and information management and data processing, or simply to reconstruct PDF content to ensure optimal device-responsive content delivery and consumption for your end users.

Of course, we’re always happy to hear your feedback and suggestions, and to engage with you to showcase how PDF Alchemist can help you do more! If you’re interested in learning more or trying out PDF Alchemist for yourself, please stop by the Datalogics store.

Leave a Reply

Your email address will not be published. Required fields are marked *