PDF Alchemist: The OCRMode Option

PDF Alchemist: The OCRMode Option

In PDF Alchemist 2.3, Datalogics added a new Optical Character Recognition (OCR) feature allowing users to retrieve text from images within PDF files. The new OCR feature allows the user to retrieve image text as alternate text for pictures in the output, have PDF Alchemist remove images and replace them with their textual content equivalents, or simply leave pictures alone and not search for text inside them. You can learn more about the new features in this blog article.

To use the OCR feature, two new options have been added:

<strong>-ocrMode</strong>[tag | replace | off]: specifies desired output characteristics for OCR purposes.
    Default: off
<strong>-ocrLanguage</strong> [deu | eng | fra | ita | nld | por | spa]: specifies language to use during OCR.
    Default: eng

Let’s focus on the -ocrMode option and look at an example:

PDF with text of images

The above PDF contains an image of text, but does not contain any text content.

-ocrMode tag

If the -ocrMode option is set to “tag,” PDF Alchemist uses OCR to scan images when converting PDF files. Any text found within an image is embedded in the <img element’s alt attribute. For example, I used the simple syntax: PDFAlchemist pdf_filepath output_dir -ocrMode tag

Open the output page1.html in a text editor and note that any text retrieved from the OCR process will be found within the <img alt attribute!

Tag HTML and Source

-ocrMode replace

If the -ocrMode option is set to “replace,” the OCR feature is turned on and the OCR text replaces the original image in the output file. The process creates selectable text in the HTML or XML output, and the source image is removed. The text will also be tagged as OCR text within the export file. This allows the person reviewing the output file to know where the text came from, and also serves as a warning as the OCR text might not be rendered perfectly. For an HTML or EPUB output file, any text generated from an image in the PDF input file using OCR is marked with this tag: data-ocr-text=”true”

For example, I used the simple syntax: PDFAlchemist pdf_filepath output_dir -ocrMode replace

Again, open the output page1.html in a text editor and note that any text retrieved from the OCR process will be marked with a data-ocr-text=”true” attribute.

Replace Source

You can test out the OCR capability discussed here by downloading a free evaluation of PDF Alchemist. As always, feel free to leave us your feedback as we are always happy to hear from our audience.

Leave a Reply

Your email address will not be published. Required fields are marked *