JBIG2 Compression: Lossy vs. Lossless

Recently it was reported that the JBIG2 compression implementation in some Xerox scanners has the undesired side-effect of changing some characters under specific circumstances to other characters when scanning documents. Datalogics has received inquiries about the JBIG2 compression that is accessible in the Adobe PDF Library. I’d like to provide a bit of information about JBIG2 and how it may interact with your application.

Overview: JBIG2 (http://en.wikipedia.org/wiki/JBIG2) is a standard compression algorithm for bitonal (1 bit, black & white) raster images that can be used either in a lossless or a lossy mode. In both modes, JBIG2 works by searching for and creating reusable compression dictionary entries for a given raster image being compressed, and then re-composing an image as a series of compression dictionary references. In lossless mode, ever raster image area will be expressed exactly by one of the dictionary entries. In lossy modes, JBIG2 compressors can start to substitute compression dictionary entries that are close in appearance to a given raster image area. This enhances compression capability by allowing for more references to fewer entries. However, because these substitutions are sometimes closest-match references, this can cause subtle changes in image appearance due to the choice of reusing an existing, close match for a given image area rather than creating a new entry. Higher levels of lossy compression are accomplished by allowing for matches to be further away from exact and therefore using more references to fewer compression dictionary entries. In some limited circumstances, this can cause characters to change appearance as portions of characters (letters or numbers) are evaluated and substituted with references to close visual approximations that, unfortunately, add up to a visual appearance of a different character.

Impact: the Adobe PDF Library can be used to compress bitonal images in PDF files with the JBIG2 compressor. When compressing, callers have control over the level of compression applied to an image – from lossy control that aggressively searches for close compression dictionary matches, up to completely lossless compression. Callers that specify the most aggressive compression levels for JBIG2 compression might, in theory, see similar issues to those reported against Xerox scanners – though Datalogics has never replicated the specific concern seen with these Xerox scanners. While examples Datalogics has distributed in the past of using JBIG2 compression for PDF images have used aggressive compression, the default mode of the JBIG2 compressor as implemented in the PDF Library is a lossless mode, and will never change the appearance of compressed images.

Recommendation: for archival purposes, Datalogics recommends always using lossless compression if applying compression to JBIG2 images. This can be assured by explicitly not setting the JB2Quality compression encoding dictionary value; or setting this value to 10 or greater. For long-term readability, or to extra assurance that future readers will be able to retrieve and decode bitonal images correctly, consider using CCITT G4 encoding. While CCITT G4 encoding does not compress as well as JBIG2, it always compresses losslessly and is supported by a wider variety of current PDF file consumers.

Leave a Reply

Your email address will not be published. Required fields are marked *