PDF Feature Exploration: Marked Content

PDF Feature Exploration: Marked Content

This month marks a quarter-century of PDF! Since June 1993, PDF has provided a rich and powerful format for portable document creation. From the moment PDF was introduced, until the PDF format was made into an open, worldwide standard by ISO in 2008 – the growth in PDF’s features and capabilities in its first fifteen years was rapid and revolutionary. The past decade of relative stability has allowed the use of PDF and the availability of different PDF tools to proliferate. Others have looked back on the history and ubiquity of PDF. The Register and Motherboard have spent some good letters and pixels on the history of PDF. Much has been said about the fundamentals and foundations of PDF as a page description language. But PDF is so much more than just markings on a page. Today, I’m going to take you on a journey through an underappreciated but foundational capability in PDF files.

Exploring marked content

What is marked content? At its simplest, marked content allows a series of drawing operations to be grouped together into a collection. For example, the words of a sentence or the lines that make a shape can be thought of as a single collection of content. These collections are denoted in PDF with marked content sequences. Marked content sequences have been a part of PDF for much of its lifetime. Introduced with PDF 1.2 in 1996, marked content was added to support allowing Acrobat plugins to associate proprietary information with portions of pages. This represented the beginnings of adding structure to the page description model which PDF was founded on.

Marked content is very simply represented in PDF files. At its simplest, a specific point in a page content stream may be marked with a simple tag with the MP operator. Likewise, a collection of drawing operators may be bracketed in a set of BMC (Begin Marked Content) and EMC (End Marked Content) operators to group the content, with a tag to denote the role or significance of the marked content, like so:

/Collection_1 BMC
  % PDF content stream operators: text or path drawing operations
  % to collect up into one entity
EMC

More useful is the ability to associate a collection of information to page content in the form of property lists, with the DP operator (to mark a specific point in page content) or BDC operator (to start a section of marked content, terminated by a corresponding EMC operator) used in conjunction with a list of properties and references to other objects in the PDF file:

/Collection_2 << /PropertyName (Name for this property) >> BDC
  % PDF content stream operators grouped, where these all share
  % a common property or set of properties
EMC

Very quickly, many people saw the potential in being able to group page drawing elements together. Within the PDF standard and within applications, marked content enabled key capabilities:

Element grouping

Grouping drawing elements together to be worked with as a collection. Without marked content, line art can only be represented with its constituent drawing operations: lines, paths, and very simple shapes. Marked content allows grouping these together to form more complex line art, such as charts and graphs. With marked content, charts and graphs and other image items can be moved, copied, and otherwise manipulated as one item – instead of requiring users to tediously select the multiple pieces that make these up in a PDF file.

Logical structure

Describing the logical structure of PDF content. Without marked content, PDF page content can only represent visual elements. Logical structure came into PDF in 1999 with Reader & Acrobat 4.0 and PDF 1.3. With support for properties and property lists, marked content forms the foundation for grouping visual page parts by the document concepts these represent. Because marked content containers are nestable, hierarchical concepts such as chapters in a document can be created. And because logical structure and page drawing content are stored separately in a PDF, users who are interested in the logical structure of a PDF are not required to parse the graphical elements of pages in order to understand the structure.

Tagged PDF

Standardizing semantic content representation. Marked content and logical structure were brought together in PDF 1.4 in 2001, and a standard set of structure tags created, to form what is known as Tagged PDF. Tagged PDF and its Standard Structure Tags (SSTs) extend the abilities to group and mark up page content into a portable, interoperable series of document semantic concepts. Complex PDF content spanning dozens or hundreds of different image and text operations – such as tables – now has a way to be specified, exported, and re-imported into different applications, as well as transcoded into different formats.

Tagged PDF forms the foundation of concept – not just graphical content – interoperability, and conversion. From the foundation of marked content sequences, tagged PDF sets the basis for many PDF capabilities, including:

  • Content accessibility for screen readers and text to speech applications
  • Content extraction for automated content ingestion, in applications such as neuro-lingustic processing (NLP)
  • Content transformation for non-layout markup languages such as XML

Optional content

The ability of PDF 1.5 and later files to have different layers and layer groups is built on the marked content facility. Using marked content operators to group portions of page content, and using PDF arrays and dictionaries to make collections of these, PDF allows users to make entire sets of content where these can be displayed or hidden all together as one. Layers are a powerful application of marked content and other PDF features, and have several common uses:

  • Allowing the inclusion of multiple languages of text in one PDF document. Users can toggle between different languages to match their preference
  • Collecting multiple components of a drawing into one PDF file and controlling the view of these. In files such as architectural drawings, different elements such as wiring, plumbing, HVAC and structural components can all be included – and their display activated and inactivated individually – as needed by different viewers

From humble beginnings as a means to allow Acrobat plugins to store their own private information in PDF files, content grouping via marked content has provided the basis for many useful capabilities in PDF. We hope you’ve enjoyed this brief look into marked content in PDF files, as well as the different capabilities in PDF that build upon the simple but powerful concept of grouping visual elements together into common structures. If you have any questions, comment below or contact us.

Leave a Reply

Your email address will not be published. Required fields are marked *