Microsoft Office to PDF: Automating Conversion

Microsoft Office to PDF: Automating Conversion

Let’s talk about creating PDF files from Microsoft Office documents. Microsoft Office is the leading suite for creating and editing business documents, presentations, and spreadsheets. While PDF is the ideal format for sharing and archiving these sorts of documents once they are finished, it’s usually best to do the actual writing and editing in the Office environment. We’re often asked how to turn Office files into PDFs for reliable distribution and future-proof storage. There are several ways to do so using tools that are free or that you may already have. For example, you can use Adobe Acrobat, Google Docs, or Microsoft Word to open .docx files and save or export these as PDF files. Each program has strengths and weaknesses in its conversion. The key problem is that these programs are not suited for automation. Microsoft Word can be automated, but Microsoft recommends against this. So let’s cast our net a bit wider…

Automating conversion with LibreOffice

A very popular way to automate conversion of Office files to PDF is with LibreOffice. LibreOffice is an open source suite of document tools. It provides a word processor, spreadsheet, presentation maker, and other programs that are able to import and export documents from popular Office formats. LibreOffice can also be run non-interactively in headless mode. Let’s take advantage of this to automate document conversions to PDF. Here, my examples will be on Windows 10 (x64) – these should be easy to adapt to other environments.

First, install LibreOffice. You can download LibreOffice from here. The default installation location is “C:\Program Files\LibreOffice” – you’ll want to remember where the installation put LibreOffice, as you’ll need this to run the program.

The syntax for running LibreOffice in headless mode to convert Office files to PDF format is

[installation location]\program\soffice.exe --headless --convert-to pdf:writer_pdf_Export --outdir <output directory> <input file>
  • [installation location] is the path where you installed LibreOffice;
  • –headless tells LibreOffice to run in “headless” mode. In this mode, no user interface is shown and control returns directly back to the command prompt. LibreOffice is started in an asynchronous fashion to convert the input document, and once the document conversion is complete, the program automatically shuts down;
  • –convert-to specifies that the input should be converted into an output form. The output form syntax is a bit tricky in spots, see below;
  • <output directory> is the path where you’d like the PDF written to. This is optional and will default to your current working directory if not specified;
  • <input file> is the name of the file to convert

In headless mode, LibreOffice will start a background process that converts the input document asynchronously. LibreOffice will write a lock file and a temporary file to the output directory as part of document conversion. These are replaced with the output PDF once the process is successfully completed. More recent versions of LibreOffice do not require specifying “–headless” explicitly if “–convert-to” is also specified, though for clarity we recommend specifying anyway.

The convert-to syntax gives a way to both specify the output format, as well as the desired filter to use for writing output. There are several different filters in the LibreOffice installation that can write PDF files, the “writer_pdf_Export” filter seems to be the most reliable for conversion from Office formats.

Specifying conversion options

Ok, this is where things get tricky…

There are a lot of options for PDF export in the LibreOffice GUI, but unfortunately no way to pass these options to the headless LibreOffice program directly. Instead, these need to be specified in a file called registrymodifications.xcu. This file is an XML format file that can be found on Windows for a given user in

  %APPDATA%\OpenOffice\4\user\

This registry modifications file contains all of the user preferences that a user has changed for LibreOffice, though, and should not be modified unless you wish to change the preferences a user sees when they use LibreOffice interactively. Instead: make a backup copy of the file, modify the file, and then after conversion restore the original copy back.

PDF export options are not well documented. A complete and up to date list of valid XML options is best obtained from the LibreOffice github repo, at https://github.com/LibreOffice/core/blob/master/officecfg/registry/schema/org/openoffice/Office/Common.xcs. Look for the PDF group starting with the XML tag

  <group oor:name="PDF">

However, this XML might be tricky to read on its own. An older list of options on the OpenOffice wiki – https://wiki.openoffice.org/wiki/API/Tutorials/PDF_export – gives an easier to read list of a good number of these options.

Example: specifying PDF/A output

By default LibreOffice 6 will write PDF 1.5 files. PDF/A is a common requirement for archiving and long-term storage of documents. In the Common.xcs file above the SelectPdfVersion is used to set the PDF version – PDF or PDF/A – to write:

  <!-- PDF Version selection -->
    <prop oor:name="SelectPdfVersion" oor:type="xs:int" oor:nillable="false">
      <info>
        <desc>Specifies the version of PDF to emit.</desc>
      </info>
      <constraints>
        <enumeration oor:value="0">
          <info>
            <desc>PDF 1.5 (default selection).</desc>
          </info>
        </enumeration>
        <enumeration oor:value="1">
          <info>
            <desc>PDF/A-1 (ISO 19005-1:2005)</desc>
          </info>
        </enumeration>
      </constraints>
      <value>0</value>
    </prop>
  <!-- END PDF Version selection -->

The default value is 0, specifying PDF 1.5 output as the default. To change this, add the following item to your registrymodifications.xcu file:

  <item oor:path="/org.openoffice.Office.Common/Filter/PDF/Export">
    <prop oor:name="SelectPdfVersion" oor:op="fuse">
      <value>1</value>
    </prop>
  </item>

If you’re making this change just once – for example to permanently change to PDF/A output on a computer where you’ll only be running LibreOffice for headless conversions – then I recommend adding this as the last item in the registry modifications file, as it can be placed just before the closing </oor:items> tag at the end of the file. On the other hand, if you’re going to be writing a process that changes this file depending on user conversion options, you’ll very much want to change the file with an XML DOM-aware package or API and to do this the correct way.

unoconv: an easier way (for some users)

The unoconv – Universal Office Converter tool – is a well-regarded Python script that takes care of a lot tasks that are part of using LibreOffice in a fully automated way, including allowing users to pass options for the filters on the command line and performing the necessary configuration changes automatically. unoconv is GPL v2 licensed, however, and therefore not suitable for every one and every use case out there. If your situation allows you to use unoconv, I highly recommend checking it out.

Yes! I have a PDF now

Now that you’ve converted your Office document into a PDF, you have a copy that is well-suited for distribution to readers across a variety of platforms, as well as for archiving and preservation. If you’re interested in additional PDF functionality, Datalogics products and technologies can help you go further:

  • Make sure that your PDF is free of common issues that can lead to problems down the road with our free PDF CHECKER tool
  • Ensure your PDF is optimized for fast downloading and sharing – reduce its size with our PDF OPTIMIZER
  • Combine, continue, or transform your PDF with our PDF developer SDKs and APIs, including the Adobe PDF Library and Datalogics PDF Java Toolkit

Happy trails!

Leave a Reply

Your email address will not be published. Required fields are marked *