Tuesday, July 14, 2009

CF9 Beta: Convert .DOC files to PDF (..if you have OpenOffice)

While I knew about the spreadsheet integration in CF9, I was very surprised to discover handling for other Office formats. Specifically the new cfdocument feature that can convert .doc files to .pdf format. Now for the catch: it requires an installation of OpenOffice. But if you can live with that caveat, it is a pretty sweet feature.

I already had some familiarity with OpenOffice, having looked into a few months back. In summary, I was using the JODConverter + OpenOffice + CF8 to convert office documents to pdf. Essentially, OpenOffice is run as a server, and with the help of the JODConverter it recieves incoming requests and returns the converted documents back to CF. Of course CF9's integration with OpenOffice makes the process a whole lot easier.

Configuring OpenOffice is very simple. Just install OpenOffice and then make sure the directory of the OpenOffice program is correctly entered in the ColdFusion Administrator.

That is about all there is to it. Once the setup is complete, you can convert documents to your heart's content using the same old cfdocument syntax:

<cfdocument format="pdf"
filename="c:\myFiles\testWord2003_Converted.pdf" />

So far the results for the .doc files I have tested have been very good. Cfdocument did a fairly good job converting images, links, and even creates internal page links when converting a table of contents. Though not surprisingly it did nothing with embedded objects (like an embedded .zip file). The resulting pdf contained only the icon of .zip file, not the binary data itself. Of course there are still a bunch of new attributes to look at, like formfields, formsType and of course the new in memory files feature.

One interesting thing I noticed in the current documentation is that it says:
... the cfdocumenttag lets you read, write, and process Word documents and PowerPoint presentations. All versions of Microsoft office applications from 97 to 2003 are supported.

While it does not mention anything about 2007, OpenOffice is capable of converting .docx files. So I decided to test the theory and feed in a .docx file. Sure enough, it successfully converted the file to .pdf format. Though after testing several files, I began to suspect the omission of 2007 may have been deliberate.

OpenOffice can do a fairly good job of converting .docx files, but it occasionally choked on some of the more elaborate ones. Curiously, one of the files it garbled was one I used when testing the OpenXMLViewer program. I am not sure if it was the vml in the source document, or something else. But with that particular document, OpenXMLViewer and docx4j did a much better job at the conversion than OpenOffice. So perhaps 2007 was not listed as "supported" because there were a larger number of quirks with that format. Still, it would seem to be an option. Though probably an unsupported one ;)

The Man Who Fell to Earth
When I first started testing word converters a few months ago, I think I had some naive idea that there was a magical program out there that could instantaneously transform any file into any format. Obviously there is no such thing. Not in the commercial or open source realm. So if you are envisioning a world of perfect peace, where all file formats exist in complete harmony, one hundred percent of the time .. go back to bed. You are still dreaming. But if you take a more level-headed approach, and keep in mind some of the practical limitations, you may find that the available tools can get you pretty darn close.

Note, my comments above are not intended to disparage the new cfdocument features, or any of the tools I have tested. But the fact of the matter is the Word format is incredibly complex and more importantly, fluid. In some ways, even more so than Excel spreadsheets. Add in the conversion to a completely different file format (pdf) and there is definitely room for some "creative interpretations" . So my recommendation would be do not make yourself crazy over every minor formatting difference or you will end up in a padded cell in no time.

Now, back to our regularly scheduled program.

Related Entries:
CF9 Beta: CFDOCUMENT + OpenOffice Can Convert Any Format to PDF? (Documents,RTF's and Excel Sheets .. oo.Oh!)


Russ Michaels October 19, 2009 at 6:27 AM  

PrimoPDF, a free application for windows that prints anything to PDF format, works flawlessly so far for me.

Ben Nadel February 6, 2010 at 3:22 PM  

That's very interesting. Document updates in CF9 are something that I have hardley looking into yet. I was very curious to know how pixel-perfect the conversions would be. Knowing that there is a 3rd party service underneath makes sense. Good to know.

cfSearching February 6, 2010 at 5:16 PM  

If you run into any issues, you may want to check the OOO bug database first. A few of the issues I encountered (and listed in the release notes as known issues) are actually OpenOffice bugs ;-)

  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep