Tuesday, June 30, 2009

ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 1

So a while back I was looking into open source tools that could be used to covert RTF or MS Word files to HTML and/or PDF format. Eventually, I stumbled across two projects with some good potential, at least for the newer .docx format.

I decided to focus on the newer format only, and not the binary .doc format. Primarily because there was better support for ooxml. Plus, there are several tools available for converting binary documents to ooxml format. So the newer format seemed the better option. I am still in the experimental stages, but so far the results have been pretty good.


OpenXMLViewer Project
License: Microsoft Public License (Ms-PL) and Related licenses
Base Language: C++
The OpenXMLViewer project focuses on "showing how documents created using Open XML Format can be translated to HTML". The OpenXML Document Viewer project provides both a browser plugin and a command line program for server side conversion of .docx files to HTML.

docx4j Project
License: Apache License (v2)
Base Language: Java
The docx4j project has a broader scope. It is "an open source Java library for manipulating OpenXML WordprocessingML documents." Including conversion to HTML and PDF formats.

Six Degrees of Separation
Though I came across the two project through separate routes, surprisingly they have something in common. One of the tools utilized by the OpenXMLViewer is xslt. The OpenXMLViewer program uses a file named DocX2Html.xslt to perform a large chunk of the conversion from ooxml to html. Now the docx4j project also contains a docx to html converter. Technicially it has two different implementations. The later one is based on the DocX2Html.xslt file from the OpenXMLViewer project. So even though the two projects are based in different languages, it turns out they were not as far apart as I thought. At least in one respect.

A Thin Line ..
Like anyone who is not an xslt guru, I have a stormy relationship with it. On the one hand, trying to debug the type of transformations needed to process something as complex as the Microsoft Office File Format is enough to make my head implode. But it is also amazing just how much you can do with it.

For example you can get a rough picture of how much of the conversion is achieved with xslt by downloading the DocX2Html.xslt file and running a simple transform. Just use cfzip to extract the main document content. Then run it through XmlTransform.

<!--- initialize file paths --->
<cfset pathToXSLT = ExpandPath("DocX2Html_V2.xslt")>
<cfset inputPath  = "c:\test\docs\Introduction to Microsoft .NET Services.docx">
<cfset outputPath = ExpandPath("DocXToHTML-RoughVersion.html")>

<!--- read the document.xml file into a variable --->
<cfzip action = "read"
entrypath="word\document.xml"
file="#inputPath#"
variable="docXML">

<!--- transform the content to html and save to disk --->
<cfset htmlDoc = XmlTransform(docXML, pathToXSLT)>
<cfset FileWrite(outputPath, htmlDoc)>

<!--- Display raw results --->
<cfoutput>
<a href="#outputPath#"gt;Display as HTML</agt;<br><br>
Generated HTML:
<pre>
#HTMLEditFormat(htmlDoc)#
</pre>
</cfoutput>
You can see from the results, it handles most of the formatting. What it does not handle are things like list numbering, links, images, etcetera. Those are handled by other areas of the OpenXMLViewer program and docx4J jar. (As well as those crazy MS quote characters we all know and love.) Now obviously more than a simple transform is needed to fully convert documents. But it does give you a glimpse of what the final output might look like.


Source file 1: Copy of Pete Freitag's Cfscript Cheat Sheet found on google.








Continued in Part 2

Update: I finally had to give up on using docx4j with Adobe ColdFusion. It is nothing against docx4j. But there were just too many "jar hell" type conflicts with CF's own internal jars.

1 comments:

Anonymous,  October 19, 2009 at 10:10 PM  

great info thanks.

  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep