Tuesday, June 30, 2009

ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 2

In a Part 1 I mentioned two interesting projects for converting docx files. The first one is the OpenXML Document Viewer Project, which can be used to convert docx files to html. The server version is a small executable that can be used with cfexecute. The small program accepts three arguments:

  • source_file absolute path to the docx file to convert
  • dest_path folder to place the converted files (html and images)
  • browser_type used to generate a browser specific html file (IE, Firefox or Opera)

There is not much more to using the program than that. The installation is equally simple.

1. Download the command line version for your operating system.
Example: OpenXMLViewer_Win_Cmd.zip

2. Unzip the files and copy the entire OpenXMLViewer subfolder to the desired location.
Example: I copied the subfolder to: c:\tools\OpenXMLViewer

3. If you are on windows, you must add the directory from step #2 to your PATH variable (or simply call the program from a .bat file instead). Due to some issues with the PATH value, I ultimately ended up using a .bat file.

To use the converter, just initialize a few path variables then run the program with cfexecute.


Initialize file paths
<cfset inputFilePath = ExpandPath("Introduction to Microsoft .NET Services.docx")>
<cfset outputFolder = ExpandPath("test")>
<cfset pathToProgram = "C:\tools\OpenXMLViewer\OpenXMLViewer.exe">
<cfset browserType = "FIREFOX">

OpenXMLViewer does not seem to create the output
folder if it does not exist. So ensure it exists
before doing the conversion.
<cfif NOT DirectoryExists(outputFolder)>
<cfdirectory action="create" directory="#outputFolder#">

Do the conversion
arguments='/c "#pathToProgram#" "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
<cfexecute name="c:\windows\system32\cmd.exe"
arguments='/c #pathToProgram# "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
variable="result" />

Display the generated html file
<cfdirectory action="list" directory="#outputFolder#" name="getHTMLFiles" filter="*html*">

<h3>Generated HTML</h3>
<cfoutput query="getHTMLFiles">
<a href="#ListLast(directory, '\/')#/#Name#">Display HTML (#Name#)</a>

<cfdump var="#result#" label="Results from Cfexecute">

I was pretty pleased with the results and was actually able to feed the html content into cfdocument to produce a reasonable facsimile in pdf format. Though that did take a while . I also noticed a few issues with some of those funky MS Word characters I know and love. I am still looking into how to fix that.

Convert to PDF:
because the .cfm script is not in the same directory
as the html file, first correct the relative image
paths for cfdocument
<cfset pathToFile = ExpandPath("test/Introduction to Microsoft .NET Services.xhtml")>
<cfset content = Replace(FileRead(PathToFile), "word//media/", "test/word/media/", "all")>

<cfdocument format="pdf" filename="#ExpandPath('./convertedDocument.pdf')#" overwrite="true">

But overall the results were very good.

PDF - With Funky Characters

Final Notes/Quirks:
  1. OpenXMLConverter generates one type of file for Internet Explorer and another for Firefox and Opera. I believe the primary reason is because Word documents can contain vml. Internet Explorer is capable of displaying vml, but Firefox/Opera are not. So if you select the latter browser type, the vml is converted to svg.

  2. From what I can tell OpenXMLConverter does not allow you to specify the name of the output file or the path to image directories. So unfortunately that means you must output the generated files to separate directories to avoid naming conflicts.

  3. The OpenXMLConverter does not seem to create the output folder, if it does not exist. So you must ensure it exists before calling the program.

More about OpenXMLViewer and docx4j in Part 3.


Todd Sharp July 1, 2009 at 10:30 AM  

Does this utility only handle docx, or can it handle xlsx and pptx too? I'd figure it should handle the others since the name is OpenXMLViewer, but I'm not seeing anything that indicates supported formats.

cfSearching July 1, 2009 at 10:46 AM  


No, I have looked at the source code and I believe this particular tool is for docx only.

I agree the project naming is a bit ambiguous. The home site is "OpenXML Viewer", but only the subproject "OpenXML Document Viewer" is mentioned. So perhaps they have plans for other formats. That said, I was only interested in docx, so I did not look that deeply ;-)


Todd Sharp July 1, 2009 at 10:49 AM  

Thanks - good to know. That said, have you looked at using OpenOffice (and the ODF Converter Integrator - http://katana.oooninja.com/w/odf-converter-integrator) to do what you're trying to do?

I've only messed with the PPTX stuff, and it was a bit lacking IMO, but it is another option.

Looking forward, as always, to your next post.

cfSearching July 1, 2009 at 11:05 AM  


Hmm, I do not know if I came across that one before. There was one odf converter I was looking at, but I got sidetracked with docx4j. Thanks, I will check that one out and see how it does with docx files.

Yes, some of the tools are better than others. Every one I have tested so far has "some" issues, even the better ones. I suspect the commercial tools are no different. So I guess the motto here is "manage your expectations". ;-)


  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep