ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 2
In a Part 1 I mentioned two interesting projects for converting docx files. The first one is the OpenXML Document Viewer Project, which can be used to convert docx files to html. The server version is a small executable that can be used with cfexecute. The small program accepts three arguments:
There is not much more to using the program than that. The installation is equally simple.
Installation:
1. Download the command line version for your operating system.
Example: OpenXMLViewer_Win_Cmd.zip
2. Unzip the files and copy the entire OpenXMLViewer subfolder to the desired location.
Example: I copied the subfolder to: c:\tools\OpenXMLViewer
3. If you are on windows, you must add the directory from step #2 to your PATH variable (or simply call the program from a .bat file instead). Due to some issues with the PATH value, I ultimately ended up using a .bat file.
Usage:
To use the converter, just initialize a few path variables then run the program with cfexecute.
Code:
<!---
Initialize file paths
--->
<cfset inputFilePath = ExpandPath("Introduction to Microsoft .NET Services.docx")>
<cfset outputFolder = ExpandPath("test")>
<cfset pathToProgram = "C:\tools\OpenXMLViewer\OpenXMLViewer.exe">
<cfset browserType = "FIREFOX">
<!---
OpenXMLViewer does not seem to create the output
folder if it does not exist. So ensure it exists
before doing the conversion.
--->
<cfif NOT DirectoryExists(outputFolder)>
<cfdirectory action="create" directory="#outputFolder#">
</cfif>
<!---
Do the conversion
arguments='/c "#pathToProgram#" "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
--->
<cfexecute name="c:\windows\system32\cmd.exe"
arguments='/c #pathToProgram# "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
timeout="120"
variable="result" />
<!---
Display the generated html file
--->
<cfdirectory action="list" directory="#outputFolder#" name="getHTMLFiles" filter="*html*">
<h3>Generated HTML</h3>
<cfoutput query="getHTMLFiles">
<a href="#ListLast(directory, '\/')#/#Name#">Display HTML (#Name#)</a>
</cfoutput>
<cfdump var="#result#" label="Results from Cfexecute">
I was pretty pleased with the results and was actually able to feed the html content into cfdocument to produce a reasonable facsimile in pdf format. Though that did take a while . I also noticed a few issues with some of those funky MS Word characters I know and love. I am still looking into how to fix that.
Convert to PDF:<!---
because the .cfm script is not in the same directory
as the html file, first correct the relative image
paths for cfdocument
--->
<cfset pathToFile = ExpandPath("test/Introduction to Microsoft .NET Services.xhtml")>
<cfset content = Replace(FileRead(PathToFile), "word//media/", "test/word/media/", "all")>
<cfdocument format="pdf" filename="#ExpandPath('./convertedDocument.pdf')#" overwrite="true">
<cfoutput>#content#</cfoutput>
</cfdocument>
But overall the results were very good.
Final Notes/Quirks:
More about OpenXMLViewer and docx4j in Part 3.