ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 2
In a Part 1 I mentioned two interesting projects for converting docx files. The first one is the OpenXML Document Viewer Project, which can be used to convert docx files to html. The server version is a small executable that can be used with cfexecute. The small program accepts three arguments:
There is not much more to using the program than that. The installation is equally simple.
Installation:
1. Download the command line version for your operating system.
Example: OpenXMLViewer_Win_Cmd.zip
2. Unzip the files and copy the entire OpenXMLViewer subfolder to the desired location.
Example: I copied the subfolder to: c:\tools\OpenXMLViewer
3. If you are on windows, you must add the directory from step #2 to your PATH variable (or simply call the program from a .bat file instead). Due to some issues with the PATH value, I ultimately ended up using a .bat file.
Usage:
To use the converter, just initialize a few path variables then run the program with cfexecute.
Code:
<!---
Initialize file paths
--->
<cfset inputFilePath = ExpandPath("Introduction to Microsoft .NET Services.docx")>
<cfset outputFolder = ExpandPath("test")>
<cfset pathToProgram = "C:\tools\OpenXMLViewer\OpenXMLViewer.exe">
<cfset browserType = "FIREFOX">
<!---
OpenXMLViewer does not seem to create the output
folder if it does not exist. So ensure it exists
before doing the conversion.
--->
<cfif NOT DirectoryExists(outputFolder)>
<cfdirectory action="create" directory="#outputFolder#">
</cfif>
<!---
Do the conversion
arguments='/c "#pathToProgram#" "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
--->
<cfexecute name="c:\windows\system32\cmd.exe"
arguments='/c #pathToProgram# "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
timeout="120"
variable="result" />
<!---
Display the generated html file
--->
<cfdirectory action="list" directory="#outputFolder#" name="getHTMLFiles" filter="*html*">
<h3>Generated HTML</h3>
<cfoutput query="getHTMLFiles">
<a href="#ListLast(directory, '\/')#/#Name#">Display HTML (#Name#)</a>
</cfoutput>
<cfdump var="#result#" label="Results from Cfexecute">
I was pretty pleased with the results and was actually able to feed the html content into cfdocument to produce a reasonable facsimile in pdf format. Though that did take a while . I also noticed a few issues with some of those funky MS Word characters I know and love. I am still looking into how to fix that.
Convert to PDF:<!---
because the .cfm script is not in the same directory
as the html file, first correct the relative image
paths for cfdocument
--->
<cfset pathToFile = ExpandPath("test/Introduction to Microsoft .NET Services.xhtml")>
<cfset content = Replace(FileRead(PathToFile), "word//media/", "test/word/media/", "all")>
<cfdocument format="pdf" filename="#ExpandPath('./convertedDocument.pdf')#" overwrite="true">
<cfoutput>#content#</cfoutput>
</cfdocument>
But overall the results were very good.
Final Notes/Quirks:
More about OpenXMLViewer and docx4j in Part 3.
4 comments:
Does this utility only handle docx, or can it handle xlsx and pptx too? I'd figure it should handle the others since the name is OpenXMLViewer, but I'm not seeing anything that indicates supported formats.
@Todd,
No, I have looked at the source code and I believe this particular tool is for docx only.
I agree the project naming is a bit ambiguous. The home site is "OpenXML Viewer", but only the subproject "OpenXML Document Viewer" is mentioned. So perhaps they have plans for other formats. That said, I was only interested in docx, so I did not look that deeply ;-)
-Leigh
Thanks - good to know. That said, have you looked at using OpenOffice (and the ODF Converter Integrator - http://katana.oooninja.com/w/odf-converter-integrator) to do what you're trying to do?
I've only messed with the PPTX stuff, and it was a bit lacking IMO, but it is another option.
Looking forward, as always, to your next post.
@Todd,
Hmm, I do not know if I came across that one before. There was one odf converter I was looking at, but I got sidetracked with docx4j. Thanks, I will check that one out and see how it does with docx files.
Yes, some of the tools are better than others. Every one I have tested so far has "some" issues, even the better ones. I suspect the commercial tools are no different. So I guess the motto here is "manage your expectations". ;-)
-Leigh
Post a Comment