Tuesday, June 30, 2009

ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 2

In a Part 1 I mentioned two interesting projects for converting docx files. The first one is the OpenXML Document Viewer Project, which can be used to convert docx files to html. The server version is a small executable that can be used with cfexecute. The small program accepts three arguments:

  • source_file absolute path to the docx file to convert
  • dest_path folder to place the converted files (html and images)
  • browser_type used to generate a browser specific html file (IE, Firefox or Opera)

There is not much more to using the program than that. The installation is equally simple.

1. Download the command line version for your operating system.
Example: OpenXMLViewer_Win_Cmd.zip

2. Unzip the files and copy the entire OpenXMLViewer subfolder to the desired location.
Example: I copied the subfolder to: c:\tools\OpenXMLViewer

3. If you are on windows, you must add the directory from step #2 to your PATH variable (or simply call the program from a .bat file instead). Due to some issues with the PATH value, I ultimately ended up using a .bat file.

To use the converter, just initialize a few path variables then run the program with cfexecute.


Initialize file paths
<cfset inputFilePath = ExpandPath("Introduction to Microsoft .NET Services.docx")>
<cfset outputFolder = ExpandPath("test")>
<cfset pathToProgram = "C:\tools\OpenXMLViewer\OpenXMLViewer.exe">
<cfset browserType = "FIREFOX">

OpenXMLViewer does not seem to create the output
folder if it does not exist. So ensure it exists
before doing the conversion.
<cfif NOT DirectoryExists(outputFolder)>
<cfdirectory action="create" directory="#outputFolder#">

Do the conversion
arguments='/c "#pathToProgram#" "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
<cfexecute name="c:\windows\system32\cmd.exe"
arguments='/c #pathToProgram# "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
variable="result" />

Display the generated html file
<cfdirectory action="list" directory="#outputFolder#" name="getHTMLFiles" filter="*html*">

<h3>Generated HTML</h3>
<cfoutput query="getHTMLFiles">
<a href="#ListLast(directory, '\/')#/#Name#">Display HTML (#Name#)</a>

<cfdump var="#result#" label="Results from Cfexecute">

I was pretty pleased with the results and was actually able to feed the html content into cfdocument to produce a reasonable facsimile in pdf format. Though that did take a while . I also noticed a few issues with some of those funky MS Word characters I know and love. I am still looking into how to fix that.

Convert to PDF:
because the .cfm script is not in the same directory
as the html file, first correct the relative image
paths for cfdocument
<cfset pathToFile = ExpandPath("test/Introduction to Microsoft .NET Services.xhtml")>
<cfset content = Replace(FileRead(PathToFile), "word//media/", "test/word/media/", "all")>

<cfdocument format="pdf" filename="#ExpandPath('./convertedDocument.pdf')#" overwrite="true">

But overall the results were very good.

PDF - With Funky Characters

Final Notes/Quirks:
  1. OpenXMLConverter generates one type of file for Internet Explorer and another for Firefox and Opera. I believe the primary reason is because Word documents can contain vml. Internet Explorer is capable of displaying vml, but Firefox/Opera are not. So if you select the latter browser type, the vml is converted to svg.

  2. From what I can tell OpenXMLConverter does not allow you to specify the name of the output file or the path to image directories. So unfortunately that means you must output the generated files to separate directories to avoid naming conflicts.

  3. The OpenXMLConverter does not seem to create the output folder, if it does not exist. So you must ensure it exists before calling the program.

More about OpenXMLViewer and docx4j in Part 3.

...Read More

ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 1

So a while back I was looking into open source tools that could be used to covert RTF or MS Word files to HTML and/or PDF format. Eventually, I stumbled across two projects with some good potential, at least for the newer .docx format.

I decided to focus on the newer format only, and not the binary .doc format. Primarily because there was better support for ooxml. Plus, there are several tools available for converting binary documents to ooxml format. So the newer format seemed the better option. I am still in the experimental stages, but so far the results have been pretty good.

OpenXMLViewer Project
License: Microsoft Public License (Ms-PL) and Related licenses
Base Language: C++
The OpenXMLViewer project focuses on "showing how documents created using Open XML Format can be translated to HTML". The OpenXML Document Viewer project provides both a browser plugin and a command line program for server side conversion of .docx files to HTML.

docx4j Project
License: Apache License (v2)
Base Language: Java
The docx4j project has a broader scope. It is "an open source Java library for manipulating OpenXML WordprocessingML documents." Including conversion to HTML and PDF formats.

Six Degrees of Separation
Though I came across the two project through separate routes, surprisingly they have something in common. One of the tools utilized by the OpenXMLViewer is xslt. The OpenXMLViewer program uses a file named DocX2Html.xslt to perform a large chunk of the conversion from ooxml to html. Now the docx4j project also contains a docx to html converter. Technicially it has two different implementations. The later one is based on the DocX2Html.xslt file from the OpenXMLViewer project. So even though the two projects are based in different languages, it turns out they were not as far apart as I thought. At least in one respect.

A Thin Line ..
Like anyone who is not an xslt guru, I have a stormy relationship with it. On the one hand, trying to debug the type of transformations needed to process something as complex as the Microsoft Office File Format is enough to make my head implode. But it is also amazing just how much you can do with it.

For example you can get a rough picture of how much of the conversion is achieved with xslt by downloading the DocX2Html.xslt file and running a simple transform. Just use cfzip to extract the main document content. Then run it through XmlTransform.

<!--- initialize file paths --->
<cfset pathToXSLT = ExpandPath("DocX2Html_V2.xslt")>
<cfset inputPath  = "c:\test\docs\Introduction to Microsoft .NET Services.docx">
<cfset outputPath = ExpandPath("DocXToHTML-RoughVersion.html")>

<!--- read the document.xml file into a variable --->
<cfzip action = "read"

<!--- transform the content to html and save to disk --->
<cfset htmlDoc = XmlTransform(docXML, pathToXSLT)>
<cfset FileWrite(outputPath, htmlDoc)>

<!--- Display raw results --->
<a href="#outputPath#"gt;Display as HTML</agt;<br><br>
Generated HTML:
You can see from the results, it handles most of the formatting. What it does not handle are things like list numbering, links, images, etcetera. Those are handled by other areas of the OpenXMLViewer program and docx4J jar. (As well as those crazy MS quote characters we all know and love.) Now obviously more than a simple transform is needed to fully convert documents. But it does give you a glimpse of what the final output might look like.

Source file 1: Copy of Pete Freitag's Cfscript Cheat Sheet found on google.

Continued in Part 2

Update: I finally had to give up on using docx4j with Adobe ColdFusion. It is nothing against docx4j. But there were just too many "jar hell" type conflicts with CF's own internal jars.

...Read More

ColdFusion: X-Files, Trust No One (Mime Type Security Issues)

If you do any file uploads on your site, a recent entry on Raymond Camden's blog is a must read on mime type security holes. The issue has been around forever, but people are often unaware of it. When it comes to uploading, remember the X-Files: Trust No One ;)


...Read More

Monday, June 29, 2009

That which we call a structure, by any other name does not smell as sweet

I came across an interesting question on the forums today about a quirk of the FORM scope. On the off chance you have not already encountered this one, the poster discovered that StructCount() does not return the correct value after keys are deleted using StructDelete.

To demonstrate, create a simple form with one or more fields. When the form is submitted, delete the keys one by one. Check the StructCount() value on each iteration. You will notice the resulting counts are incorrect. The size of the FORM structure always remains the same. Even after keys are deleted.

<cfif structKeyExists(FORM, "testThis")>
<cfloop list="#form.fieldNames#" index="key">
<cfset StructDelete(FORM, key)>
Deleted key #key#. New StructCount(FORM) = #StructCount(FORM)#<br>

<h3>Test Form</h3>
<form action="#CGI.SCRIPT_NAME#" method="post">
<cfloop from="1" to="10" index="x">
<input type="text" name="field#x#" value="#x#">
<input type="submit" name="testThis">
Interestingly, if you copy the form structure with duplicate(), then run the same code on the copied object, the results are correct. The reason is duplicate() returns a slightly different type of object. If you check the underlying class names, you will see that duplicate returns a coldfusion.runtime.Struct object. Whereas form is actually a coldfusion.filter.FormScope object.

<cfset copy = duplicate(FORM)>
FORM object type = #form.getClass().name#<br>
COPY object type = #copy.getClass().name#<br><br>

Out of curiosity, I used several different methods to verify the structure counts (some less than efficient). Since ColdFusion structures are java.util.Map objects, I also used the undocumented size() method to confirm the results. It too returned the wrong value. Though I do not know how StructCount() works internally, I suspect it uses size(). Which would explain the StructCount() bug.

It turns out the same problem applies to the URL scope. At least with ColdFusion 8. So while we tend to think of URL and FORM as indistinguishable from "regular" CF structures .. they are not. In this case, neither one came out smelling like a rose ;)

Complete Test Code
<cfif structKeyExists(FORM, "testThis")> 
<!--- create a deep copy of the FORM scope --->
<cfset copy = duplicate(FORM)>

<!--- Display the object types --->
<h3>Object Types:</h3>
FORM object type = #form.getClass().name#<br>
COPY object type = #copy.getClass().name#<br><br>

<tr><th rowspan="2">Action</th>
<th colspan="4" class="one">FORM Object</th>
<th colspan="4" class="two">Copy</th>
<td class="one">StructCount</td>
<td class="one">Size()</td>
<td class="one">ArrayLen()</td>
<td class="one">ListLen()</td>
<td class="two">StructCount</th>
<td class="two">Size()</th>
<td class="two">ArrayLen()</th>
<td class="two">ListLen()</td>
<!--- Display the counts as each key is deleted --->
<cfloop list="#form.fieldNames#" index="key">
<cfset StructDelete(COPY, key)>
<cfset StructDelete(FORM, key)>
<td>Deleted key #key#</td>

<h3>Test Form</h3>
<form action="#CGI.SCRIPT_NAME#" method="post">
<cfloop from="1" to="10" index="x">
<input type="text" name="field#x#" value="#x#">
<input type="submit" name="testThis">

...Read More

Saturday, June 27, 2009

CFPDF: Problems with addWatermark foreground="false"

Another interesting issue with cfpdf and watermarks came up on the adobe forums last week. A poster mentioned having problems using cfpdf to apply a watermark to the background of a pdf. Whenever they tried using foreground="false" a white rectangle always obscured the watermark.

<cfpdf action="addwatermark"

I ran a few tests and suprisingly my attempts to apply the background watermark using ddx and iText both failed. But they did reveal something strange: the problem only seems to apply to pdf's created with cfdocument. The same code worked with similar files created by Acrobat. So it definitely seems to be an issue with cfdocument.

However, a post on houseoffusion.com, by Randi Knutson, mentions a work-around using css. He was able to apply a background watermark using the css background-image property. So at least there is one way around this particular issue. For those that like one-stop-shopping, here is a quick example using Randi's code:

<cfdocument format="pdf" filename="simulateForegroundEqualsFalse.pdf" overwrite="true">
body { background-image: url(/images/myWatermarkLetterSize.gif);
<cfloop from="1" to="30" index="r">
<p>The only way to comprehend what mathematicians mean by Infinity is to contemplate the extent of human stupidity.</p>

Update: A helpful Adobe rep. pointed out a simpler fix that works with the CF9 Beta. When creating the pdf with cfdocument, simply save the results to a variable. Then use the variable as the pdf "source" instead of a file path.

...Read More

CFPDF - Issues When Using Transparent Images as a Watermark

I saw an interesting question on the abode forums yesterday, about problems with watermarks and cfpdf. The issue involved using transparent png's or gif's as a watermark. The transparent parts of the image seem to be rendered as white, instead of maintaining their transparency.

As I was curious, I tried a number of different things but nothing seemed to work except a bit of iText magic. The work-around comes from an adaptation of two great iText examples. The code is very simple. It uses PdfGState to set the watermark to 50% opacity, but you can change that (and other properties like blendMode) as well.

If anyone knows a way around this issue (using cfpdf or ddx), I would love to hear it.

Update July 13,2009: This issue appears to be fixed in CF9 beta.

iText Example Java Source:

Add a centered watermark with 50% opacity
savedErrorMessage = "";

fullPathToInputFile = ExpandPath("mySourceFile.pdf");
fullPathToWatermark = ExpandPath("myTransparentImage.png");
fullPathToOutputFile =  ExpandPath("mySourceFile_Watermarked.pdf");

try {
    // create PdfReader instance to read in source pdf
    pdfReader = createObject("java", "com.lowagie.text.pdf.PdfReader").init(fullPathToInputFile);
    totalPages = pdfReader.getNumberOfPages();

    // create PdfStamper instance to create new watermarked file
    outStream = createObject("java", "java.io.FileOutputStream").init(fullPathToOutputFile);
    pdfStamper = createObject("java", "com.lowagie.text.pdf.PdfStamper").init(pdfReader, outStream);

    // Read in the watermark image
    img = createObject("java", "com.lowagie.text.Image").getInstance(fullPathToWatermark);

    // Use PdfGState to change fill,blendMode, etcetera as needed
    gState = createObject("java", "com.lowagie.text.pdf.PdfGState").init();

    // adding content to each page
    p = 0;
    while (p LT totalPages) {
        p = p + 1;
        // Prepare to place image on OVERcontent
        content = pdfStamper.getOverContent( javacast("int", p) );
        // Only needed if you are changing the opacity, blending, etcetera ..

        // Center the watermark. Note - using deprecated methods for CF8/iText 1.4 compatability
        rectangle = pdfStamper.getReader().getPageSizeWithRotation( javacast("int", p) );
        x = rectangle.left() + (rectangle.width() - img.plainWidth()) / 2;
        y = rectangle.bottom() + (rectangle.height() - img.plainHeight()) / 2;
        img.setAbsolutePosition(x, y);

        WriteOutput("Watermarked page "& p &"<hr>");

catch (java.lang.Exception e) {
    savedErrorMessage = e;
// closing PdfStamper will generate the new PDF file
if (IsDefined("pdfStamper")) {
if (IsDefined("outStream")) {

<!--- show any errors --->
<cfif len(savedErrorMessage) gt 0>
    ERROR - Unable to create document
    <cfdump var="#savedErrorMessage#">

...Read More

Monday, June 1, 2009

MS SQL / ALTER TABLE: Add a new column with default values

While not an every day occurence, I often need to add columns to an existing sql server table. But it is a bit of a pain if the new column should not allow nulls. Fortunatelly, sql server provides the WITH VALUES clause for just such a situation. Because I can never remember the exact syntax (when I most need it), here is an example to jog my memory next time ;)

--- Change "BIT" to whatever data type is needed
--- and "0" to whatever default value is desired
CONSTRAINT MyConstraintName

...Read More

  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep