Tuesday, June 30, 2009

ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 2

In a Part 1 I mentioned two interesting projects for converting docx files. The first one is the OpenXML Document Viewer Project, which can be used to convert docx files to html. The server version is a small executable that can be used with cfexecute. The small program accepts three arguments:

  • source_file absolute path to the docx file to convert
  • dest_path folder to place the converted files (html and images)
  • browser_type used to generate a browser specific html file (IE, Firefox or Opera)

There is not much more to using the program than that. The installation is equally simple.

Installation:
1. Download the command line version for your operating system.
Example: OpenXMLViewer_Win_Cmd.zip

2. Unzip the files and copy the entire OpenXMLViewer subfolder to the desired location.
Example: I copied the subfolder to: c:\tools\OpenXMLViewer

3. If you are on windows, you must add the directory from step #2 to your PATH variable (or simply call the program from a .bat file instead). Due to some issues with the PATH value, I ultimately ended up using a .bat file.


Usage:
To use the converter, just initialize a few path variables then run the program with cfexecute.

Code:

<!---
Initialize file paths
--->
<cfset inputFilePath = ExpandPath("Introduction to Microsoft .NET Services.docx")>
<cfset outputFolder = ExpandPath("test")>
<cfset pathToProgram = "C:\tools\OpenXMLViewer\OpenXMLViewer.exe">
<cfset browserType = "FIREFOX">

<!---
OpenXMLViewer does not seem to create the output
folder if it does not exist. So ensure it exists
before doing the conversion.
--->
<cfif NOT DirectoryExists(outputFolder)>
<cfdirectory action="create" directory="#outputFolder#">
</cfif>

<!---
Do the conversion
arguments='/c "#pathToProgram#" "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
--->
<cfexecute name="c:\windows\system32\cmd.exe"
arguments='/c #pathToProgram# "#inputFilePath#" "#outputFolder#" #browserType# 2>&1'
timeout="120"
variable="result" />

<!---
Display the generated html file
--->
<cfdirectory action="list" directory="#outputFolder#" name="getHTMLFiles" filter="*html*">

<h3>Generated HTML</h3>
<cfoutput query="getHTMLFiles">
<a href="#ListLast(directory, '\/')#/#Name#">Display HTML (#Name#)</a>
</cfoutput>

<cfdump var="#result#" label="Results from Cfexecute">


I was pretty pleased with the results and was actually able to feed the html content into cfdocument to produce a reasonable facsimile in pdf format. Though that did take a while . I also noticed a few issues with some of those funky MS Word characters I know and love. I am still looking into how to fix that.

Convert to PDF:
<!---
because the .cfm script is not in the same directory
as the html file, first correct the relative image
paths for cfdocument
--->
<cfset pathToFile = ExpandPath("test/Introduction to Microsoft .NET Services.xhtml")>
<cfset content = Replace(FileRead(PathToFile), "word//media/", "test/word/media/", "all")>

<cfdocument format="pdf" filename="#ExpandPath('./convertedDocument.pdf')#" overwrite="true">
<cfoutput>#content#</cfoutput>
</cfdocument>


But overall the results were very good.



PDF - With Funky Characters


Final Notes/Quirks:
  1. OpenXMLConverter generates one type of file for Internet Explorer and another for Firefox and Opera. I believe the primary reason is because Word documents can contain vml. Internet Explorer is capable of displaying vml, but Firefox/Opera are not. So if you select the latter browser type, the vml is converted to svg.


  2. From what I can tell OpenXMLConverter does not allow you to specify the name of the output file or the path to image directories. So unfortunately that means you must output the generated files to separate directories to avoid naming conflicts.

  3. The OpenXMLConverter does not seem to create the output folder, if it does not exist. So you must ensure it exists before calling the program.



More about OpenXMLViewer and docx4j in Part 3.

...Read More

ColdFusion: Experiment Converting MS Word to HTML/PDF (At Last) - Part 1

So a while back I was looking into open source tools that could be used to covert RTF or MS Word files to HTML and/or PDF format. Eventually, I stumbled across two projects with some good potential, at least for the newer .docx format.

I decided to focus on the newer format only, and not the binary .doc format. Primarily because there was better support for ooxml. Plus, there are several tools available for converting binary documents to ooxml format. So the newer format seemed the better option. I am still in the experimental stages, but so far the results have been pretty good.

<!--- initialize file paths --->
<cfset pathToXSLT = ExpandPath("DocX2Html_V2.xslt")>
<cfset inputPath  = "c:\test\docs\Introduction to Microsoft .NET Services.docx">
<cfset outputPath = ExpandPath("DocXToHTML-RoughVersion.html")>

<!--- read the document.xml file into a variable --->
<cfzip action = "read"
entrypath="word\document.xml"
file="#inputPath#"
variable="docXML">

<!--- transform the content to html and save to disk --->
<cfset htmlDoc = XmlTransform(docXML, pathToXSLT)>
<cfset FileWrite(outputPath, htmlDoc)>

<!--- Display raw results --->
<cfoutput>
<a href="#outputPath#"gt;Display as HTML</agt;<br><br>
Generated HTML:
<pre>
#HTMLEditFormat(htmlDoc)#
</pre>
</cfoutput>
You can see from the results, it handles most of the formatting. What it does not handle are things like list numbering, links, images, etcetera. Those are handled by other areas of the OpenXMLViewer program and docx4J jar. (As well as those crazy MS quote characters we all know and love.) Now obviously more than a simple transform is needed to fully convert documents. But it does give you a glimpse of what the final output might look like.


Source file 1: Copy of Pete Freitag's Cfscript Cheat Sheet found on google.








Continued in Part 2

Update: I finally had to give up on using docx4j with Adobe ColdFusion. It is nothing against docx4j. But there were just too many "jar hell" type conflicts with CF's own internal jars.

...Read More

ColdFusion: X-Files, Trust No One (Mime Type Security Issues)

If you do any file uploads on your site, a recent entry on Raymond Camden's blog is a must read on mime type security holes. The issue has been around forever, but people are often unaware of it. When it comes to uploading, remember the X-Files: Trust No One ;)

http://www.coldfusionjedi.com/index.cfm/2009/6/30/Are-you-aware-of-the-MIMEFile-Upload-Security-Issue

Monday, June 29, 2009

That which we call a structure, by any other name does not smell as sweet

I came across an interesting question on the forums today about a quirk of the FORM scope. On the off chance you have not already encountered this one, the poster discovered that StructCount() does not return the correct value after keys are deleted using StructDelete.


<cfif structKeyExists(FORM, "testThis")>
<cfoutput>
<cfloop list="#form.fieldNames#" index="key">
<cfset StructDelete(FORM, key)>
Deleted key #key#. New StructCount(FORM) = #StructCount(FORM)#<br>
</cfloop>
</cfoutput>
</cfif>

<cfoutput>
<h3>Test Form</h3>
<form action="#CGI.SCRIPT_NAME#" method="post">
<cfloop from="1" to="10" index="x">
<input type="text" name="field#x#" value="#x#">
</cfloop>
<input type="submit" name="testThis">
</form>
</cfoutput>
Interestingly, if you copy the form structure with duplicate(), then run the same code on the copied object, the results are correct. The reason is duplicate() returns a slightly different type of object. If you check the underlying class names, you will see that duplicate returns a coldfusion.runtime.Struct object. Whereas form is actually a coldfusion.filter.FormScope object.

<cfset copy = duplicate(FORM)>
<cfoutput>
FORM object type = #form.getClass().name#<br>
COPY object type = #copy.getClass().name#<br><br>
</cfoutput>

Out of curiosity, I used several different methods to verify the structure counts (some less than efficient). Since ColdFusion structures are java.util.Map objects, I also used the undocumented size() method to confirm the results. It too returned the wrong value. Though I do not know how StructCount() works internally, I suspect it uses size(). Which would explain the StructCount() bug.



It turns out the same problem applies to the URL scope. At least with ColdFusion 8. So while we tend to think of URL and FORM as indistinguishable from "regular" CF structures .. they are not. In this case, neither one came out smelling like a rose ;)

Complete Test Code
<cfif structKeyExists(FORM, "testThis")> 
<!--- create a deep copy of the FORM scope --->
<cfset copy = duplicate(FORM)>

<cfoutput>
<!--- Display the object types --->
<h3>Object Types:</h3>
FORM object type = #form.getClass().name#<br>
COPY object type = #copy.getClass().name#<br><br>

<table>
<tr><th rowspan="2">Action</th>
<th colspan="4" class="one">FORM Object</th>
<th colspan="4" class="two">Copy</th>
</tr>
<tr>
<td class="one">StructCount</td>
<td class="one">Size()</td>
<td class="one">ArrayLen()</td>
<td class="one">ListLen()</td>
<td class="two">StructCount</th>
<td class="two">Size()</th>
<td class="two">ArrayLen()</th>
<td class="two">ListLen()</td>
</tr>
<!--- Display the counts as each key is deleted --->
<cfloop list="#form.fieldNames#" index="key">
<cfset StructDelete(COPY, key)>
<cfset StructDelete(FORM, key)>
<tr>
<td>Deleted key #key#</td>
<td>#StructCount(FORM)#</td>
<td>#FORM.size()#</td>
<td>#ArrayLen(StructKeyArray(FORM))#</td>
<td>#ListLen(StructKeyList(FORM))#</td>
<td>#StructCount(COPY)#</td>
<td>#COPY.size()#</td>
<td>#ArrayLen(StructKeyArray(COPY))#</td>
<td>#ListLen(StructKeyList(COPY))#</td>
</tr>
</cfloop>
</table>
</cfoutput>
</cfif>


<cfoutput>
<h3>Test Form</h3>
<form action="#CGI.SCRIPT_NAME#" method="post">
<cfloop from="1" to="10" index="x">
<input type="text" name="field#x#" value="#x#">
</cfloop>
<input type="submit" name="testThis">
</form>
</cfoutput>

...Read More

Saturday, June 27, 2009

CFPDF: Problems with addWatermark foreground="false"

Another interesting issue with cfpdf and watermarks came up on the adobe forums last week. A poster mentioned having problems using cfpdf to apply a watermark to the background of a pdf. Whenever they tried using foreground="false" a white rectangle always obscured the watermark.


<cfpdf action="addwatermark"
image="myWatermarkImage.gif"
foreground="false"
source="test.pdf"
destination="test_Watermarked.pdf"
overwrite="yes">

I ran a few tests and suprisingly my attempts to apply the background watermark using ddx and iText both failed. But they did reveal something strange: the problem only seems to apply to pdf's created with cfdocument. The same code worked with similar files created by Acrobat. So it definitely seems to be an issue with cfdocument.




Update: A helpful Adobe rep. pointed out a simpler fix that works with the CF9 Beta. When creating the pdf with cfdocument, simply save the results to a variable. Then use the variable as the pdf "source" instead of a file path.

...Read More

CFPDF - Issues When Using Transparent Images as a Watermark

I saw an interesting question on the abode forums yesterday, about problems with watermarks and cfpdf. The issue involved using transparent png's or gif's as a watermark. The transparent parts of the image seem to be rendered as white, instead of maintaining their transparency.



<!---
Add a centered watermark with 50% opacity
--->
<cfscript>
savedErrorMessage = "";

fullPathToInputFile = ExpandPath("mySourceFile.pdf");
fullPathToWatermark = ExpandPath("myTransparentImage.png");
fullPathToOutputFile =  ExpandPath("mySourceFile_Watermarked.pdf");

try {
    // create PdfReader instance to read in source pdf
    pdfReader = createObject("java", "com.lowagie.text.pdf.PdfReader").init(fullPathToInputFile);
    totalPages = pdfReader.getNumberOfPages();

    // create PdfStamper instance to create new watermarked file
    outStream = createObject("java", "java.io.FileOutputStream").init(fullPathToOutputFile);
    pdfStamper = createObject("java", "com.lowagie.text.pdf.PdfStamper").init(pdfReader, outStream);

    // Read in the watermark image
    img = createObject("java", "com.lowagie.text.Image").getInstance(fullPathToWatermark);

    // Use PdfGState to change fill,blendMode, etcetera as needed
    gState = createObject("java", "com.lowagie.text.pdf.PdfGState").init();
    gState.setFillOpacity(0.5);

    // adding content to each page
    p = 0;
    while (p LT totalPages) {
        p = p + 1;
        // Prepare to place image on OVERcontent
        content = pdfStamper.getOverContent( javacast("int", p) );
        // Only needed if you are changing the opacity, blending, etcetera ..
        content.setGState(gState);

        // Center the watermark. Note - using deprecated methods for CF8/iText 1.4 compatability
        rectangle = pdfStamper.getReader().getPageSizeWithRotation( javacast("int", p) );
        x = rectangle.left() + (rectangle.width() - img.plainWidth()) / 2;
        y = rectangle.bottom() + (rectangle.height() - img.plainHeight()) / 2;
        img.setAbsolutePosition(x, y);

        content.addImage(img);
        WriteOutput("Watermarked page "& p &"<hr>");
    }

    WriteOutput("Finished!");
}
catch (java.lang.Exception e) {
    savedErrorMessage = e;
}
// closing PdfStamper will generate the new PDF file
if (IsDefined("pdfStamper")) {
    pdfStamper.close();
}
if (IsDefined("outStream")) {
    outStream.close();
}
</cfscript>

<!--- show any errors --->
<cfif len(savedErrorMessage) gt 0>
    ERROR - Unable to create document
    <cfdump var="#savedErrorMessage#">
</cfif>


...Read More

Monday, June 1, 2009

MS SQL / ALTER TABLE: Add a new column with default values

While not an every day occurence, I often need to add columns to an existing sql server table. But it is a bit of a pain if the new column should not allow nulls. Fortunatelly, sql server provides the WITH VALUES clause for just such a situation. Because I can never remember the exact syntax (when I most need it), here is an example to jog my memory next time ;)

  
--- Change "BIT" to whatever data type is needed
--- and "0" to whatever default value is desired
ALTER TABLE MyTable ADD MyNewColumName BIT NOT NULL
CONSTRAINT MyConstraintName
DEFAULT 0 WITH VALUES

  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep