Wednesday, January 23, 2008

MS Word metadata with POI and ColdFusion

I have been playing around with POI's HWPF library (Horrible Word Processing Format) and found it provides an easy way to extract metadata from an MS Word file. Now CF and java gurus were probably aware of this already ;) but it was a cool find for me.

I used the JavaLoader.cfc and POI 3.0.1 from poi.apache.org. As you can see it returns the key summary information. Everything from subject and comments to number of words in the document.



Here is the sample code I used. If you already have the right version of POI installed in your classpath, you can simply replace the javaLoader.create(..) statement with a call to the createObject(..) function.

Time to see what else POI can do ;)

Code


<!--- NOTE I am storing my javaLoader in the server scope --->
<!--- read why here; http://www.compoundtheory.com/?action=displayPost&ID=212 --->
<cfset javaLoader = server[MyUniqueKeyForJavaLoader]>

<!--- open a word document with POI and get the summary information --->
<cfset inputFilePath = ExpandPath('fromMSWord.doc')>
<cfset inputStream = createObject("java", "java.io.FileInputStream").init( inputFilePath )>
<cfset document = createObject("java", "org.apache.poi.hwpf.HWPFDocument").init( inputStream )>
<cfset summary = document.getSummaryInformation()>

<b>HWPF Summary Information:</b><br>
<cfoutput>
getSubject = #summary.getSubject()# <br>
getTemplate = #summary.getTemplate()# <br>
getAuthor = #summary.getAuthor()# <br>
getTitle = #summary.getTitle()# <br>
getSecurity = #summary.getSecurity()# <br>
getApplicationName = #summary.getApplicationName()# <br>
getKeywords = #summary.getKeywords()# <br>
getComments = #summary.getComments()# <br>
getLastAuthor = #summary.getLastAuthor()# <br>
getRevNumber = #summary.getRevNumber()# <br>
getEditTime = #summary.getEditTime()# <br>
getLastPrinted = #summary.getLastPrinted()# <br>
getCreateDateTime = #summary.getCreateDateTime()# <br>
getLastSaveDateTime = #summary.getLastSaveDateTime()# <br>
getPageCount = #summary.getPageCount()# <br>
getWordCount = #summary.getWordCount()# <br>
getCharCount = #summary.getCharCount()# <br>
</cfoutput>

2 comments:

Tad May 18, 2010 at 11:46 AM  

Do you have similar code that you've written to deal with DOCX files? Do you recommend using the XWPF methods directly, or using Extractor?

Trying to pull a bunch of docx summary info out of a batch of files, and struggling with such.

cfSearching May 18, 2010 at 3:16 PM  

@Tad,

Yes, you can use the ExtractorFactory. But there are different properties for doc and docx files.

Rather than duplicating the code, see my response to your comment here:

http://www.coldfusionjedi.com/index.cfm/2009/2/6/Working-with-Office-Metadata#cC6AF9B2C-A112-9E47-F2766C09664266C1


HTH
-Leigh

  © Blogger templates The Professional Template by Ourblogtemplates.com 2008

Header image adapted from atomicjeep