wordpress hit counter
Welcome to OpenXML Developer Sign in | Join | Help

Open XML and Java

Note: this code has been updated for the final Ecma schemas and the RTM version of Office. If you note some mistakes or have some issues, please report them to julien@chable.net

This post is a summary translation of articles by Julien Chable that have are available (in French) on MSDN France:

Retrieve a document’s main part

public final static String NS_CORE_DOCUMENT =
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

...

final String APP_ROOT = System.getProperty("user.dir") + File.separator;
ZipFile zipFile = null;

try {
zipFile = new ZipFile(APP_ROOT + "sample.docx");
} catch (IOException e) {
e.printStackTrace();
}

Package p = Package.open(zipFile, PackageAccess.Read);

// Retrieve core part relationship from his type
PackageRelationship coreDocRelationship =
p.getRelationshipsByType(
PackageRelationshipConstants.NS_CORE_DOCUMENT).getRelationship(0);

// Get the content part from the relationship
PackagePart coreDocument = p.getPart(coreDocRelationship);
System.out.println(coreDocument.getUri() + " -> "
+ coreDocument.getContentType());

Listing 1

Listing 1 output for several types of documents :

  • Word :
    word/document.xml -> application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
  • Excel :
    xl/workbook.xml -> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml
  • PowerPoint :
    ppt/presentation.xml -> application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml

Here are the extensions and the URI of the main part for several types of documents :

  • WordProcessingML (.docx) : word/document.xml
  • SpreadsheetML (.xlsx) : xl/workbook.xml
  • PresentationML (.pptx) : ppt/presentation.xml

How to get document’s properties

The following sample demonstrates how to get the core property part of a document :

public final static String NS_CORE_PROPERTIES =
"http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties";
...

Package p = Package.open(zipFile, PackageAccess.Read);

// Get core properties part relationship
PackageRelationship corePropertiesRelationship =
p.getRelationshipsByType(PackageRelationshipConstants.NS_CORE_PROPERTIES).getRelationship(0);

// Get core properties part from the previous relationship
PackagePart coreDocument = p.getPart(corePropertiesRelationship);
System.out.println(coreDocument.getUri() + " -> "
+ coreDocument.getContentType());

Listing 2

The output displays :

docProps/core.xml -> application/vnd.openxmlformats-package.core-properties+xml

Only a few simple lines are needed to get document’s properties :

...
OpenXMLDocument docx = new OpenXMLDocument(Package.open(zipFile,
PackageAccess.Read));
System.out.println(docx.getCoreProperties().getCreator());
System.out.println(docx.getCoreProperties().getTitle());
System.out.println(docx.getCoreProperties().getSubject());

Listing 3

The output displays :

Julien CHABLE

Lorem Ipsum

Sample document

How to change document’s properties

It’s as simple as to get a property :

// Destination file
File destFile = new File(APP_ROOT + "sample_out.docx");

// Open the document
Package pack = Package.open(zipFile, PackageAccess.ReadWrite);
OpenXMLDocument docx = new OpenXMLDocument(pack);

CoreProperties coreProps = docx.getCoreProperties();
coreProps.setCreator("OpenXMLDeveloer.org powa");
coreProps.setDescription("A new description");
coreProps.setTitle("SampleListing4");

// Save document
docx.save(destFile);

Listing 4

Extended properties

The little framework associated with this article doesn’t provide any class or method to access extended properties. As a result, in this sample, we need to use DOM API to extract information from the extended properties part :

...

// Open the package
Package p = Package.open(..., PackageAccess.Read);

// Get extended properties relationship
PackageRelationship extendedPropertiesRelationship = p
.getRelationshipsByType(
PackageRelationshipConstants.NS_EXTENDED_PROPERTIES)
.getRelationship(0);

// Get extended properties part from the previous relationship
PackagePart extPropsPart = p.getPart(extendedPropertiesRelationship);
System.out.println(extPropsPart.getUri() + " -> "
+ extPropsPart.getContentType());

// Extract content
try {
InputStream inStream = extPropsPart.getInputStream();

// Create DOM parser
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory
.newInstance();
documentBuilderFactory.setNamespaceAware(true);
documentBuilderFactory.setIgnoringElementContentWhitespace(true);

DocumentBuilder documentBuilder;
documentBuilder = documentBuilderFactory.newDocumentBuilder();

// Parse XML content
Document extPropsDoc = documentBuilder.parse(inStream);

// Extract the name and the version of the Open XML file generator
System.out.println("Document generated with "
+ extPropsDoc.getElementsByTagName("Application").item(0)
.getTextContent()
+ " vers. "
+ extPropsDoc.getElementsByTagName("AppVersion").item(0)
.getTextContent());

// Extract statistics about this document
System.out.println("This document contains "
+ extPropsDoc.getElementsByTagName("Words").item(0)
.getTextContent()
+ " words and is composed of "
+ extPropsDoc.getElementsByTagName("Characters").item(0)
.getTextContent()
+ " characters and "
+ extPropsDoc.getElementsByTagName("Lines").item(0)
.getTextContent() + " lines");

inStream.close();
} catch (Exception ioe) {
System.err
.println("Failed to extract extended properties ! :(");
}

Listing 5

Output of Listing 5 :

docProps/app.xml -> application/vnd.openxmlformats-officedocument.extended-properties+xml

Document generated with Microsoft Office Word vers. 12.0000

This document contains 262 words and is composed of 1444 characters and 12 lines

Thumbnail

Many OpenXML documents, for example PowerPoint 2007, contain a thumbnail of the document. This specific part have the following relationship : http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail.

The following listing use tow methods – getThumbnails() and extractParts() – to extract the thumbnail of the document, and put it into the ‘export’ directory :

final String APP_ROOT = System.getProperty("user.dir") + File.separator;
ZipFile zipFile = null; // Le fichier source
try {
zipFile = new ZipFile(APP_ROOT + "sample.pptx");
} catch (IOException e) {
...
}

// Destination folder
File destFile = new File(APP_ROOT + "export");

// Open the package
OpenXMLDocument docx = OpenXMLDocument.open(zipFile, PackageAccess.Read);

// Extract thumbnails
docx.extractParts(docx.getThumbnails(), destFile);

Listing 6

Here are the details of the getThumbnails() and extractParts() methods :

public final static String NS_THUMBNAIL_PART =
"http://schemas.openxmlformats.org/package/2006/relationships/metadata/thumbnail";
...

// Retrieve all thumbnails contain in the document.
public ArrayList<PackagePart> getThumbnails() {
return container.getPartByRelationshipType(
PackageRelationshipConstants.NS_THUMBNAIL_PART);
}

Listing 6-1 (class OpenXMLDocument)

/**
*
Extract part content into the specified folder.
*
* @param
parts
*
Parts to extract.
* @param
destFolder
*
Destination folder.
*/

public void extractParts(ArrayList<PackagePart>parts, File destFolder) {
for (PackagePart part : parts) {
String filename = PackageURIHelper.getFilename(part.getUri());
try {
InputStream ins = part.getInputStream();
FileOutputStream fw = new FileOutputStream(destFolder
.getAbsolutePath()
+ File.separator + filename);
byte[] buff = new byte[512];
while (ins.available() > 0) {
ins.read(buff);
fw.write(buff);
}
fw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

Listing 6-2 (class OpenXMLDocument)

Listing 6 result :

Word document basic creation

To simplify this example, we’re going to create a document from a blank one by modifying his content ; this manipulation is simpler to understand and to do for this article, than a ‘from scratch’ creation. To add paragraphs in a document, the classes ParagraphBuilder, Paragraph and Run are greatly useful :

// Creation of a paragraph builder
ParagraphBuilder paraBuilder = new ParagraphBuilder();
paraBuilder.setAlignment(ParagraphAlignment.CENTER);

Listing 7-1

Once the ParagraphBuilder is ready, you could create a new paragraph by using the newParagraph() method :

// We create the first paragraph
Paragraph par1 = paraBuilder.newParagraph();

Listing 7-2

The following example creates two paragraphs with the content : ‘Hello Office Open XML’ and ‘OpenXMLDeveloper.org’ with a great font size :

...

Package pack = Package.open(zipFile, PackageAccess.ReadWrite);
WordDocument docx = new WordDocument(pack);

// Creation of a paragraph builder
ParagraphBuilder paraBuilder = new ParagraphBuilder();
paraBuilder.setAlignment(ParagraphAlignment.CENTER);

// We create the first paragraph
Paragraph par1 = paraBuilder.newParagraph();

// Add runs to modify the style
Run r1 = new Run("Hello");
r1.setBold(true);

Run r2 = new Run(" Office");
r2.setItalic(true);

Run r3 = new Run(" Open");
r3.setUnderline(UnderlineStyle.SINGLE);

Run r4 = new Run(" XML");
r4.setVerticalAlignement(VerticalAlignment.SUPERSCRIPT);

// Add previous runs to the first paragraph
par1.addRun(r1);
par1.addRun(r2);
par1.addRun(r3);
par1.addRun(r4);

// Add the first paragraph in the document’s content
docx.appendParagraph(par1);

// Creation of a second paragraph
paraBuilder.setBold(true);

Paragraph par2 = paraBuilder.newParagraph();

Run r21 = new Run("www.openxmldeveloper.org");
r21.setFontSize(55);
par2.addRun(r21);

// Append the second paragraph to content
docx.appendParagraph(par2);

// Save the document
docx.save(destFile);

Listing 8

The result :

Convert a Word document to HTML

The OpenXML format is partly based on the XML technology, so the conversion to HTML is quite simple, at least, for basic documents thanks to the XSLT technology !

In this example, we’ll use the straightforward document generated by the previous listing with the following XSLT stylesheet :

<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/3/main">

<xsl:output method="html" />

<!-- Document root -->
<xsl:template match="/w:document">
<xsl:apply-templates select="w:body" />
</xsl:template>


<!-- Body and paragraphs -->
<xsl:template match="w:body">
<html>
<body
>
<xsl:for-each select="w:p">
<p>
<xsl:apply-templates select="w:pPr" />
<xsl:apply-templates select="w:r" />
</p>
</xsl:for-each>
</body>
</html
>
</xsl:template>

<!-- Paragraph properties -->
<xsl:template match="w:pPr">
<xsl:attribute name="style">
<xsl:apply-templates />
</xsl:attribute>
</xsl:template>

<!-- Text alignment -->
<xsl:template match="w:jc">
text-align:
<xsl:value-of select="@w:val" />
</xsl:template>

<!-- Run -->
<xsl:template match="w:r">
<span>
<xsl:apply-templates select="w:rPr" />
<xsl:value-of select="w:t" />
</span>
</xsl:template>

<!-- Run properties -->
<xsl:template match="w:rPr">
<xsl:attribute name="style">
<xsl:apply-templates />
</xsl:attribute>
</xsl:template>

<!-- Font size -->
<xsl:template match="w:sz">
font-size:
<xsl:value-of select="@w:val" />
px;
</xsl:template>

<!-- Vertical alignment -->
<xsl:template match="w:vertAlign">
<xsl:variable name="jcVal" select="@w:val" />
<xsl:if test="$jcVal = 'superscript'">
font-size:33%;position:relative;bottom:0.5em;
</xsl:if>
<xsl:if test="$jcVal = 'subscript'">
font-size:33%;position:relative;bottom:-0.5em;
</xsl:if>
</xsl:template>

<!-- Bold -->
<xsl:template match="w:b">font-weight:bold;</xsl:template>

<!-- Italic -->
<xsl:template match="w:i">font-style:italic;</xsl:template>

<!-- Underline -->
<xsl:template match="w:u">text-decoration:underline;</xsl:template>

</xsl:stylesheet>


We’ll use the class WordToHTMLTransformer and the associated method transform() to convert our OpenXML document to HTML :

WordDocument docx = new WordDocument(...);
...
WordToHTMLTransformer wt = new WordToHTMLTransformer();
InputStream transformStream = wt.transform(docx);

Listing 9-1

The complete example :

final String APP_ROOT = System.getProperty("user.dir") + File.separator;
ZipFile zipFile = null; // Le fichier source

try {
zipFile = new ZipFile(APP_ROOT + "sample_out.docx");
} catch (IOException e) {
e.printStackTrace();
}

// La destination du fichier de sortie
File destFile = new File(APP_ROOT + "output.html");

Package pack = Package.open(zipFile, PackageAccess.ReadWrite);
WordDocument docx = new WordDocument(pack);

WordToHTMLTransformer wt = new WordToHTMLTransformer();
try {
InputStream transformStream = wt.transform(docx);
BufferedWriter outStream = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(destFile)));

BufferedReader br = new BufferedReader(new InputStreamReader(
transformStream));

String buff;
while ((buff = br.readLine()) != null)
outStream.write(buff);
outStream.close();

br.close();
} catch (Exception e) {
e.printStackTrace();
}

Listing 10

The HTML file generated by Listing 10 in Internet Explorer 7 :

Author

Julien Chable, student at EFREI in France and Microsoft Student Partner writes articles about Java and .NET in several magazines and websites. He can be contacted via his website http://julien.chable.net or his blog http://blogs.developpeur.org/neodante/

Published Tuesday, November 21, 2006 7:25 AM by dmahugh Edit
Filed Under: , , ,
Attachment(s): sources.zip

Comments

 

TechiJon said:

Does any such API exists for .Net
April 9, 2007 8:43 PM [Remove this Comment]
 

loeppel said:

For .NET there is Open XML SDK CTP April 2008.
But at the moment i don't like it cause of its very complicated!

greets,
loeppel
May 19, 2008 10:49 AM [Remove this Comment]
 

broshni said:

hi folks
i m new to openxml
and i m getting through it recently
i hav a docx file which I am trying to parse into an html file.
i hav some images in the docx fiel itself. i m using java nd xslt to serve the purpose. i solved somehow the issues with text but still struggling with image part.
the image part is linked like
<pic:blipFill>
  <a:blip r:embed="rId4" />
</pic:blipFill>
where rId4 is being stored in another xml file-
/word/_rels/document.xml.rels
i m trying to iterate though this xml file using java and get the target of rId4 relaltionship.
and i come to know that all the images are stored in media folder as image1.jpeg, image2.jpeg etc
please do help me out.
any comments would be highly appreciative
with regards?
broshni
September 1, 2008 9:41 AM [Remove this Comment]
 

jasonh said:

This article is getting a bit old now, but Google still ranks it highly, so I thought I'd post some other relevant info.

You might be interested in docx4j, which is open source and focused on making common use cases easy (disclaimer: its my project).

As an alternative, Apache's POI is now using OpenXML4J and XmlBeans to provide an object model. POI's main focus has been on spreadsheets, but the project also covers Word.
December 17, 2008 9:26 PM [Remove this Comment]
Anonymous comments are disabled