Extracting texts from text frames in OpenOffice.org Writer using the Java UNO interface

I recently ran into the challenge to use OpenOffice.org to convert a three digit number of old Microsoft Word documents into simple text files.

Under normal circumstances no big deal: Write a program using the UNO interface to start OpenOffice, load the documents one after the other into OpenOffice Writer (conversion to OpenOffice.org document format is done automatically) and save the documents as simple text documents.

It was, alas, not that simple. For some reason the texts were contained in text frames which seem to be the only text components that get completely lost when you save an OpenOffice Writer document as a text file. Which meant that I had to write some code to extract the text from the text frames myself.

As I did nowhere find a code snippet for this I thought I post with what I finally came up with. Just in case someone else runs into a similar challenge …

The code assumes you have an object oDocument for the document that contains the text frame for example by using the loadComponentFromURL method of the XComponentLoader interface. One way to accomplish this is explained here.

// Get the list of names of all text frames of the document

XTextFramesSupplier xFrameSupplier = UnoRuntime.queryInterface(XTextFramesSupplier.class, oDocument);
if (xFrameSupplier.getTextFrames().hasElements()){
    String elementNames [] = xFrameSupplier.getTextFrames().getElementNames();

    // Create a text cursor for the first text frame
    // To access the other text frames use elementNames[1], elementNames[2], ...

    Object oTextFrame = xFrameSupplier.getTextFrames().getByName(elementNames[0]);
    XTextFrame xTextFrame = (XTextFrame) UnoRuntime.queryInterface(XTextFrame.class, oTextFrame);
    XText xText=xTextFrame.getText(); 
    XTextCursor xTextCursor = xText.createTextCursor();

    // Extract the text 

    String text = xTextCursor.getString();


Finding the file path to a database file embedded in an OpenOffice.org extension (.oxt)

As already mentionend in my previous post, one of the advantages of an OpenOffice.org extension from a user perspective is that you install everything from one file, including an eventually needed database file as is the case for the Jitenize extension.

OpenOffice.org unpacks the contents of an .oxt file and puts it in a separate directory before executing it. The place of this directory depends on the operating system, whether the extension is installed for a single or for all users and the path contains some random directory names. For example on my Ubuntu installation the Jitenize extension can be found at


For writing a platform independent OpenOffice.org extension you therefore need  a way to

  1. find the path to the directory in which the extension sits
  2. extend the path with the file name of the database file using the correct (platform dependent) delimiter sign (“/”, “\” or “:”)

Fortunately the UNO framework has some support for doing this but I had to make intensive usage of Google to find out how and some code snippets that I found simply did not work. So I thought it would be a good idea to publish my final solution here:

// Create path to dictionary database in UNO component package cache in a platform independant way
//The dictionary database of the Jitenize extension is always in the root directory of the UNO extension package
String databasePath;
try {
   XPackageInformationProvider xPackageInformationProvider = PackageInformationProvider.get(m_xContext);
   String packageLocation = xPackageInformationProvider.getPackageLocation("com.fuyosoft.jitenize");
   XFileIdentifierConverter xFileConverter = (XFileIdentifierConverter) UnoRuntime.queryInterface(XFileIdentifierConverter.class, m_xContext.getServiceManager().createInstanceWithContext("com.sun.star.ucb.FileContentProvider", m_xContext));
   databasePath = xFileConverter.getSystemPathFromFileURL(packageLocation + "/dict.sqlite");
} catch (com.sun.star.uno.Exception ex) {
   databasePath = "Exception during construction of path to dictionary database!";

com.fuyosoft.jitenize is the identifier of the extension specified in the description.xml file as explained in this post.

This has been tested with Ubuntu 10.04 and Windows XP. Feedback from any other operating system platform highly welcome!

Writing an OpenOffice.org UNO service in Java with Eclipse

UNO is the component model of OpenOffice.org. It is a small application server integrated into the OpenOffice.org software whose services can be accessed either internally by macros contained in documents or externally by other applications. That way it is for example possible that an application loads a document in OpenOffice.org Writer, changes the formatting of the document and saves it.

Writing an UNO service in Java (C++ and other languages are possible as well but not treated in this post) means:

  • Define the functionality of the service in a language independant way by using the interface definition language of UNO (UNOIDL).
  • Write the implementation of the service by implementing the interface of the UNOIDL definition plus a few other standard interfaces that every UNO service is required to prepare.
  • Compile the UNOIDL specifications and Java classes in one JAR file and put this with the type database and the other files already described here into an .oxt file (of course you can add other things like macros that make use of the UNO service as well).
  • Deploy the service by loading the .oxt file into the Extension Manager of OpenOffice.org.

Chapter 2 of the OpenOffice.org developers manual explains this in all details. Fortunately there is a plugin for Eclipse whose wizard generates an Eclipse project with a complete skeleton for an UNO service. There is a tutorial for installation and usage of this plugin and I recommend from my own experience reading the tutorial and following it’s instructions step by step.

The plugin has some limitations though, so for Jitenize I had to do some manual changes to the project:

  • The plugin wizard lets you only define interfaces with simple or already defined data types. For Jitenize I needed my own record structures. I used the plugin wizard to define the functions of the service with simple data types, added a new UNOIDL description file containing the record structure and changed the function definition afterwards. Btw. adding a new UNOIDL file did not work, I always got the “NOT A UNOIDL CAPABLE FOLDER” error. But adding a simple text file with the .idl suffix worked and this file was compiled automatically with idlc like all other .idl files.
  • After changing the UNOIDL description of the service the parameters of the implementation function of the service of course needed to be changed as well. Here I recommend to read the sections called “Type mapping of xxx” in Chapter 2 of the OpenOffice.org developers manual, especially “Mapping of Interface Types”  explains very well how IN, OUT and INOUT parameters in UNOIDL are mapped to Java.
  • What I did not accomplish was to generate a complete .oxt file with the plugin. The showstoppers were the .jar files for the dictionary lookup that I had added to the project and that did not show up in the .oxt file. I finally gave up and wrote my own shell script that created the .jar file with all .class files needed and created an .oxt file with this .jar file and the other files that the plugin had created for the .oxt file. If you try the same thing make sure you add the .class files in the /bin and in the /build directory of the project’s work space to the jar file.


Components of an OpenOffice.org extension (.oxt)

From a user perspective OpenOffice.org extensions are really easy to handle: Download the extension as one single .oxt file, start any application of the suite, go to the extension manager, load the .oxt file, accept the license agreement if available and you are done.

For developers there are quite a few components to prepare before you can create the .oxt file. To see what is needed let me show you the contents of the Jitenize extension file.

Like .jar files the .oxt files are actually archives in ZIP file format and can be opened and generated with every archive tool that can handle ZIP files, so let’s have a look at the Jitenize extension using the jar command line tool:

> jar tf jitenize_0.8.3.en.oxt

The central information that tells an OpenOffice.org application how to load an extension is in the manifest.xml file:

<?xml version="1.0" encoding="UTF-8"?>
 <manifest:file-entry manifest:full-path="types.rdb" manifest:media-type="application/vnd.sun.star.uno-typelibrary;type=RDB"/>
 <manifest:file-entry manifest:full-path="jitenize.jar" manifest:media-type="application/vnd.sun.star.uno-component;type=Java"/>
 <manifest:file-entry manifest:full-path="Jitenize/" manifest:media-type="application/vnd.sun.star.basic-library"/>
 <manifest:file-entry manifest:full-path="jitenize.xcu" manifest:media-type="application/vnd.sun.star.configuration-data" />

jiteninze.jar and types.rdb contain the main code of the extension written as an UNO service in Java. More on this in a later post.

The Jitenize subdirectory contains the user interface written in BASIC that call the UNO service. The files were generated by writing a macro library in OpenOffice.org Writer and exporting it as a BASIC library (and not as an extension!).

jitenize.xcu contains the information how to extend the menu structure of OpenOffice.org Writer and how to connect the menu items to the BASIC macros. The easiest way that I found to create this file is to use the macro contained in chapter 2.2 of this document. Btw. addon is an old name for what is nowadays called extension, the document is quite outdated, but the macro still works well.

In addition to these four central files a few other files are available:

dict_en.sqlite, sqljet-1.0.7.jar and  antlr-runtime-3.1.3.jar contain the dictionary database and the free sqlite database engine written in Java from TMate Software. The engine is used by the UNO service to whom it is made available by adding the jar files to the classpath defined in the manifest of jitenize.jar.

Finally description.xml contains information about the extension that show up in the extension manager (display name, version, etc.), links to the license file COPYRIGHT and an identifier for the extension that must be worldwide unique (com. fuyosoft.jitenize). The Extension manager uses this identifier to decide whether a newly loaded extension is an update to an already existing extension or not.

<?xml version="1.0" encoding="UTF-8"?>
<description xmlns="http://openoffice.org/extensions/description/2006" xmlns:d="http://openoffice.org/extensions/description/2006" xmlns:xlink="http://www.w3.org/1999/xlink">
 <identifier value="com.fuyosoft.jitenize" />
 <version value="0.8.3.en" />
  <OpenOffice.org-minimal-version value="2.2" name="OpenOffice.org 2.2"/>
  <name xlink:href="http://fuyosoft.com">fuyosoft.com</name>
  <simple-license accept-by="admin" suppress-on-update="true" >
   <license-text xlink:href="COPYRIGHTS"/>