Extracting texts from text frames in OpenOffice.org Writer using the Java UNO interface

I recently ran into the challenge to use OpenOffice.org to convert a three digit number of old Microsoft Word documents into simple text files.

Under normal circumstances no big deal: Write a program using the UNO interface to start OpenOffice, load the documents one after the other into OpenOffice Writer (conversion to OpenOffice.org document format is done automatically) and save the documents as simple text documents.

It was, alas, not that simple. For some reason the texts were contained in text frames which seem to be the only text components that get completely lost when you save an OpenOffice Writer document as a text file. Which meant that I had to write some code to extract the text from the text frames myself.

As I did nowhere find a code snippet for this I thought I post with what I finally came up with. Just in case someone else runs into a similar challenge …

The code assumes you have an object oDocument for the document that contains the text frame for example by using the loadComponentFromURL method of the XComponentLoader interface. One way to accomplish this is explained here.

// Get the list of names of all text frames of the document

XTextFramesSupplier xFrameSupplier = UnoRuntime.queryInterface(XTextFramesSupplier.class, oDocument);
if (xFrameSupplier.getTextFrames().hasElements()){
    String elementNames [] = xFrameSupplier.getTextFrames().getElementNames();

    // Create a text cursor for the first text frame
    // To access the other text frames use elementNames[1], elementNames[2], ...

    Object oTextFrame = xFrameSupplier.getTextFrames().getByName(elementNames[0]);
    XTextFrame xTextFrame = (XTextFrame) UnoRuntime.queryInterface(XTextFrame.class, oTextFrame);
    XText xText=xTextFrame.getText(); 
    XTextCursor xTextCursor = xText.createTextCursor();

    // Extract the text 

    String text = xTextCursor.getString();


Leave a Reply

Your email address will not be published. Required fields are marked *