SMC has proposals for Google Summer of Code 2009, and the initial ideas are put in the wiki page.
A lot of documents, books and other information exists in the form of ASCII data in India. Quite a lot of them are in proprietary formats, usually in Microsoft Word format. Migrating them to the Unicode format is a daunting task, despite the existing tools for aide. Since we already have Payyans, which does a decent job for converting ASCII data (in either text format or PDF) in Malayalam. The project plan is to enhance Payyans in such a way that it handles:
- A bunch of Indic languages. We need to incorporate the language specific grammatical rules (prebase and postbase, etc) if they differ from the generic implementation.
- All the document formats supported by OpenOffice. Be it .DOC, .ODT, .DOCX… Payyans should be able to read them and convert to Unicode.
I did a feasibility study and research on the second feature. As Payyans is written in Python, interaction with OpenOffice can be implemented by making use of PyUNO. What we need to have is, load an input file in .DOC or .ODT format, extract the text, and convert it to Unicode based on the ASCII font map.
There are some examples on how to interact with OpenOffice using Python – how to start OpenOffice in listening mode, how to connect to the running instance, how to load a document and how to write some text to the current document or search and replace a phrase.
But disappointedly, there is no reference on how to extract only the text, removing all the tags and formatting. Something that simple is not available anywhere in the wiki or forums! There are one or two code snippets to do this, but they didn’t work. I tried using the Enumeration technique, but the output text was cryptic. No luck.
After couple of days exhaustive searching through reference documents, I found out the way to convert the files to text files. The below code snippet does exactly that. First, an instance of OpenOffice has to be started. It could be started in the “headless” mode, where you won’t see the window. Perfect. Do that this way :
openoffice.org "-accept=socket,host=localhost,port=2002;urp;StarOffice.ServiceManager" -nologo -headless &
Pass the input file name (file could be in any format OpenOffice understands) and output file name (output file will be Text format) as commandline arguments to this Python program:
# Copyright (c) 2009 Rajeesh K Nambiar <firstname.lastname@example.org>
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# This hack is required since pyuno installed is of standalone OOO-3.0 instead of Fedora's
# This bloody hack is required due to the PyUno bug
# get the uno component context from the PyUNO runtime
localContext = uno.getComponentContext()
# create the UnoUrlResolver
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
# connect to the running office
ctx = resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
smgr = ctx.ServiceManager
# get the central desktop object
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",ctx)
# access the current writer document
#model = desktop.getCurrentComponent()
infile = "file://" + os.path.abspath(sys.argv)
outfile = "file://" + os.path.abspath(sys.argv)
document = desktop.loadComponentFromURL(infile, "_blank", 0, ())
# Needed for FilterName - to export to TXT
from com.sun.star.beans import PropertyValue
TXT = PropertyValue()
TXT.Name = "FilterName"
TXT.Value = "Text"
# Close the document
# Do a nasty thing before exiting the python process. In case the
# last call is a oneway call (e.g. see idl-spec of insertString),
# it must be forced out of the remote-bridge caches before python
# exits the process. Otherwise, the oneway call may or may not reach
# the target object.
# I do this here by calling a cheap synchronous call (getPropertyValue).
I need a break!