Greenstone2 – Greenstone Blog

docx processing without libreoffice: UnknownConverterPlugin + Apache Tika

2020-08-07T07:13:42Z

I didn’t get to fix the bug found last week: issues transferring files with non-ASCII filenames between a Windows client-GLI and (Ubuntu) Linux remote GS3 server. My supervisor said this may be a bigger issue than I thought it was, to do with the fact that filesystems of different operating systems (OS) may use different encodings, for example Ubuntu uses UTF-8 and Windows UTF-16 I believe. Differences in filesystem encodings used may cause characters in filenames of files being transferred from one OS to another, to be lost or altered. My supervisor thinks the problem may require deeper thought given to it and suggested that I log a ticket on trac for when we can spend more time on investigating whether this can be solved in the first place.

Other than that failed attempt at a bugfix, I did not contribute any Greenstone work for the overall community of GS users this week. Some work from back in June is worth writing up though.

Greenstone plugins often use existing open-source software to process complex document types like PDFs and older versions of Word documents, to get them to be searchable in the built collection. However, for Greenstone to be able to process newer Word documents, that have the docx file extension, users so far needed to have either the free LibreOffice or payware Microsoft Word of the Office Suite installed. LibreOffice is large in size and sometimes it can be hard to get it to successfully run headless or to restart it in headless mode. After I saw some messages in the mailing list about this, I searched online for if new command line tools had appeared in the meantime that could convert docx files to html or text. Because then Greenstone’s existing UnknownConverterPlugin can be used to run that command line tool to extract text from docx files, so that they’ll get indexed and become searchable in the built collection.

To gain a better understanding of generally using the UnknownConverterPlugin, refer to the tutorial at http://files.greenstone.org/tutorial/gs3-current/en/unknown_converter_plugin.htm

Apache Tika is Apache’s open-source software to extract text from countless different (textual) document types, one of which is docx. While one can write code to make calls on Apache-Tika’s API, their ready made jar file contained everything that we needed to get Greenstone to index text in docx files.

All I had to do was configure the UnknownConverterPlugin to make use of the Apache-Tika jar dropped into GS3’s extension folder, such that docx files could be processed and indexed (for searchability) by Greenstone without requiring users to install libreoffice.

The UnknownConverterPlugin has been officially available since Greenstone 3.09, so that 3.09 users can also start using Tika with the plugin, by

1. creating a subfolder called “tika” inside their GS3-install-dir/gs2build/ext,

2. downloading the Apache-Tika binary jar file from https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.24.1.jar (or by visiting http://trac.greenstone.org/browser/main/trunk/greenstone2/ext/tika/tika-app-1.24.1.jar and clicking the link labelled “downloading” there), then dropping the downloaded jar file into GS3/gs2build/ext/tika

3. and then configuring an UnknownConverterPlugin instance for any collection that needs docx processing as follows:

All 3 of the above steps are already setup for you in the GS3 binaries generated every night and available from http://www.greenstone.org/caveat-emptor/

Untried: Greenstone 2 users can try a grabbing a nightly GS2 binary from http://www.greenstone.org/caveat-emptor/ as it should also come with an UnknownConverterPlugin). The nightly GS2 binaries should already have an ext/tika subfolder within the GS2-installation folder, containing the tika jar file. Otherwise you can create this folder yourself and download the tika jar file into that location as in step 2. Next configure your UnknownConverterPlugin as in step 3 above before building your GS2 collection containing docx files.

You’re not limited to processing docx files by using UnknownConverterPlugin with Tika. You can process other textual doc types, whether already supported by existing Greenstone plugins or not, by configuring a new instance of UnknownConverterPlugin and setting the mime_type, srcicon, process_extension (and file_format) fields appropriately for that doctype.

The above instructions on using the UnknownConverterPlugin has now also been added to the Greenstone wiki, so you don’t have to track down this blog page if you want to revisit the instructions of using Tika with the UnknownConverterPlugin. Simply search the Greenstone wiki at wiki.greenstone.org for “UnknownConverterPlugin” or “Tika”.

Greenstone 2.87 released

2017-10-02T06:16:00Z

We recommend that Greenstone users shift to Greenstone 3, as that’s the future of Greenstone.

But for those who really need Greenstone 2, we’ve now brought out a maintenance release of it with Greenstone 2.87.

You can grab the new GS 2.87 binaries, and source component and source distributions packages from the downloads page.

The release notes include the installation instructions. And here are the Greenstone 2 tutorials, make sure to select the “Greenstone 2” tab at the top.

For those who need to work from source, the release notes also contain links to pages detailing the steps for compiling the source components and distributions.

As usual, write us if you encounter any issues with GS2.87.

Greenstone sourceforge projects merged

2015-06-14T23:52:15Z

I have tidied up the greenstone project on sourceforge: http://sourceforge.net/projects/greenstone/.

The Greenstone3 project has been merged into the main Greenstone project, and Greenstone3 binaries are now the default downloads.Â To download Greenstone2 binaries, click “BrowseAll Files” and navigate to the binaries you want.

Note, for now I have left the Greenstone3 project in place, but it will eventually be deleted.

Greenstone updates

2014-05-09T06:05:04Z

It’s been a very long period since we blogged our progress. Rest assured we’ve been working hard to improve Greenstone 2 and 3, and it’s only the blogging about it that fell by the wayside. Here’s some of the things that I’ve been working on in the last few months:

Securing Greenstone 2 pages: this involved significant changes to both the code and the macros files
Updating the GTI Greenstone installation. Owing to the security changes for Greenstone 2, the Greenstone Translation Interface on nzdl needed to have the latest Greenstone2 to work again
Fixing up a Remote Greenstone 3 authentication issue
The FormatConversion wizard dialog. This completes the process of opening a Greenstone 2 collection in Greenstone 3. The format conversion dialog automatically generates Greenstone 3 equivalents for Greenstone 2 format statements behind-the-scenes, before presenting these to you. You can then interact with the wizard to improve any of the Greenstone 3 format statements that have been generated.

Anu’s entry for the week of 7-11 May 2012

2012-05-11T11:09:21Z

Over the week, have been working on the activate.pl script (and things that it needs). The details are at http://trac.greenstone.org/ticket/825

For the latest changes made today, need to retest these changes against GS3 on Windows.

Still need to test the entire process on Linux.

Anu’s entry for the weeks of 23 Apr – 4 May 2012

2012-05-04T07:54:04Z

At the start of last week, finished off the task of the GS3 â€œdebuginfoâ€ button that now appears next to the login button.
The Greenstone tutorial xml files can now include a MajorVersion element with number attribute to specify if the instructions are for GS3 or GS2 and will get processed by the XSLT to display or hide such elements depending on the active version.
Joshua Scarsbrook discovered two bugs compiling GS3 on a Mac and has helped us fix these (but one of the fixes still needs to be tested on his machine). Unfortunately there were some issues with setting the Java preferences on my account on the Mac here. At present, GS3 can’t be compiled there because it requires Java 1.6.
After Dr Bainbridge fixed error handling and display of the PDFBox Extension, it became easier to debug a PDFBox Extension bug discovered by a member on the mailing list. She helped us to track it down and it turned out that the PDFBox extension did not try to first look for and use any JRE included in a GS2 binary when running the java -version test.
While trying to work out why searching 3 digit numbers crashed the server (when Diego wanted to try the ifl=1 parameter to the GS2 URL), I first found and tracked down a very troublesome bug that I had accidentally introduced into GS2. The documents in browse or search results would not display and their URLs looked strange (with the word handle in their path). It turned out that in January, I’d committed the -DDOCHANDLE option to CXXFLAGS in a win32.mak file that was meant for the experimental work Dr Bainbridge and Diego had been doing with REST URLs. I meant to commit only the RSS support code they had written. Dr Bainbridge then fixed the bug Diego had originally noticed to do with the ifl parameter.
Some translation work and looked at a few mailing list questions.
Currently started work on activate.pl which should perform in perl the task that GLI currently does of stopping the GS2 or GS3 server while moving the building to index and restarting the server again.

Anu’s blog entry for 5 March – 23 March

2012-03-23T08:34:20Z

The first two weeks involved:

generating some files for translation of the Greenstone interface (Mongolian, Bhutanese) and committing changes translators had submitted (Laotian)
fixing up the GS2 CORBA code, including bringing it up to speed with the rest of GS2’s runtime code, so that CORBA works again: it can now compile once more, and the corbaserver and corbarecptldd client program run well against each other when on the same machine. Running the server against the client in a remote situation does not yet work, but it did not work in the demo/hello-1 example of the now-updated MICO package either.
there was still a small error in the way the PDFBox extension tests for Java when Java is version 1.7 that made the extension not work with JDK1.7. The test for the presence of Java now has to run java -version rather than just java, since the return value in Java 7 is different from that in Java 6.
when testing the Powerpoint plugin, it was found that the OpenOffice extension needed to be corrected to make jodconverter use the same port as that which OpenOffice is run on. It was moreover discovered that users can’t already have the graphical user interface of OO running in the background, nor can they start this, during Greenstone’s processing of documents using the OO extension.

This week:

there was some issue with Greenstone 3’s tomcat server crashing on 64 bit Linux owing to a Java segmentation fault created by an error in the JNI code. Dr Bainbridge found out that the number of bytes to store pointers to data structures shared between Java and C++ code needed to be long rather than int, so MG’s and MGPP’s JNI code was updated. The error has not returned since, but debugging code has been left in for future debugging if required.
Dr Te Taka hoped to update the Maori translations for Greenstone’s interface using Google’s Translator Toolkit (GTT), and suggested that Greenstone’s translation process be expanded to allow this so that other translators too could benefit from the toolkit for translation if they wanted. He found out that the toolkit accepted an open-XML format called TMX, Translation Memory eXchange, and thus would need the strings that required translation to be converted into the TMX XML format (rather than into the usual spreadsheets versions of the .excel.xml format which we currently generate). Two new XSLT files have been written which Te Taka may kindly be testing for us: the first generates the TMX translation files that translators can load into Google’s Translator Toolkit. The second XSLT takes translated TMX files and converts them into an intermediary format that can be processed in the usual manner when submitting new and updated translations back into Greenstone.
currently looking at usersDB in GS3 having the correct values on startup.

Update: did not get much further with the GS3 usersDB as there was a lot more to be done with the translation files for GTT and their processing. The process became clearer thanks to Te Taka’s explanations and his testing at each stage. TMX files will only be needed the first time a translator migrates from GS’s usual translation procedure, which makes use of excel spreadsheet files, to Google’s toolkit. The TMX file will start them up with all the up-to-date translated strings that are available so far in GS3 for the selected language. For the strings that need to be translated and updated, the translator will get a text file that contains the unicode spreadsheet data (as comma separated values, but the file will have a .txt extension instead of .csv in order to preserve the unicode). The translator will then copy the English and columns of the spreadsheet into the GTT. Once their translation work is done, they can send these same columns back by way of the same spreadsheet.

Sam’s Greenstone Blog 2/3/2012

2012-03-02T02:21:02Z

This week has had a rather exciting development that several people have been wanting for quite a long time.Â The 64-bit compatible versions of MG, MGPP and GDBM have been added to the main code, meaning that Greenstone 2 and 3 can now compile successfully on 64-bit systems. The reason this has taken a long time to be done is that the 32-bit and 64-bit versions of MG and MGPP produced seemingly different files when run over the same documents, which was a concerning for us as people might want to move their 32-bit MG/MGPP collections over to a 64-bit Greenstone installation and we suspected that this might not work given the different files. This week we discovered the cause of the difference and are now reassured that files from 32-bit and 64-bit installations can be interchanged without issue.

This week has seen more upgrades to Greenstone 3 as well. One of the features we have been working on for the Pei Jones collection is the ability to zoom “screen” images by using the mouse like a magnifying glass. We have added this into the default Greenstone 3 capabilities. In order for this to work however there needs to be a “screen” (small) and “source” (usually larger) version of the same image.

In general Greenstone 3 now handles paged-images much better. They are now properly displayed at the top of their specific sections. There is also an option to change between text-only, image-only and the default text and image modes, which is available in both the paged style collections as well as normal hierarchy style collections.

Next week will most likely involve more improvements like this as we continue to prepare Greenstone 3 for release.

Anu’s entry for the week ending 2 Dec 2011

2011-12-02T05:38:59Z

Continued on the problem that I thought had been almost resolved last week: getting the batch files in GS2 to handle not only spaces but also brackets in the Greenstone filepaths. The batch files were done, but the perl code needed some correcting too. After inspecting many files in order to see whether they needed correcting, the GS2 code seems to work well on Windows even where Greenstone is installed in a path containing brackets.

This week, I was able to finally return to the problem of jodconverter not interacting well with the LibreOffice on the Ubuntu 11 whereas the same worked perfectly against an OpenOffice on the CentOs machine. We decided that perhaps OpenOffice had different behaviour for the signals sent by jodconverter. Installing OpenOffice turned out harder than expected and I think I botched it. I ended up having to uninstall all openoffice files and libreoffice files and then reinstalled all of libreoffice. At this stage, upon trying jodconverter again, it was found to work fine each time. This seemed to confirm the suspicion that some updates to Ubuntu may have messed up some libraries or something, breaking LibreOffice a little.

However, despite things now working again, Sam wondered, very correctly, whether a user’s experience would be this convoluted or whether it would work straight away for them. He suggested trying out a VM of Ubuntu 11. Which is what I did. It was my first VM installation and after installing a Ubuntu 11.10 VM on Sam’s Windows 7 (which comes with LibreOffice), Greenstone with the open-office extension fortunately worked fine on a sequence of word documents.

On Friday, got round to Diego’s long-standing question at last: about the possibility of a single metadata.xml at the import level which defines the metadata for all files in import’s subfolders. Dr Bainbridge had already confirmed earlier that this was indeed possible, but the question was of how the metadata.xml out to specify the path to the files in the subfolders, especially if there were spaces in the path. After a series of incremental tests, it was found out to be still possible and the solution rather straightforward. Hopefully it will work for Diego also.

There was some translation work, and a few further questions on the mailing list to look at, before I finally got round to considering Michael Goodwin’s complex question on the setup.exe generated by an Export To CD-ROM operation failing on Windows 7 on 64 bit. A preliminary successful test on a Windows 7 machine turned out to be misleading: I had assumed it was a 64 bit machine but it turned out to be 32 bit after all. I will have to get back to trying this out next week. All this fine-tuning is bound to pay off in the upcoming perfected release of Greenstone 2: version 2.86.

Anu’s entry for week ending 26 Nov 2011

2011-11-28T01:13:38Z

For the last two weeks, I was mainly learning the practical side of how to handle the Greenstone translations. Mainly how to generate the spreadsheets for translators to use, though there was also the opportunity for learning to handle translated spreadsheets. Next to that, there were some questions on the mailing list that I had a go at answering and uploaded the updates to the ACKU and AREU collections.

On the final 3 days, got round to working on getting the batch files in GS2 to handle not only spaces but also brackets in the Greenstone filepaths. There is still a final problem to resolve before the changes can be committed, but the Greenstone web server is now back to working again, despite Greenstone being installed in a path with brackets (and spaces). There’s even some allowance made in the makegs2.bat script–which is used to compile up GS2–to get apache to compile up even in those instances of there being spaces or brackets in the filepaths it works with. Fortunately, the change could be made in the makegs2.bat itself: it sets the command prompt in which Greenstone is being compiled up to be in short-filenames mode. This then is the situation that the apache compile scripts inherit also, making any space/bracket in the long pathname irrelevant.