docx processing without libreoffice: UnknownConverterPlugin + Apache Tika

ak19. Friday, August 7th, 2020

I didn’t get to fix the bug found last week: issues transferring files with non-ASCII filenames between a Windows client-GLI and (Ubuntu) Linux remote GS3 server. My supervisor said this may be a bigger issue than I thought it was, to do with the fact that filesystems of different operating systems (OS) may use different encodings, for example Ubuntu uses UTF-8 and Windows UTF-16 I believe. Differences in filesystem encodings used may cause characters in filenames of files being transferred from one OS to another, to be lost or altered. My supervisor thinks the problem may require deeper thought given to it and suggested that I log a ticket on trac for when we can spend more time on investigating whether this can be solved in the first place.

Other than that failed attempt at a bugfix, I did not contribute any Greenstone work for the overall community of GS users this week. Some work from back in June is worth writing up though.

Greenstone plugins often use existing open-source software to process complex document types like PDFs and older versions of Word documents, to get them to be searchable in the built collection. However, for Greenstone to be able to process newer Word documents, that have the docx file extension, users so far needed to have either the free LibreOffice or payware Microsoft Word of the Office Suite installed. LibreOffice is large in size and sometimes it can be hard to get it to successfully run headless or to restart it in headless mode. After I saw some messages in the mailing list about this, I searched online for if new command line tools had appeared in the meantime that could convert docx files to html or text. Because then Greenstone’s existing UnknownConverterPlugin can be used to run that command line tool to extract text from docx files, so that they’ll get indexed and become searchable in the built collection.

To gain a better understanding of generally using the UnknownConverterPlugin, refer to the tutorial at http://files.greenstone.org/tutorial/gs3-current/en/unknown_converter_plugin.htm

Apache Tika is Apache’s open-source software to extract text from countless different (textual) document types, one of which is docx. While one can write code to make calls on Apache-Tika’s API, their ready made jar file contained everything that we needed to get Greenstone to index text in docx files.

All I had to do was configure the UnknownConverterPlugin to make use of the Apache-Tika jar dropped into GS3’s extension folder, such that docx files could be processed and indexed (for searchability) by Greenstone without requiring users to install libreoffice.

The UnknownConverterPlugin has been officially available since Greenstone 3.09, so that 3.09 users can also start using Tika with the plugin, by

1. creating a subfolder called “tika” inside their GS3-install-dir/gs2build/ext,

2. downloading the Apache-Tika binary jar file from https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.24.1.jar (or by visiting http://trac.greenstone.org/browser/main/trunk/greenstone2/ext/tika/tika-app-1.24.1.jar and clicking the link labelled “downloading” there), then dropping the downloaded jar file into GS3/gs2build/ext/tika

3. and then configuring an UnknownConverterPlugin instance for any collection that needs docx processing as follows:

All 3 of the above steps are already setup for you in the GS3 binaries generated every night and available from http://www.greenstone.org/caveat-emptor/

Untried: Greenstone 2 users can try a grabbing a nightly GS2 binary from http://www.greenstone.org/caveat-emptor/ as it should also come with an UnknownConverterPlugin). The nightly GS2 binaries should already have an ext/tika subfolder within the GS2-installation folder, containing the tika jar file. Otherwise you can create this folder yourself and download the tika jar file into that location as in step 2. Next configure your UnknownConverterPlugin as in step 3 above before building your GS2 collection containing docx files.

You’re not limited to processing docx files by using UnknownConverterPlugin with Tika. You can process other textual doc types, whether already supported by existing Greenstone plugins or not, by configuring a new instance of UnknownConverterPlugin and setting the mime_type, srcicon, process_extension (and file_format) fields appropriately for that doctype.

The above instructions on using the UnknownConverterPlugin has now also been added to the Greenstone wiki, so you don’t have to track down this blog page if you want to revisit the instructions of using Tika with the UnknownConverterPlugin. Simply search the Greenstone wiki at wiki.greenstone.org for “UnknownConverterPlugin” or “Tika”.

Update for week ended 31 July 2020

ak19. Friday, July 31st, 2020

Hi again GS blog readers,

This week has been eventful but only involved a little work on general Greenstone code.  That was mainly in the form of a fix to allow the recently-introduced “Export/Convert (GLI) metadata to CSV” GLI feature to work on Java 6 and 7 too, and not just Java 8. This was a necessary change, as our nightly builds are on systems where Java 6 and 7 are installed. Thanks to Kathy’s keen eye which caught the cause of nightly linux GS3 binaries not going through, this was solvable and solved.

There’s a new bug that I discovered this week when working with someone who uses client-GLI to connect to their remote server. Unfortunately, this is a dreaded encoding bug and may take some time to resolve (or, if I’m really lucky, will be easy and quick to solve!) I will be looking at it as soon as this blog post is done.

During the past few months and ongoing, I’ve been assigned to help an institution who hired the university’s CS department to set up their own Digital Library. So I’ve been learning hands-on what it is like to use Greenstone with real-world purpose, rather than just using Greenstone when testing it for a release or investigating it for bugs discovered by mailing list users, or just using the basic collection design. It’s been quite full on work, but it simultaneously benefits the larger Greenstone community: all bugs discovered so far have been fixed for everyone and are made available in the nightly releases. Most of the bugs discovered had to do with client-GLI and its interaction with the remote Greenstone 3 server. As a result of this process, client-GLI has become far more robust to use.

A consequence of working with our colleagues at that institution to set up a GS3 DL for them is that I had to understand format statements and collection design, so I could pass my understanding on to them. While most of our remote conferencing sessions mixed both general needs and needs specific to the institution, I’d created 2 general pre-recorded tutorials covering all those major topics, which I believe could be of use to all persons learning to become Greenstone librarians. The topics covered are collection design from scratch, ranging from configuring browsing classifiers and search indexes, and configuring the UnknownPlugin for MP4s and associating files/equivalent documents, to understanding and working with format statements, including creating basic reusable templates in the global format statement. The second video covers using MetadataCSVPlugin with document metadata in the form of a CSV spreadsheet and ends by going over how to add collection/document level security (pasword protection).

The first video is 3.5 hours long and has some sensitive content visible at present in the form of passwords and private server connection details that would first need editing out before being made public. The second video clocks in at 1 hour and is more or less ready for viewing, except both suffer from my embarrassing grating voice and from format statements being very slow to edit back then, plus I’m not comfortable that there’s a brief shot of my head in the first video. After editing the unwanted segments out of the first video, a task which I have no experience in yet, I’d ideally like to subtitle the videos and then strip out the audio altogether, which would also allow the subs to be translatable, assuming the videos are found useful. I don’t know when I’ll have the time for this, but I’ll do my best to work.

That about covers what I wanted to blog on this week. Until next time!

Improvements to remote GS3 and client-GLI

ak19. Friday, July 24th, 2020

In the past several weeks, using client-GLI running against a remote Greenstone 3 in a real-world setting allowed many bugs to be found and fixed and some hopefully-useful new features to be added to (client-)GLI.

No official release of GS3 containing these features and bugfixes is available yet, but those described below are/will be available in the nightly GS3 binaries at  http://www.greenstone.org/caveat-emptor/ from today onward. The linux nightly binaries are temporarily down and we’ll try to get them back up.

Among the work done:

(1)  Better remote GS3/client-GLI support for different sites and servlet names.

Once you’ve adjusted GS3/web/WEB-INF/servlets.xml as in the GS3 customisation tutorials and set the default servlet in GS3/build.properties, in client-GLI, go to File > Preferences > Connection and choose the Site then Servlet name. Click Apply and OK. Now once you go to File > Open collection, you will see all the collections available in this site and previewing will use the correct servlet.

Rebuilding will activate the collection on the selected servlet so that previewing will now at last work for non-default site and servlet.

Fixed a bug swapping between different remote GS3 sites that client-GLI can connect to: in the past, client-GLI would get stuck trying to load in the previous’ session’s site and servlet, even if it doesn’t exist in the remote GS3 that client-GLI is currently attempting to connect to. Now it will resort to the default site and servlet if the stored one is not present in the remote GS3 server the client is connecting to.

(2) Improvements to working with collectionConfig.xml through (client-)GLI:

- client-GLI (and GLI) now  properly saves edits made to the collectionConfig.xml file through Edit > Edit collectionConfig.xml menu

and furthermore, these changes are immediately reflected in the (client-)GLI interface, instead of GLI reloading the collection as before (which used to take especially long in client-GLI)

- proper support for HTML formatted text in the “about” page description for a collection: Format > General > Format Description field

Now, when you edit the collectionConfig.xml file through the Edit > Edit collectionConfig.xml menu, any HTML in your collection description is still preserved as before. And when you preview, the GS3 reader interface also preserves it as you intended.

(3) Can successfully create new and edit existing Metadata Sets through client-GLI now.

In the past it would let you create a metadata set but then there were issues when you tried to edit an existing one. Also in the past, creating a new metadata set would cause subtle issues that you’d only actually notice if you tried to visit File> Preferences > Connection tab afterward (when client-GLI would freeze).

(4) Completer and improved support for Metadata spreadsheet CSV files:

- MetadataCSVPlugin was extended to allow multi-valued metadata fields by Dr Bainbridge and his improvements to the plugin have now been incorporated into the current Greenstone. The MetadataCSVPlugin included in GS3 allows multi-valued metadata fields as follows: configure the plugin now with the new “metadata_value_separator” field set to “\|”. Then in your CSV metadata spreadsheet cells, use the vertical bar (”|”) to separate multiple metadata values for a particular column denoting a metadata field.

- Fixed bugs related to (client-)GLI rightclick  > Replace feature on a document that occurred when you attempted to replace an existing file with another file of the same name. Although this fixes the feature in general, it is also useful for when you want to update your metadata CSV spreadsheet.

Update from a week later: When you’re using Replace to replace a file with an updated identically named one, GLI always popped up a message allowing you to cancel. However, in the past, even if you cancelled, client-GLI would continue to upload the replacement file to the GS3 server where the replacement would be performed. The remote GS3 files and local files on the client machine would then get out of sync. But with this bug fixed, if you now cancel the Replace operation on replacing a file with an identically named one, client-GLI will no longer send the replacement file to the remote server.

(5) New (client-)GLI features:

a. Metadata to CSV options:

- File > Export to metadata CSV: for a collection you have open, this option creates a metadata.csv file in a location of your choosing containing all the metadata you can see in GLI, including inherited metadata. If the metadata.csv file you selected already exists, then the metadata you see in GLI is amalgamated with the selected CSV file. This option allows you to backup your collection’s metadata to a spreadsheet file. There is NO RECONVERT feature, to convert back to metadata.xml files from metadata csv format. But you can build your collection with metadata from the CSV spreadsheet. See the following option below which explains how to redo your collection to work with metadata from a spreadsheet instead of using metadata in GLI/metadata.xml files.

- File > Convert to metadata CSV: for the collection you have open, this option creates a metadata.csv file in your collection’s “import” folder by default (which is the best location), by destructively removing all the metadata from the collection’s metadata.xml files (in other words, by removing the metadata you see in GLI) and shifting them out into the selected metadata.csv file. If you selected an existing metadata.csv file, then any metadata you currently see in GLI is amalgamated with the selected CSV file, before it gets removed from GLI/the metadata.xml files. Selecting this option prepares your collection so that you can switch over to using a MetadataCSVPlugin, configured with metadata_value_separator field set to “\|”, to then rebuild your collection producing the same results as before.

b. Collection security skeleton elements, as discussed at http://wiki.greenstone.org/doku.php?id=en:user_advanced:security, can now be added through (client-)GLI’s Edit > Edit collectionConfig.xml menu option. At the bottom of the Config File Editor dialog that appears, you will find a small toolbar that allows you to choose which (skeleton) XML <security> element to add:

- to hide the current collection,

- to add the appropriate <security> element to make the entire collection private except for one or more groups you specify,

- to add the appropriate <security> element to make all the docs in the collection private except for the groups you specify (adds a <security> element),

- to add the appropriate <security> element to make select docs in the collection private except for the groups you specify (where you can then specify which docs as explained on the wiki link already provided),

- to add a further <documentSet> element to the existing <security> element

- to  add a further <documentSet> element into the existing <security> element

One issue with this remains: if you want to undo the addition of a security element, you have to press the Undo button twice at present. I haven’t yet figured out why this is. (If you press Undo once, the entire XML content of your collectionConfig.xml becomes empty, so you’ll naturally press Undo again in alarm and then it will look right again.)

 (6) Possibly one of the best client-GLI improvements of all: Editing format statements in client-GLI is no longer excruciatingly slow

In the past you had to wait several seconds for every character you entered, so that back then it was better to edit your format statements outside client-GLI and paste them back in. With the bugfix now in place, you can now finally easily edit format statements directly in client-GLI.

Other: 

* incorporated perl 5.30 support needed for Ubuntu 20.04 LTS

* bundling CGI.pm perl module now, so hopefully no more “ERROR: Can’t locate CGI.pm in @INC (you may need to install the CGI module)” messages

* I believe I’ve now also fixed a bug that caused deadlocks in client-GLI which could occur with some popups when some remote Greenstone3 action goes wrong

* better error display when something on remote GS3 goes wrong: instead of a giant window to contain a giant error message, and which can potentially exceed your own screensize, you get a decent sized dialog with scrollable pane

* bugfix to “replace  srcdoc with html” feature available on rightclicking a doc in GLI so that it now works again in (client-)GLI.

* TextPlugin will now properly preserve manually formatted text went it embeds text content in <pre> tags. Starting whitespace even in pre tags used to get clobbered before so that it used to lose tabspaces.

Greenstone 3.09 out now!

ak19. Thursday, May 30th, 2019

At long last, a new and improved version of Greenstone 3  is available for all to download from http://www.greenstone.org/download

Greenstone 3.09 binaries are available for Windows, 32 and 64 bit Linux, and a single Mac binary for High Sierra/10.13 and Mojave/10.14.

The up to date tutorials are at http://wiki.greenstone.org/doku.php?id=en:tutorials (click the Greenstone 3 tab)

The release notes are at http://wiki.greenstone.org/doku.php?id=en:release:3.09_release_notes

The release notes will also cover some of the major areas of Greenstone 3 that have been changed and improved with this release.

We welcome all to download the Greenstone 3.09 binary for your OS today!

Greenstone 2.87 released

ak19. Monday, October 2nd, 2017

We recommend that Greenstone users shift to Greenstone 3, as that’s the future of Greenstone.

But for those who really need Greenstone 2, we’ve now brought out a maintenance release of it with Greenstone 2.87.

You can grab the new GS 2.87 binaries, and source component and source distributions packages from the downloads page.

The release notes include the installation instructions. And here are the Greenstone 2 tutorials, make sure to select the “Greenstone 2″ tab at the top.

For those who need to work from source, the release notes also contain links to pages detailing the steps for compiling the source components and distributions.

As usual, write us if you encounter any issues with GS2.87.

Greenstone 3.08 released

kjdon. Thursday, May 4th, 2017

Greenstone 3.08 was released November 2016 (oops, what a late blog entry!)

You can grab the binaries, source components and source distributions from the downloads page.

The release notes include the installation instructions. There are instructions for helping you port Greenstone2 collections to Greenstone3.

To familiarise yourself with how things are done in Greenstone3, follow along with the Greenstone 3 tutorials.

For those who need to work from source, the release notes also contain links detailing the steps for compiling the source components and distributions.

Have fun with Greenstone 3.08 and write us if you discover any bugs or have any problems.

GTI has moved

kjdon. Thursday, July 7th, 2016

The GTI (Greenstone Translator Interface) is a Greenstone installation providing a web-based facility to translate the various Greenstone interfaces and websites, including the Greenstone2 and Greenstone3 runtime web interfaces, GLI’s interface, the installer interface, and the greenstone.org website. This facility has been moved to a faster and more reliable server. You can access the new interface at http://gti.greenstone.org. The status page showing the translation status of each module, for each language, is now available at http://gti.greenstone.org/etc/status.html.

TWSO library moved

kjdon. Thursday, July 7th, 2016

The TWSO library is a collection of concert programmes from the Trust Waikato Symphony Orchestra. This has been moved to a new server, so update your bookmarks if you use this collection. It is now available at http://community.nzdl.org/greenstone3/twso/collection/twso/page/about. Note, the http://nzdl.org/twso shortcut now points to the new site. The new server is much faster and more reliable than the previous one.

Greenstone 3.07 released

ak19. Wednesday, September 9th, 2015

We’ve finally released the binaries, source components and source distributions for Greenstone 3.07.

You can grab them from the downloads page.

The release notes include the installation instructions. There are instructions for helping you port Greenstone2 collections to Greenstone3.

To familiarise yourself with how things are done in Greenstone3, follow along with the Greenstone 3 tutorials.

For those who need to work from source, the release notes also contain links detailing the steps for compiling the source components and distributions.

Have fun with Greenstone 3.07 and write us if you discover any bugs or have any problems.

GS 3.07rc2 out now: Mac Mountain Lion binary now runs on Mac Yosemite out of the box

ak19. Wednesday, August 26th, 2015

Good news for Mac Yosemite users out there (and possibly Mac Maverick users too): the current Greenstone 3.07 release-candidate binaries now include a JRE with the Mac Mountain Lion binary. This allows the Greenstone 3.07-rc2 Mountain Lion installer to now run out of the box on Mac Yosemite machines. You may still need to change the security settings on your Mac to allow you to open dmg binary files not created by Apple, as this is a new security feature on Macs, but otherwise the Greenstone installer and the Greenstone applications once installed should run fine on Yosemite too.

We’ve decided to come out with another intermediate release candidate, because the major change we’ve made are specific to a Mac binary this time. That means there is still time for Greenstone users to find and report on bugs, and for translators to add further translations and send these back in to us for inclusion in the upcoming official 3.07 release. Therefore, please do try out the current 3.07 rc2 release by heading on over to the Snapshots page, and let us know how the binaries fare, so we can improve them for the official release.