Archive for the ‘Greenstone3’ Category

docx processing without libreoffice: UnknownConverterPlugin + Apache Tika

ak19. Friday, August 7th, 2020.

I didn’t get to fix the bug found last week: issues transferring files with non-ASCII filenames between a Windows client-GLI and (Ubuntu) Linux remote GS3 server. My supervisor said this may be a bigger issue than I thought it was, to do with the fact that filesystems of different operating systems (OS) may use different encodings, for example Ubuntu uses UTF-8 and Windows UTF-16 I believe. Differences in filesystem encodings used may cause characters in filenames of files being transferred from one OS to another, to be lost or altered. My supervisor thinks the problem may require deeper thought given to it and suggested that I log a ticket on trac for when we can spend more time on investigating whether this can be solved in the first place.

Other than that failed attempt at a bugfix, I did not contribute any Greenstone work for the overall community of GS users this week. Some work from back in June is worth writing up though.

Greenstone plugins often use existing open-source software to process complex document types like PDFs and older versions of Word documents, to get them to be searchable in the built collection. However, for Greenstone to be able to process newer Word documents, that have the docx file extension, users so far needed to have either the free LibreOffice or payware Microsoft Word of the Office Suite installed. LibreOffice is large in size and sometimes it can be hard to get it to successfully run headless or to restart it in headless mode. After I saw some messages in the mailing list about this, I searched online for if new command line tools had appeared in the meantime that could convert docx files to html or text. Because then Greenstone’s existing UnknownConverterPlugin can be used to run that command line tool to extract text from docx files, so that they’ll get indexed and become searchable in the built collection.

To gain a better understanding of generally using the UnknownConverterPlugin, refer to the tutorial at http://files.greenstone.org/tutorial/gs3-current/en/unknown_converter_plugin.htm

Apache Tika is Apache’s open-source software to extract text from countless different (textual) document types, one of which is docx. While one can write code to make calls on Apache-Tika’s API, their ready made jar file contained everything that we needed to get Greenstone to index text in docx files.

All I had to do was configure the UnknownConverterPlugin to make use of the Apache-Tika jar dropped into GS3’s extension folder, such that docx files could be processed and indexed (for searchability) by Greenstone without requiring users to install libreoffice.

The UnknownConverterPlugin has been officially available since Greenstone 3.09, so that 3.09 users can also start using Tika with the plugin, by

1. creating a subfolder called “tika” inside their GS3-install-dir/gs2build/ext,

2. downloading the Apache-Tika binary jar file from https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.24.1.jar (or by visiting http://trac.greenstone.org/browser/main/trunk/greenstone2/ext/tika/tika-app-1.24.1.jar and clicking the link labelled “downloading” there), then dropping the downloaded jar file into GS3/gs2build/ext/tika

3. and then configuring an UnknownConverterPlugin instance for any collection that needs docx processing as follows:

All 3 of the above steps are already setup for you in the GS3 binaries generated every night and available from http://www.greenstone.org/caveat-emptor/

Untried: Greenstone 2 users can try a grabbing a nightly GS2 binary from http://www.greenstone.org/caveat-emptor/ as it should also come with an UnknownConverterPlugin). The nightly GS2 binaries should already have an ext/tika subfolder within the GS2-installation folder, containing the tika jar file. Otherwise you can create this folder yourself and download the tika jar file into that location as in step 2. Next configure your UnknownConverterPlugin as in step 3 above before building your GS2 collection containing docx files.

You’re not limited to processing docx files by using UnknownConverterPlugin with Tika. You can process other textual doc types, whether already supported by existing Greenstone plugins or not, by configuring a new instance of UnknownConverterPlugin and setting the mime_type, srcicon, process_extension (and file_format) fields appropriately for that doctype.

The above instructions on using the UnknownConverterPlugin has now also been added to the Greenstone wiki, so you don’t have to track down this blog page if you want to revisit the instructions of using Tika with the UnknownConverterPlugin. Simply search the Greenstone wiki at wiki.greenstone.org for “UnknownConverterPlugin” or “Tika”.

Improvements to remote GS3 and client-GLI

ak19. Friday, July 24th, 2020.

In the past several weeks, using client-GLI running against a remote Greenstone 3 in a real-world setting allowed many bugs to be found and fixed and some hopefully-useful new features to be added to (client-)GLI.

No official release of GS3 containing these features and bugfixes is available yet, but those described below are/will be available in the nightly GS3 binaries at  http://www.greenstone.org/caveat-emptor/ from today onward. The linux nightly binaries are temporarily down and we’ll try to get them back up.

Among the work done:

(1)  Better remote GS3/client-GLI support for different sites and servlet names.

Once you’ve adjusted GS3/web/WEB-INF/servlets.xml as in the GS3 customisation tutorials and set the default servlet in GS3/build.properties, in client-GLI, go to File > Preferences > Connection and choose the Site then Servlet name. Click Apply and OK. Now once you go to File > Open collection, you will see all the collections available in this site and previewing will use the correct servlet.

Rebuilding will activate the collection on the selected servlet so that previewing will now at last work for non-default site and servlet.

Fixed a bug swapping between different remote GS3 sites that client-GLI can connect to: in the past, client-GLI would get stuck trying to load in the previous’ session’s site and servlet, even if it doesn’t exist in the remote GS3 that client-GLI is currently attempting to connect to. Now it will resort to the default site and servlet if the stored one is not present in the remote GS3 server the client is connecting to.

(2) Improvements to working with collectionConfig.xml through (client-)GLI:

– client-GLI (and GLI) now  properly saves edits made to the collectionConfig.xml file through Edit > Edit collectionConfig.xml menu

and furthermore, these changes are immediately reflected in the (client-)GLI interface, instead of GLI reloading the collection as before (which used to take especially long in client-GLI)

– proper support for HTML formatted text in the “about” page description for a collection: Format > General > Format Description field

Now, when you edit the collectionConfig.xml file through the Edit > Edit collectionConfig.xml menu, any HTML in your collection description is still preserved as before. And when you preview, the GS3 reader interface also preserves it as you intended.

(3) Can successfully create new and edit existing Metadata Sets through client-GLI now.

In the past it would let you create a metadata set but then there were issues when you tried to edit an existing one. Also in the past, creating a new metadata set would cause subtle issues that you’d only actually notice if you tried to visit File> Preferences > Connection tab afterward (when client-GLI would freeze).

(4) Completer and improved support for Metadata spreadsheet CSV files:

– MetadataCSVPlugin was extended to allow multi-valued metadata fields by Dr Bainbridge and his improvements to the plugin have now been incorporated into the current Greenstone. The MetadataCSVPlugin included in GS3 allows multi-valued metadata fields as follows: configure the plugin now with the new “metadata_value_separator” field set to “\|”. Then in your CSV metadata spreadsheet cells, use the vertical bar (“|”) to separate multiple metadata values for a particular column denoting a metadata field.

– Fixed bugs related to (client-)GLI rightclick  > Replace feature on a document that occurred when you attempted to replace an existing file with another file of the same name. Although this fixes the feature in general, it is also useful for when you want to update your metadata CSV spreadsheet.

Update from a week later: When you’re using Replace to replace a file with an updated identically named one, GLI always popped up a message allowing you to cancel. However, in the past, even if you cancelled, client-GLI would continue to upload the replacement file to the GS3 server where the replacement would be performed. The remote GS3 files and local files on the client machine would then get out of sync. But with this bug fixed, if you now cancel the Replace operation on replacing a file with an identically named one, client-GLI will no longer send the replacement file to the remote server.

(5) New (client-)GLI features:

a. Metadata to CSV options:

– File > Export to metadata CSV: for a collection you have open, this option creates a metadata.csv file in a location of your choosing containing all the metadata you can see in GLI, including inherited metadata. If the metadata.csv file you selected already exists, then the metadata you see in GLI is amalgamated with the selected CSV file. This option allows you to backup your collection’s metadata to a spreadsheet file. There is NO RECONVERT feature, to convert back to metadata.xml files from metadata csv format. But you can build your collection with metadata from the CSV spreadsheet. See the following option below which explains how to redo your collection to work with metadata from a spreadsheet instead of using metadata in GLI/metadata.xml files.

File > Convert to metadata CSV: for the collection you have open, this option creates a metadata.csv file in your collection’s “import” folder by default (which is the best location), by destructively removing all the metadata from the collection’s metadata.xml files (in other words, by removing the metadata you see in GLI) and shifting them out into the selected metadata.csv file. If you selected an existing metadata.csv file, then any metadata you currently see in GLI is amalgamated with the selected CSV file, before it gets removed from GLI/the metadata.xml files. Selecting this option prepares your collection so that you can switch over to using a MetadataCSVPlugin, configured with metadata_value_separator field set to “\|”, to then rebuild your collection producing the same results as before.

b. Collection security skeleton elements, as discussed at http://wiki.greenstone.org/doku.php?id=en:user_advanced:security, can now be added through (client-)GLI’s Edit > Edit collectionConfig.xml menu option. At the bottom of the Config File Editor dialog that appears, you will find a small toolbar that allows you to choose which (skeleton) XML <security> element to add:

– to hide the current collection,

– to add the appropriate <security> element to make the entire collection private except for one or more groups you specify,

– to add the appropriate <security> element to make all the docs in the collection private except for the groups you specify (adds a <security> element),

– to add the appropriate <security> element to make select docs in the collection private except for the groups you specify (where you can then specify which docs as explained on the wiki link already provided),

– to add a further <documentSet> element to the existing <security> element

– to  add a further <documentSet> element into the existing <security> element

One issue with this remains: if you want to undo the addition of a security element, you have to press the Undo button twice at present. I haven’t yet figured out why this is. (If you press Undo once, the entire XML content of your collectionConfig.xml becomes empty, so you’ll naturally press Undo again in alarm and then it will look right again.)

 (6) Possibly one of the best client-GLI improvements of all: Editing format statements in client-GLI is no longer excruciatingly slow

In the past you had to wait several seconds for every character you entered, so that back then it was better to edit your format statements outside client-GLI and paste them back in. With the bugfix now in place, you can now finally easily edit format statements directly in client-GLI.

Other: 

* incorporated perl 5.30 support needed for Ubuntu 20.04 LTS

* bundling CGI.pm perl module now, so hopefully no more “ERROR: Can’t locate CGI.pm in @INC (you may need to install the CGI module)” messages

* I believe I’ve now also fixed a bug that caused deadlocks in client-GLI which could occur with some popups when some remote Greenstone3 action goes wrong

* better error display when something on remote GS3 goes wrong: instead of a giant window to contain a giant error message, and which can potentially exceed your own screensize, you get a decent sized dialog with scrollable pane

* bugfix to “replace  srcdoc with html” feature available on rightclicking a doc in GLI so that it now works again in (client-)GLI.

* TextPlugin will now properly preserve manually formatted text went it embeds text content in <pre> tags. Starting whitespace even in pre tags used to get clobbered before so that it used to lose tabspaces.

Greenstone 3.09 out now!

ak19. Thursday, May 30th, 2019.

At long last, a new and improved version of Greenstone 3  is available for all to download from http://www.greenstone.org/download

Greenstone 3.09 binaries are available for Windows, 32 and 64 bit Linux, and a single Mac binary for High Sierra/10.13 and Mojave/10.14.

The up to date tutorials are at http://wiki.greenstone.org/doku.php?id=en:tutorials (click the Greenstone 3 tab)

The release notes are at http://wiki.greenstone.org/doku.php?id=en:release:3.09_release_notes

The release notes will also cover some of the major areas of Greenstone 3 that have been changed and improved with this release.

We welcome all to download the Greenstone 3.09 binary for your OS today!

Greenstone 3.08 released

kjdon. Thursday, May 4th, 2017.

Greenstone 3.08 was released November 2016 (oops, what a late blog entry!)

You can grab the binaries, source components and source distributions from the downloads page.

The release notes include the installation instructions. There are instructions for helping you port Greenstone2 collections to Greenstone3.

To familiarise yourself with how things are done in Greenstone3, follow along with the Greenstone 3 tutorials.

For those who need to work from source, the release notes also contain links detailing the steps for compiling the source components and distributions.

Have fun with Greenstone 3.08 and write us if you discover any bugs or have any problems.

Greenstone 3.07 released

ak19. Wednesday, September 9th, 2015.

We’ve finally released the binaries, source components and source distributions for Greenstone 3.07.

You can grab them from the downloads page.

The release notes include the installation instructions. There are instructions for helping you port Greenstone2 collections to Greenstone3.

To familiarise yourself with how things are done in Greenstone3, follow along with the Greenstone 3 tutorials.

For those who need to work from source, the release notes also contain links detailing the steps for compiling the source components and distributions.

Have fun with Greenstone 3.07 and write us if you discover any bugs or have any problems.

GS 3.07rc2 out now: Mac Mountain Lion binary now runs on Mac Yosemite out of the box

ak19. Wednesday, August 26th, 2015.

Good news for Mac Yosemite users out there (and possibly Mac Maverick users too): the current Greenstone 3.07 release-candidate binaries now include a JRE with the Mac Mountain Lion binary. This allows the Greenstone 3.07-rc2 Mountain Lion installer to now run out of the box on Mac Yosemite machines. You may still need to change the security settings on your Mac to allow you to open dmg binary files not created by Apple, as this is a new security feature on Macs, but otherwise the Greenstone installer and the Greenstone applications once installed should run fine on Yosemite too.

We’ve decided to come out with another intermediate release candidate, because the major change we’ve made are specific to a Mac binary this time. That means there is still time for Greenstone users to find and report on bugs, and for translators to add further translations and send these back in to us for inclusion in the upcoming official 3.07 release. Therefore, please do try out the current 3.07 rc2 release by heading on over to the Snapshots page, and let us know how the binaries fare, so we can improve them for the official release.

Greenstone 3.07 release candidate out (03 Aug 2015)

ak19. Saturday, August 15th, 2015.

Hello all,

The GS3.07 release candidate 1 (3.07 rc1) was released on 03 Aug 2015 and can be downloaded from the snapshots page. There have been quite a few changes and improvements. These have been documented in the release notes.

The 3.07 rc1 binaries available are for Windows, Linux (32 and 64 bit) and Mac (Snow) Leopard and (Mountain) Lion. The Mountain Lion release can be used with some modification on Mac Maverick and Yosemite machines.

We’re still working on the final, official release of 3.07. It will include a Java Runtime Environment with the Mac Lion release so that this can be run on later releases of the Mac OS that do not include Java like Yosemite, without Greenstone Mac users having to download Java. In all we’d like to make the experience for Mac Maverick and Yosemite users much smoother than it’s been so far.

In the meantime, we would like to continue to encourage Greenstone users to try out the current 3.07 rc1 release and report back their experiences, in particular any bugs or issues they detected.

We’ll keep you posted about the upcoming 3.07 release.

Greenstone sourceforge projects merged

kjdon. Monday, June 15th, 2015.

I have tidied up the greenstone project on sourceforge: http://sourceforge.net/projects/greenstone/.

The Greenstone3 project has been merged into the main Greenstone project, and Greenstone3 binaries are now the default downloads.  To download Greenstone2 binaries, click “BrowseAll Files” and navigate to the binaries you want.

Note, for now I have left the Greenstone3 project in place, but it will eventually be deleted.

Greenstone 3.06 release out now

ak19. Thursday, November 13th, 2014.

Hello all!

On 6 Nov we released the final binaries of Greenstone 3.06. There are binaries available for the following operating systems:

  • Windows,
  • Linux 32-bit and 64-bit machines and
  • Mac OS Leopard, for Leopards and Snow Leopards
  • Mac OS Mountain Lion for Mountain Lion, Maverick (and possibly the older Lion, but untested on there)

For those who want to compile 3.06 up from source, there is the 3.06 “source distribution” package.
Those who have installed the binary and who eventually want to recompile can top up their binary installation with the 3.06 “source component”.

You can get the binary for your operating system from the Greenstone 3 home page’s Download section at

http://www.greenstone.org/greenstone3-home

The source component and source distribution zip files are available from the same page as well.

The 3.06 Release Notes are at

http://wiki.greenstone.org/doku.php?id=en:release:3.06_release_notes

These contain instructions on how to install the binaries, or compile with the source distribution or source component for your operating system, as well as information on the basic of use Greenstone and its new features. The release notes may get modified and expanded over time, as questions appear.

For those who want to follow along with the Greenstone 3 tutorials, these are at

http://wiki.greenstone.org/doku.php?id=en:tutorials

Make sure to select the “Greenstone 3” tab.

An example of the kind of interfaces now possible with Greenstone 3 can be viewed at

http://www.music-ir.org/gc14ux-ex/thankyou-library/collection/basic-implementation/page/about

If you’re wondering about how to port Greenstone 2 collections from Greenstone 3, then have a look at the page

http://wiki.greenstone.org/doku.php?id=en:user:gs2_to_gs3

It explains the use of the new Format Conversion Wizard that’s now part of Greenstone 3.06’s GLI and which automates some of the conversion process, while allowing you to interactively modify its suggested Greenstone 3 format statements. Therefore, copy a Greenstone 2 collection into your Greenstone 3.06 installation’s web/sites/localsite/collect/ folder, open it in 3.06 GLI and try out the Format Conversion Wizard.

If you discover bugs or encounter any issues, join the mailing list at greenstone-users @ list.waikato.ac.nz and drop us a message, and we’ll try to get it fixed.

Greenstone updates

ak19. Friday, May 9th, 2014.

It’s been a very long period since we blogged our progress. Rest assured we’ve been working hard to improve Greenstone 2 and 3, and it’s only the blogging about it that fell by the wayside. Here’s some of the things that I’ve been working on in the last few months:

  • Securing Greenstone 2 pages: this involved significant changes to both the code and the macros files
  • Updating the GTI Greenstone installation. Owing to the security changes for Greenstone 2, the Greenstone Translation Interface on nzdl needed to have the latest Greenstone2 to work again
  • Fixing up a Remote Greenstone 3 authentication issue
  • The FormatConversion wizard dialog. This completes the process of opening a Greenstone 2 collection in Greenstone 3. The format conversion dialog automatically generates Greenstone 3 equivalents for Greenstone 2 format statements behind-the-scenes, before presenting these to you. You can then interact with the wizard to improve any of the Greenstone 3 format statements that have been generated.