Other than that failed attempt at a bugfix, I did not contribute any Greenstone work for the overall community of GS users this week. Some work from back in June is worth writing up though.
Greenstone plugins often use existing open-source software to process complex document types like PDFs and older versions of Word documents, to get them to be searchable in the built collection. However, for Greenstone to be able to process newer Word documents, that have the docx file extension, users so far needed to have either the free LibreOffice or payware Microsoft Word of the Office Suite installed. LibreOffice is large in size and sometimes it can be hard to get it to successfully run headless or to restart it in headless mode. After I saw some messages in the mailing list about this, I searched online for if new command line tools had appeared in the meantime that could convert docx files to html or text. Because then Greenstone’s existing UnknownConverterPlugin can be used to run that command line tool to extract text from docx files, so that they’ll get indexed and become searchable in the built collection.
To gain a better understanding of generally using the UnknownConverterPlugin, refer to the tutorial at http://files.greenstone.org/tutorial/gs3-current/en/unknown_converter_plugin.htm
Apache Tika is Apache’s open-source software to extract text from countless different (textual) document types, one of which is docx. While one can write code to make calls on Apache-Tika’s API, their ready made jar file contained everything that we needed to get Greenstone to index text in docx files.
All I had to do was configure the UnknownConverterPlugin to make use of the Apache-Tika jar dropped into GS3’s extension folder, such that docx files could be processed and indexed (for searchability) by Greenstone without requiring users to install libreoffice.
The UnknownConverterPlugin has been officially available since Greenstone 3.09, so that 3.09 users can also start using Tika with the plugin, by
1. creating a subfolder called “tika” inside their GS3-install-dir/gs2build/ext,
2. downloading the Apache-Tika binary jar file from https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.24.1.jar (or by visiting http://trac.greenstone.org/browser/main/trunk/greenstone2/ext/tika/tika-app-1.24.1.jar and clicking the link labelled “downloading” there), then dropping the downloaded jar file into GS3/gs2build/ext/tika
3. and then configuring an UnknownConverterPlugin instance for any collection that needs docx processing as follows:
All 3 of the above steps are already setup for you in the GS3 binaries generated every night and available from http://www.greenstone.org/caveat-emptor/
Untried: Greenstone 2 users can try a grabbing a nightly GS2 binary from http://www.greenstone.org/caveat-emptor/ as it should also come with an UnknownConverterPlugin). The nightly GS2 binaries should already have an ext/tika subfolder within the GS2-installation folder, containing the tika jar file. Otherwise you can create this folder yourself and download the tika jar file into that location as in step 2. Next configure your UnknownConverterPlugin as in step 3 above before building your GS2 collection containing docx files.
You’re not limited to processing docx files by using UnknownConverterPlugin with Tika. You can process other textual doc types, whether already supported by existing Greenstone plugins or not, by configuring a new instance of UnknownConverterPlugin and setting the mime_type, srcicon, process_extension (and file_format) fields appropriately for that doctype.
The above instructions on using the UnknownConverterPlugin has now also been added to the Greenstone wiki, so you don’t have to track down this blog page if you want to revisit the instructions of using Tika with the UnknownConverterPlugin. Simply search the Greenstone wiki at wiki.greenstone.org for “UnknownConverterPlugin” or “Tika”.
]]>But for those who really need Greenstone 2, we’ve now brought out a maintenance release of it with Greenstone 2.87.
You can grab the new GS 2.87 binaries, and source component and source distributions packages from the downloads page.
The release notes include the installation instructions. And here are the Greenstone 2 tutorials, make sure to select the “Greenstone 2” tab at the top.
For those who need to work from source, the release notes also contain links to pages detailing the steps for compiling the source components and distributions.
As usual, write us if you encounter any issues with GS2.87.
]]>The Greenstone3 project has been merged into the main Greenstone project, and Greenstone3 binaries are now the default downloads. To download Greenstone2 binaries, click “BrowseAll Files” and navigate to the binaries you want.
Note, for now I have left the Greenstone3 project in place, but it will eventually be deleted.
]]>For the latest changes made today, need to retest these changes against GS3 on Windows.
Still need to test the entire process on Linux.
]]>This week:
Update: did not get much further with the GS3 usersDB as there was a lot more to be done with the translation files for GTT and their processing. The process became clearer thanks to Te Taka’s explanations and his testing at each stage. TMX files will only be needed the first time a translator migrates from GS’s usual translation procedure, which makes use of excel spreadsheet files, to Google’s toolkit. The TMX file will start them up with all the up-to-date translated strings that are available so far in GS3 for the selected language. For the strings that need to be translated and updated, the translator will get a text file that contains the unicode spreadsheet data (as comma separated values, but the file will have a .txt extension instead of .csv in order to preserve the unicode). The translator will then copy the English and <Language> columns of the spreadsheet into the GTT. Once their translation work is done, they can send these same columns back by way of the same spreadsheet.
]]>This week has seen more upgrades to Greenstone 3 as well. One of the features we have been working on for the Pei Jones collection is the ability to zoom “screen” images by using the mouse like a magnifying glass. We have added this into the default Greenstone 3 capabilities. In order for this to work however there needs to be a “screen” (small) and “source” (usually larger) version of the same image.
In general Greenstone 3 now handles paged-images much better. They are now properly displayed at the top of their specific sections. There is also an option to change between text-only, image-only and the default text and image modes, which is available in both the paged style collections as well as normal hierarchy style collections.
Next week will most likely involve more improvements like this as we continue to prepare Greenstone 3 for release.
]]>This week, I was able to finally return to the problem of jodconverter not interacting well with the LibreOffice on the Ubuntu 11 whereas the same worked perfectly against an OpenOffice on the CentOs machine. We decided that perhaps OpenOffice had different behaviour for the signals sent by jodconverter. Installing OpenOffice turned out harder than expected and I think I botched it. I ended up having to uninstall all openoffice files and libreoffice files and then reinstalled all of libreoffice. At this stage, upon trying jodconverter again, it was found to work fine each time. This seemed to confirm the suspicion that some updates to Ubuntu may have messed up some libraries or something, breaking LibreOffice a little.
However, despite things now working again, Sam wondered, very correctly, whether a user’s experience would be this convoluted or whether it would work straight away for them. He suggested trying out a VM of Ubuntu 11. Which is what I did. It was my first VM installation and after installing a Ubuntu 11.10 VM on Sam’s Windows 7 (which comes with LibreOffice), Greenstone with the open-office extension fortunately worked fine on a sequence of word documents.
On Friday, got round to Diego’s long-standing question at last: about the possibility of a single metadata.xml at the import level which defines the metadata for all files in import’s subfolders. Dr Bainbridge had already confirmed earlier that this was indeed possible, but the question was of how the metadata.xml out to specify the path to the files in the subfolders, especially if there were spaces in the path. After a series of incremental tests, it was found out to be still possible and the solution rather straightforward. Hopefully it will work for Diego also.
There was some translation work, and a few further questions on the mailing list to look at, before I finally got round to considering Michael Goodwin’s complex question on the setup.exe generated by an Export To CD-ROM operation failing on Windows 7 on 64 bit. A preliminary successful test on a Windows 7 machine turned out to be misleading: I had assumed it was a 64 bit machine but it turned out to be 32 bit after all. I will have to get back to trying this out next week. All this fine-tuning is bound to pay off in the upcoming perfected release of Greenstone 2: version 2.86.
]]>On the final 3 days, got round to working on getting the batch files in GS2 to handle not only spaces but also brackets in the Greenstone filepaths. There is still a final problem to resolve before the changes can be committed, but the Greenstone web server is now back to working again, despite Greenstone being installed in a path with brackets (and spaces). There’s even some allowance made in the makegs2.bat script–which is used to compile up GS2–to get apache to compile up even in those instances of there being spaces or brackets in the filepaths it works with. Fortunately, the change could be made in the makegs2.bat itself: it sets the command prompt in which Greenstone is being compiled up to be in short-filenames mode. This then is the situation that the apache compile scripts inherit also, making any space/bracket in the long pathname irrelevant.
]]>