Archive for June, 2011

Sam’s Greenstone Blog 27/6/2011

sjm84. Monday, June 27th, 2011.

Last week I spent a fair amount of time helping out other people in Greenstone lab. One of our recent additions is a Masters student from India named Papitha. Her project involves using Greenstone 3 to create a framework for scholars to work with large sets of images and OCRed text and to organise these into cohesive collections. Some examples of potential functionality include dynamically creating new documents and merging and/or splitting existing documents. So I have been showing her the ins and outs of Greenstone 3, as well as helping to set up a platform for her to start working from.

There is also another Sam in the lab who had been working for us alongside working on his PHD, he took some time off to finish his PHD and now that his PHD is finished he is working for us part time again. He has been working on a way to more easily customise the format of Greenstone 3. As I’ve mentioned before, with Greenstone 3 we use XSL stylesheets to control the formatting of the various pages, instead of the format statements and macros that Greenstone 2 used. XSL stylesheets are good in that they give us a large amount of flexibility, but this can also make them difficult to understand (especially for people who don’t have any experience in XML) so Sam has been working on a way to hide a lot of the underlying complexity by presenting a simple interface on the pages themselves. I have been helping him with some of the XSL coding as well as some of the run-time code.

Sam’s Greenstone Blog 20/6/2011

sjm84. Monday, June 20th, 2011.

One of the things we are working on at the moment in Greenstone is 64-bit compatibility (especially on Linux and Mac). Last week while David was away at JCDL I decided to take a deeper look at the issue.

At the most basic level there are two problems. The first is that, in the C and C++ programming languages there exists a data type known as a “long” and it is usually used to store a long integer (a large whole number). On a 32-bit Linux or Mac machine a long is represented using 32 bits (32 ones or zeros), but on a 64-bit Linux or Mac machine a long is represented using 64 bits. This may seem harmless enough, but unfortunately some of our older code has been written assuming that a long is 32 bits and this can stop the code compiling successfully or (in the worst case) cause some fairly hard-to-track-down crashes.

The second problem is fairly similar. Pointers (which are basically big numbers that represent memory addresses) are the same as longs in that they are 32 bits long on a 32-bit machine and 64 bits long on a 64-bit machine. This is only ever an issue when we try and convert a pointer to a number. For example, in some places we try to convert a pointer to an int (32 bit integer on both 32-bit and 64-bit) which doesn’t even compile on 64-bit machines.

There are 3 different ways that we have tried to solve this problem. The first method (the one that we currently use) is to enable an option during compilation that compiles the code as if it were on a 32-bit computer. This method works but has a massive downside that, in order for our code libraries to work together we need to compile most of the rest of the code as if it is 32-bit as well.

The second method is to change all occurrences of long to a data type that is always 32 bits. This effectively makes it exactly the same as the 32-bit version but without the downside of the first method. The downside of making the program run as if it is 32-bit is that you lose the advantages of 64-bit (such as a bigger memory space).

The third method is to only change the areas that cause problems on 64-bit and to leave the rest of code as it is.  This gives us the advantages of 64-bit as well as letting the rest of our code also remain 64-bit. The only downside is the potential to miss areas that are problematic and that might only show up at a later stage.

So last week I spent a lot of time debugging the various errors I encountered trying to implement the third method and for the most part it has been successful.

Anu’s entry for the week of Mon 13 Jun 2011

ak19. Friday, June 17th, 2011.

Since Dr Bainbridge was away this week and because I was at an impasse regarding my last ticket for GS3, I had already decided last Friday to consider two requests that seemed feasible and just required a bit of investigation:

1 . Diego noticed that use_sections didn’t work with the PDFBox extensions. So some changes were then made to the PDFBox code to generate the html in a paged fashion and adjust PDFPlugin to handle it. It turned out that changes to the PDFBox code were unnecessary after all, and the latest binary was all that was needed to work with the updated PDFPlugin: the latest PDFBox jar file was inserting page separator elements already and the PDFPlugin merely needed to pick up on that.

2.  Professor Witten had suggested that if someone had Word or Office 2007 installed on their Windows, windows_scripting should be able to convert docx to html for us without requiring Open Office. Last week I had tried out whether this was already possible: but word2html which used native windows scripting cropped the “docx” extension down to “doc” and declared it couldn’t be found. It was not possible to get at the VB source code to modify it, so the next idea was to find some WSH scripts on converting docx (WSH tends to be switched on on Windows by default). There was a WSH script on the web for extracting all the *text* from a docx. It wasn’t quite what was wished for, since formatting would be lost.

Fortunately for us, Veronica recalled the existence of a VBscript for WSH that promised to do just what we needed: docx to html. After she located it for us, all that was required were some modifications to integrate it with to get things to work in the default case: where Office/Word 2007 was installed. It worked fine on XP. Then some further changes for error handling needed to be inserted on Word not being installed on a machine. Having got the error output to go to STDERR from the VBscript, it now did pretty well on the Vista where there was no Word either.

It still needs to be tested for how it acts on a Windows which has a version of Word predating the docX format.

3. The idea is to expand the VBS script to have subroutines to handle xlsx and pptx files as well. Some bit of the code for pptx is already working (opening the document), but there may be some differences between opening or saving things in Word and Powerpoint, as the universal Office SaveAs method wasn’t working for me.

4. Temporarily fixed a bug in GS’s classifiers which was noticed on the mailing list and sent the tentative fix to the notifier:

When a user enters non-English characters for a buttonname, perl does not preserve them and so it displays wrong in the browser. The fix required me to assume that the user would have input this in UTF-8, for which I got the perl to work with it now. But need to talk to Dr Bainbridge about whether my assumption was reasonable before commiting the code for all.

There were some questions in the mailing list which I finally got round to answering. There is still Diego’s request for implementing the “allvalues” option in the List classifier to look at, and number 3 above.

Anu’s entry for week of 6 June 2011

ak19. Monday, June 13th, 2011.

Mainly small odds and ends. From making sure that the GS2 OAI server was validating against a new online validator (at which point the resumptionToken functionality was retested), very minor bug fixes such as making sure images in PagedImage collections built with xml item files won’t get reprocessed by ImagePlugin and some other questions on the mailing list. Spent time investigating how to implement use_sections with the PDFBox to PDFPlugin (can try updating the PDFBox code to split on a page at a time) and on Friday was (still unsuccessfully) trying to figure out problems on circumventing hard-coded GS2 {If} format statements in metadata so that things still work with GS3, as in ticket

Sam’s Greenstone Blog 10/6/2011

sjm84. Friday, June 10th, 2011.

This week was mostly spent producing test CD versions of 2.84 for Professor Ian Witten. Usually Greenstone CDs are designed so that they can be installed on either Windows, Mac OS, or Linux but due to size constraints (the total size of necessary files was greater than 700MB, which is as much as a standard CD can hold) we were forced to remove that Mac installer as it is unlikely to be necessary in the workshops that Ian will be hosting. Removing the Mac component reduced the size of the ISO to 697MB which was just under the limit. After fixing up a few minor problems with the CD installer it was decided that it would be better if the tutorial sample files were unzipped by default (usually they are compressed in zip format to save space) so that they were more accessible. The problem with this however is that it would have taken us over our 700MB limit.

The documented examples collection (basically a large set of example collections) is usually stored uncompressed on the CD and is copied directly to the Greenstone folder during intallation. It is made up of thousands of small files and we hypothesise that it is one of the biggest contributors to the slow CD installations because of this. By compressing the documented examples into a single file and uncompressing it on installation we can both save space and (potentially) increase the speed of installations. After implementing this (and having the tutorial sample files unzipped) the CD size was reduced to about 650MB which may be enough to reintroduce the Mac installer, we will have to wait till next week to find out.

I have also added authentication pages to the new Greenstone 3 skin (these existed in the old default skin but not in the new skin. Next week I plan to produce a final version of the CD and continue work on Greenstone 3.

Sam’s Greenstone Blog 3/6/2011

sjm84. Tuesday, June 7th, 2011.

This week was mostly spent working on Greenstone 2 for a change.

I upgraded the ExifTool Perl module to the latest version which removed some weirdness we were experiencing with embedded metadata with different character encodings. We use this Perl module to extract embedded metadata in formats such as XMP (Extensible Metadata Platform) and Exif (Exchangeable image file format), as well as metadata from files formats like PDF that have their own embedded metadata formats.

I have also been familiarising myself again with the code we use to generate versions of Greenstone that we can put onto CDs. Professor Ian Witten is heading overseas in the near future and will be running several Greenstone workshops, so he is wanting to a batch of CDs to give out to the participants. Everything appeared to be running smoothly - I was able to install Greenstone off an ISO (basically a file that represents a CD) mounted in a virtual CD drive - but when we then burn that ISO to a real disk the installer does not even start. We get a not-very-helpful error message telling us that there was an error with the error reporting, which certainly doesn’t make for easy debugging.

So next week I will be working on this and probably more Greenstone 3 touch ups.

Anu’s entry for week beginning 30 May 2011

ak19. Friday, June 3rd, 2011.

Fixed a server crashing bug reported on the mailing list (bug was traced to GSDLQueryLex.cpp).Finally worked out a rudimentary way to get an Exact Phrase option in the GS interface for the Web Administrator of our university library, who was faced with this problem. Got GS2 to pass all the remaining OAI validation tests at last. And also got the earliestDatestamp working for GS2’s OAI server as it should (it works out the earliestDatestamp in the manner that GS3 was changed to do it). This means it no longer always returns the unix epoch time of 1970 for all Greenstone OAI repositories, as it used to.

Sam’s Greenstone Blog 27/5/2011

sjm84. Thursday, June 2nd, 2011.

At the end of last week we discovered the that - unlike the current skin - the new skin did not correctly allow for custom user templates for things like browsing classifiers and search results. We were originally unsure why it was not working but we tracked it down to XSL template priorities. For those of you who are unaware, Greenstone 3 makes heavy use of XML to produce the information necessary to serve pages to the end-user. This XML is then transformed into an actual web page by performing several XSL transformations. XSL (Extensible Stylesheet Language) is basically made up of a set of rules (templates) that say what to do when you encounter a given piece of XML. For example there might be template that means: when you see a documentNode in the XML, replace it with a book icon (HTML tag) followed by a link (HTML tag) to the page for that document. These rules are stored in various files in the web/interfaces/{interface name} folder. To support format statements like what Greenstone 2 has we also allow users to write format statements in the collectionConfig.xml file and these are added to the other templates used to transform the page. So what happens when a user wants to overwrite a pre-existing template (like the one I mentioned before)? Well, we deliberately give the user’s template a higher priority than the default one, so it will do that transformation instead. We found that the default templates in the new skin had deliberately been designed to be more important that the user’s templates, mean that they weren’t showing up at all. So this has now been changed to be more user-template friendly.

Next week I will be adding the finishing touches to the new skin (mostly organising it so that it is easy to modify) and assuming that we can clear up the remaining tickets then I predict I will be spending time testing Greenstone 3 to make sure they are ready for release.