Archive for the ‘Greenstone2’ Category

xx

Official Greenstone 2.85 released!

ak19. Friday, November 4th, 2011.

At last, we did it. After a lot of testing, bug discovery and fixing, we’ve finally released Greenstone 2.85. It should be much improved from 2.84. There were also some last minute changes from release candidate version 2.

Please do grab a binary for your operating system by visiting the download page at http://www.greenstone.org/download and start using it!

The Release Notes can be found at http://wiki.greenstone.org/wiki/index.php/2.85_Release_Notes

Greenstone 2.85rc2 (release candidate 2) released

ak19. Friday, October 28th, 2011.

There was a lot of testing going on in the last 2 months, and I forgot all about writing blog entries.

The first stage of testing was to go through the Greenstone tutorials on Windows (Vista), Linux (Ubuntu) and Mac (Leopard). Some bugs were discovered and fixed, and after that RC1 of GS2.85 could be released.

Thereafter, further tests were conducted on all three OS: testing out combinations of the 3 indexers and 3 database types, processing of a range of file types including the use of Greenstone’s PDFBox and OpenOffice extensions, filenames with different encodings and HTML files that interlink with each other using different encodings, the remote Greenstone server and the GLI applet were tested out, as well as spaces in the filepath for Windows. This time, the tests were conducted on Windows XP, Linux CentOS as well as Mac Leopard again. A lot of bugs had still got through the net after the first stage of testing, but were caught this time around and fixed for the release of GS2.85 RC2.

Greenstone 2.85 RC2 was finally released on Wednesday 26 October 2011. The Greenstone Team invites all those interested to please test the new release binaries out, which can be obtained from http://www.greenstone.org/snapshots, and write back on any bugs or issues encountered. The updated release notes are at http://wiki.greenstone.org/wiki/index.php/2.85_Release_Notes

The release notes already contain instructions on a patch for a minor issue that Diego discovered in the earlier release and which had persisted into the current one.

Sam’s Greenstone Blog 3/10/2011

sjm84. Monday, October 3rd, 2011.

Those who are eagerly awaiting the release of the final version of 2.85 will not have to wait much longer. Anu has been working hard testing it on each of the platforms we support and for the most part things are looking good. Any assistance in testing is always greatly appreciated and if you would like to help us out then please download the 2.85 release candidate which is available here. If you find any problems then join the mailing list to email us at <greenstone-users @ list.waikato.ac.nz> and let us know. The more you can tell us about the issue the better.

Work on the Document Basket functionality continues to go well. I am in the initial stages of connecting the front-end Javascript to the Java back-end. To transmit the operations we are using JSON (rather than XML) as it is a very simple to write in Javascript and we have found a good Java library (gson) that converts JSON back into an object. So hopefully this week we will start seeing some promising results.

Sam’s Greenstone Blog 26/9/2011

sjm84. Monday, September 26th, 2011.

Hello again, sorry about the big delay between posts, things have been pretty busy here recently and remembering to write this often slips my mind.

We’ve released our first release candidate for 2.85 so please try it out and let us know if there are any issues. You can find it at http://www.greenstone.org/snapshots.

The front-end for the Document Basket (formerly the Document Maker) is looking really good now, sections can be added, moved around, duplicated and removed. The text of each section can also be easily edited thanks to Brook Novak’s Seaweed (Seamless Web Editing) technology which was developed here at the University of Waikato. For those of you who haven’t seen this yet, it basically allows you to click text on a web page and start editing it right there without the need for any complicated text boxes/buttons etc. Very cool.

We have yet to connect the front-end Document Basket interface to the back-end yet and we are still working on features such as undo, so that is what I will be working on this for this week.

Anu’s entry for the month of Aug 2011

ak19. Friday, August 26th, 2011.

It’s been about 4 weeks since I wrote an entry. In the meantime we’ve been tidying up the last of the To Do list items for the upcoming GS2 release and several of the To Do list items for the GS3 release. Sam is now hard working on the GS3 interface alongside his other work on the Document Maker. It now looks like GS3 may be released separately, after GS2.

Some of the more involved things that required doing were:

  • testing OAI (dc.Resource Identifier issues) and downloading over OAI
  • The extracted embedded metadata, ex.*.metadata (e.g. ex.dc.* prefixes), needed to be handled different from ex.metadata. This required some changes in various files and a lot of testing.
  • Conflicts between EmbeddedMetadataPlugin and some of the existing Plugins in the pipeline (OAI, DSpace, PDF plugins). Fortunately, Dr Bainbridge came up with fixes. After some testing, the known problems with these plugins no longer exist. With the tutorials we will continue to investigate how well other plugins interact with the EmbeddedMetaPlugin.
  • The OAI validator at openarchives now had a test where GS2’s OAI server failed and a different one where the GS3 OAI server failed. These have been fixed up.
  • The GS3 installer needed to have an admin page, like the GS2 installer does, where the user can enable admin pages and provide a password.
  • wvware.pl is a new intermediary script to launch wvware in its own particular environment. This script is necessary in order for wvware’s required environment not to be set globally (thereby tampering with Linux’ windowing/GUI libraries)
  • At the moment, after John Rose’s request, we’re in the process of merging the two server configuration files (glisite.cfg and llssite.cfg), so we can have just one, with some properties qualified by a “gli” prefix. The Server.jar code, the GS2 C++ code, the startup scripts and config files have been sufficiently modified to work with the work-in-progress on the GLI code, while still working with the stable GLI. Changing the GLI code was tricky two years ago, and made the code’s behaviour rather  complex. Now that I’m in the process of testing the latest overhaul to it, the changes I’ve just made to what was stable are still very buggy and reproducing the bugs takes some time. Fortunately, without the changes to the GLI code, everything else committed is able to work as accurately as before, which is fortunate since if I break anything, it will be just the LocalLibraryServer.java GLI code that once committed needs to be reverted.
  • The above task has now been completely resolved, and changes committed after being tested thoroughly on both Windows and Linux.

Minor issues also kept popping up over the last month.

  • There was a Z3950 “issue”that sidetracked me and which turned out not to be an issue after all: The Library of Congress’ Z3950 address seems to return SRU data. The fix is simply for the user to use the right module of the download pane.
  • A bug in starting and stopping GS3 via GLI on windows
  • One Greenstone member encountered a unicode issue that I wasn’t able to reproduce after initial investigations.
  • Minor but frustrating bugs with the GLI for GS3 have been resolved (an extra nested <format/> tag appearing when all format statements have been removed, and the preview button activating itself when editing format statements in an unbuilt GS3 collection)
  • Fixed GS3’s way of handling the port in the GSI application, so that it is no longer arbitrarily modified. The Do Not Modify port is still available.
  • Some requests on the mailing list like porting indexed databases from one GS2 version to the next, since changes had been made to the name of an ex.metadata

Anu’s entry for 25-29 July

ak19. Monday, August 1st, 2011.

Last week started off with requiring fixes to a bug introduced during recent GS3 code changes: suddenly metadata and titles were no longer being retrieved for normal search and browse operations. Then Sam’s recent improvement to GS3’s GLI by starting the tomcat server upon GLI startup was expanded to also stop the tomcat server on GLI’s exit.

Then it was time to move back to GS3 XSLT files once more. Recently, changes were made to GS3’s old standard skin (gs3library) XSLT files, so that the features exhibited in the DSpace Tutorial would work for GS3 as well. These changes needed to still be ported over to the new standard skin for GS3, currently called “oran” (its servlet is called “dev”). However, in trying to make sense of how to do this, it was discovered that the default dev servlet was not set to use Sam’s excellent default GS3 interface for dev. Because GS3’s format features need to be customisable, having any format statements in a collection’s configuration file would bypass Sam’s interface to show up a default one. However, this default one was not working at this stage. This was therefore fixed up to get back some rudimentary behaviour not unlike what GS2’s interface offers for hierarchical browsing and search results. To use Sam’s interface, all users would need to do is use GLI to delete any format statements in a collection’s config file.

In looking into this matter, a further minor bug was discovered in classifier.xsl that was also fixed.

Porting the GS3 changes made for DSpace tutorial into the new default skin later had to be continued later, since there was some incomplete work awaiting finishing: the week ended with continuing work to do with working with embedded metadata (such as of the form ex.dc.*).

Sam’s Greenstone Blog 27/7/2011

sjm84. Wednesday, July 27th, 2011.

It’s been a while since I last wrote, so I’ll fill you all in on what has been happening.

We have been working fairly solidly on some improvements for 2.85. One thing we have been aiming to do is improve the use of PDF files with complex embedded metadata. We have added several options to the EmbeddedMetadataPlugin that allows more advanced manipulation of metadata arrays (metadata values that have multiple entries like ex.PDF.Keywords).

We have also fixed several issues that arise when 2 similar documents (for example if two identical PDF documents are put into Greenstone but have different embedded metadata) are put into Greenstone.

In other news, we are currently taking another look at the way we encode PDF files. As some of you may know we introduced the PDFBox extension along with 2.84 as a way of converting the latest PDF formats to HTML (pdftohtml only allows conversion of the earlier PDF formats). PDFBox works well except that it does not also get the images out of the PDF like pdftohtml does, it also is fairly large which is why we need to keep it as an extension rather than bundle it with Greenstone. Unfortunately for us, the pdftohtml utility has not been in active development for quite a while now so it has not been upgraded to deal with the more recent PDF versions. However the Xpdf library that pdftohtml uses is still in active development so we have been exploring the viability of upgrading pdftohtml ourselves.

Alongside this I am continuing to work on the Document Maker for Greenstone 3. I have a skeleton of the program in place and have starting filling it out.

Anu’s entry for weeks 11-22 July 2011

ak19. Friday, July 22nd, 2011.
  • Week starting 11 July: Closed ticket 770 to do with multiple pieces of metadata for the same metadata name in GS3. GS3 was previously not consulting the mdoffset field in the index database to work out which of multiple assigned metadata values to display for a particular metadata field. When browsing on that metadata field, it used to display only the first each time, but now displays all values in turn.
  • For the rest of that week and the start of the week thereafter, worked on some items discovered by John Rose and Luigi. They found a bug in the GS2 OAI server that manifested when a GS2 client tried to download docs from it over OAI. The bug had to do with an incorrect URL being generated for the dc.Resource Identifier field. They also requested a minor improvement to the button layout in GLI’s OAI download panel and needed some clarifications on the GS2 OAI server’s behaviour.
  • Continuing on in the week of 18 July: On GLI startup, an information dialog box will show up if the user does not have the PDFBox extension installed (telling them how to get it if they want newer PDF versions processed). A dialog will also appear on startup if the user’s collect home was set to be somewhere outside its default location inside the GS2 installation.
  • In implementing the last, a bug was discovered that had been introduced when implementing the reset-gsdlhome target of the gsicontrol script. The bug interfered with the proper behaviour of setting and loading a custom collecthome when using GLI. It’s now been fixed in such a manner that there’s the added advantage that the intensive operations of the reset-gsdlhome task will not be carried out anymore each time the GS2-server is launched. Instead, the relocation-specific operations are only performed when GSDLHOME has in fact changed since the previous time the GS2-server was launched.
  • The pdfbox-app.jar executable file was changed again: it was returned to being the plain, official 1.5.0 release, without the Greenstone-specific changes regarding the line-separator that had thereafter been committed. Instead, the line.separator is now set as a command-line property when launching the pdfbox-app.jar, as suggested by Dr. Bainbridge, since it was no more than a Java System property that needed to be adjusted for GS’ customisation of PDFBox anyway.
  • Changes have been made to modelcol’s config.cfg (and related changes in runtime-src) to deal with embedded metadata, so that it will now handle the “ex.” prefix of metadata already qualified by a set name, such as ex.dc.something. Further changes were made to runtime-src’s code to not always remove the ex. prefix, since this should be retained for embedded metadata. The handling of embedded metadata by the DSpacePlugin was also slightly modified so that DC metadata in the dublin_core.xml files of DSpace documents get prefixed with “ex.”. This allows these metadata fields to be visible in GLI, while yet being unmodifiable, as they are still extracted (ex) metadata.
  • Tried to reproduce some issues noticed by members of the mailing list.

Blog entry for 19 June - 1st of July

ak19. Tuesday, July 5th, 2011.

Forgot to write entries for the last two weeks

- A lot of time was devoted to ticket 449: after Dr Bainbridge’s initial solution to the problem in javascript, Sam and Veronica spent a lot of their time on it just so that we could get it to do the same in XSLT,  and so at last (yesterday, 4 July) this was finished.

- Sam and Dr Bainbridge noticed the GS2 server’s portnumber would keep incrementing at times if the chosen port was unavailable at that moment. Their ticket specified a way to request preserving the chosen port. So that was implemented some time last week.

- investigated pdf to text on Windows. Ghostscript seems to support ASCII conversion, but Greenstone would need unicode to be preserved. There were Perl solutions as well as open source programs to do this on Windows. For now, PDFBox has been tweaked to use its inbuilt ability to convert PDF to text when this is specified. Also looked into the latest version of AbiWord which Max pointed out as a free and small-sized alternative to MS Office and Open Office for converting docX files.

- the latest updates to acku and areu collections were uploaded

Anu’s entry for the week of Mon 13 Jun 2011

ak19. Friday, June 17th, 2011.

Since Dr Bainbridge was away this week and because I was at an impasse regarding my last ticket for GS3, I had already decided last Friday to consider two requests that seemed feasible and just required a bit of investigation:

1 . Diego noticed that use_sections didn’t work with the PDFBox extensions. So some changes were then made to the PDFBox code to generate the html in a paged fashion and adjust PDFPlugin to handle it. It turned out that changes to the PDFBox code were unnecessary after all, and the latest binary was all that was needed to work with the updated PDFPlugin: the latest PDFBox jar file was inserting page separator elements already and the PDFPlugin merely needed to pick up on that.

2.  Professor Witten had suggested that if someone had Word or Office 2007 installed on their Windows, windows_scripting should be able to convert docx to html for us without requiring Open Office. Last week I had tried out whether this was already possible: but word2html which used native windows scripting cropped the “docx” extension down to “doc” and declared it couldn’t be found. It was not possible to get at the VB source code to modify it, so the next idea was to find some WSH scripts on converting docx (WSH tends to be switched on on Windows by default). There was a WSH script on the web for extracting all the *text* from a docx. It wasn’t quite what was wished for, since formatting would be lost.

Fortunately for us, Veronica recalled the existence of a VBscript for WSH that promised to do just what we needed: docx to html. After she located it for us, all that was required were some modifications to integrate it with gsConvert.pl to get things to work in the default case: where Office/Word 2007 was installed. It worked fine on XP. Then some further changes for error handling needed to be inserted on Word not being installed on a machine. Having got the error output to go to STDERR from the VBscript, it now did pretty well on the Vista where there was no Word either.

It still needs to be tested for how it acts on a Windows which has a version of Word predating the docX format.

3. The idea is to expand the VBS script to have subroutines to handle xlsx and pptx files as well. Some bit of the code for pptx is already working (opening the document), but there may be some differences between opening or saving things in Word and Powerpoint, as the universal Office SaveAs method wasn’t working for me.

4. Temporarily fixed a bug in GS’s classifiers which was noticed on the mailing list and sent the tentative fix to the notifier:

When a user enters non-English characters for a buttonname, perl does not preserve them and so it displays wrong in the browser. The fix required me to assume that the user would have input this in UTF-8, for which I got the perl to work with it now. But need to talk to Dr Bainbridge about whether my assumption was reasonable before commiting the code for all.

There were some questions in the mailing list which I finally got round to answering. There is still Diego’s request for implementing the “allvalues” option in the List classifier to look at, and number 3 above.

xx