Archive for July, 2011

Sam’s Greenstone Blog 27/7/2011

admin. Wednesday, July 27th, 2011.

It’s been a while since I last wrote, so I’ll fill you all in on what has been happening.

We have been working fairly solidly on some improvements for 2.85. One thing we have been aiming to do is improve the use of PDF files with complex embedded metadata. We have added several options to the EmbeddedMetadataPlugin that allows more advanced manipulation of metadata arrays (metadata values that have multiple entries like ex.PDF.Keywords).

We have also fixed several issues that arise when 2 similar documents (for example if two identical PDF documents are put into Greenstone but have different embedded metadata) are put into Greenstone.

In other news, we are currently taking another look at the way we encode PDF files. As some of you may know we introduced the PDFBox extension along with 2.84 as a way of converting the latest PDF formats to HTML (pdftohtml only allows conversion of the earlier PDF formats). PDFBox works well except that it does not also get the images out of the PDF like pdftohtml does, it also is fairly large which is why we need to keep it as an extension rather than bundle it with Greenstone. Unfortunately for us, the pdftohtml utility has not been in active development for quite a while now so it has not been upgraded to deal with the more recent PDF versions. However the Xpdf library that pdftohtml uses is still in active development so we have been exploring the viability of upgrading pdftohtml ourselves.

Alongside this I am continuing to work on the Document Maker for Greenstone 3. I have a skeleton of the program in place and have starting filling it out.

Anu’s entry for weeks 11-22 July 2011

ak19. Friday, July 22nd, 2011.
  • Week starting 11 July: Closed ticket 770 to do with multiple pieces of metadata for the same metadata name in GS3. GS3 was previously not consulting the mdoffset field in the index database to work out which of multiple assigned metadata values to display for a particular metadata field. When browsing on that metadata field, it used to display only the first each time, but now displays all values in turn.
  • For the rest of that week and the start of the week thereafter, worked on some items discovered by John Rose and Luigi. They found a bug in the GS2 OAI server that manifested when a GS2 client tried to download docs from it over OAI. The bug had to do with an incorrect URL being generated for the dc.Resource Identifier field. They also requested a minor improvement to the button layout in GLI’s OAI download panel and needed some clarifications on the GS2 OAI server’s behaviour.
  • Continuing on in the week of 18 July: On GLI startup, an information dialog box will show up if the user does not have the PDFBox extension installed (telling them how to get it if they want newer PDF versions processed). A dialog will also appear on startup if the user’s collect home was set to be somewhere outside its default location inside the GS2 installation.
  • In implementing the last, a bug was discovered that had been introduced when implementing the reset-gsdlhome target of the gsicontrol script. The bug interfered with the proper behaviour of setting and loading a custom collecthome when using GLI. It’s now been fixed in such a manner that there’s the added advantage that the intensive operations of the reset-gsdlhome task will not be carried out anymore each time the GS2-server is launched. Instead, the relocation-specific operations are only performed when GSDLHOME has in fact changed since the previous time the GS2-server was launched.
  • The pdfbox-app.jar executable file was changed again: it was returned to being the plain, official 1.5.0 release, without the Greenstone-specific changes regarding the line-separator that had thereafter been committed. Instead, the line.separator is now set as a command-line property when launching the pdfbox-app.jar, as suggested by Dr. Bainbridge, since it was no more than a Java System property that needed to be adjusted for GS’ customisation of PDFBox anyway.
  • Changes have been made to modelcol’s config.cfg (and related changes in runtime-src) to deal with embedded metadata, so that it will now handle the “ex.” prefix of metadata already qualified by a set name, such as ex.dc.something. Further changes were made to runtime-src’s code to not always remove the ex. prefix, since this should be retained for embedded metadata. The handling of embedded metadata by the DSpacePlugin was also slightly modified so that DC metadata in the dublin_core.xml files of DSpace documents get prefixed with “ex.”. This allows these metadata fields to be visible in GLI, while yet being unmodifiable, as they are still extracted (ex) metadata.
  • Tried to reproduce some issues noticed by members of the mailing list.

Sam’s Greenstone Blog 11/7/2011

admin. Tuesday, July 12th, 2011.

Looks like I have some catching up to do.

My time is still mostly being spent on Greenstone 3, tidying up loose ends and making sure we haven’t forgotten anything.  One thing I fixed up was what Greenstone 3 does when GLI is started.  As most Greenstone 2 users will know, when you start up GLI the Greenstone 2 server window also starts.  Previously in Greenstone 3 nothing happened when GLI started (if the server wasn’t running then it would stay not running) but I have modified it so that on Windows the Tomcat window will launch as GLI is launched and on Linux it runs silently in the background.

I have also been spending some time working on the API for the new Document Maker facility that will no doubt make it into the public release of Greenstone at some stage (not 3.05 but maybe 3.06? It’s probably to early to say. Dr. David Bainbridge and I have been discussing the API in detail and I think we are close to finalising what needs to be included to support all of the operations we are planning. The next stage is figuring out how to do the things we want and then implementing them.

Blog entry for 19 June – 1st of July

ak19. Tuesday, July 5th, 2011.

Forgot to write entries for the last two weeks

– A lot of time was devoted to ticket 449: after Dr Bainbridge’s initial solution to the problem in javascript, Sam and Veronica spent a lot of their time on it just so that we could get it to do the same in XSLT,  and so at last (yesterday, 4 July) this was finished.

– Sam and Dr Bainbridge noticed the GS2 server’s portnumber would keep incrementing at times if the chosen port was unavailable at that moment. Their ticket specified a way to request preserving the chosen port. So that was implemented some time last week.

– investigated pdf to text on Windows. Ghostscript seems to support ASCII conversion, but Greenstone would need unicode to be preserved. There were Perl solutions as well as open source programs to do this on Windows. For now, PDFBox has been tweaked to use its inbuilt ability to convert PDF to text when this is specified. Also looked into the latest version of AbiWord which Max pointed out as a free and small-sized alternative to MS Office and Open Office for converting docX files.

– the latest updates to acku and areu collections were uploaded