Sam’s Greenstone Blog 27/7/2011

sjm84. Wednesday, July 27th, 2011.

It’s been a while since I last wrote, so I’ll fill you all in on what has been happening.

We have been working fairly solidly on some improvements for 2.85. One thing we have been aiming to do is improve the use of PDF files with complex embedded metadata. We have added several options to the EmbeddedMetadataPlugin that allows more advanced manipulation of metadata arrays (metadata values that have multiple entries like ex.PDF.Keywords).

We have also fixed several issues that arise when 2 similar documents (for example if two identical PDF documents are put into Greenstone but have different embedded metadata) are put into Greenstone.

In other news, we are currently taking another look at the way we encode PDF files. As some of you may know we introduced the PDFBox extension along with 2.84 as a way of converting the latest PDF formats to HTML (pdftohtml only allows conversion of the earlier PDF formats). PDFBox works well except that it does not also get the images out of the PDF like pdftohtml does, it also is fairly large which is why we need to keep it as an extension rather than bundle it with Greenstone. Unfortunately for us, the pdftohtml utility has not been in active development for quite a while now so it has not been upgraded to deal with the more recent PDF versions. However the Xpdf library that pdftohtml uses is still in active development so we have been exploring the viability of upgrading pdftohtml ourselves.

Alongside this I am continuing to work on the Document Maker for Greenstone 3. I have a skeleton of the program in place and have starting filling it out.

Leave a Reply

You must be logged in to post a comment.