Anu’s entry for the week of Mon 13 Jun 2011

ak19. Friday, June 17th, 2011.

Since Dr Bainbridge was away this week and because I was at an impasse regarding my last ticket for GS3, I had already decided last Friday to consider two requests that seemed feasible and just required a bit of investigation:

1 . Diego noticed that use_sections didn’t work with the PDFBox extensions. So some changes were then made to the PDFBox code to generate the html in a paged fashion and adjust PDFPlugin to handle it. It turned out that changes to the PDFBox code were unnecessary after all, and the latest binary was all that was needed to work with the updated PDFPlugin: the latest PDFBox jar file was inserting page separator elements already and the PDFPlugin merely needed to pick up on that.

2.  Professor Witten had suggested that if someone had Word or Office 2007 installed on their Windows, windows_scripting should be able to convert docx to html for us without requiring Open Office. Last week I had tried out whether this was already possible: but word2html which used native windows scripting cropped the “docx” extension down to “doc” and declared it couldn’t be found. It was not possible to get at the VB source code to modify it, so the next idea was to find some WSH scripts on converting docx (WSH tends to be switched on on Windows by default). There was a WSH script on the web for extracting all the *text* from a docx. It wasn’t quite what was wished for, since formatting would be lost.

Fortunately for us, Veronica recalled the existence of a VBscript for WSH that promised to do just what we needed: docx to html. After she located it for us, all that was required were some modifications to integrate it with gsConvert.pl to get things to work in the default case: where Office/Word 2007 was installed. It worked fine on XP. Then some further changes for error handling needed to be inserted on Word not being installed on a machine. Having got the error output to go to STDERR from the VBscript, it now did pretty well on the Vista where there was no Word either.

It still needs to be tested for how it acts on a Windows which has a version of Word predating the docX format.

3. The idea is to expand the VBS script to have subroutines to handle xlsx and pptx files as well. Some bit of the code for pptx is already working (opening the document), but there may be some differences between opening or saving things in Word and Powerpoint, as the universal Office SaveAs method wasn’t working for me.

4. Temporarily fixed a bug in GS’s classifiers which was noticed on the mailing list and sent the tentative fix to the notifier:

When a user enters non-English characters for a buttonname, perl does not preserve them and so it displays wrong in the browser. The fix required me to assume that the user would have input this in UTF-8, for which I got the perl to work with it now. But need to talk to Dr Bainbridge about whether my assumption was reasonable before commiting the code for all.

There were some questions in the mailing list which I finally got round to answering. There is still Diego’s request for implementing the “allvalues” option in the List classifier to look at, and number 3 above.

Comments are closed. If you have feedback or questions, please email the Greenstone users mailing list.