Greenstone Archives collection


This is a collection of email messages from the Greenstone mailing list archives, from November/December, 2008.

How the collection works

The Greenstone Archives collection uses the Email plugin, which parses files in email formats. In this case, there is a file per month per mailing list, and each file contains many email messages. The Email plugin splits these into individual documents, and produces Title, Subject, From, FromName, FromAddr, Date, DateText, InReplyTo, and optionally Headers, metadata.

The collection configuration file, etc/collectionConfig.xml specifies <importOption name="groupsize" value="200"/>. This groups documents together into groups of 200. Email collections typically have many small documents, and grouping them together prevents Greenstone's internal file structures from becoming bloated and occupying more disk space than necessary. Notice that the Email plugin first splits the input files up into individual Emails, then groupsize groups them together again. This allows the collection designer to control what is going on.

The indexes line specifies 3 searchable indexes, which can be seen by clicking beside the word "Messages" on the search page to reveal a drop-down menu. The first (called Messages) is created from the document text, while the others are formed from From and Subject metadata.

There are three classifiers, based on Subject, FromName, and Date metadata. The AZCompactList classifier used for the first two is like AZList but generates a bookshelf for duplicate items, as illustrated here. This is represented by a tree structure whose nodes are either leaf nodes, representing documents, or internal nodes. A metadata item called numleafdocs gives the total number of documents below an internal node. The format statement for the first classifier, called CL1Vlist, checks whether this item exists. If so the node must be an internal one, in which case it is labeled by its Title. Otherwise the node's label starts with the Subject which links to the document, then gives FromName metadata, with a link to "Search by Sender", followed by the DateText.

The second classifier (CL2Vlist) is similar, but shows slightly different information -- the result can be seen here. For internal nodes, the actual number of leaf documents (numleafdocs) is given in parentheses after the Title. For document nodes the FromName, with a link to "Search By Sender", Subject (linked to the document), and DateText metadata is shown.

The third classifier is a DateList, which allows selection by month and year.

Finally, the documentHeading is overridden to show the header fields: FromName, DateText, Subject, InReplyTo (as the default documentHeading would not show the InReplyTo Field, nor to label the fields). The default documentContent already displays the message text (with the call to <xsl:call-template name="documentNodeText"/>). FromName is linked to a search on that name, while InReplyTo links to the email message that it refers to.