Saturday, 27 September 2008

XML Schema for archiving email accounts


I attended several great sessions at the Society of American Archivists conference last month. There is a wiki for the conference, but very few of the presentations have been posted so far...

One session I particularly enjoyed addressed the archiving of email - 'Capturing the E-Tiger: New Tools for Email Preservation'. Archiving email is challenging for many reasons, which were very well put by the session speakers.

Both the EMCAP and CERP projects were introduced in the session.

EMCAP is a collaboration between state archives in North Carolina, Kentucky, and Pennsylvania to develop means to archive email. In the past, the archives have typically received email on CDs from a variety of systems, including MS Exchange, Novell Groupwise and Lotus Notes. One of the interesting outcomes of this work is software (an extension of the hmail software - see sourceforge) that enables ongoing capture of email, selected for archiving by users, from user systems. Email identified for archiving is normalised in an XML format and can be transformed to html for access. The software supports open email standards (POP3, SMTP, and IMAP4) as well as MySQL and MS SQL Server. The effort has been underway for five years and the software continues to be tested and refined.

CERP is a collaboration between the Smithsonian Institution Archives and Rockefeller Center Archives. This context has more in common with archiving email in the Bodleian context, where an email account is more likely to be accessioned from its owner in bulk than cumulatively. Ricc Ferrante gave an overview of the issues encountered, which were similar to our experiences on the Paradigm project and in working with creators more generally.

CERP has worked with EMCAP to publish an XML schema for preserving email accounts. Email is first normalised to mbox format and then converted to this XML standard using a prototype parser built in squeak smalltalk, which also has a web interface (seaside/comanche). The result of the transformation is a single XML file that represents an entire email account as per its original arrangement. Attachements can be embedded in the XML file, or externally referenced if on the larger side (over 25kb). If I remember rightly, the largest email account that has been processed so far is c. 1.5GB; we have one at the Library that's significantly larger and I'd like to see how the parser handles this. It will be interesting to compare the schema/parser with The National Archives of Australia's Xena. The developers are keen to receive comments on the schema, which is available here.

5 comments:

Eric Lease Morgan said...

As librarian who has systematically applied the processes of librarianship against email for about fifteen years, I found this posting very interesting. [1]

I would certainly advocate an XML schema used to encode email, but I got a chuckle when the posting alluded to mbox. It is the MARC of the SMTP world.

Creating XML versions of mbox data would go a long way in the collection, re-distribution, and additional functionality of emailed content. And email content of today is the archival letter content of tomorrow.

Is there a freely available, and relatively ease-to-use mbox to email XML parser?

[1] http://serials.infomotions.com/

--
Eric Lease Morgan
University Libraries of Notre Dame

Susan Thomas said...

I think the idea is that the CERP tool will be available, and XENA is available already (via sourceforge). I would be interested to know of others that might be out there.

Maureen said...

Hi Susan,

Thanks for flagging this up. I don't know of any other tools working on the mbox files (other than XENA)- as you know, the work we did on Testbed was on the scale of individual email messages rather than an entire inbox. The MS Outlook email to XML converter is still available at http://www.digitaleduurzaamheid.nl/index.cfm?paginakeuze=299 .

Do you know if there is a related CERP/EMCAP project that's looking at preservation of attachments (rather than just encoding them in the XML file or referencing them esternally)?

I'd also be interested to know why some (admittedly a very small number) of the messages in their pilots couldn't be converted - any ideas?

Anonymous said...

Most of the emails we had difficulties with had to do with date formats, or other similar things where the original message actually didn't comply with the Internet Message Format, RFC 2822, or the standard is so vague as too allow for many different interpretations and implementations of structured content. As we encountered these we were able to enhance the parser to recognize and handle most situations. However, we know that the parser, as with any other software, matures with continued attention. A first step would be testing it with more and more diverse content. We're working through some open source loicensing steps so that we can make the parser available.

Susan Thomas said...

Hi Ricc,

That's really useful - thanks! Look forward to seeing the parser.

S