Friday, 19 June 2009

Friday Post

Couple of weeks ago now, in a developer meeting, Neil (Jefferies) mentioned the "Island of Unscalable Complexity". Even if I could remember the context, I probably wouldn't tell, but it seemed such a cool place to go visit I tried to make it real... Look close and you'll see our boat, on the way, grappling hooks at hand...

Never say "unscalable" and expect no one to try! :-)

Monday, 15 June 2009

CAIRO Content Model: A noob's overview (Part 2) Accessions

I-E-O-2. It is a curious one and no mistake. The idea of an accession wasn't something I'd come across before (though alien concepts are no surprise to me now having worked in libraries for some time) - and if you really want to confuse yourself about what it is, try this Wikipedia article on 'accession'.

All clear now? Didn't think so! (Though the bit about the museum is probably closest). For the purpose of ingest, I'm not too worried about legal ownership (this must be assumed) so my working definition - and I make no claims that it is a good or accurate - is:

"an accession is a unit of stuff that arrives at a given time, at a given archive, and needs to be added to that archive"

An accession then is defined by its origin, date of arrival and the collection to which it belongs.

Like IE01, IE02 uses the METS header element to record agent data - who did what and when to this record. This will usually "created by ingest process", "updated by CAIRO tool by user X", etc.

In IE02, the descriptive metadata is kept deliberately Spartan. This is because the concept of the accession remains largely transparent to researchers. While it is important to the archivist to know the source of the collection's parts, the researcher really just needs the collection and the items. So, dmdSec gives us space for an accession identifier (unitId) and title (unitTitle) (using controlled formats - which may or may not be specified yet - I'll find out as I move through the model!) and then a minimal EAD/DC/MODS description. Since we're an archive, I focussed on the EAD, but cross-walking to the others would be possible.

The dmdSec EAD description is quite minimal, listing just origination information (pricipal creator), a physical description (extent, in MBs), a description of the formats and a description of the software/hardware environment used. All these are given as free-text, manual fields, but I wonder how far we can get offering automated "tips" for these elements.

There is also room for a pointer to some rights statement (amdSec) (as with IE01) and also a pointer to a log file that records activity over this accession (fileSec). The format of that log file remains undefined, but it'll record things like "checked all files for viruses", "identified 15 obsolete files", "transformed obsolete formats to ODT", etc. We'll have to think about how we implement this linkage. The model says use a URN to point at the log file, but I wonder if we use a URI that points to a canned search of a generally logging service - something along the lines of That way we can dynamically generate log reports for each accession. Needs thinking about anyways.

Finally there is, in the structMap, a map to the other "main divisions" of the accession and the model suggests that these are either folders or files (different types are given - like subject folder or email directory) and so it is apparent that the accession structMap could be used to reflect the entire accession structure or just the top level directories. Which is better is unclear, but I suspect we will be adding a manifest to the accession that does list all the files (so the structure can be browsed without getting near the real data objects and, if we do it that way, without having to parse the IE02 too much).

We could put a pointer to the manifest alongside the log file pointer in fileSec.

I guess that is probably no clearer than the aforementioned wikipedia page but at least I never used the word "Inaedificatio"!

Wednesday, 10 June 2009

Our Cunning Plan

We've had lots of long discussion here about what BEAM will look like and I back in April (I think) we drew our first thoughts onto our newly arrived whiteboard. The diagram remains to remind us of where we are going, though I've already got some refinements to make. However, if you've ever doubted our sanity, here is some proof that you were right! :-)

Tuesday, 9 June 2009

Presenting email archives

Just a quick thought. Am wondering whether some of the tools for presenting mailing list archives might be adapted to present personal or organisational mailboxes. Maybe something like MHonArc?

Monday, 8 June 2009

CAIRO Content Model: A noob's overview (Part 1)

I have started to take a look at the CAIRO content model to see how this will be used for futureArch and, later, BEAM. Creation of the model predates me joining the team here at Oxford so it is possible I'll entirely misrepresent the work, so feel free to ask questions or shout! :-)

As there is quite a lot to the content model, I'll cover it in parts as I read them. This is to allow my brain to digest each bit rather than try to cram the entire thing in one go. Hopefully the same idea will help anyone trying to read about it here.

So without further ado, I present Intellectual Entity 01: Collection.

Affectionately known as "I-E-O-1" (the O is really a 0 but we don't say zero), this is described in the documentation as the "descriptive overview of the collection". The metadata here is designed to enable a curator to respond to researcher enquiries and freedom of information requests and also to provide the foundation for the rest of the collection's metadata (which is broken down - as we'll see in future posts - into accessions and items).

IE01 relates to both digital and physical (really need a better word - analogue?) components of the collection, but it is important to note that it does not replace the EAD record created by the archivist - though it links to it. This begs the question: are we currentlly able to link to our EAD records? (I'll need to find out the answer).

Like all of our entities, the object is specified using METS. From my perspective of creating digital object ingest tools, the fields I'm worrying about are (in no particular order):

  • A list of agents, found in the metsHdr element, and recording who did what with this record.
  • An embedded EAD and/or DC and/or MODS record used to describe the collection. Nb. this is NOT the archival EAD, but rather a subset of the given schema (EAD/DC/MODS) to record a minimal amount of metadata about this aspect of the entire collection - archive identifier, country code, dates, formats, scope and access.
  • A link to a further entity (PR01) describing preservation rights of the collection.
There is also room to record a bit of structural information about the collection - the accessions that make it up for example - in the structmap.

That is a fairly sketchy outline of IE01 and there is a lot more to it than that of course, but that is my first impression. I hope it is useful to someone, and useful to me when I've recycled my paper notes! :-)