Thursday 11 March 2010

scat @ Gloucestershire archives

Yesterday, I enjoyed a very scenic drive through the Cotswolds to attend a workshop at Gloucestershire Archives (GA). GA have been working on digital curation for a few years now, using their historical digital archives - a discrete and reasonably unproblematic set of data - as a testbed for developing approaches that might help them with the more modern digital records created by the local authority in due course.

Yesterday's event was the culmination of a six month project 'Digital curation: from ingest to trusted storage' and was designed to provide attendees with lots of hands-on time, using the 'SCAT' tool developed through the project. The project was funded by the Society of Archivist's research fund, but builds on previous work funded by CyMAL to develop GAip - a software tool which packages digital objects ready for ingest to preservation storage. There is not much in the way of a web presence for GAip, or its successor project just yet, but the slides from Viv's presentation at the Society of Archivist's digital preservation road show give a flavour of GAip at least.

SCAT
The main output of the recent project is a tool called SCAT (Scat is Curation And Trust). The tool is written in perl and currently runs only in Linux environments; with a few modifications it should also run on a Windows platform. SCAT provides an interface to a number of open source digital curation tools that exist out in the wild; by loading a file or directory into SCAT, it is possible to apply these curation tools to them. Among the tools represented are:
  • Bagit - the Library of Congress tool mentioned elsewhere on this blog
  • GAip - GA's own packaging tool, which creates a Bagit-conforming package. This is used by GA to package its digital archives.
  • DROID - The National Archives' tool for file format identification
  • Jhove - the tool developed by JSTOR and Harvard for object identification, validation and metadata extraction
  • NLNZ metadata extraction tool - extracts basic metadata from some popular formats
  • FITS - identifies and validates files, and extracts technical metadata. It is a wrapper for a number of third-party tools (Jhove, EXIFtool, NLNZ, DROID, Ffident and the file utility). The intersting things about FITS is that is also attempts to normalise and consolidate the metadata output from these tools. Pete is using FITS at the moment to generate certain file-level metadata for dissemination purposes.
  • Antiword - a reader for Word files
  • Imagemagick - a tool that can do many wonderful things with image files
  • xmllint - used for validating XML files against their schemas, etc.
  • Unix's file utility.
  • SWORD deposit to repository (GA have been experimenting with an eprints instance in this project)
  • tools for fixity checking employing the MD5 and SHA1 algorithms.
  • document conversion tool (destination formats odt and pdf - possibly openoffice.org?)
This list isn't comprehensive, but it's enough to show you that there is a strong open source philosophy underpinning the project.

SCAT is still very much alpha code, and Viv Cothey (its developer) intends to do a bit of tidying up before putting it out on the web. It's really been designed to provide a hands-on learning space for archivists, and is not conceived as a ready-to-use system for digital curation. As a learning environment, I think it is very effective, providing a workbench which can call up a whole host of tools that the archivist can experiment with.

GAip packages
It's worth saying a little something about the package produced by GAip. Using the tool, the archivist can decide to package:
  • a single item
  • a collection of materials as single items
  • a collection of materials as a bundle, retaining their directory structure.
In a GAip package you will find a:
  • copy of the original source data
  • a sidecar metadata record for each data item (GAip uses XMP for this, and the metadata includes dublin core, and other, metadata expressed in rdf)
  • an inventory of the files contained in the package including their hash values 
The package is a compressed tar file (.tar.gz, although gaip uses a .gaip file extension). Each package is identified by a unique timestamp.

This is not a million miles away from our approach with the BEAM  ingest tool.

Building on SCAT
Could SCAT be developed into something more? There are a number of areas that would need to addressed. Some of the ones raised in discussions yesterday include:
  • the need to support multiple users (both GAip and SCAT are conceived as single-user software at present)
  • a better approach to unique, and persistent, identifiers
  • a method by which data objects can accrue additional metadata in their XMP sidecars beyond that supplied when the GAip package is created
  • workflow
If any of this is of interest, then I'm sure Viv Cothey would be pleased to hear from potential collaborators.

3 comments:

Simon Spero said...

It might be worth considering renaming the product, as there are some senses of the term that may trigger the "Scunthorpe Problem" in some poorly written censorware.

Simon

Seth said...

Do you have any contact information for Viv?

Susan Thomas said...

You can find Viv's email in the slides here.