Wednesday, 15 December 2010

Digital forensics and digital archives

If you're interested in how digital forensics methodologies and tools could, and are, being applied to digital archives then you might like to take a look at a new report published by CLIR.  See It also includes a sidebar outlining how forensics tools are incorporated into our workflow.

Friday, 26 November 2010

What I learned from the word clouds...

Now, word clouds are probably bit out of fashion these days. Like a Google Map, they just seem shiny but most of the time quite useless. Still, that hasn't stopped us trying them out in the interface - because I'm curious to see what interesting (and simple to gather) metadata n-grams & their frequency can suggest.

Take for instance the text of "Folk-Lore and Legends of Scotland" [from Project Gutenberg] (I'm probably not allowed to publish stuff from a real collection here and choose this text because I'm pining for the mountains). It generates a "bi-gram"-based word cloud that looks like this:

Names (of both people and places) quickly become obvious to human readers, as do some subjects ("haunted ships" is my favourite). To make it more useful to machines, I'm pretty sure someone has already tried cross-referencing bi-grams with name authority files. I also imagine someone has used the bi-grams as facets. Theoretically a bi-gram like "Winston Churchill" may well turn up in manuscripts from multiple collections. (Any one know of any successes doing these things?).

Still, for now I'll probably just add the word clouds of the full-texts to the interface, including a "summary" of a shelfmark, and then see what happens!

I made the (very simple) Java code available on GitHub, but I take no credit for it! It is simply a Java reworking of Jim Bumgardner's word cloud article using Jonathan Feinberg's tokenizer (part of Wordle).

Wednesday, 10 November 2010

The as yet unpaved publication pathway...

It has been a while since we had a whiteboard post, so I thought it was high time we had one! This delightful picture is the result of trying to explain the "Publication Pathway" - Susan's term for making our content available - to a new member of staff at the Library...

Nothing too startling here really - take some disparate sources of metadata, add a sprinkling of auto-gen'd metadata (using the marvelous FITS and the equally marvelous tools it wraps), migrate the arcane input formats to something useful, normalise and publish! (I'm thinking I might get "Normalise and Publish!" printed on a t-shirt! :-))

The blue box CollectionBuilder is what does most of the work - constructs an in memory tree of "components" from the EAD, tags the items onto the right shelfmarks, augments the items with additional metadata, and writes the whole lot out in a tidy directory structure that even includes a foxml file with DC, PREMIS and RDF data streams (the RDF is used to maintain the hierarchical relationships in the EAD). That all sounds a lot neater than it currently is, but, like all computer software, it is a work in progress that works, rather than a perfect end result! :-)

After that, we (will, it aint quite there yet) push the metadata parts into the Web interface and from there index it and present to our lovely readers!


The four boxes at the bottom are the "vhysical" layout - its a new word I made up to describe what is essentially a physical (machine) architecture, but is in fact a bunch of virtual machines... 

For the really attentive among you, this shot is of the whiteboard in its new home on the 2nd floor of Osney One, where Renhart and I have moved following a fairly major building renovation. Clearly we were too naughty to remain with the archivists! ;-)

Wednesday, 29 September 2010

(ZX) spectrums & stained glass

Only slightly off topic but I was pointed at this morning. I can't decide what is more impressive, the site itself, that Java can be used to emulate a ZX Spectrum or that there are games listed from 2010!

Slightly more on topic, I was curious to read this article about stained glass, or is it file formats. Skip over the rather flowery bit and start reading from:

"The archive problem is one of format" and concludes "Then in five hundred years time... pictures of the Auch windows will be stored and accessible in the cloud. But not one of them...will have the jaw-dropping impact of people seeing them for the very first time...and realising that humans could do wondrous things for themselves."

I read another article, this time from New View (not something I'd usually read, but an article about the "origins of computing" was recommended to me), which - in a different article - likened stained glass windows to computer screens - mostly because of the back lighting - and suggested, a little like Chris Mellor in the Register, that the cultural experience of seeing stained glass is a bit like experiencing computers for the first time.

(An other article berated digital technology, almost branding it evil, and certainly bad enough to make the author ill using it. I was reading this at DrupalCon where 90% of the people there were welded to their MacBooks and wondered how bad using a plough might make these people feel! ;-))

I guess many folks don't have the "big bang" experience of the digital world. It crept up slowly - a bit like watching the cathedral and its windows being built would probably reduce its wonder - going from a ZX81 to a Pet to a BBC B to an Amstrad 1640 to a... well, you get the picture...

But does that make it any less astonishing? It shouldn't. If anything, it is all about helping people do "wondrous things for themselves", just like

Friday, 17 September 2010

Media players and the reader interface...

This is quite a long post, so I'm going to put the final line at the top too in case you don't read that far... ;-)

Your thoughts on media players would be most welcome!

The trouble, um, I mean, beauty of digital collections is that they redefine what a "manuscript" is. This is nothing new. Once upon a time someone somewhere probably upset the apple cart when they arrived at the hallowed doors with a basket full of photographs. Now we have video, audio and images, all of which can be encoded in any number of "standard" ways. (Not to mention a zillion different binary formats for just about any purpose you can imagine from sheet music to the latest car designs, which may well require more than just document-like presentation too - 3D models for example). These new manuscripts bring challenges for preservation, of course, but they also present challenges for presentation.

To address this, I've been learning more about media players in browsers with a view to picking one for the reader interface. I'm no expert in this field, so here is my layman's consideration of what I've found out and if you want to read more then this is great!

The traditional method to render audio/video in browsers, which pre-dates their ability to handle video themselves, is to use a browser plug-in, either directly (for example VLC plugin) or (more commonly) to build on top of Flash (eg. Flowplayer) or Java (eg. Cortado). The exact mark-up required to use these players varies. Some will simply use the "embed" tag and others have JavaScript libraries to simplify their usage and allow for graceful degradation in the event that the browser does not have the correct plug-in and/or understand/run JavaScript. (This may be an issue when we deploy the interface into a reading room with machines we do not control the configuration of).

But the times, they are a-changin'. Just as old browsers knew what to do when presented with an "img" tag, most modern browsers are beginning to support HTML5's "video" and "audio" tags, allowing the browser itself to handle the playback rather than farming this out to a plug-in. (For more on HTML5 generally see this presentation - the video tag is mentioned at about 58 minutes in). As an added bonus of bringing video into the browser in this way is it has inspired folks to build media players that manipulate the Web page to add the correct mark-up, be it a video tag, an embed, or whatever to play the media. This is currently being used to generate some nice media players that'll use the browser, the Flash-plugin, or whatever is available (see OpenVideoPlayer and OSMPlayer).

So now we get to the crux of it. What should we do for the reader interface? Go old-school (and annoy Steve Jobs) and use a Flash-based player? Adopt the new ways of HTML5? Insist on an Open Source player? Buy something in?

To work out the answer I did a bit of investigating and have installed most of the players mentioned thus far in this post - Flowplayer, OSMPlayer, video-tag only, VLC and Cortado, as well as JWPlayer.

Flowplayer uses the Flash-plugin to play Flash video (and, with an additional plug-in, MP3 audio) - it does not support Ogg. It is very simple to use and very slick to look at. It is open source, released under GPL3 with an additional (and reasonable) "attribution clause" which basically means the Flowplayer logo must appear on the player unless you pay extra.

JWPlayer works much like Flowplayer (though there is also a beta HTML5 video player in the making) and seems pretty good. While the source code is available, it is not clear if this is an open source product or otherwise - the source files do not include a LICENSE.txt or any boilerplate. Probably I'm just missing something there though, and JWPlayer seems a good choice if you don't mind Flash.

OSMPlayer is also open source and has numerous options for installation including a Drupal module (untested), a PHP library and a "stand-alone" configuration. In theory it supports lots of different audio and video formats and uses several divs to create a nice browser based player. Unfortunately, following the guidelines for both PHP and stand-alone configurations, I could not get it to work on my test server.

Video-tag only works pretty well with Firefox 3.6 on Ubuntu 10.04 and is very easy to include in a Web-page. Unfortunately it isn't nearly as slick at playback as Flowplayer - there is a delay in starting the video and it is unclear what is going on.

The VLC plug-in is also open source and seems to work pretty well and should be able to handle many different formats, but it isn't nearly as refined as other players and the provided example code fails to stop the video or make it full-screen. The VLC desktop player is wonderful, but I'm not convinced by the Firefox plug-in.

Cortado is a Java-applet provided to play Ogg Theora among other things. Usage is very simple - you just add an applet tag to the page - but playback is jerky, slow and lacked sound. I do not know if my machine is to blame for this or if it is the player itself so will have to investigate further.

Were I sat on and forced to make a choice I think I'd struggle. Flowplayer is slick to use and easy to implement, but requires we convert everything to Flash video or MP3 (mind you, most media will arrive in suitable formats I imagine). JWPlayer is very similar in this regard. I'd like to adopt the video-tag as this supports a wide range of formats, including open ones, but currently the experience is not very smooth and refinements in this area provided by things like OSMPlayer are still in their early stages of development. JWPlayer's HTML5 offering is still beta for example.

I guess my feeling for now is to either go with Flowplayer (and swallow the conversions required - actually pretty easy with ffmpeg) or spend a bit of time with OpenVideoPlayer's HTML5 work and the video tag. At this stage I think we probably need both working in the interface and see where the better user experience is...

I should throw one more thing into the pot - the problem of formats. Video and audio files are complicated beasts consisting of containers and tracks and such - a bit like cassettes! The contents of these containers are encoded in a variety of ways, each requiring different software to decode and render their content. We have the same problem with documents and we solve that by converting all the text-based materials we get into PDFs (for presentation before anyone starts worrying about the preservation implications of PDF!) and use a PDF plug-in to display them.

Can we do the same with our audio/video material and if we can, what format (I'm using "format" as a general term to mean "container/encoding"!) do we use? (Victoria has already done some work along these lines, creating WAVs for storage and MP3s for presentation, from audio CDs). Is there any additional concerns given that most born-digital video/audio is likely to arrive at our doors in a compressed format? Should we uncompress it? Is such a thing even possible? Should we (and do we have the processing power to) convert all audio/video materials to open formats for both preservation and presentation purposes?

We're going to raise this final question at our next Library developer meeting and see what folks think. In theory we can delay the decision because most browsers and their plug-ins handle multiple formats, but perhaps we should have a standard delivery format much like we currently have PDF?

Oh dear. I started writing this post with the hope of finding all the answers! I have found out a lot about media players at least, which can only be a good thing, and I've also found out that that state of the art is not quite as far along as the proponents of HTML5 killing Flash would like us to believe - though there is good work going on here and this is the future. I'm also unclear just how much my experience of these things is hindered by using Ubuntu - I often wrestle with the playback of media files under Linux! :-)

Still, I think we're further along, nearer an answer and at least in a place to know where to start testing...

Your thoughts on media players would be most welcome! :-)

Friday, 3 September 2010

Wot I Lernd At DrupalCon

I spent last week in the lovely city of Copenhagen immersed in all things Drupal. It was a great experience, not just because of the city (so many happy cyclists!), but because I'd not seen a large scale Open Source project up close before and it is a very different and very interesting world!

I'm going to pick out some of my highlights here as to cover it all would take days, but if you want to know more I'd encourage you to check out the conference Web site and the presentation videos on

So, wot did I lernd?

Drupal Does RDF
OK, so I knew that already, but I didn't know that from Drupal 7 (release pending) RDF support will be part of the Drupal core, showing a fairly significant commitment in this area. Even better, there is an active Semantic Web Drupal group working on this stuff. While "linked data" remains something of an aside for us (99.9% of our materials will not make their way to the Web any time soon) the "x has relationship y with z" structure of RDF is still useful when building the BEAM interfaces - for example Item 10 is part of shelfmark MS Digital 01, etc. There is also no harm in trying to be future proof (assuming the Semantic Web is indeed the future of the Web! ;-)) for when the resources are released into the wild.

Projects like Islandora and discussions like this suggest growing utility in the use of Drupal as an aspect of an institutional repository, archives or even Library catalogues (this last one my (pxuxp) experiment with Drupal 6 and RDF).

Speaking of IRs...

Drupal Does Publishing
During his keynote, Dries Buytaert (the creator of Drupal) mentioned "distributions". Much like Linux distributions, these are custom builds of Drupal for a particular market or function. (It is testament the software's flexibility that this is possible!) Such distributions already exist and I attended a session on OpenPublish because I wondered what the interface would look like and also thought it might be handy if you wanted to build, for instance, an Open Access Journal over institutional repositories. Mix in the RDF mentioned above and you've a very attractive publishing platform indeed!

Another distro that might be of interest is OpenAtrium which bills itself as an Intranet in a Box.

Drupal Does Community
One of my motivations in attending the conference was to find out about Open Source development and communities. One of the talks was entitled "Come for the Software, Stay for the Community" and I think part of Drupal's success is its drive to create and maintain a sharing culture - the code is GPL'd for example. It was a curious thing to arrive into this community, an outsider, and feel completely on the edge of it all. That said, I met some wonderful people, spent a productive day finding my way around the code at the "sprint" and think that a little effort to contribute will go a long way. This is a good opportunity to engage with a real life Open Source community. All I need to do is work out what I have to offer!

Drupal Needs to Get Old School
There were three keynotes in total, and the middle one was by Rasmus Lerdorf of PHP fame, scaring the Web designers in the audience with a technical performance analysis of the core Drupal code. I scribbled down the names of various debugging tools, but what struck me the most was the almost bewildered look on Rasmus' face when considering that PHP had been used to build a full-scale Web platform. He even suggested at one point that parts of the Drupal core should be migrated to C rather than remain as "a PHP script". There is something very cool about C. I should dig my old books out! :-)

HTML5 is Here!
Jeremy Keith gave a wonderful keynote on HTML5, why it is like it is and what happened to xhtml 2.0. Parts were reminiscent of the RSS wars, but mostly I was impressed by the HTML 5 Design Principles which favour a working Web rather than a theoretically pure (XML-based) one. The talk is well worth a watch if you're interested in such things and I felt reassured and inspired by the practical and pragmatic approach outlined. I can't decide if I should start to implement HTML5 in our interface or not, but given that 5 is broadly compatible with the hotchpotch of HTMLs we all code in now, I suspect this migration will be gentle and as required rather than a brutal revolution.

Responsive Design
I often feel I'm a little slow at finding things out, but I don't think I was the only person in the audience to have never heard about responsive Web design, though when you know what it is, it seems the most obvious thing in the world! The problem with the Web has long been the variation in technology used to render the HTML. Different browsers react differently and things can look very different on different hardware - from large desktop monitors, through smaller screens to phones. Adherence to standards like HTML5 and CSS3 will go a long way to solving the browser problem, but what of screen size? One way would be to create a site for each screen size. Another way would be to make a single design that scales well, so things like images disappear on narrower screens, multiple columns become one, etc.

Though not without its problems, this is the essence of responsive design and CSS3 makes it all possible. Still not sure what I'm on about? dconstruct was given as a good example. Using a standards compliant browser (ie. not IE! (yet)) shrink the browser window so it is quite narrow. See what happens? This kind of design, along with the underlying technology and frameworks, will be very useful to our interface so I probably need to look more into it. Currently we're working with a screen size in mind (that of the reading room laptop) but being more flexible can only be a good thing!

There were so many more interesting things but I hope this has given you a flavour of what was a grand conference.

Wednesday, 1 September 2010

The Case for Digital Preservation

Now, I'm pretty sure there is no need for me to make the case to the good readers of this blog, but if you're ever stuck for something to say about why your work is important - for example at parties - then the demise of the print edition of the OED seems a good candidate!

OK, so no one is about to ditch the Pocket version, or even the Shorter (I got one of those for a graduation present from my Grandma!), but even so...

The last print OED was published in 1989. I imagine, given the regular updates to the OED online, that there has been a substantial influx of words since 1989 and I guess (given how Chaucer looks now) English will undergo some significant changes in the future. Unless we (the DP community) decide to preserve the digital OED, we will condemn readers of 2489 to struggle on with an antique 1989 print copy and much will they wonder when they don't find things like "Internet"...

(Mind you, the electricity might have all run out by then so it wont really matter...)

On the flip side, and no doubt something someone at the party will point out, this is also a case for continuing to print the OED - at least a few copies, kept in safe places... ;-)

Friday, 27 August 2010

Homes for old software

The more hybrid archives we work with, the more obvious it becomes that we need access to repositories of older software (or 'abandonware'). For older formats you often find that not only is the creating software obsolete, but any migration tool you can dig up is pretty out-of-date too. Recently I used to source older versions of CompuServe and Eudora to transform an old CompuServe account to mbox format with CS2Eudora. The oldversion site is really valuable and we could use more like it, and more in it. The trouble is, collecting and publishing proprietary 'abandonware' seems to be a bit of a grey area.

In 2003, the Internet Archive obtained some exemptions from the Digital Millennium Copyright Act (DCMA) that has allowed them to archive software, but this has to be done privately with the software being made available after copyright expiry. Not much help now, but promising for the long-term. The best thing that could happen (from an archivist's point of view) is that individuals and companies formally rescinded their interests in older software and put them in the public domain. Ideally they would put an expiry date into the initial licence before the software becomes abandonware.

I'm curious to hear about other good abandonware sites, especially ones that include 'productivity software' (our focus is here rather than gaming!). The Macintosh Garden is a good one, and Apple themselves also provide access to some older software, like ClarisWorks. What else is out there that we should know about?

Tuesday, 17 August 2010

Balisage 2010 The Markup Conference

Balisage 2010 The Markup Conference was
preceded by the International Symposium on XML for the Long Haul Issues in the Long-term Preservation of XML which opened with:

A brief history of markup of social science data: from punched cards to “the life cycle” approach covering the “25-year process of historical evolution leading to DDI, the Data Documentation Initiative, which unites several levels of metadata in one emerging standard.”

Sustainability of linguistic resources revisited looked at some of the difficulties facing language resources over the long-term.

Report from the field: PubMed Central, an XML-based archive of life science journal articles provided insight into the processes deployed to give public access to the full text of more than two million articles.

Portico: A case study in the use of XML for the long-term preservation of digital artifacts discussed some practices that can help assure the semantic stability of digital assets.

The Sustainability of the Scholarly Edition in a Digital World explored the need for “ tools to make XML encoding easier, to encourage collaboration, to exploit social media, and to separate transcriptions of texts from the editorial scholarship applied to

A formal approach to XML semantics: implications for archive standards examined whether “The application of Montague semantics to markup languages may make it possible to distinguish vocabularies that can last from those which will not last”.

Metadata for long term preservation of product data discussed the “valuable lessons to be learned from the library metadata and packaging standards and how they relate to product metadata”.

The day concluded with Beyond eighteen wheels: Considerations in archiving documents represented using the Extensible Markup Language (XML) which contemplated “strategies for extending the useful life of archived documents”.

Sessions in the main conference 2010 – covered topics such as :

gXML, a new approach to cultivating XML trees in Java which proposed “A single unified Java-based API, gXML, can provide a programming platform for all tree models for which a “bridge” has been developed. gXML exploits the Handle/Body design pattern and supports the XQuery Data Model (XDM)”.

Java integration of XQuery — an information unit oriented approach explored “a novel pattern of cooperation between XQuery and Java developer? A new API, XQJPLUS, makes it possible to let XQuery build “information units” collected into “information trays”.

XML pipeline processing in the browser discussed the benefits that providing XProc as a Javascript-based implementation would offer comprehensive client-side portability for XML pipelines specified in XProc.

Where XForms meets the glass: Bridging between data and interaction design explored using XForms which offers a model-view framework for XML whilst working within the conventions of existing Ajax frameworks such as Dojo as a way to bridge differing development approaches,data-centric versus starting from the user interface .

A packaging system for EXPath demonstrated how to adapt conventional ideas of packaging to work well in the EXPath environment. “EXPath provides a framework for collaborative community-based development of extensions to XPath and XPath-based technologies (including XSLT and Xquery)”.

A streaming XSLT processor Michael Kay (editor of the XSLT 2.1 specification) showed how he has been implementing streaming features in his Saxon XSLT processor;

Processing arbitrarily large XML using a persistent DOM covered moving the DOM out of memory and into persistent storage offering another processing option for large documents, by utilising, an efficient binary representation of the XML document that has been developed, with a supporting Java API.

Scripting documents with XQuery: virtual documents in TNTBase presented a virtual-document facility integrated into TNTBase, an XML database with support for versioning. The virtual documents can be edited, and changes to elements in the underlying XML repository are propagated automatically back to the database.

XQuery design patterns illustrated the benefits that might extend from the application of meta design patterns to Xquery.

Monday, 9 August 2010

Any old tapes - a true story - part 1

My neighbour is a self-employed architect. He has worked digitally for at least ten years and now most of his work is done on either his old (but still perfectly serviceable) ThinkPad or a shiny new desktop PC. He works with a couple of different CAD packages along with some tax software and MS Office, all on Windows XP.

Recently, knowing what I do for a living, he asked if I could help with a problem he was having retrieving files from an external hard drive and, being easily persuaded by the promise of food and wine, I agreed to try to help (with all the usual caveats about probably not knowing anything about it all!).

We got the disk drive working quickly (this is often the way when solving other people's computer issues. Sit with them and they'll solve it themselves!) and so he asked me about his backups too - which should have been happening regularly to another external drive, but were not. I checked out the drive and found an old directory with a very uninformative name that contained some data files and a few manifests that didn't make much sense. I've forgotten the name already, but he told me this was the name of the backup software. Searching, this software was not on the PC. The new PC had been recently built on the basis of the old one by an outsourced IT support. They'd done a good job restoring the software, etc. but this one backup program (a commercial one) was missing.

The consequences where two-fold:

1) No backup was running
2) the data files (about 1.4GB worth) and manifest were, without the software, entirely unreadable.

My neighbour thought perhaps the backup software was about so he'd ask the IT support to install and configure it. I fired up MS Windows Backup (the first time I've ever used it - it seems OK) and ran a one off backup of his work, just to be on the safe side and suggested he ask his support about that (one thing you must never do is undo or override the work of the real support person!) too - it required a password to add it to Windows scheduler.

After it completed, he astutely asked where the files had gone, and so I showed him, on the external drive and was dismayed to find that Windows Backup had also dumped all the files into a 1.4GB (proprietary?) container. I wondered if we'd ever have to extract files from Windows Backup files and made a mental note to keep a copy of the software (bundled with XP) in the cupboard just in case! Worse, it was then impossible to reassure him that the files were there without a crash course in Windows Restore. Still, I remember MS Backup and Restore being a pain way back to MS-DOS! :-)

As we finished our wine and talked about these things, he seemed to suddenly remember my job and jumped up, rummaged in a cupboard. He pulled out an old tape cartridge:

Once his main backup media, but, like the files on the external drive, no longer usable. This time both the hardware and the software were long gone. He didn't seem worried - the files has probably been migrated off his old machine to the new one at some point - but still he wondered what was on it and said "I don't suppose it is readable now is it?". He hadn't meant it as a challenge, but I couldn't resist! I convinced him to let me take the tape with me and try to recover his data - all in the name of digital archaeology, of course!

My next post will be my first adventures in the land of the Travans...

Friday, 16 July 2010


I'm very excited! I just looked at a Web site that lists all the "high profile" folks using Drupal for the Web sites. I'm also nervous. The excitement and the nerves are linked. This is because I told Susan a few weeks ago that I would commit to Drupal as the front-end for our archival materials. So, there is a lot to live up to and I'm also stuck with a decision I made, so it is all my fault! No pressure! :-)

Why Drupal?

Well, when I was building the second incarnation of the archive interface (the first was a prototype put together by Susan), it started out as a bunch of Web pages and a Solr-based search engine. The back-end data was created using a combination of source data and metadata gleaned from the EAD catalogue, the output of FTK and a spreadsheet that was the result of some appraisal work by the archivist, all munged together by some Java code that did the transformations, created the thumbnails, etc.

As time moved on it became apparent that additional features would be nice to build into the interface. At least one of the Project Advisory Board members suggested it would be nice to see a more Web 2.0-like features and I've long thought that having reader-generated tags and (perhaps) comments attached to the manuscripts might be a nice idea. Other features also arose, and soon I realised that I'd have to either build database-driven site to make all this happen (which I suspect would've been rather ropey) or, far more sensibly, use one that already existed.

By wonderful coincidence (though the kind of thing that often happens) I saw some emails on the Fedora lists about Islandora. Secretly harbouring a desire to visit Prince Edward Island ;-), I took a closer look and it was there that I chanced on Drupal and it seemed to fit the bill quite nicely, offering comments, tagging, types of content, and user management. Further, it is extensible, has a bewildering, if full of promise API, and will hopefully mean I can build a "publication pathway" that interfaces with the preservation store (indirectly) and can be managed by the archivists in a nice Web-friendly way.

Does the excitement and the nerves start to make more sense now?

It is still early days, but I have re-factored the Java code to output content (fixing a major memory leak in the process!) suitable for import into Drupal and have developed a module that imports that content, including the structure of the collection as Collection - Shelfmarks - Items. It aint much to look at just yet, but it is getting there.

As I have further adventures in Drupal-land I'll keep you updated!

Have a lovely weekend!

Friday, 9 July 2010

Graduate Trainee Presentations - Archiving Digital Audio

Since I am the graduate trainee for the futureArch project I also participated in Oxford's Graduate Library Trainee Scheme. During our year at Oxford all the trainees have to undertake a project. There aren't many restrictions on what this project can and can't be about; the only prerequisites are that it is additional to our day-to-day duties and is useful to each trainee's library. After talking it over with Susan I decided to research digital audio files and produce a guide to archiving digital audio.

On 7th July the trainees held a Project Showcase where each trainee gave a five minute presentation on their project. For anyone interested, most of the presentations (including mine) are now on slideshare. Five minutes is really not very long for a presentation and so I had to severely condense mine, although I've expanded my presentaion notes to include more detail - these are also on slideshare.

Friday, 2 July 2010

Our first local 'dead' hard disk acquisition

We've imaged lots of removable media over the past year (~ 400, according to Victoria's stats), and I've also done a  fair amount of forensic imaging of material on-site with donors (live acquisition) . One aspect of our 'forensic' armoury that has not been subject to so much testing is the imaging of whole hard disks at BEAM. So-called 'dead' acquisitions.

In the past few months two new accessions have presented us with an additional four hard disks. This is excellent news, as I have finally had the chance to use our forensic computer's Ultrabay (write-blocking device) to image a real 'collection hard disk'. Everything went smoothly. So far so good.

Monday, 21 June 2010

Vintage Computing Festival

Yesterday I took a trip to the first official Vintage Computing Festival in Britain. I was a little surprised to hear that it was the first, but I imagine that there are plenty of 'unofficial' gatherings too. This event was held by the National Museum of Computing in Bletchley Park, which warrants a visit in its own right.

For the weekend's festival, Bletchley was transformed into vintage computing heaven: a couple of marquees and the ground floor of the house were packed with computers of all makes and models, each one up and running and ready for some hands-on time. The vast majority were being used for gaming - chuckie egg was all over the place - but I did spot the odd word-processing application here and there.

I thought I'd post some pictures from two exhibits that really caught my eye.
First was the BBC playing the 1980s BBC Domesday project from laserdisc. Look right and you'll see some video footage that we found having searched for 'falklands'. I've read quite a bit about the BBC Domesday laserdiscs over the years (after the CAMiLEON project they've become digital preservation folklore), but seeing the content at stake, and interacting with it on a contemporary platform is something quite special. I also suffer from BBC Micro nostalgia (though this is a Master).

This other I'm including partly for nostalgic reasons (I loved my spectrums, and so did my sister and my grandfather :-) ), and partly because it amused me. Twittering from a spectrum! Whatever next?!


This is probably an old and battered hat for you good folks (seeing as the Web site's last "announcement" was in 2004!), but most days I still feel pretty new to this whole digital archiving business - not just with the "archive" bit, but also the "digital preservation", um, bit so it was news to me... ;-)

Perusing the latest Linux Format at the weekend, I chanced on an article by Ben Martin (I couldn't find a Web site for him...) about parchive and specifically par2cmdline.

Par-what? I hear you ask? (Or perhaps "oh yeah, that old thing" ;-))

Par2 files are what the article calls "error correcting files". A bit like checksums, only once created they can be used to repair the original file in the event of bit/byte level damage.


So I duly installed par2 - did I mention how wonderful Linux (Ubuntu in this case) is? - the install was simple:

sudo apt-get install par2

Then tried it out on a 300MB Mac disk image - the new Doctor Who game from the BBC - and guess what? It works! Do some damage to the file with dd, run the verify again and it says "the file is damaged, but I can fix it" in a reassuring HAL-like way (that could be my imagination, it didn't really talk - and if it did, probably best not to trust it to fix the file right...)

The par2 files totalled around 9MB at "5% redundancy" - not quite sure what that means - which isn't much of an overhead for a some extra data security... I think, though I've not tried, that it is integrated into KDE4 too for a little bit of personal file protection.

The interesting thing about par2 is that it comes from an age when bandwidth was limited. If you downloaded a large file and it was corrupt, rather than have to download it again, you simply downloaded the (much smaller) par2 file that had the power to fix your download.

This got me thinking. Is there then any scope for archives to share par2 files with each other? (Do they already?) We cannot exchange confidential data but perhaps we could share the par2 files, a little like a pseudo-mini-LOCKSS?

All that said, I'm not quite sure we will use parchive here, though it'd be pretty easy to create the par2 files on ingest. In theory our use of ZFS, RAID, etc. should be covering this level of data security for us, but I guess it remains an interesting question - would anything be gained by keeping par2 data alongside our disk images? And, after Dundee, would smaller archives be able to get some of the protection offered by things like ZFS, but in a smaller, lighter way?

Oh, and Happy Summer Solstice!

Thursday, 10 June 2010

OSS projects for accessing data held in .pst format

Thanks to Neil Jefferies for a link to this article in The Register, which tells us that MS has begun two open source projects that will make it possible for developers to create tools to 'browse, read and extract emails, calendar, contacts and events information' which live in MS Outlook's .pst file format. These tools are the PST Data Structure View Tool and the PST File Format SDK, and both are to be Apache-licensed.

Wednesday, 2 June 2010

Developing & Implementing Tools for Managing Hybrid Archives

As previously blogged, we were invited to talk at the University of Dundee's Centre for Archive and Information Studies seminar. I understand that the presentations along with a set of notes will be made available shortly, but in the mean time I thought I'd let you know my slides and notes are available on SlideShare and also my rather hastily thrown together home page! :-)

Monday, 24 May 2010

#4n6 event, and CLIR report on digital forensics as applied to cultural materials

For a couple of days week before last I was at a meeting which went by the name of Computer Forensics and Born-Digital Content in Cultural Heritage Collections. The meeting was in support of a report bearing the same name (for now at least) which is currently being written by Matt Kirschenbaum, Richard Ovenden and Gabby Redwine. The final day of the workshop was dedicated to reviewing the first draft of the report, and the finalised version should be published by CLIR later this year.

We've been adapting digital forensics tools and techniques within BEAM (Bodleian Electronic Archives and Manuscripts) for a few years now, and this meeting was a useful event to talk about how we do this, and some of the issues (process, technical and ethical) it raises.

It was a good meeting, and I very much enjoyed hearing from other digital archivists and *real* forensics practitioners (they have rather different objectives to ours, but their tools are still useful!). Another highlight for me was Stephen Ennis' framing thoughts, presented in the first session. Ennis grounded the discussion, with three key - and very practical - points that should be important to any archivist:

1) What is the hard-cash value of born-digital archives?
Ennis contends that monetary value has been a preservation agent for literary manuscripts. If disks and digital data are of no value, their survival rate is likely to be poor. He cited the example of John Updike's archive (at Harvard), which contained software disks but no related data disks. It's worrying that dealers don't/won't appraise born-digital material, but this will surely change. Another issue is that we need dealers to be able to appraise digital archives without altering what they are appraising. Will they have to adopt digital forensic techniques too?

2) Are the steps that seem justified for celebrity authors justified for others?
This question is very important and equally applicable to 'papers', of course. In the digital domain, the obvious 'celebrity' example is the work Emory's MARBL have done to make one of Salman Rushdie's hard disks accessible to scholars through an emulator and a searchable database. We certainly won't be processing every digital archive submission at this level, and I suspect MARBL won't either. Where it's justified, I think it's a very good thing.

3) What is the researcher's object of study? Are we promoting new and different forms of enquiry?
This question, perhaps, gets closer to exploring our simultaneous excitement and concern when we consider the potential of combining scholarly enquiry and digital forensic tools in relation to born-digital archives. There's a good deal we need to learn about scholars' requirements and I'm looking forward to the day that we have more case studies so we can move this discussion beyond conjecture!

If you're interested in finding out more in advance of the report, you'll probably find that some of the slides will be published in due course at the event's website. You can also take a look at some photos and tweets.

I may extend this post with some of the more interesting tidbits if I find a moment.

Thursday, 29 April 2010

Passwords you never created and never knew

Every so often the more technically-savy in a family are called on to help set-up a new computer when an old one begins to fail. Experience tells you that there are a number of things that you'll need to do as part of this process, but there's generally one or more things you forget to check for and have to fix later. It's seldom a single-session process.

Last weekend the main problem was an unknown password for an email account. In a scenario which can't be that uncommon, an email account had been established by a friend and the password for it remembered by the email client but no human being. Luckily we were able to salvage the password using one of these tools and restore access to the email via a new client on the new computer.

It seems all to possible that we will encounter this scenario with a depositor at some stage, so it's handy to have an easy fix for it. On the other hand, it's a little worrying how easy a fix it is...

Wednesday, 28 April 2010

So long floppy, hello retro cool!

If you've been following Victoria's rather brilliant posts about media, you'll be sad (or perhaps glad) to hear that the demise of the floppy draws ever closer now that Sony are discontinuing floppy disks. I suspect everyone has a story to tell that involves a floppy disk, the fear, the shear agony of that lost essay, the relief at the kindness of the geek who saved the file. These stories will become a thing of the past.

To balance this bad news, I also wanted to flag up the Vintage Computer Festival up the road at Bletchly Park. Lets hope they raise a glass to deprecated storage devices and their tales!

Friday, 23 April 2010

Do you know the way to Dundee?

The Centre for Archive and Information Studies at the University of Dundee is putting on what looks to be a very interesting seminar entitled "Practical Approaches to Electronic Records: the Academy and Beyond". And I'm not just saying it'll be very interesting because we're talking at it either. Just take a look at the packed programme and I'm sure you'll agree.

I'll be covering the workflow we're adopting here at futureArch and hopefully demo part of it, as well as discussing our digital asset management system, the foundation for our archive and how those ideas may scale to smaller systems.

Hope to see you there and if not I'm sure we'll be reporting back right here so stay tuned!

(Also a bit (um, I mean big) thank you to Jennifer Johnstone for helping me find my way to Dundee! :-))

Wednesday, 14 April 2010

Using a D-Link DGE-530T Gigabit Network adapter in ESX 4.

For our developers ESX testbed/playground I wanted to install two D-Link DGE-530T Gigabit PCI Desktop network adapters unfortunately they do not appear to be on the ESX supported list. These are the steps I took to get them to be recognised by ESX:

1. Acquire the skge.o driver which supports the Marvell Yukon 88E0001 chipset

The discussion Using a Marvell LAN card with ESXi 4 contains a link to a tarball sky2-and-skge-for-esxi4-0.02.tar.gz containing both the sky2 and skge driver

2. login to ESX 4.0 as root and copy the skge.o driver to /usr/lib/vmware/vmknod

2.1 download sky2-and-skge-for-esxi4-0.02.tar.gz

2.2 tar xvzf ../sky2-and-skge-for-esxi4-0.02.tar.gz

2.3 cp vmtest/usr/lib/vmware/vmkmod/skge.o /usr/lib/vmware/vmkmod

3. run 'lspci' and identify the NICs location (the xx:xx.x number in front of the description)

03:00.0 Ethernet controller: D-Link System Inc Unknown device 4b01 (rev 11)

4. run 'lspci -n' and determine the vendor and device IDs (for D-Link it should be 1186:xxxx)

lspci -n
00:00.0 0600: 8086:29b0 (rev 02)
03:00.0 0200: 1186:4b01 (rev 11)
03:02.0 0200: 8086:1026 (rev 04)

5. create the vmware pciid file '/etc/vmware/pciid/skge.xml' here's a listing of the mine

cat /etc/vmware/pciid/skge.xml

<?xml version='1.0' encoding='iso-8859-1'?>
<vendor id="1186">
<short>D-Link System Inc</short>
<name>D-Link System Inc</name>
<device id="4b01">
<vmware label="nic">
<name>DGE-530T Ethernet NIC</name>
<table file="pcitable" module="ignore" />
<table file="pcitable.Linux" module="skge">
<desc>D-Link System|DGE-530T Ethernet NIC</desc>

6. create file /etc/vmware/init/manifests/ which contains a single line as shown
cat /etc/vmware/init/manifests/
copy /usr/lib/vmware/vmkmod/skge.o

7. reboot the server and checking the /var/log/vmware/esxcfg-boot.log should confirm:

That the esxcfg boot process has loaded the skge.xml metafile , constructed the new file and included the skge.o driver in the initramfs image.

8. running 'lspci' after adding a second DGE-530T card now shows

03:00.0 Ethernet controller: D-Link System Inc DGE-530T Ethernet NIC (rev 11)
03:02.0 Ethernet controller: D-Link System Inc DGE-530T Ethernet NIC (rev 11)

Of course the normal caveats and disclaimers apply as in not supported by VMware etc.

Monday, 12 April 2010

Want to be our new graduate trainee?

We are now advertising for our second graduate traineeship post within the project. This one-year post is intended to provide pre-course experience to a graduate prior to undergoing professional training on one of the recognised archive courses. Being based within futureArch, it particularly suits an applicant wishing to develop an understanding of how the shift to digital communications impacts the work of archivists. 

The postholder will support the curatorial and technical work of the futureArch project, while sampling a variety of more traditional archival work, including providing services to researchers in the Special Collections Reading Room. The postholder will also participate in activities organised through the OWL Graduate Trainee Scheme.

Further details and application forms are available here. The closing date for applications is 10 May 2010 and we expect to interview on 1 June. For a flavour of some of the work Victoria has done during her time as a trainee take a look at some of her posts to this blog and to the Bodleian graduate trainee scheme blog.

Wednesday, 31 March 2010

Microsoft Works library

Quick future aide-mémoire for accessing Works files. Available at the libwps site. Implemented in some openoffice variants.

Tuesday, 30 March 2010

Disk imaging for older floppies

Thanks to Michael Olson for the link to Kryoflux , which is currently being developed by the Software Preservation Society (an organisation established to preserve disk-based computer games). Stanford are also using the Catweasel floppy disk controller; see Stanford's post on Catweasel and the Catweasel site itself. These could be handy to have around when we receive more in the way of unusual floppy formats.

Monday, 22 March 2010

Media Recognition Guide - Flash Media

Flash memory is the alternative to byte-programmable memory, which is used by hard, floppy and Zip disks. It is much less expensive, meaning large capacity devices are economically viable and has faster access times and much better shock resistance and durability. Altogether this makes it particularly suitable for use as a portable storage device. Flash memory does have a finite number of write-erase cycles, but manufacturers can guarantee at least 100,000 cycles, which is a much larger number than with byte-programmable memory.

USB Flash Drive


Flash memory data storage device with USB interface


2000, though the company that invented the device is a legal issue.






First drive had a capacity of 8 MB but the latest versions can have capacities as large as 256 GB


Widely supported by modern operating systems including Windows, Mac OS, Linux and Unix systems.


Broad. Has replaced 3.5” floppy disks as the preferred device for individuals and small organisations for personal data storage, transfer and backup.

File Systems:

FAT, NTFS, HFS+, ext2, ext3

Common manufacturers:

Many manufacturers and brands including Sandisk, Integral, HP, Kingston Technology and Sony


USB flash drives can come in a range of shapes and sizes, but as a general rule they measure somewhere in the region of 70mm x 20mm x 10mm and all have a male USB connector at one end. Capacity also varies widely, though the majority of manufacturers specify this either by printing the information on the casing or etching it onto the connector.

Using the word ‘drive’ is misleading as nothing moves mechanically in a USB flash drive. However, they are read and written to by computers in the same way they read and write to disk drives, therefore they are referred to by operating systems as ‘drives’.

The only visible component is the male USB connector, often with a protective cap. Inside the plastic casing is a USB mass storage controller, a NAND flash memory chip and a crystal oscillator to control data output. Some drives also include jumpers and LEDS and a few also have a write-protect switch.

High Level Formatting

USB drives use many of the same file systems as hard disk drives, though it is rare to find a drive that contains a version that pre-dates its creation. Therefore, USB drives most likely contain FAT32, rather than FAT16 or FAT12. FAT32 is the file system most commonly found on USB drives due to its broad compatibility with all major operating systems. NTFS can be used but it is not as reliable when used on operating systems other than Windows. If a drives is intended for a specific operating system, you can expect to find either HFS+ (for Macs) or ext2 or ext3 (for Linux).

Formatting a disk is done in the same way as formatting a floppy disk. If being done on a Windows operating system for example the only difference is you will right click on the USB drive icon, rather than the floppy drive.

FireWire Flash Drive


Flash memory data storage device with firewire interface








Either 4, 8 or 16 GB


Compatible with any computer with a firewire connector


Limited. Never achieved the same popularity as USB flash drives. They come in smaller sizes and have slower memory

File Systems:

FAT, NTFS, HFS+, ext2, ext3

Common manufacturers:



FireWire flash drives look similar and are similar in construction to USB drives, the one difference being that they use a FireWire connector, rather than a USB one. Due to this they have different data transfer rates and capacities than USB drives. Depending on which version of FireWire the drive has been manufactured with it has a transfer rate of either 49.13, 98.25 or 393 MB/s. With the exception of 40.13 MB/s, these rates exceed that of the latest USB version, however they have a much smaller capacity. Furthermore they are heavier and more expensive and fewer computers have the appropriate FireWire connectors compared to those with USB ports. Thus, FireWire flash drives have never dominated the market and are fairly rare.

High Level Formatting

FireWire drives only differ from USB drives in their type of connector, therefore they will contain the same file systems and can be formatted in the same way.