Tuesday, 18 October 2011

What is ‘The Future of the Past of the Web’?

‘The Future of the Past of the Web’,
Digital Preservation Coalition Workshop
British Library, 7 October 2011

Chrissie Webb and Liz McCarthy

In his keynote address to this event – organised by the Digital Preservation Coalition , the Joint Information Systems Committee and the British Library – Herbert van der Sompel described the purpose of web archiving as combating the internet’s ‘perpetual now’. Stressing the importance to researchers of establishing the ‘temporal context’ of publications and information, he explained how the framework of his Memento Project uses a ‘ timegate’ implemented via web plugins to show what a resource was like at a particular date in the past. There is a danger, however, that not enough is being archived to provide the temporal context; for instance, although DOIs provide stable documents, the resources they link to may disappear (‘link rot’).

The Memento Project Firefox plugin uses a sliding timeline (here, just below the Google search box) to let users choose an archived date
A session on using web archives picked up on the theme of web continuity in a presentation by The National Archives on the UK Government Web Archive, where a redirection solution using open source software helps tackle the problems that occur when content is moved or removed and broken links result. Current projects are looking at secure web archiving, capturing internal (e.g. intranet) sources, social media capture and a semantic search tool that helps to tag ‘unstructured’ material. In a presentation that reinforced the reason for the day’s ‘use and impact’ theme, Eric Meyer of the Oxford Internet Institute wondered whether web archives were in danger of becoming the ‘dusty archives’ of the future, contrasting their lack of use with the mass digitisation of older records to make them accessible. Is this due to a lack of engagement with researchers, their lack of confidence with the material or the lingering feeling that a URL is not a ‘real’ source? Archivists need to interrupt the momentum of ‘learned’ academic behaviour, engaging researchers with new online material and developing archival resources in ways that are relevant to real research – for instance, by helping set up mechanisms for researchers to trigger archiving activity around events or interests, or making more use of server logs to help them understand use of content and web traffic.

One of the themes of the second session on emerging trends was the shift from a ‘page by page’ approach to the concept of ‘data mining’ and large scale data analysis. Some of the work being done in this area is key to addressing the concerns of Eric Meyer’s presentation; it has meant working with researchers to determine what kinds and sources of data they could really use in their work. Representatives of the UK Web Archive and the Internet Archive described their innovations in this field, including visualisation and interactive tools. Archiving social networks was also a major theme, and Wim Peters outlined the challenges of the ARCOMEM project, a collaboration between Sheffield and Hanover Universities that is tackling the problems of archiving ‘community memory’ through the social web, confronting extremely diverse and volatile content of varying quality for which future demand is uncertain. Richard Davis of the University of London Computer Centre spoke about the BlogForever project, a multi-partner initiative to preserve blogs, while Mark Williamson of Hanzo Archives spoke about web archiving from a commercial perspective, noting that companies are very interested in preserving the research opportunities online information offers.

The final panel session raised the issue of the changing face of the internet, as blogs replace personal websites and social media rather than discrete pages are used to create records of events. The notion of ‘web pages’ may eventually disappear, and web archivists must be prepared to manage the dispersed data that will take (and is taking) their place. Other points discussed included the need for advocacy and better articulation of the demand for web archiving (proposed campaign: ‘Preserve!: Are you saving your digital stuff?’), duplication and deduplication of content, the use of automated selection for archiving and the question of standards.

Thursday, 6 October 2011

Day of Digital Archives, 2011

Today is officially 'Day of Digital Archives' 2011! Well, it's been quite a busy week on the digital archives front here at the Bodleian...

The week began with the arrival of our new digital archives graduate trainee, Rebecca Nielsen. During her year here with us, the majority of Rebecca's work will be on digital archives of one kind or another, she'll be archiving all sorts, from materials arriving on old floppies to web sites on the live web.

Another of my colleagues, Matthew Neely, has been spending quite a bit of time this week working on the archive of Oxford don, John Barton. The archive includes over 150 floppies and a hard disk as well as hard-copy papers and photographs.

Barton's digital material was captured in our processing lab back in the Spring of 2010, and now Matthew is busy using Forensic Toolkit software to appraise, arrange and describe the digital content alongside the papers. There are a few older word-processing formats in the collection, but all things that we can handle.

We've also been having conversations with quite a few archive depositors this week, about scoping collections and transfer mechanisms, among other things. There has been some planning work too, while we consider the requirements for processing the archive of Sir Walter Bodmer, which includes around 300 disks (3.5" and 5.25"). For more on the Bodmer archive see the Library's Special Collections blog, The Conveyor.

Today, I've spent a little time looking at our 'Publication Pathway' and thinking about where we need a few tweaks. This is the process and toolset that we are building to publish our digital archives to users (Pete called it CollectionBuilder, and you can have a look at a slightly out-of-date version of it here: http://sourceforge.net/projects/beamcollectionb/). We have a bit more work to do on this and our user interface, but quite a bit of material in the pipeline waiting to get out to our users.

To close out the week, two of our webarchiving pilot group are heading off to the DPC's The Future of the Past of The Web event tomorrow, to learn more about the state of the art in webarchiving.

Lastly, I can't resist returning to the start of the week. On Monday, we had a power cut and temporarily lost access to Bodleian Electronic Archives and Manuscripts (BEAM) services. An unsubtle reminder that digital archives require lots of things to remain accessible, power being one of them!

Thursday, 8 September 2011

Monday, 22 August 2011

Comparing software tools

While looking at software relating to digital video earlier today I came across a handy website called alternativeTo. It's a useful means of comparing software applications and getting an idea of the tools that are out there to help perform a particular task. AlternativeTo gives a brief summary of each piece of software along with screenshots of the software in action. Another useful feature is that searches can be filtered by whether the tools are free or open source.

Wednesday, 10 August 2011

Mobile forensics

Anyone who heard Brad Glisson's talk at the DPC event on digital forensics may be interested in the paper 'A comparison of forensic evidence recovery techniques for a windows mobile smart phone' published in Digital Investigation Volume 8, Issue 1, July 2011, Pages 23-36. Interesting to see what was recovered, and how the different tools behave. 

Wednesday, 27 July 2011

Preserving born-digital video - what are good practices?

Interesting to see Killian Escobedo's post on digital video preservation over at the Smithsonian Archives' visual archives blog. Our trainee, Emma, is working on questions of these sort at the moment as we start to develop strategies for preserving the vast amount of born-digital video being deposited in our archive collections. While there's quite a lot of material out there on digitising analogue video, we've found a real shortage of guidance on the management of born-digital video collections. With that in mind I'd be really interested in hearing how other folks are dealing with this kind of material. Can you give us any pointers? At the moment we're particularly interested in learning more about existing practices, good tools, realistic workflows, and preservation-grade standards (for metadata and content - which ones and why?).

So, what kind of digital video do we have? It's a good question, and one I can't answer fully for the moment. What I can say is that our collections include digital video deposited on CDs, DVDs, Bluray discs, miniDV and mediumDV cassettes, and hard disks. Much of this material has yet to be captured from its original media so we don't have that inventory of codecs, wrapper formats, frame rates, metadata, etc. that Killian talks about. This kind of detailed survey work is a next step for us, but one that will have to wait until we have developed a workflow for initial capture (bit-level preservation comes first). I wonder if we'll see the same diversity of technical characteristics present in the Smithsonian's materials. It seems likely.

Wednesday, 11 May 2011

Hidden Pages

Yesterday I foolishly uploaded a Pages document to my work machine (that isn't a Mac) before heading into the office. I needed the content because I was due to give it to Susan that morning. Luckily I stumbled upon this tip and thought I'd share it in case you ever find yourself faced with a Mac disk full of documents and no Mac to read them on...

I guess I shouldn't have been surprised to discover a Pages document is in fact a zip file (like Word docs) and if you unpack it not only do you find an XML representation of the document (which would let you get at the text - run it through tidy first though as there aren't any line breaks!) or, neater, in the QuickLook directory is a PDF (file reports PDF 1.3) of the document.

Day saved! Phew!

Wednesday, 20 April 2011

Media recognition - Floppy Disks part 3

3 inch Disks (Mitsumi 'Quick Disk')

Magnetic storage media
Used in the 1980s.
?128KB - 256KB
Requires a 3” drive appropriate to the manufacturer's specifications.
Likely to have been individual users and small organisations. Used for word-processing, music and gaming.
File Systems:
Unknown. May vary according to use. The disks were manufactured by Mitsumi and offered as OEM to resellers and used in a range of contexts including Nintendo (Famicom), various MIDI keyboards/samplers (Roland) and the Smith Corona Personal Word Processor (PWP).
Common Manufacturers:
Disks: Mitsumi appear to have made the magnetic disk (the innards), while other manufacturers made the cases. This resulted in different case shapes and labelling. For example Sharp Corona labelled the disks as DataDisk 2.8"
Drives: Mitsumi?

The Smith Corona Personal Word Processor (PWP) variant of the disk is double sided with one side being labelled ‘A’ and the other ‘B’. Each side also had a dedicated write-protect hole, known as a 'breakout lug'.

2.8" Smith Corona 'Quick Disk'
3.5" floppy side-by-side with a 2.8" Smith Corona 'Quick Disk'
Nintendo Famicon disk
Some rights reserved by bochalla

High Level Formatting
Unknown. Possibly varied according to use.

3 Inch Disk Drives
Varied according to disk. The Smith Corona word processing disks are most likely to turn up in an archival collection. These were used in a Smith Corona PWP and possible models nos. include: 3,5,6, 6BL, 7, X15,X25, 40, 50LT, 55D, 60, 65D, 75D, 80, 85DLT, 100, 100C, 220, 230, 250, 270LT, 300, 350, 355, 960, 990, 2000, 2100, 3000, 3100, 5000, 5100, 7000LT, DeVille 3, DeVille 300, Mark X, Mark XXX, Mark XL LT. 

Lego mockup of a Nintendo Famicon drive
Some rights reserved by kelvin255
Useful links

Wednesday, 13 April 2011

Preserving Digital Sound and Vision: A Briefing 8th April 2011

Last Friday I went along to the DPC briefing Preserving Digital Sound and Vision. I was particularly interested in the event because of digital video files currently held on DVD media at the Bodleian.

After arriving at the British Library and collecting my very funky Save the Bits DPC badge I sat down to listen to a packed programme of speakers. The morning talks gave an overview of issues associated with preserving audio-visual resources. We began with Nicky Whitsed from the Open University who spoke about the nature of the problem of preserving audio-visual content; a particularly pertinent issue for the OU who have 40 years of audio-visual teaching resources to deal with. Richard Ranft then gave a fascinating insight into the history and management of the British Library Sound Archive. He played a speech from Nelson Mandela’s 1964 trial to emphasise the value of audio preservation. Next Stephen Gray from JISC Digital Media spoke about how students are using audio-visual content in their research. He mentioned the difficulties researchers find when citing videos, especially those on YouTube that may disappear at any time! To round off the morning John Zubrycki from BBC R and D spoke about Challenges and Solutions in Broadcast Archives. One of the many interesting facts that he mentioned was that subtitle files originally produced by the BBC for broadcast have been used as a tool for search and retrieval of video content.

After enjoying lunch and the beautiful sunny weather on the British Library terrace we moved onto the afternoon programme based on specific projects and tools. Richard Wright of the BBC spoke about the Presto Centre and the tools it has developed to help with audio-visual preservation. He also spoke about the useful digital preservation tools available online via Presto Space. Sue Allcock and James Alexander then discussed the Outcomes and Lessons learnt from the Access to Video Assets Project at the Open University which makes past video content from the Open University’s courses available to OU staff through a Fedora repository. Like the BBC, discovering subtitle files has allowed the OU to index their audio-visual collections. Finally Simon Dixon from the Centre for Digital Music Queen Mary University spoke about emerging tools for digital sound.

A final wide ranging discussion about collaboration and next steps followed which included discussion about storage as well as ideas for a future event addressing the contexts of audio-visual resources. I left the event with my mind full of new information and lots of pointers for places to look to help me consider the next steps for our digital video collections… watch this space.

Tuesday, 12 April 2011

Sharp font writer files

Not a format I'd come across before, but we now have files of this type in the collections. They were written on something like this. Luckily someone has written a migration tool, and it seems to work. See fwwputils.

Anyone know of other tools for this format?

Thursday, 7 April 2011

Got any older?

Interesting article about the "oldest working Seagate drive in the UK". When we talk about storage here, eventually someone says "storage is getting cheaper". If you ever needed concrete proof, this is it!

Thursday, 31 March 2011

World backup day 2011

Today is officially 'world backup day', apparently. Time to back up your files...

I'm sure you know why, and perhaps even how. There are a few tips here if you want them: http://www.worldbackupday.net/

Thursday, 24 March 2011

Advisory Board Meeting, 18 March 2011

Our second advisory board meeting took place on Friday. Thanks to everyone who came along and contributed to what was a very useful meeting. For those of you who weren't there here is a summary of the meeting and our discussions.

The first half of the afternoon took the form of an overview of different aspects of the project.

Overview of futureArch's progress

Susan Thomas gave us a reminder of the project's aims and objectives and the progress being made to meet them. After an overview of the percentage of digital accessions coming into the library since 2004 and the remaining storage space we currently have, we discussed the challenge of predicting the size of future digital accessions and collecting digital material. We also discussed what we think researcher demand for born-digital material is now and will be in the future.

Born Digital Archives Case Studies

Bill Stingone presented a useful case study about what the New York Public Library has learnt from the process of making born-digital materials from the Straphangers Campaign records available to researchers.

After this Dave Thompson spoke about some of the technicalities of making all content in the Wellcome Library (born-digital, analogue and digitised) available through the Wellcome Digital Library. Since the project is so wide reaching a number of questions followed about the practicalities involved.

Web archiving update

Next we returned to the futureArch project and Susan gave an overview of the scoping, research and decisions that have been made regarding the web archiving pilot since the last meeting. I then gave an insight into how the process of web archiving will be managed using a tracking database. Some very helpful discussions followed about the practicalities of obtaining permission for archiving websites and the legal risks involved.

After breaking for a well earned coffee we reconvened to look at systems.

Systems for Curators

Susan explained how the current data capture process works for digital collections at the Bodleian including an overview of the required metadata which we enter manually at the moment. Renhart moved on to talk about our intention to use a web-based capture workbench in the future and to give us a demo of the RAP workbench. Susan also showed us how FTK is used for appraisal, arrangement and description of collections and the directions we would like to take in the future.

Researcher Interface

To conclude the systems part of the afternoon, Pete spoke about how the BEAM researcher interface has developed since the last advisory board meeting, the experience of the first stage of testing the interface and the feedback gained so far. He then encouraged everyone to get up and have a go at using the interface for themselves and to comment on it.

Training the next generation of archivists?

With the end of the meeting fast approaching, Caroline Brown from the University of Dundee gave our final talk. She addressed the extent to which different archives courses in the UK cover digital curation and the challenges faced by course providers aiming to include this kind of content in their modules.

With the final talk over we moved onto some concluding discussions around the various skills that digital archivists need. Those of us who were able to stay continued our discussions over dinner.

Thursday, 17 March 2011


I know it is a bit sad, but I couldn't help but feel a little flutter of happiness when I read about TZX. Just wanted to share that.

Friday, 18 February 2011

what have the Romans ever Done For us?

Today I presented an internal seminar on RDF to the Bodleian Library developers, the first in a series of (hopefully) regular R&D meetings. This one was to provide a practical introduction to RDF to give us a baseline to build from when we start building models of our content (at least one Library project requires we generate RDF). I called it "what have the Romans ever Done For us?":

The title is a line from the Monty Python film "The Life of Brian" and I chose it not just because I've been looking into Python (the language) but also because I can imagine a future where people ask "What has RDF ever done for us?" in a disgruntled way. In the film people suggest the Romans did quite a lot - bits of public infrastructure, like aqueducts and roads, alongside useful services like wine and medicine. I think RDF is a bit like that. Done well, it creates a solid infrastructure from which useful services will be built, but it is also likely to invisible, if not taken for granted, like sanitation and HTML. :-)

The seminar itself seemed to go well - though you'd have to ask the attendees rather than the presenter to get the real story! We started with some slides that outlines the basics of RDF, using Dean Allemang & Jim Hendler's nice method of distributing responsibility for tablular data and ending up at RDF (see pg. 32 onwards in the book Semantic Web for the Working Ontologist), and leapt straight in with LIBRIS (for example Neverwhere as RDF) as a case study. In the resultant discussion we looked at notation, RDFS, and linked data.

The final half of the seminar was a workshop in which we split into two groups: data providers and data consumers, and then considered what resources at the Bodleian might be suitable for publication as RDF (and linked data) and what services we might build using data from elsewhere.

The data providers discussed how there was probably quite a lot of resources in the Library that we could publish, or become the authority on - members of the University for example, or any of the many wonders we have in the collections. To make this manageable, it was felt it would be sensible to break this task up, probably by project. This would also allow for specific models to be identified and/or developed for each set of resources.

There was some concern among the providers regarding how to "sell" the benefits of RDF and linked data to management. The concerns paralleled what I imagine happened with the emergent Web. Is this kind of data publication giving away valuable information assets, for little or no return? At worst this leads people away from the Library to viewing and using information via aggregation services. Of course, there is a flip side to this argument. Serendipitous discovery of a Bodleian resource via a third-party is essentially free advertising and may drive users through our (virtual or real) door?

The consumers seemed realise early on that one of the big problems was the lack of usable data. Indeed, for the talk, I scoured datasets trying to find a decent match of data to augment Library catalogues and found it quite hard. That isn't to say there is not a lot of data available (though of course there could be more), it is just without an application for it, shoehorning data into a novel use remains a novelty item. However, one possible example was the Legislation API. The group suggested that reviews from other sites could be used to augment Library catalogue results. They also suggested that the people data the Library published could be very useful and Monica talked about a suggestion she had heard at Dev8D for a developer expertise database (data Web?).

All in all lots of very useful discussion and I hope everyone went away with a good idea of what RDF was and what it might do for us and what we might do with it. There (justifiably) remains some scepticism, mostly because without the Web, linked data is simply data and we've all got our own neat ways to handle data already, be it a SQL database, XML & Solr, or whatever. Without the Web of Data the question "What do we gain?" remains.

It is a bit chicken and egg and the answer will eventually become clear as more and more people create machine processable data on the Web.

For the next meeting we'll be modelling people. I'm going to bring the clay! :-)

and worksheets are available with the source open office documents also published on this (non-RDF) page!

Monday, 24 January 2011

Migrating documents

We have a collection that consists of several thousand documents in various archaic (well, 1980s/90s) word processor formats including Ami Professional and (its predecessor) Samna Word. Perhaps of interest to folks intent on discussing the implications of migration for authenticity of the items, some of those Ami Pro files contain the (automatically generated) line:

"File ... was converted from Samna Word on ..."

So which is the original now?

Migrating these file formats has not been straight forward. This is because it was proved remarkably tricky to ascertain a key piece of information - the file format of the original. This is not the fault of file format tools (I'm using FITS, which itself wraps the usual suspects JHOVE & DROID), but the broader problem that the files have multiple formats. Ami Pro files are correctly identified "text/plain". The command file reports them as "ASCII English text". Some (not all) have a file extension ".sam" which is usually Ami Word, but the ".sam" files are not all the same format.

Yet this small piece of metadata is essential because without it it is very difficult to identify the correct tool to perform the migration. For example, if I run my usual text to PDF tool - which is primed to leap into action on arrival of a "text/plain" document - the resultant PDF shows the internals of a Ami Pro file, not the neatly laid out document the creator saw. We have a further piece of information available too, and curiously it is the most useful. This is the "Category" from the FTK - which correctly sorts the Ami Pros from the Samna Words.

This leads to a complex migration machine that needs to be capable of collating file format information from disparate sources and making sense of the differences, all within the context of the collection itself. If I know that creator X used Ami Pro a lot, then I can guess that "text/plain" & ".sam" means an Ami Pro document, for example. This approach is not without problems however, not least of which is that it requires a lot of manual input into what should ultimately be an automated and unwatched process. (One day, when it works better, I'll try to share this code!)

Sometimes you get lucky, and the tool to do the migration offers an "auto" mode for input. For this collection I am using a trial copy of FileMerlin to perform the migration and evaluate it. It actually works better if you let it guess the input format rather than attempt to tell it. Other tools, such as JODConverter, like to know the input format and here you have a similar problem - you need to know what JODConverter is happy to accept rather than the real format - for example, send it a file with a content type of "application/rtf" and it responds with an internal server error. Send the same file with a content type of "application/msword" and the PDF is generated and returned to you.

Then there is a final problem - sometimes you have to make several steps to get the file into shape. For this collection, FileMerlin should be able to migrate Ami Pro and Samna Word into PDFs. In practice, it crashes on a very small sub-set of the documents. To overcome this, I migrate these same documents to "rich text format" (which FileMerlin seems OK with) and then to PDF with JODConverter - sending the aforementioned "application/msword" content type. I had a similar problem with WordPerfect files where using JOD directly changed the formatting of the original files. Using libwpd to create ODTs and then converting them to PDFs generated more accurate PDFs. (This is strange behaviour since OpenOffice itself uses libwpd!) Every time I hit a new (old) file format, the process of identifying it and generating a heuristic for handling it starts over.

I'm starting to think I need a neural network! That really would be putting the AI in OAIS!