Wednesday, 15 December 2010
Friday, 26 November 2010
Take for instance the text of "Folk-Lore and Legends of Scotland" [from Project Gutenberg] (I'm probably not allowed to publish stuff from a real collection here and choose this text because I'm pining for the mountains). It generates a "bi-gram"-based word cloud that looks like this:
Names (of both people and places) quickly become obvious to human readers, as do some subjects ("haunted ships" is my favourite). To make it more useful to machines, I'm pretty sure someone has already tried cross-referencing bi-grams with name authority files. I also imagine someone has used the bi-grams as facets. Theoretically a bi-gram like "Winston Churchill" may well turn up in manuscripts from multiple collections. (Any one know of any successes doing these things?).
Still, for now I'll probably just add the word clouds of the full-texts to the interface, including a "summary" of a shelfmark, and then see what happens!
I made the (very simple) Java code available on GitHub, but I take no credit for it! It is simply a Java reworking of Jim Bumgardner's word cloud article using Jonathan Feinberg's tokenizer (part of Wordle).
Wednesday, 10 November 2010
It has been a while since we had a whiteboard post, so I thought it was high time we had one! This delightful picture is the result of trying to explain the "Publication Pathway" - Susan's term for making our content available - to a new member of staff at the Library...
Nothing too startling here really - take some disparate sources of metadata, add a sprinkling of auto-gen'd metadata (using the marvelous FITS and the equally marvelous tools it wraps), migrate the arcane input formats to something useful, normalise and publish! (I'm thinking I might get "Normalise and Publish!" printed on a t-shirt! :-))
The blue box CollectionBuilder is what does most of the work - constructs an in memory tree of "components" from the EAD, tags the items onto the right shelfmarks, augments the items with additional metadata, and writes the whole lot out in a tidy directory structure that even includes a foxml file with DC, PREMIS and RDF data streams (the RDF is used to maintain the hierarchical relationships in the EAD). That all sounds a lot neater than it currently is, but, like all computer software, it is a work in progress that works, rather than a perfect end result! :-)
After that, we (will, it aint quite there yet) push the metadata parts into the Web interface and from there index it and present to our lovely readers!
The four boxes at the bottom are the "vhysical" layout - its a new word I made up to describe what is essentially a physical (machine) architecture, but is in fact a bunch of virtual machines...
For the really attentive among you, this shot is of the whiteboard in its new home on the 2nd floor of Osney One, where Renhart and I have moved following a fairly major building renovation. Clearly we were too naughty to remain with the archivists! ;-)
Wednesday, 29 September 2010
Slightly more on topic, I was curious to read this article about stained glass, or is it file formats. Skip over the rather flowery bit and start reading from:
"The archive problem is one of format" and concludes "Then in five hundred years time... pictures of the Auch windows will be stored and accessible in the cloud. But not one of them...will have the jaw-dropping impact of people seeing them for the very first time...and realising that humans could do wondrous things for themselves."
I read another article, this time from New View (not something I'd usually read, but an article about the "origins of computing" was recommended to me), which - in a different article - likened stained glass windows to computer screens - mostly because of the back lighting - and suggested, a little like Chris Mellor in the Register, that the cultural experience of seeing stained glass is a bit like experiencing computers for the first time.
(An other article berated digital technology, almost branding it evil, and certainly bad enough to make the author ill using it. I was reading this at DrupalCon where 90% of the people there were welded to their MacBooks and wondered how bad using a plough might make these people feel! ;-))
I guess many folks don't have the "big bang" experience of the digital world. It crept up slowly - a bit like watching the cathedral and its windows being built would probably reduce its wonder - going from a ZX81 to a Pet to a BBC B to an Amstrad 1640 to a... well, you get the picture...
But does that make it any less astonishing? It shouldn't. If anything, it is all about helping people do "wondrous things for themselves", just like zxspectrum.net...
Friday, 17 September 2010
Your thoughts on media players would be most welcome!
The trouble, um, I mean, beauty of digital collections is that they redefine what a "manuscript" is. This is nothing new. Once upon a time someone somewhere probably upset the apple cart when they arrived at the hallowed doors with a basket full of photographs. Now we have video, audio and images, all of which can be encoded in any number of "standard" ways. (Not to mention a zillion different binary formats for just about any purpose you can imagine from sheet music to the latest car designs, which may well require more than just document-like presentation too - 3D models for example). These new manuscripts bring challenges for preservation, of course, but they also present challenges for presentation.
To address this, I've been learning more about media players in browsers with a view to picking one for the reader interface. I'm no expert in this field, so here is my layman's consideration of what I've found out and if you want to read more then this is great!
But the times, they are a-changin'. Just as old browsers knew what to do when presented with an "img" tag, most modern browsers are beginning to support HTML5's "video" and "audio" tags, allowing the browser itself to handle the playback rather than farming this out to a plug-in. (For more on HTML5 generally see this presentation - the video tag is mentioned at about 58 minutes in). As an added bonus of bringing video into the browser in this way is it has inspired folks to build media players that manipulate the Web page to add the correct mark-up, be it a video tag, an embed, or whatever to play the media. This is currently being used to generate some nice media players that'll use the browser, the Flash-plugin, or whatever is available (see OpenVideoPlayer and OSMPlayer).
So now we get to the crux of it. What should we do for the reader interface? Go old-school (and annoy Steve Jobs) and use a Flash-based player? Adopt the new ways of HTML5? Insist on an Open Source player? Buy something in?
To work out the answer I did a bit of investigating and have installed most of the players mentioned thus far in this post - Flowplayer, OSMPlayer, video-tag only, VLC and Cortado, as well as JWPlayer.
Flowplayer uses the Flash-plugin to play Flash video (and, with an additional plug-in, MP3 audio) - it does not support Ogg. It is very simple to use and very slick to look at. It is open source, released under GPL3 with an additional (and reasonable) "attribution clause" which basically means the Flowplayer logo must appear on the player unless you pay extra.
JWPlayer works much like Flowplayer (though there is also a beta HTML5 video player in the making) and seems pretty good. While the source code is available, it is not clear if this is an open source product or otherwise - the source files do not include a LICENSE.txt or any boilerplate. Probably I'm just missing something there though, and JWPlayer seems a good choice if you don't mind Flash.
OSMPlayer is also open source and has numerous options for installation including a Drupal module (untested), a PHP library and a "stand-alone" configuration. In theory it supports lots of different audio and video formats and uses several divs to create a nice browser based player. Unfortunately, following the guidelines for both PHP and stand-alone configurations, I could not get it to work on my test server.
Video-tag only works pretty well with Firefox 3.6 on Ubuntu 10.04 and is very easy to include in a Web-page. Unfortunately it isn't nearly as slick at playback as Flowplayer - there is a delay in starting the video and it is unclear what is going on.
The VLC plug-in is also open source and seems to work pretty well and should be able to handle many different formats, but it isn't nearly as refined as other players and the provided example code fails to stop the video or make it full-screen. The VLC desktop player is wonderful, but I'm not convinced by the Firefox plug-in.
Cortado is a Java-applet provided to play Ogg Theora among other things. Usage is very simple - you just add an applet tag to the page - but playback is jerky, slow and lacked sound. I do not know if my machine is to blame for this or if it is the player itself so will have to investigate further.
Were I sat on and forced to make a choice I think I'd struggle. Flowplayer is slick to use and easy to implement, but requires we convert everything to Flash video or MP3 (mind you, most media will arrive in suitable formats I imagine). JWPlayer is very similar in this regard. I'd like to adopt the video-tag as this supports a wide range of formats, including open ones, but currently the experience is not very smooth and refinements in this area provided by things like OSMPlayer are still in their early stages of development. JWPlayer's HTML5 offering is still beta for example.
I guess my feeling for now is to either go with Flowplayer (and swallow the conversions required - actually pretty easy with ffmpeg) or spend a bit of time with OpenVideoPlayer's HTML5 work and the video tag. At this stage I think we probably need both working in the interface and see where the better user experience is...
I should throw one more thing into the pot - the problem of formats. Video and audio files are complicated beasts consisting of containers and tracks and such - a bit like cassettes! The contents of these containers are encoded in a variety of ways, each requiring different software to decode and render their content. We have the same problem with documents and we solve that by converting all the text-based materials we get into PDFs (for presentation before anyone starts worrying about the preservation implications of PDF!) and use a PDF plug-in to display them.
Can we do the same with our audio/video material and if we can, what format (I'm using "format" as a general term to mean "container/encoding"!) do we use? (Victoria has already done some work along these lines, creating WAVs for storage and MP3s for presentation, from audio CDs). Is there any additional concerns given that most born-digital video/audio is likely to arrive at our doors in a compressed format? Should we uncompress it? Is such a thing even possible? Should we (and do we have the processing power to) convert all audio/video materials to open formats for both preservation and presentation purposes?
We're going to raise this final question at our next Library developer meeting and see what folks think. In theory we can delay the decision because most browsers and their plug-ins handle multiple formats, but perhaps we should have a standard delivery format much like we currently have PDF?
Oh dear. I started writing this post with the hope of finding all the answers! I have found out a lot about media players at least, which can only be a good thing, and I've also found out that that state of the art is not quite as far along as the proponents of HTML5 killing Flash would like us to believe - though there is good work going on here and this is the future. I'm also unclear just how much my experience of these things is hindered by using Ubuntu - I often wrestle with the playback of media files under Linux! :-)
Still, I think we're further along, nearer an answer and at least in a place to know where to start testing...
Your thoughts on media players would be most welcome! :-)
Friday, 3 September 2010
I'm going to pick out some of my highlights here as to cover it all would take days, but if you want to know more I'd encourage you to check out the conference Web site and the presentation videos on archive.org.
So, wot did I lernd?
Drupal Does RDF
OK, so I knew that already, but I didn't know that from Drupal 7 (release pending) RDF support will be part of the Drupal core, showing a fairly significant commitment in this area. Even better, there is an active Semantic Web Drupal group working on this stuff. While "linked data" remains something of an aside for us (99.9% of our materials will not make their way to the Web any time soon) the "x has relationship y with z" structure of RDF is still useful when building the BEAM interfaces - for example Item 10 is part of shelfmark MS Digital 01, etc. There is also no harm in trying to be future proof (assuming the Semantic Web is indeed the future of the Web! ;-)) for when the resources are released into the wild.
Projects like Islandora and discussions like this suggest growing utility in the use of Drupal as an aspect of an institutional repository, archives or even Library catalogues (this last one my (pxuxp) experiment with Drupal 6 and RDF).
Speaking of IRs...
Drupal Does Publishing
During his keynote, Dries Buytaert (the creator of Drupal) mentioned "distributions". Much like Linux distributions, these are custom builds of Drupal for a particular market or function. (It is testament the software's flexibility that this is possible!) Such distributions already exist and I attended a session on OpenPublish because I wondered what the interface would look like and also thought it might be handy if you wanted to build, for instance, an Open Access Journal over institutional repositories. Mix in the RDF mentioned above and you've a very attractive publishing platform indeed!
Another distro that might be of interest is OpenAtrium which bills itself as an Intranet in a Box.
Drupal Does Community
One of my motivations in attending the conference was to find out about Open Source development and communities. One of the talks was entitled "Come for the Software, Stay for the Community" and I think part of Drupal's success is its drive to create and maintain a sharing culture - the code is GPL'd for example. It was a curious thing to arrive into this community, an outsider, and feel completely on the edge of it all. That said, I met some wonderful people, spent a productive day finding my way around the code at the "sprint" and think that a little effort to contribute will go a long way. This is a good opportunity to engage with a real life Open Source community. All I need to do is work out what I have to offer!
Drupal Needs to Get Old School
There were three keynotes in total, and the middle one was by Rasmus Lerdorf of PHP fame, scaring the Web designers in the audience with a technical performance analysis of the core Drupal code. I scribbled down the names of various debugging tools, but what struck me the most was the almost bewildered look on Rasmus' face when considering that PHP had been used to build a full-scale Web platform. He even suggested at one point that parts of the Drupal core should be migrated to C rather than remain as "a PHP script". There is something very cool about C. I should dig my old books out! :-)
HTML5 is Here!
Jeremy Keith gave a wonderful keynote on HTML5, why it is like it is and what happened to xhtml 2.0. Parts were reminiscent of the RSS wars, but mostly I was impressed by the HTML 5 Design Principles which favour a working Web rather than a theoretically pure (XML-based) one. The talk is well worth a watch if you're interested in such things and I felt reassured and inspired by the practical and pragmatic approach outlined. I can't decide if I should start to implement HTML5 in our interface or not, but given that 5 is broadly compatible with the hotchpotch of HTMLs we all code in now, I suspect this migration will be gentle and as required rather than a brutal revolution.
I often feel I'm a little slow at finding things out, but I don't think I was the only person in the audience to have never heard about responsive Web design, though when you know what it is, it seems the most obvious thing in the world! The problem with the Web has long been the variation in technology used to render the HTML. Different browsers react differently and things can look very different on different hardware - from large desktop monitors, through smaller screens to phones. Adherence to standards like HTML5 and CSS3 will go a long way to solving the browser problem, but what of screen size? One way would be to create a site for each screen size. Another way would be to make a single design that scales well, so things like images disappear on narrower screens, multiple columns become one, etc.
Though not without its problems, this is the essence of responsive design and CSS3 makes it all possible. Still not sure what I'm on about? dconstruct was given as a good example. Using a standards compliant browser (ie. not IE! (yet)) shrink the browser window so it is quite narrow. See what happens? This kind of design, along with the underlying technology and frameworks, will be very useful to our interface so I probably need to look more into it. Currently we're working with a screen size in mind (that of the reading room laptop) but being more flexible can only be a good thing!
There were so many more interesting things but I hope this has given you a flavour of what was a grand conference.
Wednesday, 1 September 2010
OK, so no one is about to ditch the Pocket version, or even the Shorter (I got one of those for a graduation present from my Grandma!), but even so...
The last print OED was published in 1989. I imagine, given the regular updates to the OED online, that there has been a substantial influx of words since 1989 and I guess (given how Chaucer looks now) English will undergo some significant changes in the future. Unless we (the DP community) decide to preserve the digital OED, we will condemn readers of 2489 to struggle on with an antique 1989 print copy and much will they wonder when they don't find things like "Internet"...
(Mind you, the electricity might have all run out by then so it wont really matter...)
On the flip side, and no doubt something someone at the party will point out, this is also a case for continuing to print the OED - at least a few copies, kept in safe places... ;-)
Friday, 27 August 2010
In 2003, the Internet Archive obtained some exemptions from the Digital Millennium Copyright Act (DCMA) that has allowed them to archive software, but this has to be done privately with the software being made available after copyright expiry. Not much help now, but promising for the long-term. The best thing that could happen (from an archivist's point of view) is that individuals and companies formally rescinded their interests in older software and put them in the public domain. Ideally they would put an expiry date into the initial licence before the software becomes abandonware.
I'm curious to hear about other good abandonware sites, especially ones that include 'productivity software' (our focus is here rather than gaming!). The Macintosh Garden is a good one, and Apple themselves also provide access to some older software, like ClarisWorks. What else is out there that we should know about?
Saturday, 21 August 2010
Tuesday, 17 August 2010
Balisage 2010 The Markup Conference was
preceded by the International Symposium on XML for the Long Haul Issues in the Long-term Preservation of XML which opened with:
A brief history of markup of social science data: from punched cards to “the life cycle” approach covering the “25-year process of historical evolution leading to DDI, the Data Documentation Initiative, which unites several levels of metadata in one emerging standard.”
Sustainability of linguistic resources revisited looked at some of the difficulties facing language resources over the long-term.
Report from the field: PubMed Central, an XML-based archive of life science journal articles provided insight into the processes deployed to give public access to the full text of more than two million articles.
Portico: A case study in the use of XML for the long-term preservation of digital artifacts discussed some practices that can help assure the semantic stability of digital assets.
The Sustainability of the Scholarly Edition in a Digital World explored the need for “ tools to make XML encoding easier, to encourage collaboration, to exploit social media, and to separate transcriptions of texts from the editorial scholarship applied to
A formal approach to XML semantics: implications for archive standards examined whether “The application of Montague semantics to markup languages may make it possible to distinguish vocabularies that can last from those which will not last”.
Metadata for long term preservation of product data discussed the “valuable lessons to be learned from the library metadata and packaging standards and how they relate to product metadata”.
The day concluded with Beyond eighteen wheels: Considerations in archiving documents represented using the Extensible Markup Language (XML) which contemplated “strategies for extending the useful life of archived documents”.
Sessions in the main conference 2010 – covered topics such as :
gXML, a new approach to cultivating XML trees in Java which proposed “A single unified Java-based API, gXML, can provide a programming platform for all tree models for which a “bridge” has been developed. gXML exploits the Handle/Body design pattern and supports the XQuery Data Model (XDM)”.
Java integration of XQuery — an information unit oriented approach explored “a novel pattern of cooperation between XQuery and Java developer? A new API, XQJPLUS, makes it possible to let XQuery build “information units” collected into “information trays”.
Where XForms meets the glass: Bridging between data and interaction design explored using XForms which offers a model-view framework for XML whilst working within the conventions of existing Ajax frameworks such as Dojo as a way to bridge differing development approaches,data-centric versus starting from the user interface .
A packaging system for EXPath demonstrated how to adapt conventional ideas of packaging to work well in the EXPath environment. “EXPath provides a framework for collaborative community-based development of extensions to XPath and XPath-based technologies (including XSLT and Xquery)”.
Processing arbitrarily large XML using a persistent DOM covered moving the DOM out of memory and into persistent storage offering another processing option for large documents, by utilising, an efficient binary representation of the XML document that has been developed, with a supporting Java API.
Scripting documents with XQuery: virtual documents in TNTBase presented a virtual-document facility integrated into TNTBase, an XML database with support for versioning. The virtual documents can be edited, and changes to elements in the underlying XML repository are propagated automatically back to the database.
XQuery design patterns illustrated the benefits that might extend from the application of meta design patterns to Xquery.
Monday, 9 August 2010
Recently, knowing what I do for a living, he asked if I could help with a problem he was having retrieving files from an external hard drive and, being easily persuaded by the promise of food and wine, I agreed to try to help (with all the usual caveats about probably not knowing anything about it all!).
We got the disk drive working quickly (this is often the way when solving other people's computer issues. Sit with them and they'll solve it themselves!) and so he asked me about his backups too - which should have been happening regularly to another external drive, but were not. I checked out the drive and found an old directory with a very uninformative name that contained some data files and a few manifests that didn't make much sense. I've forgotten the name already, but he told me this was the name of the backup software. Searching, this software was not on the PC. The new PC had been recently built on the basis of the old one by an outsourced IT support. They'd done a good job restoring the software, etc. but this one backup program (a commercial one) was missing.
The consequences where two-fold:
1) No backup was running
2) the data files (about 1.4GB worth) and manifest were, without the software, entirely unreadable.
My neighbour thought perhaps the backup software was about so he'd ask the IT support to install and configure it. I fired up MS Windows Backup (the first time I've ever used it - it seems OK) and ran a one off backup of his work, just to be on the safe side and suggested he ask his support about that (one thing you must never do is undo or override the work of the real support person!) too - it required a password to add it to Windows scheduler.
After it completed, he astutely asked where the files had gone, and so I showed him, on the external drive and was dismayed to find that Windows Backup had also dumped all the files into a 1.4GB (proprietary?) container. I wondered if we'd ever have to extract files from Windows Backup files and made a mental note to keep a copy of the software (bundled with XP) in the cupboard just in case! Worse, it was then impossible to reassure him that the files were there without a crash course in Windows Restore. Still, I remember MS Backup and Restore being a pain way back to MS-DOS! :-)
As we finished our wine and talked about these things, he seemed to suddenly remember my job and jumped up, rummaged in a cupboard. He pulled out an old tape cartridge:
Once his main backup media, but, like the files on the external drive, no longer usable. This time both the hardware and the software were long gone. He didn't seem worried - the files has probably been migrated off his old machine to the new one at some point - but still he wondered what was on it and said "I don't suppose it is readable now is it?". He hadn't meant it as a challenge, but I couldn't resist! I convinced him to let me take the tape with me and try to recover his data - all in the name of digital archaeology, of course!
My next post will be my first adventures in the land of the Travans...
Friday, 16 July 2010
Well, when I was building the second incarnation of the archive interface (the first was a prototype put together by Susan), it started out as a bunch of Web pages and a Solr-based search engine. The back-end data was created using a combination of source data and metadata gleaned from the EAD catalogue, the output of FTK and a spreadsheet that was the result of some appraisal work by the archivist, all munged together by some Java code that did the transformations, created the thumbnails, etc.
As time moved on it became apparent that additional features would be nice to build into the interface. At least one of the Project Advisory Board members suggested it would be nice to see a more Web 2.0-like features and I've long thought that having reader-generated tags and (perhaps) comments attached to the manuscripts might be a nice idea. Other features also arose, and soon I realised that I'd have to either build database-driven site to make all this happen (which I suspect would've been rather ropey) or, far more sensibly, use one that already existed.
By wonderful coincidence (though the kind of thing that often happens) I saw some emails on the Fedora lists about Islandora. Secretly harbouring a desire to visit Prince Edward Island ;-), I took a closer look and it was there that I chanced on Drupal and it seemed to fit the bill quite nicely, offering comments, tagging, types of content, and user management. Further, it is extensible, has a bewildering, if full of promise API, and will hopefully mean I can build a "publication pathway" that interfaces with the preservation store (indirectly) and can be managed by the archivists in a nice Web-friendly way.
Does the excitement and the nerves start to make more sense now?
It is still early days, but I have re-factored the Java code to output content (fixing a major memory leak in the process!) suitable for import into Drupal and have developed a module that imports that content, including the structure of the collection as Collection - Shelfmarks - Items. It aint much to look at just yet, but it is getting there.
As I have further adventures in Drupal-land I'll keep you updated!
Have a lovely weekend!
Friday, 9 July 2010
On 7th July the trainees held a Project Showcase where each trainee gave a five minute presentation on their project. For anyone interested, most of the presentations (including mine) are now on slideshare. Five minutes is really not very long for a presentation and so I had to severely condense mine, although I've expanded my presentaion notes to include more detail - these are also on slideshare.
Friday, 2 July 2010
In the past few months two new accessions have presented us with an additional four hard disks. This is excellent news, as I have finally had the chance to use our forensic computer's Ultrabay (write-blocking device) to image a real 'collection hard disk'. Everything went smoothly. So far so good.
Monday, 21 June 2010
For the weekend's festival, Bletchley was transformed into vintage computing heaven: a couple of marquees and the ground floor of the house were packed with computers of all makes and models, each one up and running and ready for some hands-on time. The vast majority were being used for gaming - chuckie egg was all over the place - but I did spot the odd word-processing application here and there.
I thought I'd post some pictures from two exhibits that really caught my eye.
CAMiLEON project they've become digital preservation folklore), but seeing the content at stake, and interacting with it on a contemporary platform is something quite special. I also suffer from BBC Micro nostalgia (though this is a Master).
Perusing the latest Linux Format at the weekend, I chanced on an article by Ben Martin (I couldn't find a Web site for him...) about parchive and specifically par2cmdline.
Par-what? I hear you ask? (Or perhaps "oh yeah, that old thing" ;-))
Par2 files are what the article calls "error correcting files". A bit like checksums, only once created they can be used to repair the original file in the event of bit/byte level damage.
So I duly installed par2 - did I mention how wonderful Linux (Ubuntu in this case) is? - the install was simple:
sudo apt-get install par2
Then tried it out on a 300MB Mac disk image - the new Doctor Who game from the BBC - and guess what? It works! Do some damage to the file with dd, run the verify again and it says "the file is damaged, but I can fix it" in a reassuring HAL-like way (that could be my imagination, it didn't really talk - and if it did, probably best not to trust it to fix the file right...)
The par2 files totalled around 9MB at "5% redundancy" - not quite sure what that means - which isn't much of an overhead for a some extra data security... I think, though I've not tried, that it is integrated into KDE4 too for a little bit of personal file protection.
The interesting thing about par2 is that it comes from an age when bandwidth was limited. If you downloaded a large file and it was corrupt, rather than have to download it again, you simply downloaded the (much smaller) par2 file that had the power to fix your download.
This got me thinking. Is there then any scope for archives to share par2 files with each other? (Do they already?) We cannot exchange confidential data but perhaps we could share the par2 files, a little like a pseudo-mini-LOCKSS?
All that said, I'm not quite sure we will use parchive here, though it'd be pretty easy to create the par2 files on ingest. In theory our use of ZFS, RAID, etc. should be covering this level of data security for us, but I guess it remains an interesting question - would anything be gained by keeping par2 data alongside our disk images? And, after Dundee, would smaller archives be able to get some of the protection offered by things like ZFS, but in a smaller, lighter way?
Oh, and Happy Summer Solstice!
Thursday, 10 June 2010
Wednesday, 2 June 2010
Monday, 24 May 2010
We've been adapting digital forensics tools and techniques within BEAM (Bodleian Electronic Archives and Manuscripts) for a few years now, and this meeting was a useful event to talk about how we do this, and some of the issues (process, technical and ethical) it raises.
It was a good meeting, and I very much enjoyed hearing from other digital archivists and *real* forensics practitioners (they have rather different objectives to ours, but their tools are still useful!). Another highlight for me was Stephen Ennis' framing thoughts, presented in the first session. Ennis grounded the discussion, with three key - and very practical - points that should be important to any archivist:
1) What is the hard-cash value of born-digital archives?
Ennis contends that monetary value has been a preservation agent for literary manuscripts. If disks and digital data are of no value, their survival rate is likely to be poor. He cited the example of John Updike's archive (at Harvard), which contained software disks but no related data disks. It's worrying that dealers don't/won't appraise born-digital material, but this will surely change. Another issue is that we need dealers to be able to appraise digital archives without altering what they are appraising. Will they have to adopt digital forensic techniques too?
2) Are the steps that seem justified for celebrity authors justified for others?
This question is very important and equally applicable to 'papers', of course. In the digital domain, the obvious 'celebrity' example is the work Emory's MARBL have done to make one of Salman Rushdie's hard disks accessible to scholars through an emulator and a searchable database. We certainly won't be processing every digital archive submission at this level, and I suspect MARBL won't either. Where it's justified, I think it's a very good thing.
3) What is the researcher's object of study? Are we promoting new and different forms of enquiry?
This question, perhaps, gets closer to exploring our simultaneous excitement and concern when we consider the potential of combining scholarly enquiry and digital forensic tools in relation to born-digital archives. There's a good deal we need to learn about scholars' requirements and I'm looking forward to the day that we have more case studies so we can move this discussion beyond conjecture!
If you're interested in finding out more in advance of the report, you'll probably find that some of the slides will be published in due course at the event's website. You can also take a look at some photos and tweets.
I may extend this post with some of the more interesting tidbits if I find a moment.
Thursday, 29 April 2010
Last weekend the main problem was an unknown password for an email account. In a scenario which can't be that uncommon, an email account had been established by a friend and the password for it remembered by the email client but no human being. Luckily we were able to salvage the password using one of these tools and restore access to the email via a new client on the new computer.
It seems all to possible that we will encounter this scenario with a depositor at some stage, so it's handy to have an easy fix for it. On the other hand, it's a little worrying how easy a fix it is...
Wednesday, 28 April 2010
To balance this bad news, I also wanted to flag up the Vintage Computer Festival up the road at Bletchly Park. Lets hope they raise a glass to deprecated storage devices and their tales!
Friday, 23 April 2010
I'll be covering the workflow we're adopting here at futureArch and hopefully demo part of it, as well as discussing our digital asset management system, the foundation for our archive and how those ideas may scale to smaller systems.
Hope to see you there and if not I'm sure we'll be reporting back right here so stay tuned!
(Also a bit (um, I mean big) thank you to Jennifer Johnstone for helping me find my way to Dundee! :-))
Wednesday, 14 April 2010
1. Acquire the skge.o driver which supports the Marvell Yukon 88E0001 chipset
The discussion Using a Marvell LAN card with ESXi 4 contains a link to a tarball sky2-and-skge-for-esxi4-0.02.tar.gz containing both the sky2 and skge driver
2. login to ESX 4.0 as root and copy the skge.o driver to /usr/lib/vmware/vmknod
2.1 download sky2-and-skge-for-esxi4-0.02.tar.gz
2.2 tar xvzf ../sky2-and-skge-for-esxi4-0.02.tar.gz
2.3 cp vmtest/usr/lib/vmware/vmkmod/skge.o /usr/lib/vmware/vmkmod
3. run 'lspci' and identify the NICs location (the xx:xx.x number in front of the description)
03:00.0 Ethernet controller: D-Link System Inc Unknown device 4b01 (rev 11)
4. run 'lspci -n' and determine the vendor and device IDs (for D-Link it should be 1186:xxxx)
00:00.0 0600: 8086:29b0 (rev 02)
03:00.0 0200: 1186:4b01 (rev 11)
03:02.0 0200: 8086:1026 (rev 04)
5. create the vmware pciid file '/etc/vmware/pciid/skge.xml' here's a listing of the mine
<?xml version='1.0' encoding='iso-8859-1'?>
<short>D-Link System Inc</short>
<name>D-Link System Inc</name>
<name>DGE-530T Ethernet NIC</name>
<table file="pcitable" module="ignore" />
<table file="pcitable.Linux" module="skge">
<desc>D-Link System|DGE-530T Ethernet NIC</desc>
6. create file /etc/vmware/init/manifests/vmware-skge.mf which contains a single line as shown
7. reboot the server and checking the /var/log/vmware/esxcfg-boot.log should confirm:
That the esxcfg boot process has loaded the skge.xml metafile , constructed the new vmware-devices.map file and included the skge.o driver in the initramfs image.
8. running 'lspci' after adding a second DGE-530T card now shows
03:00.0 Ethernet controller: D-Link System Inc DGE-530T Ethernet NIC (rev 11)
03:02.0 Ethernet controller: D-Link System Inc DGE-530T Ethernet NIC (rev 11)
Of course the normal caveats and disclaimers apply as in not supported by VMware etc.
Monday, 12 April 2010
Wednesday, 31 March 2010
Tuesday, 30 March 2010
Monday, 22 March 2010
Flash memory is the alternative to byte-programmable memory, which is used by hard, floppy and Zip disks. It is much less expensive, meaning large capacity devices are economically viable and has faster access times and much better shock resistance and durability. Altogether this makes it particularly suitable for use as a portable storage device. Flash memory does have a finite number of write-erase cycles, but manufacturers can guarantee at least 100,000 cycles, which is a much larger number than with byte-programmable memory.
Flash memory data storage device with USB interface
2000, though the company that invented the device is a legal issue.
First drive had a capacity of 8 MB but the latest versions can have capacities as large as 256 GB
Widely supported by modern operating systems including Windows, Mac OS, Linux and Unix systems.
Broad. Has replaced 3.5” floppy disks as the preferred device for individuals and small organisations for personal data storage, transfer and backup.
FAT, NTFS, HFS+, ext2, ext3
Many manufacturers and brands including Sandisk, Integral, HP, Kingston Technology and Sony
USB flash drives can come in a range of shapes and sizes, but as a general rule they measure somewhere in the region of 70mm x 20mm x 10mm and all have a male USB connector at one end. Capacity also varies widely, though the majority of manufacturers specify this either by printing the information on the casing or etching it onto the connector.
Using the word ‘drive’ is misleading as nothing moves mechanically in a USB flash drive. However, they are read and written to by computers in the same way they read and write to disk drives, therefore they are referred to by operating systems as ‘drives’.
The only visible component is the male USB connector, often with a protective cap. Inside the plastic casing is a USB mass storage controller, a NAND flash memory chip and a crystal oscillator to control data output. Some drives also include jumpers and LEDS and a few also have a write-protect switch.
High Level Formatting
USB drives use many of the same file systems as hard disk drives, though it is rare to find a drive that contains a version that pre-dates its creation. Therefore, USB drives most likely contain FAT32, rather than FAT16 or FAT12. FAT32 is the file system most commonly found on USB drives due to its broad compatibility with all major operating systems. NTFS can be used but it is not as reliable when used on operating systems other than Windows. If a drives is intended for a specific operating system, you can expect to find either HFS+ (for Macs) or ext2 or ext3 (for Linux).
Formatting a disk is done in the same way as formatting a floppy disk. If being done on a Windows operating system for example the only difference is you will right click on the USB drive icon, rather than the floppy drive.
Flash memory data storage device with firewire interface
Either 4, 8 or 16 GB
Compatible with any computer with a firewire connector
Limited. Never achieved the same popularity as USB flash drives. They come in smaller sizes and have slower memory
FAT, NTFS, HFS+, ext2, ext3
FireWire flash drives look similar and are similar in construction to USB drives, the one difference being that they use a FireWire connector, rather than a USB one. Due to this they have different data transfer rates and capacities than USB drives. Depending on which version of FireWire the drive has been manufactured with it has a transfer rate of either 49.13, 98.25 or 393 MB/s. With the exception of 40.13 MB/s, these rates exceed that of the latest USB version, however they have a much smaller capacity. Furthermore they are heavier and more expensive and fewer computers have the appropriate FireWire connectors compared to those with USB ports. Thus, FireWire flash drives have never dominated the market and are fairly rare.High Level Formatting
FireWire drives only differ from USB drives in their type of connector, therefore they will contain the same file systems and can be formatted in the same way.