Wednesday, 29 July 2009

Carved in Silicon

Just found an article on the BBC site that is of interest - 'Rosetta stone' offers digital lifeline. Some nice pointers to research on the life of CDs and DVDs with numbers to use at presentations and it is comforting to read that what we're trying to do Just Aint Easy(tm).

I rather like this too:

"...the digital data archivists' arch enemy: magnetic polarity"

(I added the bold!)

Does that make digital archivists like the X-Men ? ;-)


I imagine nothing new to you folks, but handy all the same.

and in particular:

Monday, 20 July 2009

Thursday, 16 July 2009

open source forever, right?

As much a note to self as anything, but also a cautionary tale...

Open source libraries (I mean software libraries rather than big buildings with books - so apologies for the non-technical readers!) - are very useful. Sometimes they can vanish - projects go under, people stop being interested, and soon a code base is "unsupported" and maybe, one day, might vanish from the Web. Take, for example, the Trilead Java SSH Library, the demise of which I think must be fairly recent.

A quick Google search suggests the following:

Which helpfully says:

"Trilead SSH for Java The freely available open-source library won't be anymore developed nore supported by Trilead." (sic.)

Unsupported, in this case, also means unavailable and there are no links to any code from here.

Other sites link to:

which gives a 404.

None of which is very helpful when your code is telling you:

Exception in thread "main" java.lang.NoClassDefFoundError: com/trilead/ssh2/InteractiveCallback

(Should any non-technical types still be reading, that means "I have no idea what you are talking about when you refer to a thing called com.trilead.ssh2.InteractiveCallback so I'm not going to work, no way, not a chance, so there. Ner.").

Now, had I been more awake, I probably would have noticed a sneaky little file by the name "trilead.jar" in the SVNKit directory. I would have also duely added it to the the classpath. But I wasn't and I didn't and then got into a panic searching for it.

But, and here is the moral of the tale, I did find this:

"Also, in the meantime we will put Trilead SSH library source code into our repository (it is also distributed under BSD license) so it will remain available for the community and we still will be able to make minor bugfixes in it when necessary." [SVNKit User Mailing List, 18th May 2009]

Hooray for Open Source!

The open source project SVNKit, which made use of the open source library, was able - due to the open licensing - absorb the SSH library and make it available along with the SVNKit code. Even though the Trilead SSH Library is officially defunct, it lives on in the hands of its users. Marvellous eh?

All which is to say: 1) check the classpath and include all the jars and 2) open licensing means that something at least has a chance of being preserved by someone other than the creator who got fed up with all the emails asking how it worked... :-)

Thursday, 9 July 2009

Automatic Metadata Workshop (Long post - sorry!)

Sometimes I wonder if automatic metadata generation is viewed a little like the Industrial Revolution; which is to say that it is replacing skills and individuals with large scale industry.

I do not think it is really like that at all, being much more about enabling people to manage the ever increasing waves of information. It isn't saying to a weaver, "we can do what you can only faster, better and cheaper"; it is saying "here is something to help you make fabrics from this intangible intractable ether".

What got on this philosophical tract? The answer, as ever, is a train journey - in this case the ride home from Leicester, having attended a JISC-funded workshop on Automatic Metadata Generation. Subtitled "Use Cases" the workshop presented a series of reports outlining potential scenarios in which automatic metadata generation could be used to support the activities of researchers and, on occasion, curators/managers.

The reports have been collated by Charles Duncan and Peter Douglas at Intrallect Ltd. and the final report is due at the end of July.

The day started well as I approached the rather lovely Beaumont Hall at the University of Leicester and noted with a smile the acronym on a sign - "AMG".

Now, I'm from Essex so it is in my genes to know that AMG is the "performance" wing of Mercedes and looking just now at the AMG site, it says:

"Experience the World of Hand Crafted Performance"

a slogan any library or archive could (and should) use!

(Stick with me as I tie my philosophising with my serendipitous discovery of the AMG slogan)

I couldn't help but think a AMG-enabled (our sort, not the car sort) Library or Archive is like hand crafting finding aids, taking advantage of new technology, for better performance. I also thought that most AMG drivers don't care about the science behind getting a faster car, but just that it is faster - think about it...

Where was I? Oh yes.

The workshop!

It was a very interesting day. The format was for Charles, Peter or, occasionally the scenario author, to present the scenarios followed by an opportunity for discussion. This seemed to work well, but it was unfortunate that more of the authors of the scenarios themselves were unable to attend and give poor Charles and Peter a break from the presenting!

The scenarios themselves were around eight metadata themes:
  1. Subject-based
  2. Geographic
  3. Person-related
  4. Usage-related
  5. File Formats
  6. Factual
  7. Bibliographic
  8. Multilingual/Translated
I'll not cover all the scenarios here, but you are encouraged to visit the project Wiki where you can find more information and look out for the final report, but here are some things I got from the day:
  1. AMG to enhance discovery through automatic classification, recommendations on the basis of "similar users" activity ("also bought" function), etc. Note that this is not "by enhancing text-based searching".
  2. AMG could encourage more people to self-deposit (to Institutional Repositories) by automatically filling in the metadata fields in submission forms (now probably isn't the time to discuss the burden of metadata not being the only reason people don't self-deposit! :-)).
  3. AMG to help produce machine-to-machine data and facilitate queries. The big example of this was generating coordinates for place names to enable people with just place names to do geospacial searches, but there are uses here for generating Semantic Web-like links between items.
  4. AMG for preservation - the one I guess folks still reading are most familiar with. Identifying file formats, using PRONOM, DROID & JHOVE, etc. to identify risks, etc.
  5. AMG at creation. Metadata inserted into the digital object by the thing used to create it - iTunes grabbing data from Gracenote and poplating ID3 tags in its own sweet way, a digital camera recording shutter speed and appeture size, time of day and even location and embedding that data into the photo.
  6. The de facto method of AMG was to use Web services - with a skew towards REST-based services - which probably brings us back to cars - REST being nearer the sleek interior of a car than SOAP which exposes its innards to its users.
  7. Just in time AMG (JIT AMG - now there's a project acronym). When something like a translation service is expensive why pay to have all your metadata translated to a different language when you may be able to just do the titles and give your users the option to request (and get the result instantly) a translation if they think it useful.
  8. You might extend JIT AMG and wonder if it is worth pushing the AMG into the search engine? Text search engines already do that - the full-text being the bulk of the metadata - so what if a search engine were also enabled to "read" a music manuscript (a PDF or a Sibelius file for example) and you search for a sequence of notes. Would there be any need to put that sequence of notes into a metadata record if the object itself can function as the record (if you'll forgive the pun!)?
So what does all that mean for us?

Well, it is pretty clear that futureArch must rely on automatic metadata creation at all stages in the archival life cycle and a tool-chain to process items is a feature on diagrams Renhart has shown me since I arrived. It just would not be possible to manage a digital accession without some form of AMG - anyone fancy hand-crafting records for 11,000 random computer files? (Which are, of course, not random at all - representing as they do an individuals own private "order").

I worry slightly about the Web service stuff. For a tool to be useful to futureArch we need a copy here on our servers. First and foremost this ensures the privacy of our data and secondly we have the option then of preserving the service.

(Not to mention that a Web service probably wouldn't want us bombarding it with classification requests!)

(Fortunately the likes of DROID have already gone down the "engine" and "datafile" route favoured by anti-virus companies and let us hope that pattern remains!).

I quite like the idea of resource as metadata object, but I suspect it remains mostly unworkable. It was by accident rather than design that text-based documents, by virtue of their format, contain a body of available metadata. Still, I imagine image search engines are already extracting EXiF data and how many record companies check MP3s ID3 tags to trace their origins...? ;-)

At the end of the workshop we talked a bit about how AMG can scare people too - the Industrial Revolution where I started. To sell AMG technologists talk of how it "reduces cataloging effort", but in an economic climate looking for "reductions in cost" it is easy for management to assume the former implies the later, not realising that while the effort per item may go down, there are much more items!

Whether or not this is true remains to be seen, but early indications suggest AMG isn't any cheaper - just as any new technology isn't. It is just a different tool, designed to cope with a different information world; an essential part of managing digital information.

Yep, it is out of necessity that we will become the Automatic Metadata generation... :-)

Wednesday, 1 July 2009

Waxwork Accessions

I decided the other day that it would be useful to have a representative accession or two to play with. This way we could test for scalability and robustness (in dealing with different file formats, crazy filenames, and the like) of the various tools that will make up BEAM and also try out some of our ideas regarding packaging, disk images and such.

It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!

The way I saw it, I needed three things to create the accession:
  1. A list of files and folders that formed a real accession
  2. A set of data that could be used - real documents, images, sound files, system files, etc.
  3. Some way of tying these together to create an accession modelled on a real one but containing public data
Fortunately Susan already had a list of files that made up the a 2GB hard drive from a donor - created from the forensic machine - which I thought would be a good starting point. Point 1 covered!

Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.

It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!

In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.

(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).

So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:


I wanted to grab any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.

This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.

Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.

I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).

(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).

The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.

Hooray! Now gotta pack it into disk images and start exploring processing!

Should anyone be interested, the source code is available for download.