As much a note to self as anything, but also a cautionary tale...
Open source libraries (I mean software libraries rather than big buildings with books - so apologies for the non-technical readers!) - are very useful. Sometimes they can vanish - projects go under, people stop being interested, and soon a code base is "unsupported" and maybe, one day, might vanish from the Web. Take, for example, the Trilead Java SSH Library, the demise of which I think must be fairly recent.
A quick Google search suggests the following:
http://www.trilead.com/SSH_Library/
Which helpfully says:
"Trilead SSH for Java The freely available open-source library won't be anymore developed nore supported by Trilead." (sic.)
Unsupported, in this case, also means unavailable and there are no links to any code from here.
Other sites link to:
http://www.trilead.com/Products/Trilead-SSH-2-Java/
which gives a 404.
None of which is very helpful when your code is telling you:
Exception in thread "main" java.lang.NoClassDefFoundError: com/trilead/ssh2/InteractiveCallback
(Should any non-technical types still be reading, that means "I have no idea what you are talking about when you refer to a thing called com.trilead.ssh2.InteractiveCallback so I'm not going to work, no way, not a chance, so there. Ner.").
Now, had I been more awake, I probably would have noticed a sneaky little file by the name "trilead.jar" in the SVNKit directory. I would have also duely added it to the the classpath. But I wasn't and I didn't and then got into a panic searching for it.
But, and here is the moral of the tale, I did find this:
"Also, in the meantime we will put Trilead SSH library source code into our repository (it is also distributed under BSD license) so it will remain available for the community and we still will be able to make minor bugfixes in it when necessary." [SVNKit User Mailing List, 18th May 2009]
Hooray for Open Source!
The open source project SVNKit, which made use of the open source library, was able - due to the open licensing - absorb the SSH library and make it available along with the SVNKit code. Even though the Trilead SSH Library is officially defunct, it lives on in the hands of its users. Marvellous eh?
All which is to say: 1) check the classpath and include all the jars and 2) open licensing means that something at least has a chance of being preserved by someone other than the creator who got fed up with all the emails asking how it worked... :-)
Thursday, 16 July 2009
Thursday, 9 July 2009
Automatic Metadata Workshop (Long post - sorry!)
Sometimes I wonder if automatic metadata generation is viewed a little like the Industrial Revolution; which is to say that it is replacing skills and individuals with large scale industry.
I do not think it is really like that at all, being much more about enabling people to manage the ever increasing waves of information. It isn't saying to a weaver, "we can do what you can only faster, better and cheaper"; it is saying "here is something to help you make fabrics from this intangible intractable ether".
What got on this philosophical tract? The answer, as ever, is a train journey - in this case the ride home from Leicester, having attended a JISC-funded workshop on Automatic Metadata Generation. Subtitled "Use Cases" the workshop presented a series of reports outlining potential scenarios in which automatic metadata generation could be used to support the activities of researchers and, on occasion, curators/managers.
The reports have been collated by Charles Duncan and Peter Douglas at Intrallect Ltd. and the final report is due at the end of July.
The day started well as I approached the rather lovely Beaumont Hall at the University of Leicester and noted with a smile the acronym on a sign - "AMG".
Now, I'm from Essex so it is in my genes to know that AMG is the "performance" wing of Mercedes and looking just now at the AMG site, it says:
"Experience the World of Hand Crafted Performance"
a slogan any library or archive could (and should) use!
(Stick with me as I tie my philosophising with my serendipitous discovery of the AMG slogan)
I couldn't help but think a AMG-enabled (our sort, not the car sort) Library or Archive is like hand crafting finding aids, taking advantage of new technology, for better performance. I also thought that most AMG drivers don't care about the science behind getting a faster car, but just that it is faster - think about it...
Where was I? Oh yes.
The workshop!
It was a very interesting day. The format was for Charles, Peter or, occasionally the scenario author, to present the scenarios followed by an opportunity for discussion. This seemed to work well, but it was unfortunate that more of the authors of the scenarios themselves were unable to attend and give poor Charles and Peter a break from the presenting!
The scenarios themselves were around eight metadata themes:
Well, it is pretty clear that futureArch must rely on automatic metadata creation at all stages in the archival life cycle and a tool-chain to process items is a feature on diagrams Renhart has shown me since I arrived. It just would not be possible to manage a digital accession without some form of AMG - anyone fancy hand-crafting records for 11,000 random computer files? (Which are, of course, not random at all - representing as they do an individuals own private "order").
I worry slightly about the Web service stuff. For a tool to be useful to futureArch we need a copy here on our servers. First and foremost this ensures the privacy of our data and secondly we have the option then of preserving the service.
(Not to mention that a Web service probably wouldn't want us bombarding it with classification requests!)
(Fortunately the likes of DROID have already gone down the "engine" and "datafile" route favoured by anti-virus companies and let us hope that pattern remains!).
I quite like the idea of resource as metadata object, but I suspect it remains mostly unworkable. It was by accident rather than design that text-based documents, by virtue of their format, contain a body of available metadata. Still, I imagine image search engines are already extracting EXiF data and how many record companies check MP3s ID3 tags to trace their origins...? ;-)
At the end of the workshop we talked a bit about how AMG can scare people too - the Industrial Revolution where I started. To sell AMG technologists talk of how it "reduces cataloging effort", but in an economic climate looking for "reductions in cost" it is easy for management to assume the former implies the later, not realising that while the effort per item may go down, there are much more items!
Whether or not this is true remains to be seen, but early indications suggest AMG isn't any cheaper - just as any new technology isn't. It is just a different tool, designed to cope with a different information world; an essential part of managing digital information.
Yep, it is out of necessity that we will become the Automatic Metadata generation... :-)
I do not think it is really like that at all, being much more about enabling people to manage the ever increasing waves of information. It isn't saying to a weaver, "we can do what you can only faster, better and cheaper"; it is saying "here is something to help you make fabrics from this intangible intractable ether".
What got on this philosophical tract? The answer, as ever, is a train journey - in this case the ride home from Leicester, having attended a JISC-funded workshop on Automatic Metadata Generation. Subtitled "Use Cases" the workshop presented a series of reports outlining potential scenarios in which automatic metadata generation could be used to support the activities of researchers and, on occasion, curators/managers.
The reports have been collated by Charles Duncan and Peter Douglas at Intrallect Ltd. and the final report is due at the end of July.
The day started well as I approached the rather lovely Beaumont Hall at the University of Leicester and noted with a smile the acronym on a sign - "AMG".
Now, I'm from Essex so it is in my genes to know that AMG is the "performance" wing of Mercedes and looking just now at the AMG site, it says:
"Experience the World of Hand Crafted Performance"
a slogan any library or archive could (and should) use!
(Stick with me as I tie my philosophising with my serendipitous discovery of the AMG slogan)
I couldn't help but think a AMG-enabled (our sort, not the car sort) Library or Archive is like hand crafting finding aids, taking advantage of new technology, for better performance. I also thought that most AMG drivers don't care about the science behind getting a faster car, but just that it is faster - think about it...
Where was I? Oh yes.
The workshop!
It was a very interesting day. The format was for Charles, Peter or, occasionally the scenario author, to present the scenarios followed by an opportunity for discussion. This seemed to work well, but it was unfortunate that more of the authors of the scenarios themselves were unable to attend and give poor Charles and Peter a break from the presenting!
The scenarios themselves were around eight metadata themes:
- Subject-based
- Geographic
- Person-related
- Usage-related
- File Formats
- Factual
- Bibliographic
- Multilingual/Translated
- AMG to enhance discovery through automatic classification, recommendations on the basis of "similar users" activity ("also bought" function), etc. Note that this is not "by enhancing text-based searching".
- AMG could encourage more people to self-deposit (to Institutional Repositories) by automatically filling in the metadata fields in submission forms (now probably isn't the time to discuss the burden of metadata not being the only reason people don't self-deposit! :-)).
- AMG to help produce machine-to-machine data and facilitate queries. The big example of this was generating coordinates for place names to enable people with just place names to do geospacial searches, but there are uses here for generating Semantic Web-like links between items.
- AMG for preservation - the one I guess folks still reading are most familiar with. Identifying file formats, using PRONOM, DROID & JHOVE, etc. to identify risks, etc.
- AMG at creation. Metadata inserted into the digital object by the thing used to create it - iTunes grabbing data from Gracenote and poplating ID3 tags in its own sweet way, a digital camera recording shutter speed and appeture size, time of day and even location and embedding that data into the photo.
- The de facto method of AMG was to use Web services - with a skew towards REST-based services - which probably brings us back to cars - REST being nearer the sleek interior of a car than SOAP which exposes its innards to its users.
- Just in time AMG (JIT AMG - now there's a project acronym). When something like a translation service is expensive why pay to have all your metadata translated to a different language when you may be able to just do the titles and give your users the option to request (and get the result instantly) a translation if they think it useful.
- You might extend JIT AMG and wonder if it is worth pushing the AMG into the search engine? Text search engines already do that - the full-text being the bulk of the metadata - so what if a search engine were also enabled to "read" a music manuscript (a PDF or a Sibelius file for example) and you search for a sequence of notes. Would there be any need to put that sequence of notes into a metadata record if the object itself can function as the record (if you'll forgive the pun!)?
Well, it is pretty clear that futureArch must rely on automatic metadata creation at all stages in the archival life cycle and a tool-chain to process items is a feature on diagrams Renhart has shown me since I arrived. It just would not be possible to manage a digital accession without some form of AMG - anyone fancy hand-crafting records for 11,000 random computer files? (Which are, of course, not random at all - representing as they do an individuals own private "order").
I worry slightly about the Web service stuff. For a tool to be useful to futureArch we need a copy here on our servers. First and foremost this ensures the privacy of our data and secondly we have the option then of preserving the service.
(Not to mention that a Web service probably wouldn't want us bombarding it with classification requests!)
(Fortunately the likes of DROID have already gone down the "engine" and "datafile" route favoured by anti-virus companies and let us hope that pattern remains!).
I quite like the idea of resource as metadata object, but I suspect it remains mostly unworkable. It was by accident rather than design that text-based documents, by virtue of their format, contain a body of available metadata. Still, I imagine image search engines are already extracting EXiF data and how many record companies check MP3s ID3 tags to trace their origins...? ;-)
At the end of the workshop we talked a bit about how AMG can scare people too - the Industrial Revolution where I started. To sell AMG technologists talk of how it "reduces cataloging effort", but in an economic climate looking for "reductions in cost" it is easy for management to assume the former implies the later, not realising that while the effort per item may go down, there are much more items!
Whether or not this is true remains to be seen, but early indications suggest AMG isn't any cheaper - just as any new technology isn't. It is just a different tool, designed to cope with a different information world; an essential part of managing digital information.
Yep, it is out of necessity that we will become the Automatic Metadata generation... :-)
Wednesday, 1 July 2009
Waxwork Accessions
I decided the other day that it would be useful to have a representative accession or two to play with. This way we could test for scalability and robustness (in dealing with different file formats, crazy filenames, and the like) of the various tools that will make up BEAM and also try out some of our ideas regarding packaging, disk images and such.
It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!
The way I saw it, I needed three things to create the accession:
Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.
It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!
In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.
(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).
So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:
C:\WINDOWS\SYSTEM32\ABC.DLL
I wanted to grab any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.
This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.
Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.
I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).
(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).
The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.
Hooray! Now gotta pack it into disk images and start exploring processing!
Should anyone be interested, the source code is available for download.
It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!
The way I saw it, I needed three things to create the accession:
- A list of files and folders that formed a real accession
- A set of data that could be used - real documents, images, sound files, system files, etc.
- Some way of tying these together to create an accession modelled on a real one but containing public data
Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.
It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!
In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.
(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).
So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:
C:\WINDOWS\SYSTEM32\ABC.DLL
I wanted to grab any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.
This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.
Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.
I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).
(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).
The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.
Hooray! Now gotta pack it into disk images and start exploring processing!
Should anyone be interested, the source code is available for download.
Labels:
accession,
accessioning,
java,
solr
Friday, 19 June 2009
Friday Post
Couple of weeks ago now, in a developer meeting, Neil (Jefferies) mentioned the "Island of Unscalable Complexity". Even if I could remember the context, I probably wouldn't tell, but it seemed such a cool place to go visit I tried to make it real... Look close and you'll see our boat, on the way, grappling hooks at hand...Never say "unscalable" and expect no one to try! :-)
Monday, 15 June 2009
CAIRO Content Model: A noob's overview (Part 2) Accessions
I-E-O-2. It is a curious one and no mistake. The idea of an accession wasn't something I'd come across before (though alien concepts are no surprise to me now having worked in libraries for some time) - and if you really want to confuse yourself about what it is, try this Wikipedia article on 'accession'.
All clear now? Didn't think so! (Though the bit about the museum is probably closest). For the purpose of ingest, I'm not too worried about legal ownership (this must be assumed) so my working definition - and I make no claims that it is a good or accurate - is:
"an accession is a unit of stuff that arrives at a given time, at a given archive, and needs to be added to that archive"
An accession then is defined by its origin, date of arrival and the collection to which it belongs.
Like IE01, IE02 uses the METS header element to record agent data - who did what and when to this record. This will usually "created by ingest process", "updated by CAIRO tool by user X", etc.
In IE02, the descriptive metadata is kept deliberately Spartan. This is because the concept of the accession remains largely transparent to researchers. While it is important to the archivist to know the source of the collection's parts, the researcher really just needs the collection and the items. So, dmdSec gives us space for an accession identifier (unitId) and title (unitTitle) (using controlled formats - which may or may not be specified yet - I'll find out as I move through the model!) and then a minimal EAD/DC/MODS description. Since we're an archive, I focussed on the EAD, but cross-walking to the others would be possible.
The dmdSec EAD description is quite minimal, listing just origination information (pricipal creator), a physical description (extent, in MBs), a description of the formats and a description of the software/hardware environment used. All these are given as free-text, manual fields, but I wonder how far we can get offering automated "tips" for these elements.
There is also room for a pointer to some rights statement (amdSec) (as with IE01) and also a pointer to a log file that records activity over this accession (fileSec). The format of that log file remains undefined, but it'll record things like "checked all files for viruses", "identified 15 obsolete files", "transformed obsolete formats to ODT", etc. We'll have to think about how we implement this linkage. The model says use a URN to point at the log file, but I wonder if we use a URI that points to a canned search of a generally logging service - something along the lines of beam.ouls.ox.ac.uk/audit?accessionID=12345. That way we can dynamically generate log reports for each accession. Needs thinking about anyways.
Finally there is, in the structMap, a map to the other "main divisions" of the accession and the model suggests that these are either folders or files (different types are given - like subject folder or email directory) and so it is apparent that the accession structMap could be used to reflect the entire accession structure or just the top level directories. Which is better is unclear, but I suspect we will be adding a manifest to the accession that does list all the files (so the structure can be browsed without getting near the real data objects and, if we do it that way, without having to parse the IE02 too much).
We could put a pointer to the manifest alongside the log file pointer in fileSec.
I guess that is probably no clearer than the aforementioned wikipedia page but at least I never used the word "Inaedificatio"!
All clear now? Didn't think so! (Though the bit about the museum is probably closest). For the purpose of ingest, I'm not too worried about legal ownership (this must be assumed) so my working definition - and I make no claims that it is a good or accurate - is:
"an accession is a unit of stuff that arrives at a given time, at a given archive, and needs to be added to that archive"
An accession then is defined by its origin, date of arrival and the collection to which it belongs.
Like IE01, IE02 uses the METS header element to record agent data - who did what and when to this record. This will usually "created by ingest process", "updated by CAIRO tool by user X", etc.
In IE02, the descriptive metadata is kept deliberately Spartan. This is because the concept of the accession remains largely transparent to researchers. While it is important to the archivist to know the source of the collection's parts, the researcher really just needs the collection and the items. So, dmdSec gives us space for an accession identifier (unitId) and title (unitTitle) (using controlled formats - which may or may not be specified yet - I'll find out as I move through the model!) and then a minimal EAD/DC/MODS description. Since we're an archive, I focussed on the EAD, but cross-walking to the others would be possible.
The dmdSec EAD description is quite minimal, listing just origination information (pricipal creator), a physical description (extent, in MBs), a description of the formats and a description of the software/hardware environment used. All these are given as free-text, manual fields, but I wonder how far we can get offering automated "tips" for these elements.
There is also room for a pointer to some rights statement (amdSec) (as with IE01) and also a pointer to a log file that records activity over this accession (fileSec). The format of that log file remains undefined, but it'll record things like "checked all files for viruses", "identified 15 obsolete files", "transformed obsolete formats to ODT", etc. We'll have to think about how we implement this linkage. The model says use a URN to point at the log file, but I wonder if we use a URI that points to a canned search of a generally logging service - something along the lines of beam.ouls.ox.ac.uk/audit?accessionID=12345. That way we can dynamically generate log reports for each accession. Needs thinking about anyways.
Finally there is, in the structMap, a map to the other "main divisions" of the accession and the model suggests that these are either folders or files (different types are given - like subject folder or email directory) and so it is apparent that the accession structMap could be used to reflect the entire accession structure or just the top level directories. Which is better is unclear, but I suspect we will be adding a manifest to the accession that does list all the files (so the structure can be browsed without getting near the real data objects and, if we do it that way, without having to parse the IE02 too much).
We could put a pointer to the manifest alongside the log file pointer in fileSec.
I guess that is probably no clearer than the aforementioned wikipedia page but at least I never used the word "Inaedificatio"!
Wednesday, 10 June 2009
Our Cunning Plan

We've had lots of long discussion here about what BEAM will look like and I back in April (I think) we drew our first thoughts onto our newly arrived whiteboard. The diagram remains to remind us of where we are going, though I've already got some refinements to make. However, if you've ever doubted our sanity, here is some proof that you were right! :-)
Labels:
BEAM architecture,
cunning plan
Tuesday, 9 June 2009
Presenting email archives
Just a quick thought. Am wondering whether some of the tools for presenting mailing list archives might be adapted to present personal or organisational mailboxes. Maybe something like MHonArc?
Subscribe to:
Posts (Atom)