Thursday, 9 July 2009

Automatic Metadata Workshop (Long post - sorry!)

Sometimes I wonder if automatic metadata generation is viewed a little like the Industrial Revolution; which is to say that it is replacing skills and individuals with large scale industry.

I do not think it is really like that at all, being much more about enabling people to manage the ever increasing waves of information. It isn't saying to a weaver, "we can do what you can only faster, better and cheaper"; it is saying "here is something to help you make fabrics from this intangible intractable ether".

What got on this philosophical tract? The answer, as ever, is a train journey - in this case the ride home from Leicester, having attended a JISC-funded workshop on Automatic Metadata Generation. Subtitled "Use Cases" the workshop presented a series of reports outlining potential scenarios in which automatic metadata generation could be used to support the activities of researchers and, on occasion, curators/managers.

The reports have been collated by Charles Duncan and Peter Douglas at Intrallect Ltd. and the final report is due at the end of July.

The day started well as I approached the rather lovely Beaumont Hall at the University of Leicester and noted with a smile the acronym on a sign - "AMG".

Now, I'm from Essex so it is in my genes to know that AMG is the "performance" wing of Mercedes and looking just now at the AMG site, it says:

"Experience the World of Hand Crafted Performance"

a slogan any library or archive could (and should) use!

(Stick with me as I tie my philosophising with my serendipitous discovery of the AMG slogan)

I couldn't help but think a AMG-enabled (our sort, not the car sort) Library or Archive is like hand crafting finding aids, taking advantage of new technology, for better performance. I also thought that most AMG drivers don't care about the science behind getting a faster car, but just that it is faster - think about it...

Where was I? Oh yes.

The workshop!

It was a very interesting day. The format was for Charles, Peter or, occasionally the scenario author, to present the scenarios followed by an opportunity for discussion. This seemed to work well, but it was unfortunate that more of the authors of the scenarios themselves were unable to attend and give poor Charles and Peter a break from the presenting!

The scenarios themselves were around eight metadata themes:
  1. Subject-based
  2. Geographic
  3. Person-related
  4. Usage-related
  5. File Formats
  6. Factual
  7. Bibliographic
  8. Multilingual/Translated
I'll not cover all the scenarios here, but you are encouraged to visit the project Wiki where you can find more information and look out for the final report, but here are some things I got from the day:
  1. AMG to enhance discovery through automatic classification, recommendations on the basis of "similar users" activity ("also bought" function), etc. Note that this is not "by enhancing text-based searching".
  2. AMG could encourage more people to self-deposit (to Institutional Repositories) by automatically filling in the metadata fields in submission forms (now probably isn't the time to discuss the burden of metadata not being the only reason people don't self-deposit! :-)).
  3. AMG to help produce machine-to-machine data and facilitate queries. The big example of this was generating coordinates for place names to enable people with just place names to do geospacial searches, but there are uses here for generating Semantic Web-like links between items.
  4. AMG for preservation - the one I guess folks still reading are most familiar with. Identifying file formats, using PRONOM, DROID & JHOVE, etc. to identify risks, etc.
  5. AMG at creation. Metadata inserted into the digital object by the thing used to create it - iTunes grabbing data from Gracenote and poplating ID3 tags in its own sweet way, a digital camera recording shutter speed and appeture size, time of day and even location and embedding that data into the photo.
  6. The de facto method of AMG was to use Web services - with a skew towards REST-based services - which probably brings us back to cars - REST being nearer the sleek interior of a car than SOAP which exposes its innards to its users.
  7. Just in time AMG (JIT AMG - now there's a project acronym). When something like a translation service is expensive why pay to have all your metadata translated to a different language when you may be able to just do the titles and give your users the option to request (and get the result instantly) a translation if they think it useful.
  8. You might extend JIT AMG and wonder if it is worth pushing the AMG into the search engine? Text search engines already do that - the full-text being the bulk of the metadata - so what if a search engine were also enabled to "read" a music manuscript (a PDF or a Sibelius file for example) and you search for a sequence of notes. Would there be any need to put that sequence of notes into a metadata record if the object itself can function as the record (if you'll forgive the pun!)?
So what does all that mean for us?

Well, it is pretty clear that futureArch must rely on automatic metadata creation at all stages in the archival life cycle and a tool-chain to process items is a feature on diagrams Renhart has shown me since I arrived. It just would not be possible to manage a digital accession without some form of AMG - anyone fancy hand-crafting records for 11,000 random computer files? (Which are, of course, not random at all - representing as they do an individuals own private "order").

I worry slightly about the Web service stuff. For a tool to be useful to futureArch we need a copy here on our servers. First and foremost this ensures the privacy of our data and secondly we have the option then of preserving the service.

(Not to mention that a Web service probably wouldn't want us bombarding it with classification requests!)

(Fortunately the likes of DROID have already gone down the "engine" and "datafile" route favoured by anti-virus companies and let us hope that pattern remains!).

I quite like the idea of resource as metadata object, but I suspect it remains mostly unworkable. It was by accident rather than design that text-based documents, by virtue of their format, contain a body of available metadata. Still, I imagine image search engines are already extracting EXiF data and how many record companies check MP3s ID3 tags to trace their origins...? ;-)

At the end of the workshop we talked a bit about how AMG can scare people too - the Industrial Revolution where I started. To sell AMG technologists talk of how it "reduces cataloging effort", but in an economic climate looking for "reductions in cost" it is easy for management to assume the former implies the later, not realising that while the effort per item may go down, there are much more items!

Whether or not this is true remains to be seen, but early indications suggest AMG isn't any cheaper - just as any new technology isn't. It is just a different tool, designed to cope with a different information world; an essential part of managing digital information.

Yep, it is out of necessity that we will become the Automatic Metadata generation... :-)

No comments: