Friday, 18 December 2009
Virtually JODConverter II
Customization
All the hard work has been done by good people in the open source community, so all we do is plug some bits together. Simple!
The bits:
1) JODCoverter as Web app.
2) OpenOffice3 as Debian packages. (We *could* use apt-get but Turnkey is based on Ubuntu 8.04LTS and as far as I can tell doesn't include OO3).
3) An OpenOffice statup script. (there are others out there too, or make your own using /etc/init.d/skeleton).
Having downloaded these files, you need to get them to JOD1. You could attach JOD1 to the world (via NAT on NIC3 or something) and download them directly, but I preferred to keep JOD1 as clean as possible - either that or I like making work for myself - and so I downloaded the files onto MON1 and SCP'd them over to JOD1.
With the bits to hand, the steps are (all on JOD1 & as root):
1) Install the JODConverter WAR in JOD1's webapps directory:
- Unpack JODConverter and find the ".war" file. If you want, rename it to something else - this will be the path in the URL to JODConverter so I kept it simple and called it jodconv2.war.
- Move jodconv2.war to /var/lib/tomcat5.5/webapps/(You could also "deploy" the WAR file via the Tomcat Web Admin interface).
Now, by default Turnkey runs Tomcat (wisely) with a SecurityManager running. This limits the things servlets can do. If you restart Tomcat now (/etc/init.d/tomcat5.5 restart) and visit /jodconv2/ you'll probably find it isn't running. This puzzled me for a while, but turns out the SecurityManager is to blame. I tried to grant several (individual) permissions to JODConverter but to no avail so in the end gave it blanket rights by adding:
grant codeBase "file:/var/lib/tomcat5.5/webapps/jodconv2/WEB-INF/-"
{
permission java.security.AllPermission;
};
to the file:
/etc/tomcat5.5/policy.d/04webapps.policy
Restart Tomcat and with a bit of luck you'll get the jodconv2 page - a clean form that suggests you upload a document.
Try it and it fails! Why? Because you need to do step 2!
2) Install OpenOffice 3
- Unpack the OpenOffice tar.gz. Inside there is a directory /DEBS/. I'm sure they're not all needed, but rather than work out which ones to keep and which not, I installed them all (might want to revisit that one day). Install the contents of DEBS
- cd [OO_DIR]/DEBS/
- ls *.deb | xargs dpkg -i
3) Install the init.d script - there are instructions at the link above, but in short:
- Cut and paste the script into a new file: /etc/init.d/soffice
- Edit the script to point to the right place for OO:
OOo_HOME=/usr/bin
to
OOo_HOME=/opt/openoffice.org3/program/
And, in theory, you're done!
Trying it out!
Shutdown and restart JOD1 and, using MON1 connect to:
http://yourIPofJOD1/jodconv2/
Upload a sample document and hopefully you'll get a nice fancy PDF back.
If you don't, I suspect it is because I missed some vital step along the way! Sorry about that!
Feel free to comment or email if you need a hand! :-)
Packaging the appliance
So we're done right?Well, mostly. But what if we now want to deploy this appliance? Don't we need it neatly wrapped up and ready to roll?
Yep. I guess we do!
Since we didn't do anything "static" to JOD1 (set the IP for example) it is fairly simple to export it in OVF and, it can then be imported into any virtualization system and run just fine assuming it is connected to a real or virtual network with a DHCP server.
You may also want to create an ISO of JOD1 so it can be deployed by simple installation. This is made pretty easy with TKLPatch - a set of scripts that automate the process of creating an ISO from a OS patch.
The patch I created looks like this:
jodconv2/debs/*openoffice*.deb - ie. all the OpenOffice debs
jodconv2/overlay/ contains the following:
|-- overlay
| |-- etc
| | |-- init.d
| | | `-- soffice
| | `-- tomcat5.5
| | `-- policy.d
| | `-- 04webapps.policy
| `-- var
| `-- lib
| `-- tomcat5.5
| `-- webapps
| `-- jodconv2.war
and finally
jodconv2/conf is a simple one liner:
update-rc.d soffice defaults
Armed with those details and the TKLPatch guide you should have no worries making an ISO of a JODConverter appliance. However, there are a few caveats with TKLPatch.
Firstly, you might notice that Turnkey (and I guess Ubuntu) spread Tomcat all over the OS - in /etc/ in /var/, etc. If you make a change (say to the policy files) and want to put that in the overlay, be sure you put the path to the *real* file rather than the symbolic path (ie. /etc/tomcat5.5 rather than /var/lib/tomcat5.5/conf).
Secondly, build the patch and create the ISO on a Turnkey Linux machine - I used JOD1 in the end. (This is mentioned in the TKL support forum).
Well, that is one long post! Sorry about that! I hope someone will find it useful one day. I suspect I will in the New Year when I've forgotten just where this jodconv2 VM came from! :-)
Thursday, 17 December 2009
Virtually JODConverter I
Why bother? Well, we've committed to a virtual architecture and one of the things we gain is the ability to add and remove appliances as the need arises - meeting the changing needs of the Digital Asset Management System at any point - and so having a few appliances we can throw up at the drop of the hat (someone phones and says "I need to do a huge deposit of items and I need to do it yesterday, can you handle the extra load?") will be very useful. (There are other gains too - mostly the consolidation of space and energy use - you'll find lots on all
that out there in Web land!)
That said, you might just want to run JODConverter on your desktop machine. If you do, this'll help too. Just make the virtual appliance and run it on your desktop and use NAT & port mapping to connect to it. Voila! You're own personal copy of JODCoverter as Web service! :-)
Back in 2008 on the Google Code home of JODConverter some folks seem to have suggested a virtual appliance with JODConverter & OpenOffice would be a Good Thing(tm) :-). About a week ago, quite independently, we also decided it would be a Good Thing(tm) and I set about making it and, inspite of this preamble, is what I really wanted to write about! :-)
So, here are the simple steps I took to make a JODCoverter Virtual Appliance. Note that I used JODConverter 2.2.2, which seems more stable than 3 at the moment.
Preliminaries
1) Get yourself some virtualization software - I use VirtualBox.
2) (Optional, but I'll assume you did) Create, or reuse, a regular desktop VM (I used a standard Xubuntu install) - MON1 - and attach it to an internal (virtual) network on NIC0 (and useful to attach another NIC to the world via NAT too). Also add to this folder share with the host (your desktop PC). This will be handy later for moving isos, patches, etc. into and out of the virtual world.
3) Create a new (small) VM and install a Turnkey appliance of choice - I'll call this JOD1.
I used Turnkey's Tomcat but you might be trying to do something different. :-) I opted for a small and simple configuration (512MB RAM, 1 x 2GB disk and nothing fancy). Remember that appliances don't have fancy "desktops" so graphics capability isn't really a requirement! :-)
4) (Optional, see 2) Attach JOD1 to the same internal network as MON1 (the VM created in step 3).
We do this so that you can check open ports on JOD1, test if JODConverter is running, OpenOffice service is up, etc.
5) Start both machines.
You should now have a running Tomcat VM & a method of seeing it - open a browser in the test machine and try JOD1's IP (port 80). You should see the Web admin interface & if you don't check all the network connections and that JOD1 started OK, and such.
Now would probably be a good time to change the Web admin password!
Continued
Antagonistic Books
http://www.instructables.com/id/ANTAGONISTIC-BOOKS-Danger-How-To-make-a-book-th/
and
http://www.instructables.com/id/ANTAGONISTIC-BOOKS-Curiosity-How-To-make-a-book/
Just something else to watch out for in archives I guess! (I can't wait until the second one turns up only someone turned the ratchet around so it'll only close...)
There is something about them both that reminds me of William Gibson's Agrippa (a book of the dead) and that is something that has puzzled some of the archivists I've spoken with! :-)
Wednesday, 9 December 2009
Open Development: Building an Engaged Community
The days aims were
- Understand how open development works and know the common community structures
- Be familiar with the skills and processes that encourage community participation
- Develop ideas for improving the community friendliness of a specific project
From the offset Steve Lee of OSS Watch, in his introduction to the day, made it was clear that open development practice was key to open source software rather than simply access to the source code. (A couple of the presenters said "don't just throw it over the wall", referring to the practice of putting your source code some where public and walking away - a very common practice in our field - as this would not lead to a sustainable software product).
The rest of the day supported this ideal... Sebastian Brännström of the Symbian Foundation spoke of how Symbian hoped to make as much of the Symbian operating system (for phones) open source as soon as possible, and outlined the large (and quite formal) organisational structure required to support the 40 million lines of code. For a software project that large, this shouldn't come as a surprise, but clearly shows that "open sourcing" (I mean, the process to make software open source rather than sourcing work in an open way - though both are valid!) might not always be cheap or a free (beer) option. Indeed, he hoped that there would be a full-time, paid, community leader whose sole role would be to maintain and manage one of the 134 software packages that make up the Symbian OS.
Next up, Sander van der Waal of OSS Watch took us through the developer experience of taking part in an open source project - both from being part of a commercial company in the Netherlands and also working on the OSS Watch project SIMAL. It was very interesting to hear how his team had gone about contributing to Apache Felix & Jackrabbit (Two products very much of interest to our community!). He suggested it was very important to make use of the usual cluster of open source development tools - not just version management, but also mailing lists, bug tracking systems, wikis and the like - and that this was important if you were a "one man band" developer or a whole team. In many ways his experience here helped ease my nerves of contributing to projects.
The final speaker was Mark Johnson, of Taunton's College, giving his experiences and tips on being involved with the open source course management system, Moodle. In a past life I've developed for Moodle, so this was interesting to hear about. His advice was broadly similar to that of the other two speakers, though from a different perspective and here there was evidence of useful reinforcement of ideas rather than repetition, which is always a good thing.
A workshop isn't complete without a bit of group work and we were asked to complete a questionnaire designed, I think, to get us thinking about the sustainability of our open source projects by highlighting areas we should be considering - licensing, use of standards, documentation, etc.
This was a very useful tool and the questions got me thinking about all sorts of things. The results for futureArch were bad - all "red" (for danger) expect the section of use of standards - but that didn't come as much of a surprise. I think it would be fair to say that futureArch isn't an "Open Source Project" per se. Rather we're avid users of open source software. We, like many, do not have the resources to run a community around anything we build (who has funding for a full-time community manager?) and it would probably be inappropriate to try. But we can and will contribute to other projects and the workshop helped me see that this was both pretty easy (assuming everyone is nice) and desirable.
And, of course, anything we build here - the ingest tool for example or the metadata manager - will probably be "thrown over the wall" and people will be able to find it and others, if they get the urge, will be able to found a community, which I guess shows there is value in simple publication of source code in addition to the (far more preferable and more likely to succeed) development of a community around a product. (The revelation that community building is essential for a sustained software product, probably so obvious to many, sheds light on the reasons behind things like Dev8D too).
Just some final thought then as it grows ever darker and it is good not to cycle too late home!
Firstly, it struck me as people talked, that while open source could be seen as less formal than closed software development, it clearly is not. Development of communities and the subsequent control and management of those communities, requires formal structures making open source anything but an easy option.
Secondly, fascinating were the reasons given to contribute to an open source project. Someone mentioned how by taking part you felt you were not alone, but the overwhelming reason given was "recognition". By contributing you could get your name (and that of your employer) in lights, that participating in a community could lead to job offers, or other personal success. As most projects are on a meritocratic basis - the more good you do, the more say you have - that success could be to become the community leader or at least one of the controllers of the code - the fabled "commiters". This is a curious thing - the reason to participate in a "community" is the "selfish" urge to self-promote. Something jars there, but I'm not quite sure what.
Wednesday, 18 November 2009
Data Liberation: Google's mission
Users should be able to control the data they store in any of Google's products. Our team's goal is to make it easier to move data in and outYay! Loyalty is best achieved through great products, not data lock-in. As an individual who uses online data services this approach makes me very happy. As an archivist I'm ecstatic. Thanks guys.
More about how to get data in and out of Google's many services at the DLF's blog.
Tuesday, 17 November 2009
building castles 1: the problem
- A collection of things.
- A set of born digital items - mostly documents in antique formats.
- EAD for the collection - hierarchical according to local custom and ISAD(G).
- A spreadsheet - providing additional information about the digital items, including digests.
- We could build a database and put all the metadata into it and run the site off that
- We could build a set of resources (the items, the sub[0,*]series, the collection, the people), link all that data together and run the site off that.
- We could build a bunch of flat pages which, while generated dynamically once, don't change once the collection is up.
Thursday, 5 November 2009
Note to self...
http://www.stfj.net/art/2009/loselose/
(Needless to say we wouldn't anyway!)
This, from the site:
"As technology grows, our understanding of it diminishes, yet, at the same time, it becomes increasingly important in our lives. At what point does our virtual data become as important to us as physical possessions? If we have reached that point already, what real objects do we value less than our data? What implications does trusting something so important to something we understand so poorly have?"
Wednesday, 21 October 2009
Bendy ePaper
http://www.reghardware.co.uk/2009/10/20/auo_epaper/
Thursday, 8 October 2009
Investigating Terms of Service
Each Terms of Service is basically the same, though a few are a bit more specific about what is and is not allowed. Here’s a basic table where you can see briefly what each ToS contains (sorry it's a bit small):
A second problem is that most sites restrict data harvesting. Facebook bans it outright, however Twitter only prohibits scraping; crawling is allowed “if done in accordance with the provisions of the robots.txt file” (which aren't stated). Also, Myspace only prohibits automated harvesting data “for the purposes of sending unsolicited or unauthorised material”. This implies that harvesting data for archival purposes is allowed. However, this isn’t stated directly, and since some stipulations are quite specific I’d be inclined to check with the service provider rather than rely on assumptions.
Wednesday, 30 September 2009
Advisory board meeting, 24 Sept. 2009
Introductions
We started with some introductory discussions around the Library's hybrid collections and the futureArch project's aims and activities. This discussion was wide ranging, touching on a number of subjects including the potential content sources for 'digital manuscripts': from mobile phones, to digital media, to cloud materials.
Systems
In the past year, we've made progress on developing, and beginning to implement, the technical architecture for BEAM (Bodleian Electronic Archives & Manuscripts). Pete Cliff (futureArch Software Engineer) kicked off our session on 'systems' with an overview of the architecture, drawing on some particular highlights; it's worth a look at his slides if you're interested in finding out more.
1. Renhart Gittens demonstrated the BEAM ingester, our means of committing accessions (under a collection umbrella) to BEAM's preservation storage.
2. Dave Thompson (Wellcome Library Digital Curator) demonstrated the XIP creator. This tool does a similar job to the BEAM Ingester and forms part of the Tessella digital preservation system being implemented at the Wellcome Library.
Keeping with technical architecture, Neil Jefferies (OULS R&D Project Officer) introduced Oxford University Library Service's Digital Asset Management System (or DAMS, as we've taken to calling it). This is the resilient preservation store upon which BEAM, and other digital repositories, will sit.
How will researchers use hybrid archives?
Next we turned our attention to the needs of the researchers who will use the Library's hybrid archives. Matt Kirschenbaum (Assoc. Prof. of English & Assoc. Director of MITH at the University of Maryland) got us off to a great start with an overview of his work as a researcher working with born-digital materials. Matt's talk emphasised digital archives as ' material culture', an aspect of digital manuscripts that can be overlooked when the focus becomes overly content-driven. Some researchers want to explore the writer's writing environment; this includes seeing the writer's desktop, and looking at their MP3 playlist, as much as examining the word-processed files generated on a given computer. Look out for the paper Matt has co-authored for iPRES this year.
Next we broke into groups to critique the 'interim interface' which will serve as a temporary access mechanism for digital archives while a more sophisticated interface is developed for BEAM. Feedback from the advisory board critique session was helpful and we've come away with a to-do list of bug fixes and enhancements for the interim interface as well as ideas for developing BEAM's researcher interfaces. We expect to take work on researcher requirements further next year (2010) through workshops with researchers.
Finally, we heard from Helen Hockx-Yu (British library's Web Archiving Programme Manager) on the state of the art in web archiving. Helen kindly agreed to give us an overview of web archiving processes and the range of web archiving solutions available. Her talk covered all the options, from implementing existing tools suites in-house to outsourcing some/all of the activity. This was enormously useful and should inform conversations about the desired scope of web archiving activity at the Bodleian and the most appropriate means by which this could be supported.
Some of us continued the conversation into a sunny autumn evening on the terrace of the Fellows' Garden of Exeter College, and then over dinner.
Monday, 14 September 2009
OS recovery tool
Friday, 21 August 2009
Monday, 17 August 2009
Balisage: The Markup Conference 2009
Performance of XML-based applications
One of the components of their XML-based publishing system discussed was the Schema, Addressing, and Storage System (SASS) - a data store that provides a unified view of metadata and content for publications they host which relies on the concept of storing metadata and related resources in a file-system based set of XML files along with the use of XQuery to feed these resources to an XSLT-based display layer.
XML in the browser
Explored how new XML vocabularies could be integrated into the browser. Thereby providing a new way forward for XML in the Browser.
Towards markup support for full GODDAGs and beyond: the EARMARK approach
Examined Overlapping markup and the utilisation of the standoff approach to markup to address the issues created by the purely hierarchical approach.TNTBase: Versioned Storage for XML
Presented an open-source versioned XML database created by the integration of Berkeley DB XML into the Subversion Server.Agile Business Objects Management Application for Electronic Records Archive Transfer Process
How XForms and Genericode are assisting the National Archives and Records Administration(NARA) in their goal to provide archivists with a modernised system with automatic workflow for the digital archive business process“A practical introduction to EXPath: Collaboratively Defining Open Standards for Portable XPath Extensions.”
Covered the benefits to be derived from a Collaborative approach to the definition and implementation of standardised extensions in XPath which the core XML technologies such as Xquery and XSLT, Xproc, XForms would be able to utilise in a uniform way.Automatic XML Namespaces
Explored the difficulties with Namespaces and proposed ways of addressing them.Describing agents: EAC-CPF
Wednesday, 12 August 2009
The digital estate
http://lawprofessors.typepad.com/trusts_estates_prof/2009/03/planning-for-digital-estates-updated-company-list.html
http://edition.cnn.com/2009/TECH/05/18/death.online/index.html?iref=t2test_techmon
Many of us have digital materials online (or offline) that we want to pass on to friends and family. By outlining where materials are and supplying account credentials, we can ensure this happens. It could be done in regular estate planning, or simply keeping a note somewhere safe (and telling the right person where to find it!). There are also online services springing up to help people make arrangements for passing on the relevant information. Research libraries have an obvious interest here; digital material is an important part of a person's archive, and unless someone knows it exists, where to find it, and how to access it there is little prospect of saving it for future generations.
Thursday, 6 August 2009
Secure Delete
Wednesday, 29 July 2009
Carved in Silicon
I rather like this too:
"...the digital data archivists' arch enemy: magnetic polarity"
(I added the bold!)
Does that make digital archivists like the X-Men ? ;-)
ntfsprogs
http://man.linux-ntfs.org/
and in particular:
http://man.linux-ntfs.org/ntfsclone.8.html
Monday, 20 July 2009
Thursday, 16 July 2009
open source forever, right?
Open source libraries (I mean software libraries rather than big buildings with books - so apologies for the non-technical readers!) - are very useful. Sometimes they can vanish - projects go under, people stop being interested, and soon a code base is "unsupported" and maybe, one day, might vanish from the Web. Take, for example, the Trilead Java SSH Library, the demise of which I think must be fairly recent.
A quick Google search suggests the following:
http://www.trilead.com/SSH_Library/
Which helpfully says:
"Trilead SSH for Java The freely available open-source library won't be anymore developed nore supported by Trilead." (sic.)
Unsupported, in this case, also means unavailable and there are no links to any code from here.
Other sites link to:
http://www.trilead.com/Products/Trilead-SSH-2-Java/
which gives a 404.
None of which is very helpful when your code is telling you:
Exception in thread "main" java.lang.NoClassDefFoundError: com/trilead/ssh2/InteractiveCallback
(Should any non-technical types still be reading, that means "I have no idea what you are talking about when you refer to a thing called com.trilead.ssh2.InteractiveCallback so I'm not going to work, no way, not a chance, so there. Ner.").
Now, had I been more awake, I probably would have noticed a sneaky little file by the name "trilead.jar" in the SVNKit directory. I would have also duely added it to the the classpath. But I wasn't and I didn't and then got into a panic searching for it.
But, and here is the moral of the tale, I did find this:
"Also, in the meantime we will put Trilead SSH library source code into our repository (it is also distributed under BSD license) so it will remain available for the community and we still will be able to make minor bugfixes in it when necessary." [SVNKit User Mailing List, 18th May 2009]
Hooray for Open Source!
The open source project SVNKit, which made use of the open source library, was able - due to the open licensing - absorb the SSH library and make it available along with the SVNKit code. Even though the Trilead SSH Library is officially defunct, it lives on in the hands of its users. Marvellous eh?
All which is to say: 1) check the classpath and include all the jars and 2) open licensing means that something at least has a chance of being preserved by someone other than the creator who got fed up with all the emails asking how it worked... :-)
Thursday, 9 July 2009
Automatic Metadata Workshop (Long post - sorry!)
I do not think it is really like that at all, being much more about enabling people to manage the ever increasing waves of information. It isn't saying to a weaver, "we can do what you can only faster, better and cheaper"; it is saying "here is something to help you make fabrics from this intangible intractable ether".
What got on this philosophical tract? The answer, as ever, is a train journey - in this case the ride home from Leicester, having attended a JISC-funded workshop on Automatic Metadata Generation. Subtitled "Use Cases" the workshop presented a series of reports outlining potential scenarios in which automatic metadata generation could be used to support the activities of researchers and, on occasion, curators/managers.
The reports have been collated by Charles Duncan and Peter Douglas at Intrallect Ltd. and the final report is due at the end of July.
The day started well as I approached the rather lovely Beaumont Hall at the University of Leicester and noted with a smile the acronym on a sign - "AMG".
Now, I'm from Essex so it is in my genes to know that AMG is the "performance" wing of Mercedes and looking just now at the AMG site, it says:
"Experience the World of Hand Crafted Performance"
a slogan any library or archive could (and should) use!
(Stick with me as I tie my philosophising with my serendipitous discovery of the AMG slogan)
I couldn't help but think a AMG-enabled (our sort, not the car sort) Library or Archive is like hand crafting finding aids, taking advantage of new technology, for better performance. I also thought that most AMG drivers don't care about the science behind getting a faster car, but just that it is faster - think about it...
Where was I? Oh yes.
The workshop!
It was a very interesting day. The format was for Charles, Peter or, occasionally the scenario author, to present the scenarios followed by an opportunity for discussion. This seemed to work well, but it was unfortunate that more of the authors of the scenarios themselves were unable to attend and give poor Charles and Peter a break from the presenting!
The scenarios themselves were around eight metadata themes:
- Subject-based
- Geographic
- Person-related
- Usage-related
- File Formats
- Factual
- Bibliographic
- Multilingual/Translated
- AMG to enhance discovery through automatic classification, recommendations on the basis of "similar users" activity ("also bought" function), etc. Note that this is not "by enhancing text-based searching".
- AMG could encourage more people to self-deposit (to Institutional Repositories) by automatically filling in the metadata fields in submission forms (now probably isn't the time to discuss the burden of metadata not being the only reason people don't self-deposit! :-)).
- AMG to help produce machine-to-machine data and facilitate queries. The big example of this was generating coordinates for place names to enable people with just place names to do geospacial searches, but there are uses here for generating Semantic Web-like links between items.
- AMG for preservation - the one I guess folks still reading are most familiar with. Identifying file formats, using PRONOM, DROID & JHOVE, etc. to identify risks, etc.
- AMG at creation. Metadata inserted into the digital object by the thing used to create it - iTunes grabbing data from Gracenote and poplating ID3 tags in its own sweet way, a digital camera recording shutter speed and appeture size, time of day and even location and embedding that data into the photo.
- The de facto method of AMG was to use Web services - with a skew towards REST-based services - which probably brings us back to cars - REST being nearer the sleek interior of a car than SOAP which exposes its innards to its users.
- Just in time AMG (JIT AMG - now there's a project acronym). When something like a translation service is expensive why pay to have all your metadata translated to a different language when you may be able to just do the titles and give your users the option to request (and get the result instantly) a translation if they think it useful.
- You might extend JIT AMG and wonder if it is worth pushing the AMG into the search engine? Text search engines already do that - the full-text being the bulk of the metadata - so what if a search engine were also enabled to "read" a music manuscript (a PDF or a Sibelius file for example) and you search for a sequence of notes. Would there be any need to put that sequence of notes into a metadata record if the object itself can function as the record (if you'll forgive the pun!)?
Well, it is pretty clear that futureArch must rely on automatic metadata creation at all stages in the archival life cycle and a tool-chain to process items is a feature on diagrams Renhart has shown me since I arrived. It just would not be possible to manage a digital accession without some form of AMG - anyone fancy hand-crafting records for 11,000 random computer files? (Which are, of course, not random at all - representing as they do an individuals own private "order").
I worry slightly about the Web service stuff. For a tool to be useful to futureArch we need a copy here on our servers. First and foremost this ensures the privacy of our data and secondly we have the option then of preserving the service.
(Not to mention that a Web service probably wouldn't want us bombarding it with classification requests!)
(Fortunately the likes of DROID have already gone down the "engine" and "datafile" route favoured by anti-virus companies and let us hope that pattern remains!).
I quite like the idea of resource as metadata object, but I suspect it remains mostly unworkable. It was by accident rather than design that text-based documents, by virtue of their format, contain a body of available metadata. Still, I imagine image search engines are already extracting EXiF data and how many record companies check MP3s ID3 tags to trace their origins...? ;-)
At the end of the workshop we talked a bit about how AMG can scare people too - the Industrial Revolution where I started. To sell AMG technologists talk of how it "reduces cataloging effort", but in an economic climate looking for "reductions in cost" it is easy for management to assume the former implies the later, not realising that while the effort per item may go down, there are much more items!
Whether or not this is true remains to be seen, but early indications suggest AMG isn't any cheaper - just as any new technology isn't. It is just a different tool, designed to cope with a different information world; an essential part of managing digital information.
Yep, it is out of necessity that we will become the Automatic Metadata generation... :-)
Wednesday, 1 July 2009
Waxwork Accessions
It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!
The way I saw it, I needed three things to create the accession:
- A list of files and folders that formed a real accession
- A set of data that could be used - real documents, images, sound files, system files, etc.
- Some way of tying these together to create an accession modelled on a real one but containing public data
Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.
It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!
In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.
(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).
So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:
C:\WINDOWS\SYSTEM32\ABC.DLL
I wanted to grab any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.
This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.
Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.
I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).
(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).
The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.
Hooray! Now gotta pack it into disk images and start exploring processing!
Should anyone be interested, the source code is available for download.
Friday, 19 June 2009
Friday Post
Never say "unscalable" and expect no one to try! :-)
Monday, 15 June 2009
CAIRO Content Model: A noob's overview (Part 2) Accessions
All clear now? Didn't think so! (Though the bit about the museum is probably closest). For the purpose of ingest, I'm not too worried about legal ownership (this must be assumed) so my working definition - and I make no claims that it is a good or accurate - is:
"an accession is a unit of stuff that arrives at a given time, at a given archive, and needs to be added to that archive"
An accession then is defined by its origin, date of arrival and the collection to which it belongs.
Like IE01, IE02 uses the METS header element to record agent data - who did what and when to this record. This will usually "created by ingest process", "updated by CAIRO tool by user X", etc.
In IE02, the descriptive metadata is kept deliberately Spartan. This is because the concept of the accession remains largely transparent to researchers. While it is important to the archivist to know the source of the collection's parts, the researcher really just needs the collection and the items. So, dmdSec gives us space for an accession identifier (unitId) and title (unitTitle) (using controlled formats - which may or may not be specified yet - I'll find out as I move through the model!) and then a minimal EAD/DC/MODS description. Since we're an archive, I focussed on the EAD, but cross-walking to the others would be possible.
The dmdSec EAD description is quite minimal, listing just origination information (pricipal creator), a physical description (extent, in MBs), a description of the formats and a description of the software/hardware environment used. All these are given as free-text, manual fields, but I wonder how far we can get offering automated "tips" for these elements.
There is also room for a pointer to some rights statement (amdSec) (as with IE01) and also a pointer to a log file that records activity over this accession (fileSec). The format of that log file remains undefined, but it'll record things like "checked all files for viruses", "identified 15 obsolete files", "transformed obsolete formats to ODT", etc. We'll have to think about how we implement this linkage. The model says use a URN to point at the log file, but I wonder if we use a URI that points to a canned search of a generally logging service - something along the lines of beam.ouls.ox.ac.uk/audit?accessionID=12345. That way we can dynamically generate log reports for each accession. Needs thinking about anyways.
Finally there is, in the structMap, a map to the other "main divisions" of the accession and the model suggests that these are either folders or files (different types are given - like subject folder or email directory) and so it is apparent that the accession structMap could be used to reflect the entire accession structure or just the top level directories. Which is better is unclear, but I suspect we will be adding a manifest to the accession that does list all the files (so the structure can be browsed without getting near the real data objects and, if we do it that way, without having to parse the IE02 too much).
We could put a pointer to the manifest alongside the log file pointer in fileSec.
I guess that is probably no clearer than the aforementioned wikipedia page but at least I never used the word "Inaedificatio"!
Wednesday, 10 June 2009
Our Cunning Plan
We've had lots of long discussion here about what BEAM will look like and I back in April (I think) we drew our first thoughts onto our newly arrived whiteboard. The diagram remains to remind us of where we are going, though I've already got some refinements to make. However, if you've ever doubted our sanity, here is some proof that you were right! :-)
Tuesday, 9 June 2009
Presenting email archives
Monday, 8 June 2009
CAIRO Content Model: A noob's overview (Part 1)
As there is quite a lot to the content model, I'll cover it in parts as I read them. This is to allow my brain to digest each bit rather than try to cram the entire thing in one go. Hopefully the same idea will help anyone trying to read about it here.
So without further ado, I present Intellectual Entity 01: Collection.
Affectionately known as "I-E-O-1" (the O is really a 0 but we don't say zero), this is described in the documentation as the "descriptive overview of the collection". The metadata here is designed to enable a curator to respond to researcher enquiries and freedom of information requests and also to provide the foundation for the rest of the collection's metadata (which is broken down - as we'll see in future posts - into accessions and items).
IE01 relates to both digital and physical (really need a better word - analogue?) components of the collection, but it is important to note that it does not replace the EAD record created by the archivist - though it links to it. This begs the question: are we currentlly able to link to our EAD records? (I'll need to find out the answer).
Like all of our entities, the object is specified using METS. From my perspective of creating digital object ingest tools, the fields I'm worrying about are (in no particular order):
- A list of agents, found in the metsHdr element, and recording who did what with this record.
- An embedded EAD and/or DC and/or MODS record used to describe the collection. Nb. this is NOT the archival EAD, but rather a subset of the given schema (EAD/DC/MODS) to record a minimal amount of metadata about this aspect of the entire collection - archive identifier, country code, dates, formats, scope and access.
- A link to a further entity (PR01) describing preservation rights of the collection.
That is a fairly sketchy outline of IE01 and there is a lot more to it than that of course, but that is my first impression. I hope it is useful to someone, and useful to me when I've recycled my paper notes! :-)
Tuesday, 26 May 2009
Geocities being rescued by Archive Team
Thursday, 21 May 2009
Whitepaper from MITH/HRC/Emory born-digital literary mss. project
Monday, 18 May 2009
Digital Repository Workshop at Oxford
Details on the day are available and I've also put my talk on slideshare.
Tuesday, 12 May 2009
Standing on the shoulders of Giants?
Just attended the Repositories and Preservation Programme meeting in Aston Birmingham I would really recommend the talk The Institutional Perspective - How can institutions most effectively exploit work in the Repositories and Preservation field? given by Jeff Haywood- University of Edinburgh.
I would like to think this may kickstart a process to find methods by which current projects could more easily use and build on the outputs of previous projects and create a framework to more easily exchange code and ideas.
Jeff's talk was given even more currency as in the afternoon session Rachel Heery give a presentation (its on slideshare) in Repositories Roadmap Session launching her just published Digital Repositories Roadmap Review: towards a vision for research and learning in 2013
The question however remains: by then will Standing on the shoulders of Giants still be a distant concept?
Monday, 11 May 2009
What is this thing anyway?
Thursday, 7 May 2009
The best introduction to digital preservation ever?
Monday, 27 April 2009
Wahcade Emulator Front-End
Imagine that a collection is a "rom" (the rom being the image of a chip containing, in Wahcade's case, a game but for us could be a disk image from a donor's PC). The user picks from a list of roms and then the reading room "arcade" starts an emulator and away you go. Before you know it the dumb terminal is a replica Mac System 6 desktop complete with donor file system, etc.
Be neat wouldn't it?
Friday, 17 April 2009
Terrabyte Terror
From my early days watching my Dad teach electronics I've loved the smell of soldering, the look of components and the idea that you can make your own set of LEDs flash just for the fun of it. Thumbing through the Maplin catalogue with a cup of tea was once one of my favourite past times. But these days, more and more, I get a sense of dread as I check out the special offers.
Why?
Let me give you an example: 1TB External Drive, £99.
Here is another: 1TB Internal Drive, £89.
You read that right - 1 TerraByte of storage for under £100! Doesn't that make you quake? Probably not, but I can't help but wonder how long it will be before we have to accession a 1TB drive. What do we do with it? Do we even know what that amount of detritus accumlated over, well how long? a lifetime? a couple of evenings with iTunes? We don't know how long it'll take the average person to fill up a 1TB drive. Do we have the capacity to store 1TB of data and even if we do, how sustainable is that?
You could argue that since storage like this is so cheap, we can rest assured that our own storage costs will be less, so we always keep up with the growth of consumer storage. It is a fair point, but how many preservation-grade storage devices can manage 10p a GB? None I imagine, and for good reason. There is a whole lot more to a preservation system than a disk and a plastic case - it takes more than 1TB to keep 1TB safe for a start! (Mind you, I couldn't help but smile at Maplin's promise of "Peace of mind with 5 year limited warranty").
If we cannot keep up with the storage then, what do we do? A brute force method would be to compress the data, but then bit rot becomes a much more worrying issue (and it is pretty worrying already). We could look for duplicates - how many MP3 collections will include the same songs for instance and should we keep them all (if any)? What if it is the same song, with a different encoding/bitrate/whatever? What about copies of OSs - all those i386 directories? (Though arguably an external drive will not contain an OS, so we wont save space there).
We probably don't need or want to keep all of those 1000GBs, but how will we identify what to preserve? Susan and Renhart came up with some answers to this with their brilliant Paradigm project - which I'll paraphrase as "encourage the creators to curate their own data" - and I'm hopeful that will happen, but what if it doesn't? Will we see "personal data curation" and "managing information overload" added to the National Curriculum anytime soon? I hope so!
All of which finally gives me reason to stop worrying about cheap terrabytes! Data is going to keep growing and someone is going to have to help manage all that stuff. I guess that is where we fit in.
Monday, 6 April 2009
Validating normalised dates in XML
Friday, 3 April 2009
Draft data dictionary and schema for document significant properties
The designers, from the California Digital Library and Harvard's University Library, are seeking comments from the digital preservation community. Semantic units are: PageCount, WordCount, CharacterCount, ParagraphCount, Line Count, TableCount, GraphicsCount, Language, Fonts, FontName, IsEmbedded, Features. You can see the current schema in full at http://www.fcla.edu/dls/md/docmd.xsd
This looks like a useful addition to preservation metadata, provided tool support for extracting the information and populating metadata records follows. I think the list of values for 'Features' - isTagged, hasLayers, hasTransparancy, hasOutline, hasThumbnails, hasAttachments, hasForms, hasAnnotations - may need extending (hasFootnotes, hasEndnotes?), and it would be good to see some definitions and examples of the existing values.
I wonder if we need a different data dictionary and schema for slideshows? This one might be adequate with some additions to cover things like animations, timings, etc. Seeing this data dictionary also reminds me that we need to look at where the Planets folk are up to on their significant properties work (XCDL/XCEL).
Thursday, 2 April 2009
Digital preservation for individuals and small organisations
Tuesday, 17 March 2009
Shared marginalia for any webpage
Tuesday, 24 February 2009
Shoot those files!
Odds and ends from day one of the digital lives conference
The digital lives conference provided a space to digest some of the findings of the AHRC-funded digital lives project, and also to bring together other perspectives on the topic of personal digital archives. At the proposal stage, the conference was scheduled to last just a day; in the event one day came to be three, which demonstrates how much there is to say on the subject.
Day one was titled 'Digital Lifelines: Practicalities, Professionalities and Potentialities'. This day was intended mostly for institutions that might archive digital lives for research purposes. Cathy Marshall of Microsoft Research gave the opening talk, which explored some personal digital archiving myths on the basis of her experiences interviewing real-life users about their management of personal digital information.
Next came a series of four short talks on 'aspects of digital curation'.
- Cal Lee, of UNC Chapel Hill, emphasised the need for combining professional skills in order to undertake digital curation successfully. Archives and libraries need to have the right combination of skills to be trusted to do this work.
- Naomi Nelson of MARBL, Emory University, told a tale of two donors. The first donor being the entity that gives/sells an archive to a library and the second being the academic researcher. Libraries need to have a dialogue with donors of the first type about what a digital archive might contain; this goes beyond the 'files' that they readily conceive as components of the archive, and includes several kinds of 'hidden' data that may be unknown to them. The second donor, 'the researcher', becomes a donor by virtue of the information that the research library can collect about their use of an archive. Naomi raised interesting questions about how we might be able to collect this kind of data and make it available to other researchers, perhaps at a time of the original researcher's choosing.
- Michael Olson of Stanford University Libraries spoke of their digital collections and programmes of work. Some mention of work on the fundamentals - the digital library architecture (equivalent to our developing Digital Asset Management System - DAMS - which will provide us with resilient storage, object management and tools and services that can be shared with other library applications). Their digital collections include a software collection of some 5000 titles, containing games and other software. I think that sparked some interest from many in the audience!
- Ludmilla Pollock, Cold Spring Harbour Laboratory, told us about an extensive oral history programme giving rise to much digital data requiring preservation. The collection contains videos of the scientists talking about their memories and has a dedicated interface.
Inevitably, questions of value were a feature of the session. The dealers suggest that archives and libraries are not willing to pay for born-digital archives yet; perhaps this stems from concerns about uniqueness and authenticity, and the lack of facilities to preserve, curate and provide access. It's not like there's actually much on the market at the moment, so perhaps it's a matter of supply as much as demand? Comparisons with 'traditional' materials were also made using Larkin's magic/meaningful values:
The 'meaningful' aspects of digital archives are apparent enough, but what of the 'magical'? Most, if not all, contributors to the discussion saw 'artifactual' value in digital media that had an obvious personal connection, whether Barack Obama's Blackberry or J.K. Rowling's laptop. What wasn't discussed so much was the potential magical value of seeing a digital manuscript being rendered in its original environment. I find that quite magical, myself. I think more people will come to see it this way in time.
"All literary manuscripts have two kinds of value: what might be called the magical value and the meaningful value. The magical value is the older and more universal: this is the paper [the writer] wrote on, these are the words as he wrote them, emerging for the first time in this particular magical combination. We may feel inclined to be patronising about this Shelley-plain, Thomas-coloured factor, but it is a potent element in all collecting, and I doubt if any librarian can be a successful manuscript collector unless he responds to it to some extent. The meaningful value is of much more recent origin, and is the degree to which a manuscript helps to enlarge our knowledge and understanding of a writer’s life and work. A manuscript can show the cancellations, the substitutions, the shifting towards the ultimate form and the final meaning. A notebook, simply by being a fixed sequence of pages, can supply evidence of chronology. Unpublished work, unfinished work, even notes towards unwritten work all contribute to our knowledge of a writer’s intentions; his letters and diaries add to what we know of his life and the circumstances in which he wrote.”
Philip Larkin 'A Neglected Responsibility: Contemporary Literary Manuscripts', Encounter, July 1979, pp. 33-41.
Delegates were then able to visit to digital scriptorium and audiovisual studio at the British Library.
After lunch, we resumed with a view of the 'Digital Economy and Philosophy' from Annamaria Carusi of the Oxford e-Research Centre. Some interesting thoughts about trust and technology, referring back to Plato's Phaedrus and the misgivings that an oral culture had about writing. New technologies can be disruptive and it takes time for them to be generally accepted and trusted.
Next, four talks under the theme of digital preservation.
- First an overview of the history of personal films from Luke McKernan, a curator at the British Library. This included changes in use and physical format, up to the current rise of online video populating YouTube, and its even more prolific Chinese equivalents. Luke also talked about 'lifecasting', pointing to JenniCam (now a thing of the past, apparently), and also to folk who go so far as to install movement sensors and videos throughout their homes. Yikes!
- We also heard from the British Library's digital preservation team, about their work on risk assessment for the Library's digital collections (if memory serves, about 3% of the CDs they sampled in a recent survey had problems). Their current focus is getting material off vulnerable media and into the Library's preservation system; this is also a key aim in our first phase of futureArch. Also mention of the Planets and LIFE projects. Between project and permanent posts, the BL have some 14 people working on digital preservation. If you count those working on webarchiving, audiovisual colections, digitisation, born-digital manuscripts, digital legal deposit, etc., areas, who also have a knowledge of this area, it's probably rather more.
- William Prentice offered an enjoyable presentation on audio archiving, which had some similar features to Luke's talk on film. It always strikes me that audiovisual archiving is very similar to digital archiving in many respects, especially when there's a need to do digital archaeology that involves older hardware and software that itself requires management.
- Juan-José Boté of the University of Barcelona spoke to us about a number of projects he had been working on. These were very definitely hybrid archives and interesting for that reason.
Next, I chaired a panel of 'Practical Experiences'. Being naturally oriented toward the practical, there was lots for me here.
- John Blythe, University of North Carolina, spoke about the Southern Historical Collection at the Wilson Library, including the processes they are using for digital collections. Interestingly, they have use of a digital accessioning tool created by their neighbours at Duke University.
- Erika Farr, Emory University, talked about the digital element of Salman Rushdie's papers. Interesting to note that there was overlap of data between PCs, where the creator has migrated material from one device to another; this is something we've found in digital materials we've processed too. I also found Rushdie's filenaming and foldering conventions curious. When working with personal archives, you come to know the ways people have of doing things. This applies equally to the digital domain - you come to learn the creator's style of working with the technology.
- Gabby Redwine of the Harry Ransom Center, University of Texas at Austin gave a good talk about the HRC's experiences so far. HRC have made some of their collections accessible in the reading room and in exhibition spaces, and are doing some creative things to learn what they can from the process. Like us, they are opting for the locked down laptop approach as an interim means of researcher access to born-digital material.
- William Snow of Stanford University Libraries spoke to us about SALT, or the Self Archiving Legacy Toolkit. This does some very cool things using semantic technologies, though we would need to look at technologies that can be implemented locally (much of SALT functionality is currently achieved using third-party web services). Stanford are looking to harness creators' knowledge of their own lives, relationships, and stuff, to add value to their personal archives using SALT. I think we might use it slightly differently, with curators (perhaps mediating creator use, or just processing?) and researchers being the most likely users. I really like the richness in the faceted browser (they are currently using flamenco) - some possibilities for interfaces here. Their use of Freebase for authority control was also interesting; at the Bod, we use The National Register of Archives (NRA) for this and would be reluctant to change all our legacy finding aids and place our trust in such a new service! If the NRA could add some freebase-like functionality, that would be nice. Some other clever stuff too, like term extraction and relationship graphs.
The day concluded with a little discussion, mainly about where digital forensics and legal discovery tools fit into digital archiving. My feeling is that they are useful for capture and exploration. Less so for the work needed around long-term preservation and access.
Thursday, 12 February 2009
MinivMac
There are pros and cons to both and the best thing will be to do both. Some readers will want, for example, to experience the pain of using Word for DOS, others will care only for the content of the document and want to read it with their new personal computer (we have to spell it out now since the great PC/Mac debate - folks, a Mac IS a PC!).
Why am I saying all this? Mostly because I sit opposite a wall of shelves that will one day form a museum of old kit and those old machines have kept the subject on my mind for a bit. I have also been experimenting with virtual machines (for reasons beyond emulation) and emulators. Finally, I'm saying all this because Susan tells me this blog is the place to keep and share things that might be useful and so I wanted to log that Apple make their old software available including the OSes and that MinivMac and this Mac-On-A-Stick project looks like they may one day be useful to us. (And if you're a Mac user, check out System 7 via Mac-On-A-Stick - it really isn't much different! :-))