It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!
The way I saw it, I needed three things to create the accession:
- A list of files and folders that formed a real accession
- A set of data that could be used - real documents, images, sound files, system files, etc.
- Some way of tying these together to create an accession modelled on a real one but containing public data
Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.
It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!
In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.
(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).
So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:
C:\WINDOWS\SYSTEM32\ABC.DLL
I wanted to grab any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.
This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.
Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.
I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).
(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).
The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.
Hooray! Now gotta pack it into disk images and start exploring processing!
Should anyone be interested, the source code is available for download.
2 comments:
Very interesting process! I have a few test folders representing different disk scenarios that I will use for testing; although using disk images is smarter.
I agree with Renhart that OS profiles would be good to build. I have considered doing this for some time but it never made it up the priority list.
One potential problem with using the SHA-1 indicator is that these files change update-to-update so a single file may have several different hashes depending on the update. File path and naming regex would probably be a better broad profiling with hashes for more fine-grained analysis when needed.
I often use hash (MD5 in my case) sorting (via perl scripts) to help identify duplicate material between disks within an accession. Normally OS files have been purged before this happens.
Hey Seth! Thanks for the comment and for reading that long post! :-)
I like the idea of profiling by path in addition to some kind of hash value. More to think about, but pretty straightforward with Solr.
Post a Comment