Monday 24 January 2011

Migrating documents

We have a collection that consists of several thousand documents in various archaic (well, 1980s/90s) word processor formats including Ami Professional and (its predecessor) Samna Word. Perhaps of interest to folks intent on discussing the implications of migration for authenticity of the items, some of those Ami Pro files contain the (automatically generated) line:

"File ... was converted from Samna Word on ..."

So which is the original now?

Migrating these file formats has not been straight forward. This is because it was proved remarkably tricky to ascertain a key piece of information - the file format of the original. This is not the fault of file format tools (I'm using FITS, which itself wraps the usual suspects JHOVE & DROID), but the broader problem that the files have multiple formats. Ami Pro files are correctly identified "text/plain". The command file reports them as "ASCII English text". Some (not all) have a file extension ".sam" which is usually Ami Word, but the ".sam" files are not all the same format.

Yet this small piece of metadata is essential because without it it is very difficult to identify the correct tool to perform the migration. For example, if I run my usual text to PDF tool - which is primed to leap into action on arrival of a "text/plain" document - the resultant PDF shows the internals of a Ami Pro file, not the neatly laid out document the creator saw. We have a further piece of information available too, and curiously it is the most useful. This is the "Category" from the FTK - which correctly sorts the Ami Pros from the Samna Words.

This leads to a complex migration machine that needs to be capable of collating file format information from disparate sources and making sense of the differences, all within the context of the collection itself. If I know that creator X used Ami Pro a lot, then I can guess that "text/plain" & ".sam" means an Ami Pro document, for example. This approach is not without problems however, not least of which is that it requires a lot of manual input into what should ultimately be an automated and unwatched process. (One day, when it works better, I'll try to share this code!)

Sometimes you get lucky, and the tool to do the migration offers an "auto" mode for input. For this collection I am using a trial copy of FileMerlin to perform the migration and evaluate it. It actually works better if you let it guess the input format rather than attempt to tell it. Other tools, such as JODConverter, like to know the input format and here you have a similar problem - you need to know what JODConverter is happy to accept rather than the real format - for example, send it a file with a content type of "application/rtf" and it responds with an internal server error. Send the same file with a content type of "application/msword" and the PDF is generated and returned to you.

Then there is a final problem - sometimes you have to make several steps to get the file into shape. For this collection, FileMerlin should be able to migrate Ami Pro and Samna Word into PDFs. In practice, it crashes on a very small sub-set of the documents. To overcome this, I migrate these same documents to "rich text format" (which FileMerlin seems OK with) and then to PDF with JODConverter - sending the aforementioned "application/msword" content type. I had a similar problem with WordPerfect files where using JOD directly changed the formatting of the original files. Using libwpd to create ODTs and then converting them to PDFs generated more accurate PDFs. (This is strange behaviour since OpenOffice itself uses libwpd!) Every time I hit a new (old) file format, the process of identifying it and generating a heuristic for handling it starts over.

I'm starting to think I need a neural network! That really would be putting the AI in OAIS!