Tuesday 13 November 2012

Transcribe at the arcHIVE

I do worry from time to time that textual analogue records will come to suffer from their lack of searchability when compared with their born-digital peers. For those records that have been digitised, crowd-sourcing transcription could be an answer. A rather neat example of just that is the arcHIVE platform from the National Archives of Australia. arHIVE is a pilot from NAA's labs which allows anyone to contribute to the transcription of records. To get started they have chosen a selection of records from their Brisbane office which are 'known to be popular'. Not too many of them just yet, but at this stage I guess they're just trying to prove the concept works. All the items have been OCR-ed, and users can choose to improve or overwrite the results from the OCR process. There are lots of nice features here, including the ability to choose documents by a difficulty rating (easy, medium or hard) or by type (a description of the series by the looks of it). The competitive may be inspired by the presence of a leader board, while the more collaborative may appreciate the ability to do as much as you can, and leave the transcription for someone else to finish up later. You can register for access to some features, but you don't have to either. Very nice.

Friday 19 October 2012

Atlas of digital damages

An Atlas of digital damage has been created on Flickr, which will provide a handy resource for illustrating where digital preservation has failed. Perhaps 'failed' is a little strong. In some cases the imperfection may be an acceptable trade off. A nice, and useful, idea. Contribute here.

Saturday 13 October 2012

DayOfDigitalArchives 2012

Yesterday was Day of Digital Archives 2012! (And yes, I'm a little late posting...)

This 'Day' was initiated last year to encourage those working with digital archives to use social media to raise awareness of digital archives: "By collectively documenting what we do, we will be answering questions like: What are digital archives? Who uses them? How are they created and managed? Why are they important?" . So in that spirit, here is a whizz through my week.

Coincidentally not only does this week include the Day of Digital Archives but it's also the week that the Digital Preservation Coalition (or DPC) celebrated its 10th birthday. On Monday afternoon I went to the reception at the House of Lords to celebrate that landmark anniversary. A lovely event, during which the shortlist for the three digital preservation awards was announced. It's great to see three award categories this time around, including one that takes a longer view: 'the most outstanding contribution to digital preservation in the last decade'. That's quite an accolade.

On the train journey home from the awards I found some quiet time to review a guidance document on the subject of acquiring born-digital materials. There is something about being on a train that puts my brain in the right mode for this kind of work. Nearing its final form, this guidance is the result of a collaboration between colleagues from a handful of archive repositories. The document will be out for further review before too long, and if we've been successful in our work it should prove helpful to creators, donors, dealers and repositories.

Part of Tuesday I spent reviewing oral history guidance drafted by a colleague to support the efforts of Oxford Medical Alumni in recording interviews with significant figures in the world of Oxford medicine. Oral histories come to us in both analogue and digital formats these days, and we try to digitise the former as and when we can. The development of the guidance is in the context of our Saving Oxford Medicine initiative to capture important sources for the recent history of medicine in Oxford. One of the core activities of this initiative is survey work, and it is notable that many archives surveyed include plenty of digital material. Web archiving is another element of the 'capturing' work that the Saving Oxford Medicine team has been doing, and you can see what has been archived to-date via Archive-It, our web archiving service provider.

Much of Wednesday morning was given over to a meeting of our building committee, which had very little to do with digital archives! In the afternoon, however, we were pleased to welcome visitors from MIT - Nancy McGovern and Kari Smith. I find visits like these are one of the most important ways of sharing information, experiences and know-how, and as always I got a lot out of it. I hope Nancy and Kari did too! That same afternoon, colleagues returned from a trip to London to collect another tranche of a personal archive. I'm not sure if this instalment contains much in the way of digital material, but previous ones have included hundreds of floppies and optical media, some zip discs and two hard disks. Also arriving on Wednesday, some digital Library records courtesy of our newly retired Executive Secretary; these supplement materials uploaded to BEAM (our digital archives repository) last week.

On Thursday, I found some time to work with developer Carl Wilson on our SPRUCE-funded project. Becky Nielsen (our recent trainee, now studying at Glasgow) kicked off this short project with Carl, following on from her collaboration with Peter May at a SPRUCE mashup in Glasgow. I'm picking up some of the latter stages of testing and feedback work now Becky's started her studies. The development process has been an agile one with lots of chat and testing. I've found this very productive - it's motivating to see things evolving, and to be able to provide feedback early and often. For now you can see what's going on at github here, but this link will likely change once we settle on a name that's more useful than 'spruce-beam' (doesn't tell you much, does it?! Something to do with trees...) One of the primary aims of this tool is to facilitate collection analysis, so we know better what our holdings are in terms of format and content. We expect that it will be useful to others, and there will be more info. on it available soon.

Friday was more SPRUCE work with Carl, among other things. Also a few meetings today - one around funding and service models for digital archiving, and a meeting of the Bodleian's eLegal Deposit Group (where my special interest is web archiving). The curious can read more about e-legal deposit at the DCMS website.  One fun thing that came out of the day was that the Saving Oxford Medicine team decided to participate in a Women in Science wikipedia editathon. This will be hosted by the Radcliffe Science Library on 26 October as part of a series of 'Engage' events on social media organised by the Bodleian and the University's Computing Services. It's fascinating to contemplate how the range and content of Wikipedia articles change over time, something a web archive would facilitate perhaps. 

For more on working with digital archives, go take a look at the great posts at the Day of Digital Archives blog!

Friday 8 June 2012

Sprucing up the TikaFileIdentifier

As it's International Archives Day tomorrow, I thought it would be nice to quickly share some news of a project we are working on, which should help us (and others!) to carry out digital preservation work a little bit more efficiently.

Following the SPRUCE mashup I attended in April, we are very pleased to be one of the organizations granted a SPRUCE Project funding award, which will allow us to 'spruce' up the TikaFileIdentifier tool. (Paul has written more about these funding awards on the OPF site.)

TikaFileIdentifier is the tool which was developed at the mashup to address a problem several of us were having extracting metadata from batches of files, in our case within ISO images. Due to the nature of the mashup event the tool is still a bit rough around the edges, and this funding will allow us to improve on it. We aim to create a user interface and a simpler install process, and carry out performance improvements. Plus, if resources allow, we hope to scope some further functionality improvements.

This is really great news, as with the improvements that this funding allows us to make, the TikaFileIdentifier will provide us with better metadata
for our digital files more efficiently than our current system of manually checking each file in a disk image. Hopefully the simpler user interface and other improvements means that other repositories will want to make use of it as well; I certainly think it will be very useful!

Friday 20 April 2012

SPRUCE Mashup: 16th-18th April 2012


Earlier this week I attended a 3 day mashup event in Glasgow, organised as part of the SPRUCE project.  SPRUCE aims to enable Higher Education Institutions to address preservation gaps and articulate the business case of digital preservation, and the mashup serves as a way to bring practitioners and developers together to work on these problems. Practitioners took along a collection which they were having issues with, and were paired off with a developer who could work on a tool to provide a solution. 

Day 1
After some short presentations on the purpose of SPRUCE and the aims of the mashup, the practitioners presented some lightning talks on our collections and problems. These included dealing with email attachments, preserving content off Facebook, software emulation, black areas in scanned images, and identifying file formats with incorrect extensions, amongst others. I took along some disk images, as we find it very time-consuming to find out date ranges, file types and content of the files in the disk image, and we wanted a more efficient way to get this metadata. More information on the collections and issues presented can be found at the wiki.

After a short break for coffee (and excellent cakes and biscuits) we were sorted into small groups of collection owners and developers to discuss our issues in more detail. In my group this led to conversations about natural language processing, and the possibilities of using predefined subjects to identify files as being about a particular topic, which we thought could be really helpful, but somewhat impossible to create in a couple of days! We were then allocated our developers. As there were a few of us with problems with file identification, we were assigned to the same developer, Peter May from the BL. The day ended with a short presentation from William Kilbride on the value of digital collections and Neil Beagrie's benefits framework.

Day 2
The developers were packed off to another room to work on coding, while we collection owners started to look into the business case for digital preservation. We used Beagrie’s framework to consider the three dimensions of benefits (direct or indirect, near- or long-term, and internal or external), as they apply to our institutions. When we reported back, it was interesting to see how different organisations benefit in different ways. We also looked at various stakeholders and how important or influential they are to digital preservation. Write ups of these sessions are also available at the wiki.
 
The developers came back at several points throughout the day to share their progress with us, and by lunchtime the first solution had been found! The first steps to solving our problem were being made; Peter had found a program, Apache Tika, which can parse a file and extract metadata (it can also identify the content type of files with incorrect extensions), and had written a script so that it could work through a directory of files, and output the information into a CSV spreadsheet. This was a really promising start, especially due to the amount of metadata that could potentially be extracted (provided it exists within the file), and the ability to identify file types with incorrect extensions.

Day 3
We had another catch up with the developers and their overnight progress. Peter had written a script that took the information from the CSV file and summarised it into one row, so that it fits into the spreadsheets we use at BEAM. Unfortunately, mounting the ISO image to check it with Apache Tika was slightly more complicated than anticipated, so our disk images couldn't be checked this way without further work.

While the developers set about finalizing their solutions, we continued to work on the business case, doing a skills gap analysis to consider whether our institutions had the skills and resources to carry out digital preservation. Reporting back, we had a very interesting discussion on skills gaps within the broader archives sector, and the need to provide digital preservation training to students as well as existing professionals. We then had to prepare an ‘elevator pitch’ for those occasions when we find ourselves in a lift with senior management, which neatly brought together all the things we had discussed, as we had to explain the specific benefits of digital preservation to our institution and our goals in about a minute. 

To wrap up the developers presented their solutions, which solved many of the problems we had arrived with. A last minute breakthrough in mounting ISO images using  WinCDEmu and running scripts on them meant that we are able to use the Tika script on our disk images. However, because we were so short on time, there are still some small problems that need addressing. I'm really happy with our solution, and I was very impressed by all the developers and how much they were able to get done in such a short space of time.


I felt that this event was a very useful way to get thinking about the business case for what we do, and to get to see what other people within the sector are doing and what problems they are facing. It was also really helpful as a non-techie to get to talk with developers and get an idea of what it is possible to build tools to do (and get them made!). I would definitely recommend this type of event – in fact, I’d love to go along again if I get the opportunity!


Monday 26 March 2012

Media Recognition: DV part 3

DVCAM (encoding)
Type:
Digital videotape cassette encoding
Introduced:
1996
Active:
Yes, but few new camcorders are being produced.
Cessation:
-
Capacity:
184 minutes (large), 40 minutes (MiniDV).
Compatibility:
DVCAM is an enhancement of the widely adopted DV format, and uses the same encoding.
Cassettes recorded in DVCAM format can be played back in DVCAM VTRs (Video Tape Recorders), newer DV VTRs (made after the introduction of DVCAM), and DVCPRO VTRs, as long as the correct settings are specified (this resamples the signal to 4:1:1). DVCAM can also be played back in compatible HDV players.
Users:
Professional / Industrial.
File Systems:
-
Common Manufacturers:
Sony, Ikegami.

DVCAM is Sony’s enhancement of the DV format for the professional market. DVCAM uses the same encoding as DV, although it records ‘locked’ rather than ‘unlocked’ audio. It also differs from DV as it has a track width of 15 microns and a tape speed of 28.215 mm/sec to make it more robust. Any DV cassette can contain DVCAM format video, but some are sold with DVCAM branding on them.

Recognition
DVCAM labelled cassettes come in large (125.1 x 78 x 14.6 mm) or MiniDV (66 x 48 x 12.2mm) sizes. Tape width is ¼”. Large cassettes are used in editing and recording decks, while the smaller cassettes are used in camcorders. They are marked with the DVCAM logo, usually in the upper-right hand corner. 


HDV (encoding)

Type:
Digital videotape cassette encoding
Introduced:
2003
Active:
Yes, although industry experts do not expect many new HDV products.
Cessation:
-
Capacity:
1 hour (MiniDV), up to 4.5 hours (large)
Compatibility:
Video is recorded in the popular MPEG-2 video format. Files can be transferred to computers without loss of quality using an IEEE 1394 connection.
There are two types of HDV, HDV 720p and HDV 1080, which are not cross-compatible.
HDV can be played back in HDV VTRs. These are often able to support other formats such as DV and DVCAM.
Users:
Amateur/Professional
File Systems:
-
Common Manufacturers:
Format developed by JVC, Sony, Canon and Sharp.


Unlike the other DV enhancements, HDV uses MPEG-2 compression rather than DV encoding. Any DV cassette can contain HDV format video, but some are sold with HDV branding on them. 

There are two different types of HDV: HDV 720p (HD1, made by JVC) and HDV 1080 (HD2, made by Sony and Canon). HDV 1080 devices are not generally compatible with HDV 720p devices. The type of HDV used is not always identified on the cassette itself, as it depends on the camcorder used rather than the cassette.

Recognition 
HDV is a tape only format which can be recorded on normal DV cassettes. Some MiniDV cassettes with lower dropout rates are indicated as being for HDV, either with text or the HDV logo. These are not essential for recording HDV video. 


Media Recognition: DV part 2

DV (encoding)
Type:
Digital videotape cassette encoding
Introduced:
1995
Active:
Yes, but tapeless formats such as MPEG-1, MPEG-2 and MPEG-4 are becoming more popular.
Cessation:
-
Capacity:
MiniDV cassettes can hold up to 80/120 minutes SP/LP. Medium cassette size can hold up to 3.0/4.6 hrs SP/LP. Files sizes can be up to 1GB per 4 minutes of recording.
Compatibility:
DV format is widely adopted.
Cassettes recorded in the DV format can be played back on DVCAM, DVCPRO and HDV replay devices. However, LP recordings cannot be played back in these machines.
Users:
DV is aimed at a consumer market – may also be used by ‘prosumer’ film makers.
File Systems:
-
Common Manufacturers:
A consortium of over 60 manufacturers including Sony, Panasonic, JVC, Canon, and Sharp.

DV has a track width of 10 microns and a tape speed of 18.81mm/sec. It can be found on any type of DV cassette, regardless of branding, although most commonly it is the format used on MiniDV cassettes. 

Recognition
DV cassettes are usually found in the small size, known as MiniDV. Medium size (97.5 × 64.5 × 14.6 mm) DV cassettes are also available, although these are not as popular as MiniDV. DV cassettes are labelled with the DV logo.

DVCPRO (encoding)
Type:
Digital videotape cassette encoding
Introduced:
1995 (DVCPRO), 1997 (DVCPRO 50), 2000 (DVCPRO HD)
Active:
Yes, but few new camcorders are being produced.
Cessation:
-
Capacity:
126 minutes (large), 66 minutes (medium).
Compatibility:
DVCPRO is an enhancement of the widely adopted DV format, and uses the same encoding.
Cassettes recorded in DVCPRO format can be played back only in DVCPRO Video Tape Recorders (VTRs) and some DVCAM VTRs.
Users:
Professional / Industrial; designed for electronic news gathering
File Systems:
-
Common Manufacturers:
Panasonic, also Philips, Ikegami and Hitachi.

DVCPRO is Panasonic’s enhancement of the DV format, which is aimed at a professional market. DVCPRO uses the same encoding as DV, but it features ‘locked’ audio, and uses 4:1:1 sampling instead of 4:2:0. It has an 18 micron track width, and a tape speed of 33.82 mm/sec which makes it more robust. DVCPRO uses Metal Particle (MP) tape rather than Metal Evaporate( ME) to improve durability.

DVCPRO 50 and DVCPRO HD are further developments of DVCPRO, which use the equivalent of 2 or 4 DV codecs in parallel to increase the video data rate.

Any DV cassette can contain DVCPRO format video, but some are sold with DVCPRO branding on them.

Recognition
DVCPRO branded cassettes come in medium (97.5 × 64.5 × 14.6mm) or large (125 × 78 × 14.6mm) cassette sizes. The medium size is for use in camcorders, and the large size in editing and recording decks. DVCPRO 50 and DVCPRO HD branded cassettes are extra-large cassettes (172 x 102 x 14.6mm). Tape width is ¼”.

DVCPRO labelled cassettes have different coloured tape doors depending on their type; DVCPRO has a yellow tape door, DVCPRO50 has a blue tape door, and DVCPRO HD has a red tape door.

Images of DVCPRO cassettes are available at the Panasonic website.

Media Recognition: DV part 1

DV can be used to refer to both a digital tape format, and a codec for digital video. DV tape usually carries video encoded with the DV codec, although it can hold any type of data. The DV format was developed in the mid 1990s by a consortium of video manufacturers, including Sony, JVC and Panasonic, and quickly became the de facto standard for home video production after introduction in 1995. Videos are recorded in .dv or .dif formats, or wrapped in an AVI, QuickTime or MXF container. These can be easily transferred to a computer with no loss of data over an IEEE 1394 (Fire Wire) connection.

DV tape is ¼ inch (6.35mm) wide. DV cassettes come in four different sizes: Small, also known as MiniDV (66 x 48 x 12.2 mm), medium (97.5 × 64.5 × 14.6 mm), large (125.1 x 78 x 14.6 mm), and extra-large (172 x 102 x 14.6 mm). MiniDV is the most popular cassette size.

DV cassettes can be encoded with one of four formats; DV, DVCAM, DVCPRO, or HDV. DV is the original encoding, and is used in consumer devices. DVCPRO and DVCAM were developed by Panasonic and Sony respectively as an enhancement of DV, and are aimed at a professional market. The basic encoding algorithm is the same as with DV, but a higher track width (18 and 15 microns versus DV’s 10 micron track width) and faster tape speed means that these formats are more robust and better suited to professional users. HDV is a high-definition variant, aimed at professionals and consumers, which uses MPEG-2 compression rather than the DV format.

Depending on the recording device, any of the four DV encodings can be recorded on any size DV cassette. However, due to different recording speeds, the formats are not always backwards compatible. A cassette recorded in an enhanced format, such as HDV, DVCAM or DVCPRO, will not play back on a standard DV player. Also, as they are supported by different companies, there are some issues with playing back a DVCPRO cassette on DVCAM equipment, and vice versa.

Although all DV cassette sizes can record any format of DV, some are marketed specifically as being of a certain type; e.g. DVCAM. The guide below looks at some of the most common varieties of DV cassette that might be encountered, and the encodings that may be used with them. It is important to remember that any type of encoding may be found on any kind of cassette, depending on what system the video was recorded on.

MiniDV (cassette)
Type:
Digital videotape cassette
Introduced:
1995
Active:
Yes, but is being replaced in popularity by hard disk and flash memory recording. At the International Consumer Electronics Show 2011 no camcorders were presented which record on tape.
Cessation:
-
Capacity:
Up to 80 minutes SP / 120 minutes LP, depending on the tape used; 60/90 minutes SP/LP is standard. This can also depend on the encoding used (see further entries). Files sizes can be up to 1GB per 4 minutes of recording.
Compatibility:
DV file format is widely adopted. Requires Fire Wire (IEEE 1394) port for best transfer.
Users:
Consumer and ‘Prosumer’ film makers, some professionals.
File Systems:
-
Common Manufacturers:
A consortium of over 60 manufacturers including Sony, Panasonic, JVC, Canon, and Sharp

MiniDV refers to the size of the cassette; as noted above, it can come with any encoding. As a consumer format they generally use DV encoding. DVCAM and HDV cassettes also come in MiniDV size.

MiniDV is the most popular DV cassette, and is used for consumer and semi-professional (‘prosumer’) recordings due to its high quality.

Recognition

These cassettes are the small cassette size, measuring 66 x 48 x 12.2mm. Tape width is ¼”. They carry the MiniDV logo, as seen below:

Monday 30 January 2012

Digital Preservation: What I Wish I Knew Before I Started

Tuesday 24th January, 2012

Last week I attended a student conference, hosted by the Digital Preservation Coalition, on what digital preservation professionals wished they had known before they started. The event covered a great deal of the challenges faced by those involved in digital preservation, and the skills required to deal with these challenges.

The similarities between traditional archiving and digital preservation were highlighted at the beginning of the afternoon, when Sarah Higgins translated terms from the OAIS model into more traditional ‘archive speak’. Dave Thompson also emphasized this connection, arguing that digital data “is just a new kind of paper”, and that trained archivists already have 85-90% of the skills needed for digital preservation.

Digital preservation was shown to be a human rather than a technical challenge. Adrian Brown argued that much of the preservation process (the "boring stuff") can be automated. Dave Thompson stated that many of the technical issues of digital preservation, such as migration, have been solved, and that the challenge we now face is to retain the context and significance of the data. The point made throughout the afternoon was that you don’t need to be a computer expert in order to carry out effective digital preservation.

The urgency of intervention was another key lesson for the afternoon. As William Kilbride put it; digital preservation won’t do itself, won’t go away, and we shouldn't wait for perfection before we begin to act. Access to data in the future is not guaranteed without input now, and digital data is particularly intolerant to gaps in preservation. Andrew Fetherstone added to this argument, noting that doing something is (usually) better than doing nothing, and that even if you are not in a position to carry out the whole preservation process, it is better to follow the guidelines as far as you can, rather than wait and create a backlog.

The scale of digital preservation was another point illustrated throughout the afternoon. William Kilbride suggested that the days of manual processing are over, due to the sheer amount of digital data being created (estimated to reach 35ZB by 2020!). He argued that the ability to process this data is more important to the future of digital preservation than the risks of obsolescence. The impossibility of preserving all of this data was illustrated by Helen Hockx-Yu, who offered the statistic the the UK Web Archive and National Archives Web Archive combined have archived less than 1% of UK websites. Adrian Brown also pointed out that as we move towards dynamic, individualised content on the web, we must decide exactly what the information is that we are trying to preserve. During the Q&A session, it was argued that the scale of digital data means that we have to accept that we can’t preserve everything, that not everything needs to be preserved, and that there will be data loss.

The importance of collaboration was another theme which was repeated by many speakers. Collaboration between institutions on a local, national and even international level was encouraged, as by sharing solutions to problems and implementing common standards we can make the task of digital preservation easier.

This is only a selection of the points covered in a very engaging afternoon of discussion. Overall, the event showed that, despite the scale of the task, digital preservation needn't be a frightening prospect, as archivists already have many of the necessary skills.

The DPC have uploaded the slides used during the event, and the event was also live-tweeted, using the hashtag #dpc_wiwik, if you are interested
in finding out more.