A data dictionary and related schema has been drafted for those documents that are largely text, but where creators can specify formatting, such as fonts, colours, text size and page layout; where they can embed images and other items; and where there might take advantage of application features, such as the ability to create annotations or page thumbnails. Specifically targetted formats are: OpenDocument Text, PDF, Staroffice, MS Works, MS Word and Wordperfect. Significant properties relating to appearance, behaviour, content and structure are recorded, and it's anticipated that this metadata could be plugged into PREMIS 2.0's objectCharacteristicsExtension.
The designers, from the California Digital Library and Harvard's University Library, are seeking comments from the digital preservation community. Semantic units are: PageCount, WordCount, CharacterCount, ParagraphCount, Line Count, TableCount, GraphicsCount, Language, Fonts, FontName, IsEmbedded, Features. You can see the current schema in full at http://www.fcla.edu/dls/md/docmd.xsd
This looks like a useful addition to preservation metadata, provided tool support for extracting the information and populating metadata records follows. I think the list of values for 'Features' - isTagged, hasLayers, hasTransparancy, hasOutline, hasThumbnails, hasAttachments, hasForms, hasAnnotations - may need extending (hasFootnotes, hasEndnotes?), and it would be good to see some definitions and examples of the existing values.
I wonder if we need a different data dictionary and schema for slideshows? This one might be adequate with some additions to cover things like animations, timings, etc. Seeing this data dictionary also reminds me that we need to look at where the Planets folk are up to on their significant properties work (XCDL/XCEL).