Post on 28-Nov-2014
description
transcript
Supporting SPs in a working
archive: Software Tools
2
Challenge
Reality: Infeasible to perform manual maintenance of large number of objects. Require software capable of extracting & maintaining SPs for large of objects
Requirements:1. Object analysis tools
• Support requisite formats• Identify all/some SPs• Support batch analysis• Ideally well supported and documented
2. Description schemas to record SPs• Flexible• Machine and format idependent
3. Conversion/emulation tools capable of maintaining SPs
3
Format identification
•File identification through Magic Number and ‘light touch’ scan of encoding structure.•Recognise 100s (potentially 1000s) of formats•Provide basic encoding info, but not detailed structure•Examples:• File (1): Free version created in 1986 & available for all
operating systems.http://gnuwin32.sourceforge.net/packages/file.htm (Windows)• DROID: Java app developed by TNA. Integration with
PRONOM. Format ID & assignment of PUID, which can be linked to preservation planning. http://droid.sourceforge.net/. • FFIdent: Java library to ID and extract basic information.
Recognizes 27 encoding formats using header information (magic number & common structural information)
4
5
Detailed Analysis
•Email:• Aperture - Java framework able to decode structured text
and convert to other format• ReadPST: Open source tool for processing Outlook PSTs• XENA - Java tool developed by NAA
•Audio:• MP3Info - technical info viewer and ID3 1.x tag editor that
supports the MP3 file format. • SoX/SOXI (Sound eXchange): extracts descriptive MD and
technical info• MetaFlac: Extractor tool for FLAC audio.
•Images:• TiffInfo• ImageMagick• JHOVE
Perform detailed analysis of internal structure of one or more files.
See InSPECT Testing Reports available at http://www.significantproperties.org.uk/
for further info on these tools
6
JHOVE 1/2JHOVE (http://hul.harvard.edu/jhove/)•Format-specific digital object validation API written in Java•Functionality: Format identification, Format validation, Format Characterisation•Supports: AIFF, ASCII, Bytestream, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAV, and XML.
JHOVE2 (https://confluence.ucop.edu/display/JHOVE2Info/Home)•Supports: JPEG 2000, PDF, SGML, Shapefile, TIFF, ASCII & UTF-8 encoded text, WAVE, XML, ICC color profile•Functionality: Format identification, validation, feature extraction & policy-based assessment
7
JHOVE Demo
8
XCL (eXtensible Characterization Language)•Content extraction• Extracts content & tech properties through use of XCEL and saved as XCDL.
•Format support:• PNG, TIFF, GIF, BMP, JPEG, JP2, PBM, PCD, PCX, PICT, PPM, PSD, SVG, TGA, XBM and XPM, MS DOC, DocX, PDF
•Content comparison• Compare 2 objects e.g. TIFF & PNG, PDF & Doc
9
XCL Extract & compare
Object A
Object B
Format A XCEL
Format B XCEL
Conversion Extractor Comparator
Object A XCDL
Object B XCDL
10
XCL Demo
11
Final thoughts
•Analysis tools useful, but have problems:• Limited format support•Variable access methods (GUI, CLI, APIs)• Inconsistent reporting process•Different metrics (e.g. text vs. no.)•Metric variations (e.g. milliseconds)
•Partial solution: Wrap tools into services• PLANETS Interoperability Framework