Date post: | 27-Jun-2015 |
Category: |
Business |
Upload: | digitalpreservationeurope |
View: | 1,060 times |
Download: | 2 times |
File formats and registries
Manfred Thaller, University at Cologne
October 2nd, 2007
PART I – Formats and Registries
EXERCISE I – Evaluate some
PART II – Formats in PLANETS
EXERCISE II – A bit of modelling
An image
An image
6 rows5 columns
5 rows6 columns
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
1 == yellow 0 == red
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
1 == violett 0 == green
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
Store:1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
Store:6,1,3,0,3,11,0,4,1,1,0,4,1,1,0,7,1
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
Store:1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1
Uncompressed
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
Store:6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1
(Compressed)Run Length Encoded
An image1,1 2,1 3,1 4,1 5,1
1,2 2,2 3,2 4,2 5,2
1,3 2,3 3,3 4,3 5,3
1,4 2,4 3,4 4,4 5,4
1,5 2,5 3,5 4,5 5,5
1,6 2,6 3,6 4,6 5,6
Store:SetSize: 5 by 6SetBackgroundColor:BlueSetForegroundColor:RedSetLetterHeight: 4MoveTo: 3,5DrawLetter: T
An image6 rows5 columns
1 == yellow0 == red
Uncompressed
An imagedimensions
1 == yellow0 == red
Uncompressed
An imagedimensions
photogrammetric interpretation
Uncompressed
An imagedimensions
photogrammetric interpretation
compression
An image<basic information>
<rendering information>
<storage information>
An image<basic information>(implicit / explicit)<rendering information>(implicit / explicit)<storage information>(implicit / explicit)
… and the data?
An image<basic information>(implicit / explicit)<rendering information>(implicit / explicit)<storage information>(implicit / explicit)
… and the data?
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
Data either asdata stream
1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1
An image1 1 1 1 1
1 0 0 0 1
1 1 0 1 1
1 1 0 1 1
1 1 0 1 1
1 1 1 1 1
Data either asdata stream or asprocessing instructions
SetSize: 5 by 6SetBackgroundColor:YellowSetForegroundColor:RedSetLetterHeight: 4MoveTo: 3,5DrawLetter: T
File format<basic information> <rendering information> <storage information>
<data>
File format<basic information> What to do?<rendering information> <storage information>
<data>
File format<basic information> What to do?<rendering information> How to do it?<storage information>
<data>
File format<basic information> What to do?<rendering information> How to do it?<storage information> How to move it from persistent todeployed form?<data>
File format<basic information> What to do?<rendering information> How to do it?<storage information> How to move it from persistent todeployed form?<data> What to deploy?
File format<basic information> What to do?<rendering information> How to do it?<storage information> How to move it from persistent todeployed form?<data> What to deploy?
File format<basic information> Mandatory<rendering information> Useful<storage information> Historical<data> Mandatory
File formatA deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream).
File format: TIFF
File format: PDF1 0 obj<< /Type /Page /Parent 281 0 R /Resources 2 0 R /Contents 3 0 R /StructParents 2 /MediaBox [ 0 0 612 792 ] /CropBox [ 0 0 612 792 ] /Rotate 0 >> endobj
File format: PDF2 0 obj<< /ProcSet [ /PDF /Text ] /Font << /TT2 292 0 R /TT4 288 0 R >> /ExtGState << /GS1 300 0 R >> /ColorSpace << /Cs6 289 0 R >> >> endobj
File format: PDF3 0 obj<< /Length 4605 /Filter /FlateDecode >> streamH‰„WÛŽÛÈ}×Wô#Œ4jR”¨`±Àø ™Í" ¶(²5j›"¹lräý‘|oêÖ-j�—‹udTÙÂ…fPnˆ¿ìþ>Ó›Ež²ÝÕ˽âä”uª2i*<<v ú[Óžk9Q‰¼‡x»XTP{� � � �‹±/[i²½Ö)}ÔÏö&ªÙH;<Cµ
… and about 4000 bytes more
ŠøL"È÷ےƬJYØÂm]j¥Ýqõ¥ÏººÕ™·²ôÒ·Ûº¤–÷.u-kP0� �4“øTxM<é識9uôøˆòLi¦ØoTÖ m–;ǯ÷¤ÿlÕºvéU—Ë� �±¤Lm°gŸˆu1Åëu5l3¯’¢O %òËTîü7?ìNdhendstreamendobj
File format: XML (here: SVG)<?xml version="1.0" encoding="UTF-16"?> <svg:svg width="800" height="1000" xmlns:svg="http://www.w3.org ... <svg:rect x="0" y="0" width="800" height="1000" fill="white" /> <svg:g transform="translate(-140,0)"> <svg:line x1="600" y1="20" x2="500" y2="20" stroke="black" … <svg:text x="600" y="28.8" font-size="6" fill="black" … </svg:g> <svg:g transform="translate(-140,0)"> <svg:text x="500" y="24.4"> <svg:tspan font-size="4" fill="black">Leiste</svg:tspan> </svg:text> </svg:g> <svg:defs> <svg:g id="halbeSaeuleLeiste0">
File format: XML (here SVG)
File format: XML (ETH: “column XML”)<?xml version="1.0" encoding="UTF-8"?> <Autor name="Vitruv"> <Ordnung name="Ionisch" THz="" THn="" MH="" TBz="" TBn="" … <Element name="Gebaelk" original="" THz="" THn="" MH="" … <Element name="Gesims" original="corona" THz="" THn="" MH="" … <Element name="Leiste" original="" THz="" THn="" MH="0.03" … <Element name="Kyma" original="sima" THz="" THn="" … <Element name="Leiste" original="" THz="" THn="" MH="0.017" … <Element name="Kyma_reversa" original="cymatium" THz="" … <Element name="Platte" original="corona" THz="" THn="" … <Element name="Leiste" original="" THz="" THn="" MH="0.017" … <Element name="Kyma_reversa" original="cymatium" THz="" … <hElement name="Band" typ="1" dx="0.048" r="0.019"/> <hElement name="Band" typ="1" dx="0.048" r="0.019"/> </Element>
Files and Preservation
1.Bit rot.2.Obscolescence of software.
Bit rotAn Image filebefore ….
Bit rot... and afterone byte ischanged.
Bit rot... and afterone byte ischanged.
Undetectableby software.
Bit rot002 004
234 123
234 156
127 178
221 221
Processing dictionary
Payload
Bit rot002 004
234 123
234 156
127 xxx
221 221
One byte is damaged, one byte cannot be displayed correctly.
Bit rot002 xxx
234 123
234 156
127 178
221 221
One byte is damaged, ten bytes cannot be displayed correctly.
Result:
http://www.cflr.beniculturali.it/Progetti/Fixit.php
www.cflr.beniculturali.itwww.cflr.beniculturali.it
Franco LiberatiFranco [email protected]@di.uniroma1.it
Università di Roma “La Sapienza” Università di Roma “La Sapienza” Dipartimento Informatica Dipartimento Informatica
Centro Fotoriproduzione Centro Fotoriproduzione Legatoria e RestauroLegatoria e Restauro
Paolo BuonoraPaolo [email protected]@beniculturali.it
Paolo on JPEGJPEG2000 more robust against bit rot than
TIFF.
Paolo on JPEGJPEG2000 more robust against bit rot than
TIFF.
So, to stinulate more empiricism …
Obsolescence1. Software able to read does not exist any
more.
2. Format specification lost.
3. Implied algorithm lost.
4. Required object lost.
Recommended formats: textHigh confidence Medium confidence Low confidence Plain text (encoding: ISO8859-1 - 9, UTF-8, UTF-16 with BOM) XML (includes XSD/XSL/XHTML, etc.; with included or accessibleschema and characterencoding explicitlyspecified) PDF/A-1 (ISO 19005-1)
Cascading Style Sheets (*.css) DTD (*.dtd) PDF (*.pdf) (embedded fonts) Rich Text Format 1.x (*.rtf) HTML 4.x (include aDOCTYPE declaration) SGML (*.sgml) Open Office (*.sxw/*.odt) Office Open XML (*.docx)
PDF (*.pdf) (encrypted) Microsoft Word (*.doc) WordPerfect (*.wpd) DVI (*.dvi) All other text formats notlisted here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
Recommended formats: bitmap / raster image
High confidence Medium confidence Low confidence
TIFF (uncompressed) PNG (*.png)
BMP (*.bmp) JPEG/JFIF (*.jpg)JPEG2000 (prefer lossless or uncompressed) (*.jp2)TIFF (compressed)GIF (*.gif)
MrSID (*.sid)TIFF (in Planar format) FlashPix (*.fpx)PhotoShop (*.psd)All other raster image formats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
Recommended formats: vector graphics
High confidence Medium confidence Low confidence
SVG 1.1 (no Java binding) (*.svg)
Computer Graphic Metafile (CGM, WebCGM) (*.cgm)
Encapsulated Postscript (EPS)Macromedia Flash (*.swf)All other vector image formats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
Recommended formats: audioHigh confidence Medium confidence Low confidence
AIFF (PCM) (*.aif, *.aiff) WAV (PCM) (*.wav)
SUN Audio (uncompressed) (*.au)Standard MIDI (*.mid,*.midi)Ogg Vorbis (*.ogg)Free Lossless Audio Codec (*.flac) Advance Audio Coding (*.mp4, *.m4a, *.aac) MP3 (MPEG-1/2, Layer 3)(*.mp3)
AIFC (compressed) (*.aifc) NeXT SND (*.snd) RealNetworks 'Real Audio‚ (*.ra, *.rm, *.ram) Windows Media Audio(*.wma)WAV (compressed) (*.wav)All other audio formats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
Recommended formats: videoHigh confidence Medium confidence Low confidence
Motion JPEG 2000(ISO/IEC 15444-4)( *.mj2) AVI (uncompressed)(*.avi)QuickTime Movie(uncompressed)(*.mov)Motion JPEG (*.avi,*.mov)
Ogg Theora (*.ogg)MPEG-1, MPEG-2 (*.mpg, *.mpeg)MPEG-4(*.mp4)
AVI (compressed) (*.avi)QuickTime Movie(compressed) (*.mov)RealNetworks 'Real Video‚ (*.rv)Windows Media Video(*.wmv)All other video formats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
Recommended formats: “data base”
High confidence Medium confidence Low confidence
Delimited Text (*.txt,*.csv)SQL DDL
DBF (*.dbf)OpenOffice *.sxc/*.ods)Office Open XML *.xlsx)
Excel (*.xls)All other spreadsheet/ database formats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
Recommended formats: 3D (“virtual reality”)
High confidence Medium confidence Low confidence
X3D (*.x3d) VRML (*.wrl, *.vrml)U3D (Universal 3D fileformat)
All other virtual realityformats not listed here
http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf
What kind of file is this?
Two ways to identify a file:
(a)By extension.
(b) By internal characteristics („magic number“, „signature“).
What kind of file is this?
Two ways to identify a file:
(a)By extension.
„Each file ending with *.doc is a MS Word document“
What kind of file is this? Two ways to identify a file:
(b) By internal characteristics („magic number“, „signature“).
A TIFF file begins with …Bytes 0-1: The byte order used within the file. Legal values are:“II” (4949.H) / “MM” (4D4D.H)Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as aTIFF file.
What kind of file is this?
Necessity to identify files lead to two developments:
(a)„Clever software“ – inspects files to decide how to process them.
(b)MIME Types.
(c)FORMAT registries.
What kind of file is this?
The following 4 transparencies are a quotation from http://hul.harvard.edu/gdfr
(see below).
Global Digital Format Registry DSpace User Group, March 2004
Why Do We Need a Registry?
• Repository functions are performed on a format-specific basis
• Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented
• Interchange requires mutual agreement of format syntax and semantics
Global Digital Format Registry DSpace User Group, March 2004
Potential Use Cases
• Identification– “I have a digital object; what format is it?”
• Validation– “I have an object purportedly of format F; is it?”
• Transformation– “I have an object of format F, but need G; how can I produce it?”
• Characterization– “I have an object of format F; what are its significant properties?”
• Risk assessment– “I have an object of format F; is at risk of obsolescence?”
• Delivery– “I have an object of format F; how can I render it?”
Global Digital Format Registry DSpace User Group, March 2004
Repository Format Dependencies Using the OAIS Reference Model
SIP
AIP
Data Management
Administer
Archival storage
Manage
Access
DIP
Format registry
Preservation
Strategies
Monitoring
Migration
Emulation
Transform SIP-to-AIP
Validate SIP
Transform AIP-to-DIP
Metadata for encapsulation/archaeology
DescriptiveMetadata
Content andrepresentation
information
Ingest
QA
Generate AIP
Discovery
Generate DIP
Delivery
Global Digital Format Registry DSpace User Group, March 2004
What’s Wrong with MIME Types?
• Insufficient depth of detail– No requirements regarding syntax and semantic
description– No requirement for complete disclosure, especially
of proprietary formats• Insufficient granularity
– Both tiled RGB GeoTIFF with LZW and striped bi-tonal TIFF-FX with Group 4 are typed as “image/tiff”
– All of PDF 1.0 – 1.4, PDF/X-1, X-2, X-3, and PDF/A are typed as “application/pdf”
– These variants might require radically different workflows
File format registries - URLsPRONOM:http://www.nationalarchives.gov.uk/pronom/(does not only rely on extensions)
Global Digital Format Registry: http://hul.harvard.edu/gdfr(predominantly project description)
FileExt: http://filext.com(predominantly links to software)
Exercise I: A few experiments
Group 1
Aistė Abromaitytė
Tomasz Jablonski
Aadi Kaljuvee
Juratė Kuprienė
Violeta Meiliūnaitė
Exercise I: A few experiments
Group 2
Libor Coufal
Edvardas Germanas
Hamid Rofoogaran
Laima Šiudikiene
Eglė Žvinytė
Exercise I: A few experiments
Group 3
Renata Balandienė
Thomas Guignard
Edgars Jekabsons
Elona Malaiškienė
Bjorn Ragnolf Ronning
Exercise I: A few experiments
Group 4
Gražina Deveikytė
Raimondas Malaiška
Filip Kwiatek
Marija ProkopčikPiret Randmae
Jelena Saikovič
PART II – Formats in PLANETS:File characteristics
PART II – Formats in PLANETS:File characteristics
Based on two formal languages:
(1)eXtensible Characterisation Extraction Language (= XCEL)
(2)eXtensible Characterisation Description Language (= XCDL)
Tooth of Time
2007
2017
Extractor
Format specified in XCEL
Comparer
XCDL 2017
XCDL 2007
0,99%
Migrator
tiff
png
Extractor
tiff XCEL png XCEL
... XCEL... XCEL
Comparer
png XCDL
tiff XCDL
0,93%
<XCELDocument...> ...
<formatDescription>....<symbol identifier="ID01_I01_I01_S02"
originalName="height“ interpretation="uint32"> <range> <startposition xsi:type="sequential“>
</startposition> <length xsi:type="fixed">4</length></range>
<name>height</name>
</symbol><symbol identifier="ID01_I01_I01_S04"
originalName="colourType"> <range> <startposition xsi:type="sequential">
</startposition> <length xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>greyscale</valueLabel> <value>0</value>...
<name>imageType</name>
</symbol><symbol identifier="ID01_I01_I01_S05"
originalName="compressionMethod"> <range> <startposition xsi:type="sequential“>
</startposition> <length xsi:type="fixed">1</length></range> <valueInterpretation> <valueLabel>zlibDeflateInflate</valueLabel> <value>0</value></valueInterpretation>
<name>compression</name>
</symbol>...
<xcdl> <object id="o1" > <normData id="nd1" > ... </normData> <property id="p1" source="raw"
cat="descr" >
<name>compression</name>
<valueSet id="i_i1_s6" > <rawValue>0 </rawValue> <labValue>...</labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p2" source="raw"
cat="descr" >
<name>height</name>
<valueSet id="i_i1_s3" > <rawValue>0 0 1 ad </rawValue> <labValue> <val>429</val> <type>uint32</type> </labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property> <property id="p3" source="raw"
cat="descr" >
<name>imageType</name>
.....
<xcdl> <object id="o1" > <normData id="nd1" > ff ff ff ff ff fe ff ff fd ff ff fc ff ff fb ff ff fa ff ff f9 ff ff f8 ff ff f7 ff ff f6 ff ff f5 ff ff f4 ff ff f3 ff ff f2 ff ff f1 ff ff f0 ff ff ef ff ff ee ff ff ed ff ff ec ff ff eb ff ff ea ff ff e9 ff ff e8 ff ff e7 ff ff e6 ff ff e5 ff ff e4 ff ff e3 ff ff e2 ff ff e1 ff ff e0 ff ff df ff ff de ff ff dd ff ff dc ff ff db ff ff da ff ff d9 ff ff d8 ff ff d7 ff ff d6 ff ff d5 ff ff d4 ff ff d3 ff ff d2 ff ff d1 ff ff d0 ff ff cf ff ff ce ff ff cd ff ff cc ff ff cb ff ff ca ff ff c9 ff ff c8 ff ff c7 ff ff c6 ff ff c5 ff ff c4 ff ff c3 ff ff c2 ff ff c1 ff ff c0 ff ff bf ff ff be ff ff bd ff ff bc ff ff bb ff ff ba ff ff b9 ff ff b8 ff ff b7 ff ff b6 ff ff b5 ff ff b4 ff ff b3 ff ff b2 ff ff b1 ff ff b0 ff ff af ff ff ae ff ff ad ff ff ac ff ff ab ff ff aa ff ff a9 ff ff a8 ff ff a7 ff ff a6 ff ff a5 ff ff a4 ff ff a3 ff ff a2 ff ff a1 ff ff a0 ff ff 9f ff ff 9e ff ff 9d ff ff 9c ff ff 9b ff ff 9a ff ff 99 ff ff 98 ff ff 97 ff ff 96 ff ff 95 ff ff 94 ff ff 93 ff ff 92 ff ff 91 ff ff 90 ff ff 8f ff ff 8e ff ff 8d ff ff 8c ff ff 8b ff ff 8a ff ff 89 ff ff 88 ff ff 87 ff ff 86 ff ff 85 ff ff 84 ff ff 83 ff ff 82 ff ff 81 ff ff 80 ff ff 7f ff ff 7e ff ff 7d ff ff 7c ff ff 7b ff ff 7a ff ff 79 ff ff 78 ff ff 77 ff ff 76 ff ff 75 ff ff 74 ff ff 73 ff ff 72 ff ff 71 ff ff 70 ff ff 6f ff ff 6e ff ff 6d ff ff 6c ff ff 6b ff ff 6a ff ff 69 ff ff 68
…
Confession
Confession
Computer science does not really know what information is.
Computer science does not really know what information is.
It is pretty good at representing and processing it, though.
Representations & migrations
III == 3 == γ‘ == ●●●Four representations of the idea / concept / model three
Representations &
migrations
I divided by III == 1 / 3 == 1.3333?
I divided by III == 1 / 3 == 1.3 periodic
Some ideas are handled more precisely bySome thinkers than others.
Representations &
migrations
48 bit images on 24 and on 48 bit graphics cards.
Some data is processed more adequately by some equipment than others
Representations &
migrations
A model for information before and after a migration must therefore potentially represent all information there, irrespective of the possibility to process it in a given environment.
XCEL / XCDL
Languages are being processed …… development focus currently: dynamic handling of format specific algorithms.
XCEL / XCDL: image model (1)
A pixel cube …Each pixel:MSB (channel 1), … LSB (channel 1),…MSB (channel n), … LSB (channel n),MSB (aux 1), … LSB (aux 1),…MSB (aux m), … LSB (aux m)
XCEL / XCDL: image model (2)
A pixel cube …
Accompanied by rendering info plusdeployment info.
XCEL / XCDL: image model - example
<property id="p4" source="raw" cat="descr" > <name>imageType</name> <valueSet id="i_i1_s5" > <rawValue>2</rawValue> <labValue> <val>truecolour</val> <type>fixedLabel</type> </labValue> <dataRef ind="normAll" /> <propRel/> </valueSet> </property>
XCEL / XCDL: text model
A text (= <object>) is composed of- data (<normData>) plus- interpretations of data accordingto the underlying format specification(=properties; <property>).
XCEL / XCDL: text model - example
This is a text
<refData id="1">54 68 69 73 20 69 73 20 61 20 74 65 78 74</refData>…<property><name>fontsize</name><rawVal><val>00 18</val><type>unsignedInt8</type></rawVal><dataRef> <!-- property refers to discrete part of reference data--><ref id="1" start="0" end="3"/><ref id="1" start=“10" end="12"/></dataRef></property>
Exercise II: Abstract modelling
Group 1: maps
Group 2: music
Group 3: excel sheets
Group 4: „books“ … ever heard of FRBR?