Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | ralph-baldwin |
View: | 213 times |
Download: | 0 times |
Analysing the Impact of File Formats on Data Integrity
Volker Heydegger
University of Cologne
Archiving 2008
Bern, 23rd – 27th June 2008
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Background
• EU-founded project “Planets”
characterisation of file format content
www.planets-project.eu
University of Cologne, Computer Science for the Humanities
(Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))
Planets partner
www.hki.uni-koeln.de
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Context
• Long-term preservation of digital informationWhich file format to choose?
Criteria, e.g.:
Open standard
Spread of usage
Hard-/Software-Dependencies
Authenticity
…
Robustness
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness::= Error resilience of file formats against bit-stream corruption
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Issues/ Research topics
• Is there any correlation between file format and data integrity?
• If so, are there any differences among file formats concerning the degree of robustness?
• Which file format based factors are responsible for varying degrees of robustness?
• How can we improve the robustness of file formats?
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Benefits
• Digital preservation: Decision support for choosing file format for long-term preservation
• Contribution to file format research
• Improvement of existing file formats
• Design of future file formats
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
File Format Data and Information loss
What is “File Format” in our context?• Set of rules, constituting the logical organisation of
data
• Set of rules, indicating how to interpret data
• Set of rules file format specification
• File Format Data::= Binary data, formatted according to the rules of a file format
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files?
GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
First 224 Byte of testfile
FF
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Plain information loss: 1 byte data = = 1 Pixel
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files?
GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Part of the TIF Image File Directory, Tag: Photometric
Interpretation
00
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Conditional information loss: 1 bit changes == 100% information changed
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Categories of File Format Data• Technical data (data for processing):
Image width: 277
Image length: 339
Compression: uncompressed
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• “Payload” data (basic data of usage):
Pixel data, starting from byte #0x008
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring information loss Robustness Indicators Study results for different file formats
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness Indicators
(1) RB = Δ (b0 ,b1) / m
where
i. b0 is the basic data of usage before being corrupted,
ii. b1 is the basic data of usage after being corrupted,
iii. m is the number of corruption procedures.
RB indicates an average information loss.
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
ExampleA file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure
1. Δ (b0 ,b1) = 200 byte
2. Δ (b0 ,b1) = 150 byte
3. Δ (b0 ,b1) = 250 byte
The average information loss for file X based on 3 corruption procedures is then
RB= 600 / 3 = 200
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
RB related to the total number of payload data:
(2) RBt= RB / n where
n is the total number of basic data of usage (payload data).
(3) RBt= RB / n * 100
= RBt expressed in percentage
Interpretation: RBt = 0 % : max. Robustness
(min. Information loss)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example (continued)
(2) RBt= 200 / 2000 = 0.1
(3) RBt= 200 / 2000 * 100 = 10 (%)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results
TIF
- uncompressed
- LZW
- JPEG (2 different compression levels)
- ZIP
PNG (filtered, unfiltered)
JPEG2000 (lossless, lossy)
BMP (uncompressed)
G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results
Method- simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures)
- applying 3-5 different corruption ratios: less than 0.01% 0.01% 0.1% 1.0% more than 1.0%G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Method
- compressed payload data is decompressed
- original payload data and corrupted one is compared
- computing Robustness Indicators Values
G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “bad case”
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case”
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data
Thank you very much!
Volker Heydegger
University of Cologne
Archiving 2008
Bern, 23rd – 27th June 2008