+ All Categories
Home > Documents > Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne...

Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne...

Date post: 02-Jan-2016
Category:
Upload: ralph-baldwin
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008
Transcript
Page 1: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Analysing the Impact of File Formats on Data Integrity

Volker Heydegger

University of Cologne

Archiving 2008

Bern, 23rd – 27th June 2008

Page 2: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring Information Loss Robustness Indicators Study results for different file formats

Page 3: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring Information Loss Robustness Indicators Study results for different file formats

Page 4: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Background

• EU-founded project “Planets”

characterisation of file format content

www.planets-project.eu

University of Cologne, Computer Science for the Humanities

(Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))

Planets partner

www.hki.uni-koeln.de

Page 5: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Context

• Long-term preservation of digital informationWhich file format to choose?

Criteria, e.g.:

Open standard

Spread of usage

Hard-/Software-Dependencies

Authenticity

Robustness

Page 6: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Robustness::= Error resilience of file formats against bit-stream corruption

Page 7: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Issues/ Research topics

• Is there any correlation between file format and data integrity?

• If so, are there any differences among file formats concerning the degree of robustness?

• Which file format based factors are responsible for varying degrees of robustness?

• How can we improve the robustness of file formats?

Page 8: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Benefits

• Digital preservation: Decision support for choosing file format for long-term preservation

• Contribution to file format research

• Improvement of existing file formats

• Design of future file formats

Page 9: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring Information Loss Robustness Indicators Study results for different file formats

Page 10: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

File Format Data and Information loss

What is “File Format” in our context?• Set of rules, constituting the logical organisation of

data

• Set of rules, indicating how to interpret data

• Set of rules file format specification

• File Format Data::= Binary data, formatted according to the rules of a file format

Page 11: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

What happens if data is corrupted in files?

GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel

Page 12: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

First 224 Byte of testfile

FF

Page 13: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Plain information loss: 1 byte data = = 1 Pixel

Page 14: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

What happens if data is corrupted in files?

GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel

Page 15: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Part of the TIF Image File Directory, Tag: Photometric

Interpretation

00

Page 16: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Conditional information loss: 1 bit changes == 100% information changed

Page 17: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Categories of File Format Data• Technical data (data for processing):

Image width: 277

Image length: 339

Compression: uncompressed

Page 18: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• “Payload” data (basic data of usage):

Pixel data, starting from byte #0x008

Page 19: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Overview

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

• Introduction• File format data and information loss

What happens if data is corrupted in files? Categories of file format data

• Measuring information loss Robustness Indicators Study results for different file formats

Page 20: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Robustness Indicators

(1) RB = Δ (b0 ,b1) / m

where

i. b0 is the basic data of usage before being corrupted,

ii. b1 is the basic data of usage after being corrupted,

iii. m is the number of corruption procedures.

RB indicates an average information loss.

Page 21: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

ExampleA file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure

1. Δ (b0 ,b1) = 200 byte

2. Δ (b0 ,b1) = 150 byte

3. Δ (b0 ,b1) = 250 byte

The average information loss for file X based on 3 corruption procedures is then

RB= 600 / 3 = 200

Page 22: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

RB related to the total number of payload data:

(2) RBt= RB / n where

n is the total number of basic data of usage (payload data).

(3) RBt= RB / n * 100

= RBt expressed in percentage

Interpretation: RBt = 0 % : max. Robustness

(min. Information loss)

Page 23: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example (continued)

(2) RBt= 200 / 2000 = 0.1

(3) RBt= 200 / 2000 * 100 = 10 (%)

Page 24: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Study on Robustness for various file formats: Example Results

TIF

- uncompressed

- LZW

- JPEG (2 different compression levels)

- ZIP

PNG (filtered, unfiltered)

JPEG2000 (lossless, lossy)

BMP (uncompressed)

G

Page 25: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Study on Robustness for various file formats: Example Results

Method- simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures)

- applying 3-5 different corruption ratios: less than 0.01% 0.01% 0.1% 1.0% more than 1.0%G

Page 26: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Method

- compressed payload data is decompressed

- original payload data and corrupted one is compared

- computing Robustness Indicators Values

G

Page 27: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

G

Page 28: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example: Jp2 formatted image, corruption of 1 Byte, “bad case”

Page 29: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example: Jp2 formatted image, corruption of 1 Byte, “good case”

Page 30: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland

Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data

Page 31: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.

Thank you very much!

Volker Heydegger

University of Cologne

Archiving 2008

Bern, 23rd – 27th June 2008


Recommended