+ All Categories
Home > Presentations & Public Speaking > One Core Preservation System for all your Data. No Exceptions! Marco Klindt and Kilian Amrhein

One Core Preservation System for all your Data. No Exceptions! Marco Klindt and Kilian Amrhein

Date post: 11-Jan-2017
Category:
Upload: 12th-international-conference-on-digital-preservation-ipres-2015
View: 393 times
Download: 0 times
Share this document with a friend
43
One Core Preservation System for All your Data. No Exceptions! Marco Klindt, Kilian Amrhein Zuse Institute Berlin (ZIB) November 3, 2015 Frameworks for Digital Preservation
Transcript

One Core Preservation System for All your Data. No Exceptions!

Marco Klindt, Kilian Amrhein Zuse Institute Berlin (ZIB)

November 3, 2015

Frameworks for Digital Preservation

One Core Preservation System for All your Data.

Some assembly required.

Why?

•  Berlin funds digitization projects for cultural heritage institutions (LAM)

•  Servicecenter for Digitization (digiS) @ ZIB supports these project with training and technical solutions

•  Sustainability demands digital preservation service for digitization outputs

ZIB what? •  Zuse Institute Berlin

Research Institute for Applied Mathematics and Computer Science

•  Namesake Konrad Zuse: an inventor of the computer (Z1, 1938, 22bit floating point processing... [German view])

National Tier 2 Supercomputer

•  Konrad

Supercomputing Storage •  Tape libraries

(2x StorageTek 8500 Enterprise grade) •  Climate-controlled, fire-resistant vault •  ~ 100 PB (Petabyte = 1015 Bytes) •  400 TB (net) reserved for LAM

Facility Map DataVault Supercomputer

Cooling

Preservation is hard

•  Digital Preservation as well •  Not feasible for smaller Institutions

•  Provide Preservation as a Service utilizing ZIB infrastructure and expertise

Even as a service

•  Community effort (learn from each other) •  Depends on multiple Communities: – Preservation –  IT – Cultural Heritage

Architectural Requirements 1.  Self-contained, self-documented

Information Packages (Intellectual Entities)

2.  Anticipate obsolescence of formats, software tools, hardware, and organisation

3.  Loosely coupled Components with defined Responsibilities

4.  Use community (OSS) tools and standards

Chapel Hill, USA Data Workflow

Open (Source)

•  Do not reinvent the Wheel (it still remains hard to maintain)

Canada Ingest Workflow

USA/Canada

Access/Management Workflow

Code Glue inhouse

Architectural Design (Overview)

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &Tape

Mana

geme

nt F

edor

a / I

sland

ora

Acc

ess

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Microservices:identificationcharacterisationnormalization

Preingest

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &Tape

Mana

geme

nt F

edor

a / I

sland

ora

Acc

ess

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Microservices:identificationcharacterisationnormalization

Data Agnostic Service

•  Problem: Highly heterogenous data (formats and metadata) + We accept everything (digital)

Deposit/Transfer Components

5x

1x

1x

314,159x 271,828x

Administrative Information (Submission Manifest)

Descriptive Information (Metadata Formats DC, MODS, EAD, LIDO)

Content Information (Binary or textual data)

Context Information (Submission Documentation)

000000

000001

Deposit/Transfer Components Administrative Information

(Submission Manifest)

Descriptive Information (Metadata)

Content Information (Binary or textual data)

Context Information (Submission Documentation)

5x

1x

1x

314,159x 271,828x

000000

000001

To find Stuff (the archive)

To find Stuff (the depositor)

The Stuff

Stuff (maybe useful for users)

Preingest: SIP Creation

ContentInforma-tion

Unstructured

Data:

Text

Emails

...

Text

XML

Binary

Files

Content

Docume

ntation

Content

Subm

ission

•  Normalize Structure

•  Primary Binary/Textual Content

•  Submission Documentation

Descriptive Metadata •  Original description

in domain specific Metadata formats

•  Community standards ContentInforma-tion

DCEAD

LIDO

MODS

Metadat

a

Unstructured

Data:

Text

Emails

...

Text

XML

Binary

Files

Content

Docume

ntation

Content

Subm

ission

Submission Metadata •  Submission Manifest •  YAML or METS –  Rights Information –  Contract and Contact

Information of Depositor

–  ...

•  (nearly) complete SIP

DublinCore

Administrative

Description Information

ContentInforma-tion

Submission Manifest

DCEAD

LIDO

MODS

Metadata

Unstructured

Data:

Text

Emails

...

Text

XML

Binary

Files

Content

Documentation

Content

Subm

issio

n

SIP

MD Mapping •  Subset MD Mapping to

Dublin Core –  Metadata Object

Description Standard (MODS)

–  Encoded Archival Description (EAD)

–  Light Information Describing Objects (LIDO)

•  SIP now ready for

Ingest.

Mapped to DC

DublinCore

Administrative

Description Information

ContentInforma-tion

DublinCore

Description Information (DI)

Submission Manifest

DCEAD

LIDO

MODS

Metadata

Unstructured

Data:

Text

Emails

...

Text

XML

Binary

Files

Content

Documentation

Content

Subm

issio

n

SIP

SIP Rejection

•  We only require Submission Manifest (Administrative) Metadata and DI

•  If incomplete -> Reject

DublinCore

Administrative

Description Information

ContentInforma-tion

DublinCore

Description Information (DI)

SIP

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &TapeMa

nage

ment

Fed

ora

/ Isla

ndor

a A

cces

s

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Microservices:identificationcharacterisationnormalization

Ingest

•  Automated Ingest Workflow System –  Identification of File Formats – Characterization (Technical MD Extraction) – Normalization (Migration on Ingest) – Creation of Access Derivatives –  Fixity Hashes

•  Microservices

and Best-Practice Tools

•  Every transformation based upon Rules in Format Policy Registry (FPR)

•  One workflow (No Exceptions!) •  One FPR Ruleset (No Exceptions!)

DublinCore

Administrative

Description Information

ContentInforma-tion

DublinCore

Description Information (DI)

SIP

AIP •  Technical MD

PREMIS

•  Identification: –  Known Knowns –  Known Unknowns

–  No Unknown Unkowns

(after D. Rumpsfeld via Matthew Addis)

DublinCore

Administrative

Description Information

ContentInforma-tion

DublinCore

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

AIP

Preservation Levels

•  Preservation level is perceived not assigned: – Passive (Known Unkown) – Active (Known Known)

•  Core Preservation Actions: – Re-Identification – Migration scheduling based on

FPR changes

Beholder

Rules

Technical MD

+

+

Schedule

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &Tape

Mana

geme

nt F

edor

a / I

sland

ora

Acc

ess

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Microservices:identificationcharacterisationnormalization

Data Management

•  Federated Object Store – Object = File + Metadata (incl. Checksum) – Rule-based Synchronization across abstracted

physical storage locations – 1 AIP = 1 Object

•  Horizontal replication across data-centers

Metadata stored for AIP objects used for

Management & Retrieval

29

AVUsdefinedfordataObjbac186cd-4d11-48ac-bb1d-4ab2cd7593cc.tar:attribute:uuidvalue:bac186cd-4d11-48ac-bb1d-4ab2cd7593ccunits:----attribute:producerIDvalue:DE-MUS-019910units:----attribute:submissionIDvalue:DE-MUS-019910-201505131006units:----attribute:checksumvalue:sha2:E4dMTd7/J4z9qg36CSjSzdXXIa4ltgAak+MKfSuPKww=units:----attribute:lastFixityCheckvalue:2015-05-13T08:06:16Zunits:----attribute:typevalue:AIPunits:

DublinCore

Administrative

Description Information

ContentInforma-tion

DublinCore

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

AIP

Hierarchical Storage Management (HSM)

•  Vertical replication: – Staging Area (Disk Cache) – Automatic Archiving onto

two Tapes in different Tape Libraries

– Checksums per data block

•  Staging time offline -> online < 90s (nearline)

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &TapeMa

nage

ment

Fed

ora

/ Isla

ndor

a A

cces

s

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Microservices:identificationcharacterisationnormalization

Management/ Access

•  Access and Management System •  Dark Archive – only for Admins and Depositors

•  One Object (AIP) – Two Views: – Admin Access View (us) – Content Access View (them)

Content Access Compound Object (CACO) Mapped Descriptive Metadata

DublinCore

DublinCore

Administrative

Description Information

ContentInforma-tion

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

Text

XML

Binary

FilesSubm

issio

n Do

cu.

AIP

CACO

Content

Access

Compound

Object

CACO

DublinCore

DublinCore

Administrative

Description Information

ContentInforma-tion

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

DerivativesAccesscopies

Text

XML

Binary

FilesSubm

issio

n Do

cu.

AIP

CACO

Content

Access

Compound

Object

CACO

Amin Access Compound Object (AACO) Administrative Description for Access and Management

DublinCore

DublinCore

Administrative

Description Information

ContentInforma-tion

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

Text

XML

Binary

FilesSubm

issio

n Do

cu.

AIPAACO

Admin

Access

Compound

Object

AACO

DublinCore

DublinCore

Administrative

Description Information

ContentInforma-tion

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

Text

XML

Binary

FilesSubm

issio

n Do

cu.

Refer

ence

AIPAACO

Admin

Access

Compound

Object

Admin Access to: iRods Metadata Data Object (AIP)

AACO

AIP Data Model Mapped to DC

DublinCore

DublinCore

Administrative

Description Information

ContentInforma-tion

Description Information (DI)

PreservationDescription Information (PDI)

PREMIS

Normalized

binary and

text files

DerivativesAccesscopies

Submission Manifest

DCEAD

LIDO

MODS

Unstructured

Data:

Text

Emails

...

Text

XML

Binary

Files

Documentation

Content

Content

Subm

issio

n

Metadata

AACO

CACO

Refer

ence

AIP

Admin

Access

Compound

Object

Content

Access

Compound

Object

Preservation Actions

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &Tape

Mana

geme

nt F

edor

a / I

sland

ora

Acc

ess

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Migration

Microservices:identificationcharacterisationnormalization

AIP Reingest

PreingestDeposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &Tape

Mana

geme

nt F

edor

a / I

sland

ora

Acc

ess

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Migration

Microservices:identificationcharacterisationnormalization

•  Sponsored Feature (Q1/2016)

•  Allows for – Metadata addition – Re-Identification – Re-Normalization

Contingency – Exit Strategy

•  Archivematica: only find new ingest workflow

•  iRods: use filesystem •  Islandora: reingest into other

repository •  Organisation: Self-contained

AIP (transformation req‘d) •  No Strategy against Evil, yet!

Preingest

Deposit

Submission Manifest

Contract #

Submission ID

...

Binary

Payload

DublinCore(DC),

LIDO, MODS,

EAD ...

Descriptive Metadata

SIP

AIP

DIP

IngestArchivematica

Data Management

iRODS

Archival Storage

Online &Tape

Ma

na

ge

me

nt

Fed

ora

/ Isla

ndor

a A

cc

ess

0101100

1011

<xml>

FPR

Landing

Pages

Repository

Adm

in Ac

cess

Com

poun

dOb

ject

Cont

ent A

cces

sCo

mpo

und

Objec

tSubmission PDIMapped DC

Content

Submission PDIContent Information

PDI (PREMIS)

Information

Content DescriptionDerivatives

Migration

Single

Pip

eline

OAIS

-alig

ned

Modula

r

Data D

eposits:

Cultura

l herit

age and rsearc

h data

w/ binary

content,

descriptiv

e meta

-

data and subm

ission in

form

ation.

Policies

> control m

etadata

mapping

> norm

alizatio

n thro

ugh

Format P

olicy R

egistry (F

PR)

Microservices:identificationcharacterisationnormalization


Recommended