Date post: | 11-Jan-2017 |
Category: |
Presentations & Public Speaking |
Upload: | 12th-international-conference-on-digital-preservation-ipres-2015 |
View: | 393 times |
Download: | 0 times |
One Core Preservation System for All your Data. No Exceptions!
Marco Klindt, Kilian Amrhein Zuse Institute Berlin (ZIB)
November 3, 2015
Frameworks for Digital Preservation
Why?
• Berlin funds digitization projects for cultural heritage institutions (LAM)
• Servicecenter for Digitization (digiS) @ ZIB supports these project with training and technical solutions
• Sustainability demands digital preservation service for digitization outputs
ZIB what? • Zuse Institute Berlin
Research Institute for Applied Mathematics and Computer Science
• Namesake Konrad Zuse: an inventor of the computer (Z1, 1938, 22bit floating point processing... [German view])
Supercomputing Storage • Tape libraries
(2x StorageTek 8500 Enterprise grade) • Climate-controlled, fire-resistant vault • ~ 100 PB (Petabyte = 1015 Bytes) • 400 TB (net) reserved for LAM
Preservation is hard
• Digital Preservation as well • Not feasible for smaller Institutions
• Provide Preservation as a Service utilizing ZIB infrastructure and expertise
Even as a service
• Community effort (learn from each other) • Depends on multiple Communities: – Preservation – IT – Cultural Heritage
Architectural Requirements 1. Self-contained, self-documented
Information Packages (Intellectual Entities)
2. Anticipate obsolescence of formats, software tools, hardware, and organisation
3. Loosely coupled Components with defined Responsibilities
4. Use community (OSS) tools and standards
Chapel Hill, USA Data Workflow
Open (Source)
• Do not reinvent the Wheel (it still remains hard to maintain)
Canada Ingest Workflow
USA/Canada
Access/Management Workflow
Code Glue inhouse
Architectural Design (Overview)
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &Tape
Mana
geme
nt F
edor
a / I
sland
ora
Acc
ess
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Microservices:identificationcharacterisationnormalization
Preingest
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &Tape
Mana
geme
nt F
edor
a / I
sland
ora
Acc
ess
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Microservices:identificationcharacterisationnormalization
Data Agnostic Service
• Problem: Highly heterogenous data (formats and metadata) + We accept everything (digital)
Deposit/Transfer Components
5x
1x
1x
314,159x 271,828x
Administrative Information (Submission Manifest)
Descriptive Information (Metadata Formats DC, MODS, EAD, LIDO)
Content Information (Binary or textual data)
Context Information (Submission Documentation)
000000
000001
Deposit/Transfer Components Administrative Information
(Submission Manifest)
Descriptive Information (Metadata)
Content Information (Binary or textual data)
Context Information (Submission Documentation)
5x
1x
1x
314,159x 271,828x
000000
000001
To find Stuff (the archive)
To find Stuff (the depositor)
The Stuff
Stuff (maybe useful for users)
Preingest: SIP Creation
ContentInforma-tion
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Docume
ntation
Content
Subm
ission
• Normalize Structure
• Primary Binary/Textual Content
• Submission Documentation
Descriptive Metadata • Original description
in domain specific Metadata formats
• Community standards ContentInforma-tion
DCEAD
LIDO
MODS
Metadat
a
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Docume
ntation
Content
Subm
ission
Submission Metadata • Submission Manifest • YAML or METS – Rights Information – Contract and Contact
Information of Depositor
– ...
• (nearly) complete SIP
DublinCore
Administrative
Description Information
ContentInforma-tion
Submission Manifest
DCEAD
LIDO
MODS
Metadata
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Documentation
Content
Subm
issio
n
SIP
MD Mapping • Subset MD Mapping to
Dublin Core – Metadata Object
Description Standard (MODS)
– Encoded Archival Description (EAD)
– Light Information Describing Objects (LIDO)
• SIP now ready for
Ingest.
Mapped to DC
DublinCore
Administrative
Description Information
ContentInforma-tion
DublinCore
Description Information (DI)
Submission Manifest
DCEAD
LIDO
MODS
Metadata
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Content
Documentation
Content
Subm
issio
n
SIP
SIP Rejection
• We only require Submission Manifest (Administrative) Metadata and DI
• If incomplete -> Reject
DublinCore
Administrative
Description Information
ContentInforma-tion
DublinCore
Description Information (DI)
SIP
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &TapeMa
nage
ment
Fed
ora
/ Isla
ndor
a A
cces
s
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Microservices:identificationcharacterisationnormalization
Ingest
• Automated Ingest Workflow System – Identification of File Formats – Characterization (Technical MD Extraction) – Normalization (Migration on Ingest) – Creation of Access Derivatives – Fixity Hashes
• Microservices
and Best-Practice Tools
• Every transformation based upon Rules in Format Policy Registry (FPR)
• One workflow (No Exceptions!) • One FPR Ruleset (No Exceptions!)
DublinCore
Administrative
Description Information
ContentInforma-tion
DublinCore
Description Information (DI)
SIP
AIP • Technical MD
PREMIS
• Identification: – Known Knowns – Known Unknowns
– No Unknown Unkowns
(after D. Rumpsfeld via Matthew Addis)
DublinCore
Administrative
Description Information
ContentInforma-tion
DublinCore
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
AIP
Preservation Levels
• Preservation level is perceived not assigned: – Passive (Known Unkown) – Active (Known Known)
• Core Preservation Actions: – Re-Identification – Migration scheduling based on
FPR changes
Beholder
Rules
Technical MD
+
+
Schedule
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &Tape
Mana
geme
nt F
edor
a / I
sland
ora
Acc
ess
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Microservices:identificationcharacterisationnormalization
Data Management
• Federated Object Store – Object = File + Metadata (incl. Checksum) – Rule-based Synchronization across abstracted
physical storage locations – 1 AIP = 1 Object
• Horizontal replication across data-centers
Metadata stored for AIP objects used for
Management & Retrieval
29
AVUsdefinedfordataObjbac186cd-4d11-48ac-bb1d-4ab2cd7593cc.tar:attribute:uuidvalue:bac186cd-4d11-48ac-bb1d-4ab2cd7593ccunits:----attribute:producerIDvalue:DE-MUS-019910units:----attribute:submissionIDvalue:DE-MUS-019910-201505131006units:----attribute:checksumvalue:sha2:E4dMTd7/J4z9qg36CSjSzdXXIa4ltgAak+MKfSuPKww=units:----attribute:lastFixityCheckvalue:2015-05-13T08:06:16Zunits:----attribute:typevalue:AIPunits:
DublinCore
Administrative
Description Information
ContentInforma-tion
DublinCore
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
AIP
Hierarchical Storage Management (HSM)
• Vertical replication: – Staging Area (Disk Cache) – Automatic Archiving onto
two Tapes in different Tape Libraries
– Checksums per data block
• Staging time offline -> online < 90s (nearline)
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &TapeMa
nage
ment
Fed
ora
/ Isla
ndor
a A
cces
s
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Microservices:identificationcharacterisationnormalization
Management/ Access
• Access and Management System • Dark Archive – only for Admins and Depositors
• One Object (AIP) – Two Views: – Admin Access View (us) – Content Access View (them)
Content Access Compound Object (CACO) Mapped Descriptive Metadata
DublinCore
DublinCore
Administrative
Description Information
ContentInforma-tion
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
Text
XML
Binary
FilesSubm
issio
n Do
cu.
AIP
CACO
Content
Access
Compound
Object
CACO
DublinCore
DublinCore
Administrative
Description Information
ContentInforma-tion
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
DerivativesAccesscopies
Text
XML
Binary
FilesSubm
issio
n Do
cu.
AIP
CACO
Content
Access
Compound
Object
Amin Access Compound Object (AACO) Administrative Description for Access and Management
DublinCore
DublinCore
Administrative
Description Information
ContentInforma-tion
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
Text
XML
Binary
FilesSubm
issio
n Do
cu.
AIPAACO
Admin
Access
Compound
Object
AACO
DublinCore
DublinCore
Administrative
Description Information
ContentInforma-tion
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
Text
XML
Binary
FilesSubm
issio
n Do
cu.
Refer
ence
AIPAACO
Admin
Access
Compound
Object
AIP Data Model Mapped to DC
DublinCore
DublinCore
Administrative
Description Information
ContentInforma-tion
Description Information (DI)
PreservationDescription Information (PDI)
PREMIS
Normalized
binary and
text files
DerivativesAccesscopies
Submission Manifest
DCEAD
LIDO
MODS
Unstructured
Data:
Text
Emails
...
Text
XML
Binary
Files
Documentation
Content
Content
Subm
issio
n
Metadata
AACO
CACO
Refer
ence
AIP
Admin
Access
Compound
Object
Content
Access
Compound
Object
Preservation Actions
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &Tape
Mana
geme
nt F
edor
a / I
sland
ora
Acc
ess
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Migration
Microservices:identificationcharacterisationnormalization
AIP Reingest
PreingestDeposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &Tape
Mana
geme
nt F
edor
a / I
sland
ora
Acc
ess
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Migration
Microservices:identificationcharacterisationnormalization
• Sponsored Feature (Q1/2016)
• Allows for – Metadata addition – Re-Identification – Re-Normalization
Contingency – Exit Strategy
• Archivematica: only find new ingest workflow
• iRods: use filesystem • Islandora: reingest into other
repository • Organisation: Self-contained
AIP (transformation req‘d) • No Strategy against Evil, yet!
Preingest
Deposit
Submission Manifest
Contract #
Submission ID
...
Binary
Payload
DublinCore(DC),
LIDO, MODS,
EAD ...
Descriptive Metadata
SIP
AIP
DIP
IngestArchivematica
Data Management
iRODS
Archival Storage
Online &Tape
Ma
na
ge
me
nt
Fed
ora
/ Isla
ndor
a A
cc
ess
0101100
1011
<xml>
FPR
Landing
Pages
Repository
Adm
in Ac
cess
Com
poun
dOb
ject
Cont
ent A
cces
sCo
mpo
und
Objec
tSubmission PDIMapped DC
Content
Submission PDIContent Information
PDI (PREMIS)
Information
Content DescriptionDerivatives
Migration
Single
Pip
eline
OAIS
-alig
ned
Modula
r
Data D
eposits:
Cultura
l herit
age and rsearc
h data
w/ binary
content,
descriptiv
e meta
-
data and subm
ission in
form
ation.
Policies
> control m
etadata
mapping
> norm
alizatio
n thro
ugh
Format P
olicy R
egistry (F
PR)
Microservices:identificationcharacterisationnormalization