Date post: | 11-May-2015 |
Category: |
Documents |
Upload: | peter-wittek |
View: | 413 times |
Download: | 2 times |
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
XML Processing in the Cloud: Large-ScaleDigital Preservation in Small Institutions
Peter Wittek
Swedish School of Library and Information ScienceUniversity of Boras
16/05/11
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Outline
1 Workflows and Digital Preservation
2 Computational Requirements of Digital Preservation
3 Preservation Workflow in the Cloud
4 Experimental Results
5 Open Issues
6 Conclusions
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Fundamental Issues in Digital Preservation
Digital objects remain authentic and accessibleComponent and management failuresNatural disastersAttacks
Materials resulting from digital reformattingInformation that is born-digital and has no analogcounterpart
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Migration, Enrichment, and Other Approaches
Keeping the content of legacy file formats accessibleMost prominent with proprietary file formatsInfrastructure-independent rendering of contentMigration (legal issues)
Dynamic collections: scalabilityReuse
Exploitation with a novel purposeSufficient metadata at document and collection level
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
An Example of Enrichment: ToC Extraction
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Preserving the Pipeline
Reuse of digital content asks for metadata on both thecontent and how it was transformed to its most recent formDocument process preservation helpsArchitecture-independent description of the intent behind adocument process
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
An XML Processing Pipeline
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Workflows and Digital Preservation
Deployment
Translation of abstract description of workflowEclipse Modeling Framework generates Python sourcecodeGrid implementation using iRODS
Integrated Rule-Oriented Data SystemPolicy-based data grid software system
Current experiment using Amazon Web Services
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Computational Requirements of Digital Preservation
Conversion
Steps of a workflow are computationally expensiveXSLT processors
Processing a single large document tree can take hoursDeep parsing and named entity recognition
May involve high-complexity natural language processing
Ad-hoc computations
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Computational Requirements of Digital Preservation
Learning
A step towards digital curationSaaS approach to digital curation
Indexing by Lucene/NutchCollection-level metadata extraction by Mahout
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Preservation Workflow in the Cloud
MapReduce and Deployment
No internal dependencies for the processesDesigned process is exported via the EMF interface toPythonSimple MapReduce driver to execute the process onindividual documents
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Preservation Workflow in the Cloud
The Proposed Architecture
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Experimental Results
Cost
1 4 10 20 40 80
Number of Processing Cores
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08Avera
ge C
ost
in U
SD
100100010000
Figure: Comparison of average cost of computations with differentcollection sizes
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Experimental Results
Running time
1 4 10 20 40 80
Number of Processing Cores
0
1000
2000
3000
4000
5000
6000
7000
8000R
unnin
g T
ime (
Min
s)
100100010000
Figure: Comparison of running times with different collection sizes
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Open Issues
Obstacles to Adoption
Persistence and high-reliabilityMapReduceNot just a technological issue
Service-level agreementParticularly problematicAnother EU FP7 project working on it: SLA@SOINiche for alternative cloud providers
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Conclusions
Acknowledgment
Work has been funded by Sustaining Heritage Accessthrough Multivalent ArchiviNg (SHAMAN), an EU FP7large integrated projecthttp://shaman-ip.eu/shaman/
XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions
Conclusions
Summary
Digital preservation is an attractive area to be offered asSaaS
Computational needsExpertiseComplexity
Since persistence requires architecture-independence,cloud adoption is straightforwardHigh-reliability can be an issueService-level agreements need further research