Download - SCAPE Dr. Rainer Schmidt AIT Austrian Institute of Technology GmbH APA 2011 Conference London, 8-9 November, 2011 The SCAPE Project Overview, Objectives,

SCAPE

Dr. Rainer SchmidtAIT Austrian Institute of Technology GmbH

APA 2011 ConferenceLondon, 8-9 November, 2011

The SCAPE ProjectOverview, Objectives, and Approaches

SCAPESCAPE – what is it about?

• Planning and managing resource-intensive (digital) preservation processes • such as large-scale ingestion, analysis, or modification of

digital data sets• Focus on scalability, robustness, and automation.

SCAPE is a follow-up to the highly successful FP6 IP Planets.

SCAPESCAPE Project Data

• Project instrument: FP7 Integrated Project • 6. Call

• Objective ICT-2009.4.1:Digital Libraries and Digital Preservation

• Target outcome (a) Scalable systems and services for preserving digital content

• Duration: 42 months• February 2011 – July 2014

• Budget: 11.3 Million Euro• Funded: 8.6 Million Euro

SCAPESCAPE Consortium

Number Partner name Short name Role Country

1 (crd.) AIT Austrian Institute of Technology GmbH AIT AT

2 British Library BL UK

3 Internet Memory Foundation IMF NL

4 Ex Libris Ltd EXL IL

5 Fachinformationszentrum Karlsruhe FIZ DE

6 Koninklijke Bibliotheek KB NL

7 KEEP Solutions KEEPS PT

8 Microsoft Research MSR UK

9 Österreichische Nationalbibliothek ONB AT

10 Open Planets Foundation OPF UK

11 Statsbiblioteket Aarhus SB DK

12 Science and Technology Facilities Council STFC UK

13 Technische Universität Berlin TUB DE

14 Technische Universität Wien TUW AT

15 University of Manchester UNIMAN UK

16 Pierre & Marie Curie Université Paris 6 UPMC FR

SCAPESCAPE Project Overview

SCAPE will enhance the state of the art in digital preservation in three ways:• A scalable infrastructure and tools for preservation actions• Automated, quality-assured preservation workflows• Integration of these components with policy-based automated

preservation planning and watch

SCAPE results will be validated in three large-scale testbeds:

• Digital Repositories• Web Content• Research Data Sets

The SCAPE Consortium brings together a broad spectrum of expertise from

• Memory institutions• Data centres• Research labs• Universities• Industrial firms

SCAPESelected Scape Data Collections

• Data collections provided by 6 institutions• Complete Web archives and snapshots of public domains (.dk,

.it, .eu, gov.uk, …)• Millions of digitised newspapers, posters, law gazettes, and

16-19th century broadsheets• Collections of multi-file objects such as books, papyri, and

incunabula (up to 230MB/object)• 100.000 images of East Asian manuscripts in different quality

levels • TBs of voluntary deposit in a wide variety of formats• 500TB of broadcast radio and TV output (up to 73GB/object)• Many hundreds of thousands of data sets from synchrotron,

neutron, and muon instruments.• 30.000 items from a selection of open access journal articles

SCAPESelected SCAPE Testbed Scenarios

• Characterise large video files• The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient

detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation.

• Carry out large scale migrations• Migrating from one format to another introduces the possibility of damaging the

content or failing to capture significant properties of the original in the resulting destination format.

• Specific requirements include:• Solution tools that operate reliably at scale (80TB, 2 million pages)• Automated QA, ideally with no manual intervention on a file by file basis• QA performed by process independently from the migration• Demonstrating strong evidence of significant properties being captured in the

destination format

• Quality assurance in web harvesting• For large scale crawls, automation of the quality control processes is a necessary

requirement. Currently, this process relies on random sampling and very basic quantitative checks.

from digitalbevaring.dk

SCAPESelected SCAPE Challenges

• Bridging the gap between experimental workflows and production scenarios• e.g. coping with amount and size of payload data

• Employing data intensive technologies • for processing binary content• generation and evaluation of workflow results

• Exploiting data locality• Avoiding data transfer by placing

processors next to the data• Repository Integration

• Horizontal scalability• Scalable ingest/access

• Preservation Planning• Automation of monitoring and decision processes

• Automated Quality Assurance• Advanced Image Processing

• Scientific data• How to preserve contextual information?

SCAPESCAPE Solutions

• SCAPE Platform• Environment for carrying out preservation workflows at scale

• Software package and shared deployment (the Central Instance)• Dynamic deployment of environments

• virtualisation and cloud-based technologies.• support for native tools and environments

• Builds upon data-centric execution platform (Hadoop/Stratosphere)• Simple and natural tool support and• automated mapping of graphical (Taverna-based) workflows to parallel

programming model• Three levels of parallelization

• Distribution of files• Splitting content• Parallel query execution

• Repository integration based on two open reference implementations PPL dataflow

programMulti-StageM/R Flow


• OPF Result Evaluation Framework (Ref)

• Large RDF quadstore for storing SCAPE workflow results

• developed in cooperation with University of

Southampton • Shared database to publish

and query these results• Supports progress tracking

and monitoring over time• Input for Preservation

Planning and Watch


• Context-aware Planning and Watch• Automated watch monitoring

• trends in web harvests and repositories• linked with Results Evaluation Framework (REF) database

• Formalized policy model and representation

• using semantic technologies• Automated Planning

• Building on the Planets PLATO tool• Key factors and decision criteria• Automated policy-driven planning


• Automated Quality Assurance• QA in web harvesting and digitisation through automated comparison

of rendered pages• Characterization – feature extraction

• Level 1 - Metadata information: usingcharacterization components.

• Level 2 – Global content description: discriminant global features for individual media types.

• Level 3 – Structural content description: detect structural similarities in images

• Comparison• Discrete solution and smart metrics (level 2+3)• Development of metrics and measures of

similarity, quality, relationship to user perception

SCAPESelected Achievements

• Public Website: http://www.scape-project.eu/• Development Infrastructure hosted by the Open Planets Foundation and

GitHub: http://wiki.opf-labs.org/display/SP/Home• First Deliverables available for download• Publications

• 13 in the first nine months, including 6 at iPres last week• Report: Comparative analysis of identification tools• Report: Analysis of scalability challenges for Digital Object Repositories -

Classification and design of approaches.• Platform Infrastructure

• 10 nodes (dual-core), 20 TB experimental cluster hosted by AIT• Virtualization based on Xen + Eucalyptus

• Hardware for the Platform’s Central Instance currently being set-up within data centre at IMF.

13

SCAPEThank you for your attention!