SCAPE
Dr. Rainer SchmidtAIT Austrian Institute of Technology GmbH
APA 2011 ConferenceLondon, 8-9 November, 2011
The SCAPE ProjectOverview, Objectives, and Approaches
SCAPESCAPE – what is it about?
• Planning and managing resource-intensive (digital) preservation processes • such as large-scale ingestion, analysis, or modification of
digital data sets• Focus on scalability, robustness, and automation.
SCAPE is a follow-up to the highly successful FP6 IP Planets.
SCAPESCAPE Project Data
• Project instrument: FP7 Integrated Project • 6. Call
• Objective ICT-2009.4.1:Digital Libraries and Digital Preservation
• Target outcome (a) Scalable systems and services for preserving digital content
• Duration: 42 months• February 2011 – July 2014
• Budget: 11.3 Million Euro• Funded: 8.6 Million Euro
SCAPESCAPE Consortium
Number Partner name Short name Role Country
1 (crd.) AIT Austrian Institute of Technology GmbH AIT AT
2 British Library BL UK
3 Internet Memory Foundation IMF NL
4 Ex Libris Ltd EXL IL
5 Fachinformationszentrum Karlsruhe FIZ DE
6 Koninklijke Bibliotheek KB NL
7 KEEP Solutions KEEPS PT
8 Microsoft Research MSR UK
9 Österreichische Nationalbibliothek ONB AT
10 Open Planets Foundation OPF UK
11 Statsbiblioteket Aarhus SB DK
12 Science and Technology Facilities Council STFC UK
13 Technische Universität Berlin TUB DE
14 Technische Universität Wien TUW AT
15 University of Manchester UNIMAN UK
16 Pierre & Marie Curie Université Paris 6 UPMC FR
SCAPESCAPE Project Overview
SCAPE will enhance the state of the art in digital preservation in three ways:• A scalable infrastructure and tools for preservation actions• Automated, quality-assured preservation workflows• Integration of these components with policy-based automated
preservation planning and watch
SCAPE results will be validated in three large-scale testbeds:
• Digital Repositories• Web Content• Research Data Sets
The SCAPE Consortium brings together a broad spectrum of expertise from
• Memory institutions• Data centres• Research labs• Universities• Industrial firms
SCAPESelected Scape Data Collections
• Data collections provided by 6 institutions• Complete Web archives and snapshots of public domains (.dk,
.it, .eu, gov.uk, …)• Millions of digitised newspapers, posters, law gazettes, and
16-19th century broadsheets• Collections of multi-file objects such as books, papyri, and
incunabula (up to 230MB/object)• 100.000 images of East Asian manuscripts in different quality
levels • TBs of voluntary deposit in a wide variety of formats• 500TB of broadcast radio and TV output (up to 73GB/object)• Many hundreds of thousands of data sets from synchrotron,
neutron, and muon instruments.• 30.000 items from a selection of open access journal articles
SCAPESelected SCAPE Testbed Scenarios
• Characterise large video files• The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient
detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation.
• Carry out large scale migrations• Migrating from one format to another introduces the possibility of damaging the
content or failing to capture significant properties of the original in the resulting destination format.
• Specific requirements include:• Solution tools that operate reliably at scale (80TB, 2 million pages)• Automated QA, ideally with no manual intervention on a file by file basis• QA performed by process independently from the migration• Demonstrating strong evidence of significant properties being captured in the
destination format
• Quality assurance in web harvesting• For large scale crawls, automation of the quality control processes is a necessary
requirement. Currently, this process relies on random sampling and very basic quantitative checks.
from digitalbevaring.dk
SCAPESelected SCAPE Challenges
• Bridging the gap between experimental workflows and production scenarios• e.g. coping with amount and size of payload data
• Employing data intensive technologies • for processing binary content• generation and evaluation of workflow results
• Exploiting data locality• Avoiding data transfer by placing
processors next to the data• Repository Integration
• Horizontal scalability• Scalable ingest/access
• Preservation Planning• Automation of monitoring and decision processes
• Automated Quality Assurance• Advanced Image Processing
• Scientific data• How to preserve contextual information?
SCAPESCAPE Solutions
• SCAPE Platform• Environment for carrying out preservation workflows at scale
• Software package and shared deployment (the Central Instance)• Dynamic deployment of environments
• virtualisation and cloud-based technologies.• support for native tools and environments
• Builds upon data-centric execution platform (Hadoop/Stratosphere)• Simple and natural tool support and• automated mapping of graphical (Taverna-based) workflows to parallel
programming model• Three levels of parallelization
• Distribution of files• Splitting content• Parallel query execution
• Repository integration based on two open reference implementations PPL dataflow
programMulti-StageM/R Flow
SCAPESCAPE Solutions
• OPF Result Evaluation Framework (Ref)
• Large RDF quadstore for storing SCAPE workflow results
• developed in cooperation with University of
Southampton • Shared database to publish
and query these results• Supports progress tracking
and monitoring over time• Input for Preservation
Planning and Watch
SCAPESCAPE Solutions
• Context-aware Planning and Watch• Automated watch monitoring
• trends in web harvests and repositories• linked with Results Evaluation Framework (REF) database
• Formalized policy model and representation
• using semantic technologies• Automated Planning
• Building on the Planets PLATO tool• Key factors and decision criteria• Automated policy-driven planning
SCAPESCAPE Solutions
• Automated Quality Assurance• QA in web harvesting and digitisation through automated comparison
of rendered pages• Characterization – feature extraction
• Level 1 - Metadata information: usingcharacterization components.
• Level 2 – Global content description: discriminant global features for individual media types.
• Level 3 – Structural content description: detect structural similarities in images
• Comparison• Discrete solution and smart metrics (level 2+3)• Development of metrics and measures of
similarity, quality, relationship to user perception
SCAPESelected Achievements
• Public Website: http://www.scape-project.eu/• Development Infrastructure hosted by the Open Planets Foundation and
GitHub: http://wiki.opf-labs.org/display/SP/Home• First Deliverables available for download• Publications
• 13 in the first nine months, including 6 at iPres last week• Report: Comparative analysis of identification tools• Report: Analysis of scalability challenges for Digital Object Repositories -
Classification and design of approaches.• Platform Infrastructure
• 10 nodes (dual-core), 20 TB experimental cluster hosted by AIT• Virtualization based on Xen + Eucalyptus
• Hardware for the Platform’s Central Instance currently being set-up within data centre at IMF.
13
SCAPEThank you for your attention!