Date post: | 11-May-2015 |
Category: |
Education |
Upload: | geographical-analysis-urban-modeling-spatial-statistics |
View: | 563 times |
Download: | 0 times |
Digital preservation caring for our data to foster
knowledge discovery and
dissemination
Claudia Bauzer Medeiros
Institute of Computing
UNICAMP
Pre-Saervare
(Before) – (Save)
= save before disappears
Maintain
Manu-tenere
= being able to get/find it
Dec 2008
Feb 2010
Data deluge
• At end of 2011 – info created and replicated > 1.8 zettabytes
• 90% data created in the last 2 years
• 5 hour flight – 240 Tbytes
• Facebook – 200 million users, >70 languages
• Each person in England is filmed 300 times/day
• Teenagers in the US send average 110 phone text messages a day
=> We need to build arks during the deluge - PRESERVATION
Outline
• Why preserve?
• What to preserve?
• How to preserve?
• Where to preserve?
And a few associated challenges
Outline
• Why preserve?
• What to preserve?
• How to preserve?
• Where to preserve?
And a few associated challenges
WHY PRESERVE
• Costly to produce
• Contribute to progress of science
• Intrinsic value
culture/science/sustainability
WHY PRESERVE• Costly to produce
– Infrastructure, power, software, models, visualization, people
– Hardware, Software, Peopleware
• Contribute to progress of science– Reproducibility and reusability
– Publication and sharing
– Quality
• Intrinsic value culture/science/sustainability– Digital humanities
– Domesday project
– Fonoteca Neotropical Jacques Vieillard
WHY PRESERVE• Costly to produce
– Infrastructure, power, software, models, visualization, people
– Hardware, Software, Peopleware
• Contribute to progress of science– Reproducibility and reusability
– Publication and sharing
– Quality
• Intrinsic value culture/science/sustainability– Digital humanities
– Domesday project
– Fonoteca Neotropical Jacques Vieillard
WHY PRESERVE• Costly to produce
– Infrastructure, power, software, models, visualization, people
– Hardware, Software, Peopleware
• Contribute to progress of science– Reproducibility and reusability
– Publication and sharing
– Quality
• Intrinsic value culture/science/sustainability– Digital humanities
– Domesday project
– Fonoteca Neotropical Jacques Vieillard
The Domesday Project 1086-1986
• Digital decay
• Equipment obsolescence
• Software obsolescence
Domesday reloaded
Fonoteca
Neotropical
Jacques
Vieillard
Outline
• Why preserve?
• What to preserve? • How to preserve?
And associated challenges
What to preserve?
• Data
• BUT what is “data”?
• Only data?
What to preserve?
• Data
• BUT what is “data”?
– Files and records
– Models, documentation, annotations, sketches,
experiments, recordings
• Only data?
What to preserve?
• Data
• BUT what is “data”?
– Files and records
– Models, documentation, annotations, sketches,
experiments, recordings
• Only data?
– How produced it – workflows, devices,
methodologies, materials and methods,
reasonings, logs --- provenance
What to preserve?
• Data
• Environment in which was produced
• Data needed to preserve occupies more space
than the data itself
• Preservation means storing more than object
itself
23/10000
What about our research data?(slide adapted from Jim Gray)
Answers
Questions
“Collaboratory”Data-driven science
Models
Simulations
Papers
Files
Experiments
Instruments
DATA
24/10000
Data sources?Table of Product Characteristics
id Property name Value
MilkProd productsrep MilkA
MilkProd quantity 10000
MilkProd validity date 10/06/2006
CheeseProd productsr
ep
Minas
CheeseProd quantity 2000
CheeseProd validity date 12/02/2006
CheeseProd shape Circular
25/10000
eEnvironmental Science
• Direct and indirect observations
26/10000
Data sources
27/10000
We are
DATASCOPE
engineers
Software is the
device/tool
Outline
• Why preserve?
• What to preserve?
• How to preserve?
And associated challenges
How to preserve?
How to construct the ark during the
deluge?
Presaervare, Manutenere and Share
How to preserve?
• To ensure retrievability and sharing– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity – Media decay, software decay, hardware decay
• To ensure quality– Curation procedures
• To afford maintenance costs– Cloud? CAP theorem?
How to preserve?
• To ensure retrievability and sharing– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity – Media decay, software decay, hardware decay
• To ensure quality– Curation procedures
• To afford maintenance costs– Cloud? CAP theorem?
How to preserve?
• To ensure retrievability and sharing– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity – Media decay, software decay, hardware decay
• To ensure quality– Curation procedures
• To afford maintenance costs– Cloud? CAP theorem?
How to preserve?
• To ensure retrievability and sharing– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity – Media decay, software decay, hardware decay
• To ensure quality– Curation procedures, metadata,standards
• To afford maintenance costs– Cloud? CAP theorem?
How to preserve?
• To ensure retrievability and sharing– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity – Media decay, software decay, hardware decay
• To ensure quality– Curation procedures,metadata, standards
• To afford maintenance costs– Cloud? CAP theorem? =======� WHERE
How to preserve?
• To ensure retrievability and sharing– Index structures
– Ontologies, metadata, keywords, standards
– Workflows
• To ensure longevity – Media decay, software decay, hardware decay
– PEOPLE DECAY
• To ensure quality– Curation procedures,metadata, standards
• To afford maintenance costs– Cloud? CAP theorem? =======� WHERE
Sharing and open access
NSF Data Management Policy
Paper and data publication
Sharing of Data Leads to Progress on Alzheimer’s
By GINA KOLATA
Published: August 12, 2010
= NEW YORK TIMES
In 2003, a group of scientists and executives from the National Institutes of Health, the Food and
Drug Administration, the drug and medical-imaging industries, universities and nonprofit groups
joined in a project that experts say had no precedent: a collaborative effort to find the biological
markers that show the progression of Alzheimer’s disease in the human brain.
share all the data, making every single
finding public immediately, available to
anyone with a computer anywhere in the
world
=> AVAILABILITY and REUSE
40/10000
• Data must be properly curated throughout its
life-cycle and released with the appropriate
high-quality metadata.
• Medical Research Council UK
41/10000
• Research data should be made available for
use by other researchers. Researchers must
retain research data, including electronic data,
in a durable, indexed and retrievable form.
• Australian Govnmt National Health and
Medical Research Council
42/10000
Microsoft Academic Search
40M publications
19M authors
75 publishers (Wiley, Springer, ACM, IEEE …)
43/10000
Google Scholar Citations
44/10000
• Citing data is as important as citing papers
• For researchers, publishers, data centers
• Over 1M DOI, several major national research
libraries
– Germany, France, Korea, Netherlands, Australia,
USA...
• Present manager – German National Library of
Science and Technology
45/10000
Publish on the Cloud
Add metadata
Pre-print sharing
46/10000
FNJV
proj.lis.ic.unicamp.br/fnjv
• Sharing by publishing on the Web
• Retrievability by extending metadata
CURATION AND USE OF STANDARDS
Workflows and model preservation
52/10000
Workflows and model preservation
Comb-e-Chem
X-Ray
e-Lab
Analysis
Properties
Properties
e-Lab
SimulationVideo
Dif
fra
cto
me
ter
Grid Middleware
Structures
Database
The cloud and CAP
Outline
• Why preserve?
• What to preserve?
• How to preserve?
• Where to preserve?
And a few associated challenges
PRE-SAVE and MANU-TENERE
Outline• Why preserve?
– Costly to produce (hardware, software, peopleware)
– Contribute to progress of science
– Value – culture, science, sustainability
• What to preserve? – Data [WHAT IS DATA?]
– Context of production and use
• How to preserve?– Accessibility and sharing – standards, metadata,
ontologies
– Integrity and quality – context to use (hw, sw), standards
56/10000
References
•
References
NSF – CISE Data management policy
The Domesday Project
http://www.atsf.co.uk/dottext/domesday.html
The CLARIN Project (languages)
Eigenfactor.org
Altmetrics movement
Thank you!!!!