Date post: | 13-Mar-2018 |
Category: |
Documents |
Upload: | nguyentram |
View: | 221 times |
Download: | 6 times |
VO Sandpit, November 2009
Openness, trust, transparency: access during the data deluge
Sarah Callaghan [email protected]
@sorcha_ni
Standing on the Digits of Giants: Research data, preservation and innovation, ALPSP Seminar, London, 8 March 2016
VO Sandpit, November 2009
The Data Deluge
http://www.economist.com/node/21521549
http://www.leadformix.com/blog/2013/02/the-big-data-deluge/
VO Sandpit, November 2009
Example Big Data: CMIP5
CMIP5: Fifth Coupled Model Intercomparison Project
• Global community activity under the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)
•Aim:
– to address outstanding scientific questions that arose as part of the 4th Assessment Report process,
– improve understanding of climate, and
– to provide estimates of future climate change that will be useful to those considering its possible consequences.
Many distinct experiments, with very different characteristics, which influence the configuration of the models, (what they can do, and how they should be interpreted).
VO Sandpit, November 2009
Simulations:
~90,000 years
~60 experiments
~20 modelling centres (from around the world) using
~30 major(*) model configurations
~2 million output “atomic” datasets
~10's of petabytes of output
~2 petabytes of CMIP5 requested output
~1 petabyte of CMIP5 “replicated” output
Which are replicated at a number of sites (including ours)
Major international collaboration!
Funded by EU FP7 projects (IS-ENES2, Metafor) and US (ESG) and other national sources (e.g. NERC for the UK)
CMIP5 numbers!
VO Sandpit, November 2009
Big Data: the Human Genome
Hard copy of the Human Genome at the Wellcome Collection
VO Sandpit, November 2009
Most people have an idea of what a publication is
VO Sandpit, November 2009
Some examples of data (just from the Earth Sciences)
1. Time series, some still being updated e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer
3. 2D scans e.g. satellite data, weather radar data
4. 2D snapshots, e.g. cloud camera
5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature
6. Datasets consisting of data from multiple instruments as part of the same measurement campaign
7. Physical samples, e.g. fossils
VO Sandpit, November 2009
The Understandability Challenge: Article
VO Sandpit, November 2009
What the data set looks like on disk
What the raw data files look like.
I could make these files open easily, but no one would have
a clue how to use them!
The Understandability
Challenge: Data
VO Sandpit, November 2009
It’s not just data!
• Experimental protocols• Workflows• Software code• Metadata• Things that went wrong!• …
VO Sandpit, November 2009
Creating a dataset is hard work!
"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.com
Documenting a dataset so that it is usable and understandable by others is extra work!
VO Sandpit, November 2009
Open is not enough!
“When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.”
- http://ivory.idyll.org/blog/data-management.html https://flic.kr/p/awnCQu
VO Sandpit, November 2009
It’s ok, I’ll just put it out there and if it’s important other people will figure it out
These documents have been preserved for thousands of years!But they’ve both been translated many times, with different meanings each time.
We need Metadata to preserve Information!
Phaistos Disk, 1700BC
VO Sandpit, November 2009
Usability, trust, metadata
http://trollcats.com/2009/11/im-your-friend-and-i-only-want-whats-best-for-you-trollcat/
When you read a journal paper, it’s easy to read and get a quick understanding of the quality of the paper.
You don’t want to be downloading many GB of dataset to open it and see if it’s any use to you.
Need to use proxies for quality:• Do you know the data
source/repository? Can you trust it?• Is there enough metadata so that you
can understand and/or use the data?
In the same way that not all journal publishers are created equal, not all data repositories are created equal
Example metadata from a published dataset:
“rain.csv contains rainfall in mm for each month at Marysville, Victoria from January 1995 to February 2009”
VO Sandpit, November 2009
Should ALL data be open?
Most data produced through publically funded research should be open.
But!
• Confidentiality issues (e.g. named persons’ health records)
• Conservation issues (e.g. maps of locations of rare animals at risk from poachers)
• Security issues (e.g. data and methodologies for building biological weapons) There should be a very good
reason for publically funded data to not be open.
VO Sandpit, November 2009
Open/Closed/Published/unpublished
Openness
Qualit
y
CD Webpage
OA journal
Subs journal
Data repository
We want to encourage researchers to make their data:
• Open • Persistent• Quality assured:
• through scientific peer review• or repository-managed processes
Unless there’s a very good reason not to!
Publishing = making something public after some formal process which adds value for the consumer:
e.g. peer review and provides commitment to persistence
Shared work
space
VO Sandpit, November 2009
Peer review, data and data journals
• Peer-review of a scientific publication is generally only applied to analysis, interpretation and conclusions, and not the
underlying data.
• But if the conclusions are valid, the data must be of good quality.
• We need quality assurance of the data underlying research publications – either through peer-review or data repository checking.
• Researchers need credit for creating, managing and opening their data.
• Data journals provide that credit in an environment where academic status is
solely based on publication record.http://libguides.luc.edu/content.php?pid=5464&sid=164619
VO Sandpit, November 2009
Faster horses?
• With all the innovation that the Web offers us, journal papers on-line still look substantially the same as print versions (with some exceptions)
• Are we just using the web as the same technology as print, but with faster horses? https://flic.kr/p/2UZkgn
VO Sandpit, November 2009
Redo from Start?
What would we happen if we started again?
https://www.force11.org/group/scholarly-commons-working-group
Workshop held in Madrid, 29-30 February
https://www.force11.org/group/scholarly-commons-working-group/madrid-workshop
Coming up with new ideas to re-invent the scholarly commons!
https://docs.google.com/document/d/1ye2v0jN8uBpQy0etfD0FxOc-CdjEhtok0EV3wr95CB4/edit
https://miriamsmuse.wordpress.com/2012/08/29/out-of-cheese-error-redo-from-start/
VO Sandpit, November 2009
Workshop visualisation
http://variable.io/force11/futurecommons/
VO Sandpit, November 2009
Summary and maybe conclusions?
• We need to open the products of research
• to encourage innovation and collaboration
• to give credit to the people who’ve created them
• to be transparent and trustworthy
• Openness does come at a cost!
• It’s not enough for data to be open
• it needs to be usable and understandable too
• Data citation and publication are ways of encouraging researchers to make their data open
• or at least tell the world that their data exists!
• We need a culture change – but it’s already happening!
http://www.keepcalm-o-matic.co.uk/default.aspx#createposter
VO Sandpit, November 2009
Thanks!
Any questions?
@sorcha_ni
http://citingbytes.blogspot.co.uk/
“Publishing research without data is simply advertising, not science” - Graham Steel
http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/
http://heywhipple.com/dont-show-me-a-something-about-show-me-something/