Openness, trust, transparency: access during the … Sandpit, November 2009 Openness, trust,...

VO Sandpit, November 2009

Openness, trust, transparency: access during the data deluge

Sarah Callaghan [email protected]

@sorcha_ni

Standing on the Digits of Giants: Research data, preservation and innovation, ALPSP Seminar, London, 8 March 2016


The Data Deluge

http://www.economist.com/node/21521549

http://www.leadformix.com/blog/2013/02/the-big-data-deluge/


Example Big Data: CMIP5

CMIP5: Fifth Coupled Model Intercomparison Project

• Global community activity under the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)

•Aim:

– to address outstanding scientific questions that arose as part of the 4th Assessment Report process,

– improve understanding of climate, and

– to provide estimates of future climate change that will be useful to those considering its possible consequences.

Many distinct experiments, with very different characteristics, which influence the configuration of the models, (what they can do, and how they should be interpreted).


Simulations:

~90,000 years

~60 experiments

~20 modelling centres (from around the world) using

~30 major(*) model configurations

~2 million output “atomic” datasets

~10's of petabytes of output

~2 petabytes of CMIP5 requested output

~1 petabyte of CMIP5 “replicated” output

Which are replicated at a number of sites (including ours)

Major international collaboration!

Funded by EU FP7 projects (IS-ENES2, Metafor) and US (ESG) and other national sources (e.g. NERC for the UK)

CMIP5 numbers!


Big Data: the Human Genome

Hard copy of the Human Genome at the Wellcome Collection


Most people have an idea of what a publication is


Some examples of data (just from the Earth Sciences)

1. Time series, some still being updated e.g. meteorological measurements

2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer

3. 2D scans e.g. satellite data, weather radar data

4. 2D snapshots, e.g. cloud camera

5. Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature

6. Datasets consisting of data from multiple instruments as part of the same measurement campaign

7. Physical samples, e.g. fossils


The Understandability Challenge: Article


What the data set looks like on disk

What the raw data files look like.

I could make these files open easily, but no one would have

a clue how to use them!

The Understandability

Challenge: Data


It’s not just data!

• Experimental protocols• Workflows• Software code• Metadata• Things that went wrong!• …


Creating a dataset is hard work!

"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.com

Documenting a dataset so that it is usable and understandable by others is extra work!


Open is not enough!

“When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.”

- http://ivory.idyll.org/blog/data-management.html https://flic.kr/p/awnCQu


It’s ok, I’ll just put it out there and if it’s important other people will figure it out

These documents have been preserved for thousands of years!But they’ve both been translated many times, with different meanings each time.

We need Metadata to preserve Information!

Phaistos Disk, 1700BC


Usability, trust, metadata

http://trollcats.com/2009/11/im-your-friend-and-i-only-want-whats-best-for-you-trollcat/

When you read a journal paper, it’s easy to read and get a quick understanding of the quality of the paper.

You don’t want to be downloading many GB of dataset to open it and see if it’s any use to you.

Need to use proxies for quality:• Do you know the data

source/repository? Can you trust it?• Is there enough metadata so that you

can understand and/or use the data?

In the same way that not all journal publishers are created equal, not all data repositories are created equal

Example metadata from a published dataset:

“rain.csv contains rainfall in mm for each month at Marysville, Victoria from January 1995 to February 2009”


Should ALL data be open?

Most data produced through publically funded research should be open.

But!

• Confidentiality issues (e.g. named persons’ health records)

• Conservation issues (e.g. maps of locations of rare animals at risk from poachers)

• Security issues (e.g. data and methodologies for building biological weapons) There should be a very good

reason for publically funded data to not be open.


Open/Closed/Published/unpublished

Openness

Qualit

y

CD Webpage

OA journal

Subs journal

Data repository

We want to encourage researchers to make their data:

• Open • Persistent• Quality assured:

• through scientific peer review• or repository-managed processes

Unless there’s a very good reason not to!

Publishing = making something public after some formal process which adds value for the consumer:

e.g. peer review and provides commitment to persistence

Shared work

space


Peer review, data and data journals

• Peer-review of a scientific publication is generally only applied to analysis, interpretation and conclusions, and not the

underlying data.

• But if the conclusions are valid, the data must be of good quality.

• We need quality assurance of the data underlying research publications – either through peer-review or data repository checking.

• Researchers need credit for creating, managing and opening their data.

• Data journals provide that credit in an environment where academic status is

solely based on publication record.http://libguides.luc.edu/content.php?pid=5464&sid=164619

http://libguides.luc.edu/content.php?pid=5464&sid=164619


Faster horses?

• With all the innovation that the Web offers us, journal papers on-line still look substantially the same as print versions (with some exceptions)

• Are we just using the web as the same technology as print, but with faster horses? https://flic.kr/p/2UZkgn


Redo from Start?

What would we happen if we started again?

https://www.force11.org/group/scholarly-commons-working-group

Workshop held in Madrid, 29-30 February

https://www.force11.org/group/scholarly-commons-working-group/madrid-workshop

Coming up with new ideas to re-invent the scholarly commons!

https://docs.google.com/document/d/1ye2v0jN8uBpQy0etfD0FxOc-CdjEhtok0EV3wr95CB4/edit

https://miriamsmuse.wordpress.com/2012/08/29/out-of-cheese-error-redo-from-start/

https://www.force11.org/group/scholarly-commons-working-group

https://www.force11.org/group/scholarly-commons-working-group/madrid-workshop

https://docs.google.com/document/d/1ye2v0jN8uBpQy0etfD0FxOc-CdjEhtok0EV3wr95CB4/edit


Workshop visualisation

http://variable.io/force11/futurecommons/


Summary and maybe conclusions?

• We need to open the products of research

• to encourage innovation and collaboration

• to give credit to the people who’ve created them

• to be transparent and trustworthy

• Openness does come at a cost!

• It’s not enough for data to be open

• it needs to be usable and understandable too

• Data citation and publication are ways of encouraging researchers to make their data open

• or at least tell the world that their data exists!

• We need a culture change – but it’s already happening!

http://www.keepcalm-o-matic.co.uk/default.aspx#createposter

http://www.keepcalm-o-matic.co.uk/default.aspx#createposter


Thanks!

Any questions?

[email protected]

@sorcha_ni

http://citingbytes.blogspot.co.uk/

“Publishing research without data is simply advertising, not science” - Graham Steel

http://blog.okfn.org/2013/09/03/publishing-research-without-data-is-simply-advertising-not-science/

http://heywhipple.com/dont-show-me-a-something-about-show-me-something/

http://citingbytes.blogspot.co.uk/

Date post:	13-Mar-2018
Category:	Documents
Upload:	nguyentram
View:	221 times
Download:	6 times

Openness, trust, transparency: access during the … Sandpit, November 2009 Openness, trust,...

Documents