+ All Categories
Home > Documents > TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer...

TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer...

Date post: 16-Dec-2015
Category:
Upload: joan-morales
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
18
TWC Why Data Science Matters Xiaogang (Marshall) Ma Tetherless World Constellation Rensselaer Polytechnic Institute Email: [email protected]; Twitter: @MarshallXMa ICSU-WDS Data Stewardship Award Lecture SciDataCon 2014, New Delhi, India, Nov. 02-05
Transcript

TWCWhy Data Science Matters

Xiaogang (Marshall) Ma

Tetherless World ConstellationRensselaer Polytechnic Institute

Email: [email protected]; Twitter: @MarshallXMa

ICSU-WDS Data Stewardship Award Lecture

SciDataCon 2014, New Delhi, India, Nov. 02-05

TWCAcknowledgements

• Dr. Mustapha Mokrane and Dr. Simon Hodson

• Colleagues at TWC/RPI, CODATA-ECDP, ESIP, CGI-IUGS, AGU/ESSI, ICSU-WDS, RDA, ITC, and more

• My mentor Prof. Peter Fox

• My family

• All of you

TWCOutline

• Technical trends– Data management, publication & citation

• Methodology– Interoperability & Provenance

• Data management is just a start– Data analysis– Semantic eScience

3

TWCData Management

4

data work

Image courtesy Randy Glasbergen

TWCData Management Plan

• Data Management Plan– A formal document that outlines what you will do with your data

during and after you complete your research

• Resources/Tools help create DMPs:– NSF Data Management Plan Requirements:

http://www.nsf.gov/eng/general/dmp.jsp – DCC Data Management Plans:

http://www.dcc.ac.uk/resources/data-management-plans

– DMPTool: https://dmptool.org – DCC DMPOnline: https://dmponline.dcc.ac.uk

5

TWCData Publication

• Data as first class products of research– e.g., NSF bio-sketches can include data publications

6Image from j4h.net

See: http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp

TWC

7

“All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. ”

“…authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications.”

“…authors must make materials, data, and associated protocols available to readers.”

“…it is a condition of publication that authors make available the data and research materials supporting the results in the article.”

“…require authors to make all data underlying the findings described in their manuscript fully available without restriction…”

“Earth and space science data should be widely accessible in multiple formats and long‐term preservation of data is an integral responsibility of scientists and sponsoring institutions.”

“…support the principle that research data should be made freely available to all researchers…”

“…recommends depositing data that correspond to journal articles in reliable data repositories…”

TWC• Ways of data publication

– Data as supplemental material of a paper– Standalone data– Data paper: data in a repository + descriptive ‘data paper’

8

Strasser, GeoData 2014 Workshop Presentation (2014)

Examples:• Standalone data journals: Nature Scientific Data, Geoscience Data

Journal, Ecological Archives, Data in Brief …

• Journals that publish data papers: Earth and Space Science, GigaScience, F1000 Research, Internet Archaeology …

TWC

9

An isolated data island ?!

Image from nature.com

TWCData Citation

• Data Citation Index– Indexes the world's leading data repositories– Connects datasets to related refereed literature indexed in

the Web of Science™– Efficient access to data across subjects and regions

10Image courtesy http://wokinfo.com

TWCData interoperability

11

Ma et al., Nature Geosciecne (2011)

Interoperability:“Data should be discoverable, accessible, decodable, understandable and usable, and data sharing should be legal and ethical for all participants.”

Original image from: http://ehna.org

TWCProvenance of research

12Image from nature.com

Ma et al., Nature Climate Change (2014)

http://data.globalchange.gov

Provenance documentation “Linking a range of observations and model outputs, research activities, people and organizations involved in the production of scientific findings with the supporting data sets and methods used to generate them”

TWC• IPython Notebook:

A web-based interactive computational environment

Di Stefano et al., ESIP 2014 Summer Meeting Presentation (2014)

Codes, APIs, datasets, text…

PDF document

• We made extension to the IPython Notebook environment to enable automatic provenance capture during a scientific workflow

13

TWC

14

TWCSemantic eScience

• Artificial Intelligence accelerates scientific discovery– Data search, synthesis and hypothesis representation– Data analysis: reasoning with models of the data

Gil et al., Science (2014)

Image from science.com

A state-of-the-art example: Hanalyzer (high-throughput analyzer) • Uses natural language processing to

automatically extract a semantic network from all PubMed papers relevant to a scientist

• Uses Semantic Web technology to integrate assertions from other biomedical sources

• Reasons about the network to find new correlations that suggest new genes to investigate

15

Leach et al., PLoS Comput Bio (2009)

TWCDeep Carbon Virtual Observatory

Fox, RDA Fourth Plenary Meeting Presentation (2014)

http://deepcarbon.net

A cyber-enabled platform for linked science

TWCSummary

• Data as first class products of research

• eScience: the digital or electronic facilitation of science

• Semantic eScience– A virtuous circle between science and semantic technologies– Data driven + Knowledge driven?

Image courtesy @WileyExchanges

17

TWC

More information:Marshall X [email protected]

Thank you!


Recommended